fuzzy clustering in the analysis of fourier transform infrared spectra for cancer diagnosis
TRANSCRIPT
Fuzzy Clustering in the Analysis of
Fourier Transform Infrared Spectra
for Cancer Diagnosis
by Xiao Ying Wang, BSc
Thesis Submitted to the University of Nottingham
for the degree of Doctor of Philosophy
School of Computer Science and Information Technology
September 2006
Table of Contents
i
Table of Contents
1 Introduction .................................................................................................... 1
1.1 Background and Motivation ............................................................................. 1
1.2 Aims of this project .......................................................................................... 6
1.3 Overview of the Thesis..................................................................................... 7
2 Literature Review......................................................................................... 11
2.1 Clustering Techniques .................................................................................... 11
2.2 Cluster Validity .............................................................................................. 31
2.3 Auto Clustering .............................................................................................. 38
2.4 Cluster Merging.............................................................................................. 46
2.5 Clustering in FTIR Spectroscopy ................................................................... 49
2.6 Summary......................................................................................................... 61
3 Medical Background .................................................................................... 63
3.1 Introduction .................................................................................................... 63
3.2 Instrumentation............................................................................................... 67
3.3 Sample Preparation and Data Collection........................................................ 71
3.4 Data Pre-processing........................................................................................ 73
3.5 Summary......................................................................................................... 74
Table of Contents
ii
4 A Comparison of Hierarchical, K-Means and Fuzzy C-Means
Clustering of Oral Cancer Cells ............................................................... 76
4.1 Introduction .................................................................................................... 76
4.2 Oral Cancer Datasets Description................................................................... 77
4.3 Experiments on Oral Cancer Datasets ............................................................ 86
4.4 Summary......................................................................................................... 91
5 Methods for Automatically Determining the Number of Clusters .......... 93
5.1 Introduction .................................................................................................... 93
5.2 VFC-SA Clustering Algorithm....................................................................... 94
5.3 SAFC Clustering Algorithm........................................................................... 98
5.4 Evaluation of VFC−SA and SAFC Clustering of Oral Cancer Cells ........... 103
5.5 Summary....................................................................................................... 115
6 Methods for the Examination of Tissue Sections..................................... 117
6.1 Introduction .................................................................................................. 117
6.2 Lymph Node Dataset Description ................................................................ 118
6.3 A Combination of Principal Component Analysis and Fuzzy C-
Means Clustering.......................................................................................... 121
6.4 Comparison of K−Means and Fuzzy C−Means in Lymph Node
Tissue Sections ............................................................................................. 137
6.5 Summary....................................................................................................... 143
Table of Contents
iii
7 A Cluster Merging Algorithm ................................................................... 145
7.1 Introduction .................................................................................................. 145
7.2 Feature Extraction......................................................................................... 147
7.3 Fuzzy C-Means Based Clustering Algorithm............................................... 148
7.4 The Basis of a New Automated Method to Merge Clusters......................... 149
7.5 Experimental Results.................................................................................... 159
7.6 Discussion of Results ................................................................................... 168
7.7 Summary....................................................................................................... 172
8 Conclusions ................................................................................................. 173
8.1 Contributions ................................................................................................ 174
8.2 Future Work.................................................................................................. 180
8.3 Dissemination ............................................................................................... 184
9 References.................................................................................................... 188
10 Appendix ..................................................................................................... 198
List of Figures
iv
LIST OF FIGURES
Figure 1.1 FTIR Microscopy spectra for paint analysis [24]. .................................. 5
Figure 1.2 An overview of the project collaboration. .............................................. 6
Figure 2.1 Two dimensional dataset with 3 clusters [27]. ..................................... 15
Figure 2.2 Dendrogram obtained from Figure 2.1 [27].......................................... 15
Figure 2.3 The k-means clustering algorithm. ....................................................... 18
Figure 2.4 The fuzzy c-means clustering algorithm [47]. ...................................... 20
Figure 2.5 Outline of the SA based clustering algorithm....................................... 22
Figure 2.6 The perturbation process in Brown and Huntley [53]. ......................... 25
Figure 2.7 Identification of the number of clusters by using a validity index [56].33
Figure 2.8 Two dimensional dataset strips in four directions. ............................... 44
Figure 2.9 Effective merging radius for clusters i and j. ........................................ 48
Figure 3.1 Typical location of lymph nodes that drain lymph from the breast...... 67
Figure 3.2 Perkin elmer spotlight imager............................................................... 68
Figure 4.1 Tissue sample from Dataset 1; (a) 4× stained picture; (b) 32× unstained
picture. .................................................................................................. 78
Figure 4.2 FITR spectra from Dataset 1................................................................. 78
Figure 4.3 32× unstained picture from tissue sample Dataset 2............................. 79
Figure 4.4 32× unstained picture from tissue sample Dataset 3............................. 80
Figure 4.5 White light image of tissue sample Dataset 4....................................... 81
List of Figures
v
Figure 4.6 Tissue section from dataset 5 (a) white light image (b) spectroscopic-
staining image. ...................................................................................... 83
Figure 4.7 White image of tissue sample for dataset 6 (a) part 1 (b) part 2........... 84
Figure 4.8 White image of tissue sample for Dataset 7. ........................................ 85
Figure 5.1 VFC-SA clustering algorithm procedure. ............................................. 96
Figure 5.2 The split centre procedure................................................................... 100
Figure 5.3 An illustration of Split Centre from the original algorithm with distinct
clusters (where 11µ and 12µ represent the membership degree of w1 to
the centres v1 and v2 respectively)....................................................... 101
Figure 5.4 The new Split Centre applied to the same dataset as Figure 5.3, above,
(where w1 is now the data point that is closest to the mean value of the
membership degree above 0.5). .......................................................... 101
Figure 5.5 The SAFC clustering algorithm.......................................................... 102
Figure 5.6 Fuzzy C-Means, VFC-SA and SAFC cluster results for dataset 1. .... 108
Figure 5.7 Cluster results for dataset 2 obtained from (a) Fuzzy C-Means, VFC-SA
and 3/10 runs from SAFC (b) 7/10 runs from SAFC. ........................ 108
Figure 5.8 Cluster results for dataset 3 obtained from (a) Fuzzy C-Means and
VFC-SA (b) SAFC.............................................................................. 109
Figure 5.9 Cluster results for dataset 4 obtained from (a) Fuzzy C-Means (b) VFC-
SA and SAFC...................................................................................... 109
Figure 5.10 Cluster results for dataset 5 obtained from (a) Fuzzy C-Means and 5/10
runs from VFC-SA (b) SAFC and 5/10 runs from VFC-SA. ............. 109
Figure 5.11 Fuzzy C-means, VFC-SA and SAFC cluster results for dataset 6...... 110
List of Figures
vi
Figure 5.12 Cluster results for dataset 7 obtained from (a) Fuzzy C-Means, 9 runs
from VFC-SA and SAFC (b) 1 run from VFC-SA............................. 110
Figure 6.1 (a) Photomicrograph of the H&E stained parallel lymph node tissue
section used for IR analysis (b) selected area – LNII5 at high
magnification (c) different tissue types description (d) LNII5 spectral
image................................................................................................... 120
Figure 6.2 IR imaging of lymph node tissue section LNII5 by PCA (a) H&E
stained image of LNII5 (b)−(k) false colour weighted images for
PC1−PC10 respectively. ..................................................................... 124
Figure 6.3 Clustering results from three separate runs with fuzzy c-means. ....... 125
Figure 6.4 A three – dimensional scatter plot of the tissue section spectra projected
onto the first 3 PCs.............................................................................. 127
Figure 6.5 IR imaging of lymph node tissue section LNII5 by fuzzy c-means (a)
H&E stained image of LNII5 (b)−(i) fuzzy c-means false colour
weighted clustering results, the number of clusters were from 2 – 9
respectively. ........................................................................................ 130
Figure 6.6 IR imaging of lymph node tissue section LNII5 by PCA–fuzzy c-means
(a) H&E stained image of LNII5 (b)−(i) fuzzy c-means false colour
weighted clustering results, the number of clusters were from 2 – 9
respectively. ........................................................................................ 132
Figure 6.7 LNII5 tissue section spectra plot in three dimensional PCs space (a)
original plot with 5 clusters (b) rotated plot of picture (a).................. 136
List of Figures
vii
Figure 6.8 Clustering results from k-means (a&b) and fuzzy c-means (c) in 2
clusters. ............................................................................................... 138
Figure 6.9 K-means clustering results in 3 − 9 clusters. ...................................... 139
Figure 6.10 Fuzzy c-means clustering results in 3 − 9 clusters.............................. 139
Figure 6.11 Variation in k-means clustering results for 5 clusters......................... 139
Figure 6.12 Variation in k-means clustering results for 6 clusters......................... 140
Figure 6.13 Variation in k-means clustering results for 7 clusters......................... 140
Figure 6.14 Variation in k-means clustering results for 9 clusters......................... 140
Figure 7.1 The fuzzy c-means based clustering algorithm................................... 150
Figure 7.2 An extracted spectral dataset after applying fuzzy c-means based
clustering algorithm. ........................................................................... 151
Figure 7.3 The procedure of determining a reference wave-number. .................. 153
Figure 7.4 Mean infrared spectra obtained from different clusters...................... 154
Figure 7.5 Enlarged region of Figure 7.4. ............................................................ 154
Figure 7.6 The procedure of automated merge clusters. ...................................... 156
Figure 7.7 Four mean spectra absorbance at reference wave-number. ................ 157
Figure7.8 The resultant absorbance distribution obtained after merging the two
most similar clusters. .......................................................................... 157
Figure 7.9 The merging situation when there are two dist left (type 1). .............. 158
Figure 7.10 The merging situation when there are two dist left (type 2). .............. 158
Figure 7.11 Entire automated merging clustering procedure. ................................ 158
Figure 7.12 The extracted spectral dataset after applying the proposed automated
merging cluster method. ..................................................................... 159
List of Figures
viii
Figure 7.13 An example of an extracted dataset. ................................................... 160
Figure 7.14 An example of a whole sub area of lymph node dataset. ................... 161
Figure 7.15 (a) Extracted LNII7 clustering results after applying fuzzy c-means
based clustering algorithm. (b) Extracted LNII7merged clusters results.
............................................................................................................ 162
Figure 7.16 (a) Dataset 3 clustering results obtained from SAFC algorithm. (b)
Dataset 3 merged clusters results. ...................................................... 163
Figure 7.17 (a) Dataset 5 clustering results obtained from SAFC algorithm. (b)
Dataset 5 merged clusters results. ...................................................... 163
Figure 7.18 Lymph node tissue section LNII7. Sampled area was 275µm ×
818.75µm in size. (a) Total absorbance IR image (b) H&E stained
image. Clustering results after fuzzy c-means based clustering
algorithm. Each colour represents a different cluster of IR spectra (c) 5
cluster image (d) 6 cluster image (e) 9 cluster image (f) Final results
obtained from automated merge clustering algorithm – this image
contained two final clusters of IR spectra........................................... 164
Figure 7.19 Lymph node tissue section LNII5. Sampled area was 30625µm ×
95625µm in size. (a) Total absorbance IR image (b) H&E stained
image. Results after fuzzy c-means based clustering algorithm. Each
colour represents a different cluster of IR spectra (c) 5 cluster image (d)
merged cluster result from 5 cluster image (e) 4 cluster image (f)
merged cluster result from 4 cluster image. Both merged cluster results
contained three clusters of IR spectra. ................................................ 166
List of Figures
ix
Figure 7.20 Lymph node tissue section LN57. Sampled area was 550µm × 512.5µm
in size. (a) Total absorbance IR image (b) H&E stained image.
Clustering results after fuzzy c-means based clustering algorithm. Each
colour represents a different cluster of IR spectra (c) 3 cluster image (d)
4 cluster image (e) 5 cluster image (f) Final result obtained from
automated merge clustering algorithm. Image contained three final
clusters of IR spectra........................................................................... 166
List of Tables
x
LIST OF TABLES
Table 4.1 Distribution of the different tissue types identified clinically and as
obtained by the various clustering techniques. ..................................... 88
Table 4.2 Comparison results based on the number of disagreements between
clinical study and the various clustering results. .................................. 89
Table 4.3 Clustering variations for k-means and fuzzy c-means within three
datasets.................................................................................................. 90
Table 4.4 Average number of disagreements obtained in the three clustering
methods. ................................................................................................ 91
Table 5.1 Average of the VXB index values obtained when using the fuzzy c-
means, VFC-SA and SAFC algorithms. ............................................. 104
Table 5.2 Comparison of the number of clusters achieved by clinical analysis,
VFC-SA and the SAFC methods. ....................................................... 106
Table 6.1 The ranges of the first 3 PCs in seven oral cancer oral cancer FTIR
datasets................................................................................................ 128
Table 6.2 Summary of fuzzy c-means clustering computation time. ................. 135
Table 6.3 Summary of PCA-fuzzy c-means computation time. ......................... 135
Table 6.4 Computation time comparison between PCA, Fuzzy c-means and PCA-
fuzzy c-means analysis techniques. .................................................... 136
Table 7.1 The number of clusters obtained at different stages of clustering. ..... 167
Abstract
xi
ABSTRACT
This thesis focuses on the development of fuzzy clustering techniques to
investigate the use of infrared spectroscopy as a diagnostic probe for the
identification of the early stages of cancer. Several new clustering approaches are
developed and compared to existing clustering algorithms from the literature using
two different types of spectra datasets with an aim to automatically identifying the
different types of tissue present within any given spectral dataset. The datasets have
been obtained from actual infrared spectroscopy performed at the University of
Nottingham on oral cancer and lymph node tissue sections.
Firstly, a simulated annealing based clustering algorithm is developed that can
automatically obtain the correct number of clusters from the given oral cancer
spectra datasets. Through the use of principal component analysis, a multivariate
statistical technique, the results on different spectral datasets are visualised and
compared to existing approaches, such as k-means and fuzzy c-means clustering.
The new simulated annealing approach, developed in this thesis, obtains better and
more consistent clustering results than the original variable string length simulated
annealing algorithm from literature.
The thesis continues by developing a new technique for the purpose of merging
the clusters with the same biochemical characteristics. This is to overcome the
problem observed in previous clustering algorithms in which an excessive number of
clusters occasionally occurred. In comparison with above mentioned simulated
Abstract
xii
annealing clustering algorithm, this novel clustering method can more often identify
the correct number of clusters in analysis of the same datasets. In addition, this
technique also allows for the classification of large amounts of spectra data in a
practical, acceptable time. This makes the developed techniques good candidates for
transferring to medical diagnosis in the real world.
Acknowledgements
xiii
ACKNOWLEDGEMENTS
Firstly, I would like to thank my supervisor, Dr. Jon Garibaldi, for providing
support, guidance and opportunities during my study. I am extremely grateful for his
valuable comments and help he has afforded throughout this research programme.
I would also like to thank Professor Edmund Burke for giving me the
opportunity to study my PhD in ASAP research group.
I also thank Dr. Benjamin Bird, Professor Michael George and Mr. John M.
Chalmers from the Department of Chemistry within the University for providing
infrared spectra data and diagnoses results. I am thankful for their valuable
discussion and support throughout the period of this work.
In particular, I would like to thank Dr. Glenn Whitwell and Dr. Turhan Ozen
for their precious comments, support and help during my study.
Finally, I would like to express my sincerest thanks and gratitude to my family
in China and especially to my mum Gui Ying, my dad Ya Min and sister Xiao Jia
who have provided their deep love and unfailing encouragement throughout my
recent and previous studies.
Chapter 1
1
CHAPTER 1
Introduction
Currently there is a substantial effort into investigating whether infrared
spectroscopy can be used as a diagnostic probe to identify the early stages of cancer.
This thesis investigates this possibility by developing and utilising fuzzy clustering
techniques to classify data from infrared spectroscopy. This Chapter introduces the
background and motivation for this research and then details its aims and objectives.
Finally, an overview of the remainder of this thesis is provided.
1.1 Background and Motivation
Cancer has become one of the most frequent causes of mortality around the
world and research into its diagnosis and treatment has become an important issue
for the scientific community. In Britain, more than one in three people will be
diagnosed with cancer during their lifetime and, on current statistics, 25% of the
population will die from a cancer related illness [1]. Accurate diagnostic techniques
could enable various cancers to be detected in their infancy and, consequently, the
appropriate treatments could be undertaken earlier.
Chapter 1
2
Currently, most cancers are diagnosed by physically removing a piece of
sample tissue from the patient and then observing this tissue section under
microscope in order to obtain diagnostic results. In the following sections, the
example of cervical cancer will be used to illustrate the current diagnosis techniques
and the problems that exist during routine clinical practice which might result in
misdiagnosis of the cancer. Obviously, any misdiagnosis or delay in diagnosis can
have severe repercussions for the patient’s chances of a full recovery.
In most cases, pre-cancerous or cancerous cells of the cervix are first detected
with a technique known as the Pap smear (the name ‘Pap’ is taken from the surname
of the inventor of the screening test, Dr. George Papanicolaous) [2]. The procedure
is as follows: the physician firstly obtains the cervical cells and then gives the
specimen to a nurse, physician’s assistant or other specially trained medical
professional who then smears some of the cells onto glass slides and takes them to
the lab to be evaluated. In the laboratory, a pathologist will analyse the slides under
a microscope with regards to categorising the cells as either: abnormal (cancerous or
pre-cancerous) or normal (non-cancerous). If abnormal cells are found, the physician
requests further examinations of the patient. This often means that the existing cells,
which have been classified as abnormal, are sent to a senior cytologist for additional
investigation and additionally secondary cell samples may be required from the
patient for re-examination.
In the above diagnostic procedure, some factors may lead to abnormalities
being undetected by the smears. For instance, if the doctors or nurses did not take a
Chapter 1
3
good smear (sometimes occurring with nervous patients), the smear contains an
inadequate number of cells or other substances mask the cells required, then the
skilled laboratory technicians who examine the smears may make a mistake [3].
Also, in extremely rare cases, it is possible that the sample labels can even get mixed
up [4]. In the early stages of cancer formation, the composition of cells may change
very slightly. Whilst the smear may contain several thousands of cells, in the case
that only a few of these show signs of a change in composition, it can be very hard to
identify the onset of cancer. As it is impractical to scan every individual cell on the
smear; the technician will randomly select some cells from each sample and,
inevitably, abnormal cells may be missed [4].
As mentioned above, the main technique used to provide diagnosis is the
observation of morphological changes within the cells (by studying the form and
structure of the cells without consideration of function) [5]. In recent years, Fourier
Transform Infrared Micro-Spectroscopy (also referred to as FTIR Microscopy) has
been increasingly applied to the study of biomedical conditions and could become a
very powerful tool for the determination and monitoring of chemical composition
within biological systems [6]. It has also been used as a diagnostic tool for various
human cancers and other diseases [7-23]. This technology works by measuring the
wavelengths at which different functional groups of chemical samples absorb
infrared radiation (IR) and the intensities of these absorptions. The quantity of
absorption depends on the chemical bonds and the structure of the molecule and,
hence, small changes in this molecular structure can significantly affect the
absorption intensity. Since different chemical functional groups absorb light at
Chapter 1
4
different wavelengths, the resultant FTIR Microscopy spectrum can be likened to a
molecular ‘fingerprint’. If the characteristic spectrum of abnormal and normal tissue
components are known (in a ‘fingerprint library’), it may be possible to compare
each of the obtained spectra to the reference spectra within the fingerprint library and
an accurate diagnosis may be achieved. Figure 1.1 shows an example of an FTIR
Microscopy spectrum from a non-biochemical application in which standard and
unknown paint samples are compared [24]. In the context of cancer diagnosis, the
FTIR Microscopy technique detects the molecular differences within the cell rather
than visual changes in the cell structure and hence FTIR may lead to an earlier
detection of cell abnormalities. In comparison with conventional histology, FTIR
Microscopy has several advantages:
a) It has the potential for fully automatic measurement and analysis.
b) It is very sensitive; very small samples are adequate.
c) It is potentially much quicker and cheaper for large scale screening
procedures.
d) It has the potential to detect changes in cellular composition prior to such
changes being detectable by other means.
There are two types of FTIR detection: FTIR mapping and FTIR imaging. In
FTIR mapping, the IR spectrum of the samples is collected a point at a time and
many separate collections must be made to examine different areas of the tissue
sample. However, in the more recently developed imaging technique [18], it is
possible to produce multiple IR spectra of the samples in a single collection. In
Chapter 1
5
addition, the imaging technique allows for the collection of images in faster time and
with higher spatial resolution.
Figure 1.1 FTIR Microscopy spectra for paint analysis [24].
Clustering is a multivariate analysis technique that has been adopted in both
medical diagnosis studies and pattern recognition areas [25]. By examining the
underlying structure of a dataset, cluster analysis aims to categorise data (in this case,
IR spectra) into separate groups according to their characteristics. Clustering is
performed such that the spectra held within a cluster are as similar as possible and
those found in different clusters are as dissimilar as possible. Therefore, different
cells types found within biological tissue can be separated and characterised.
This research is based on the development of fuzzy clustering techniques to
investigate whether infrared spectroscopy can be utilised as a diagnostic probe to
identify cancer in its early stages and whether cancerous cells can be detected more
robustly than the currently employed procedures. This study is motivated by the
Chapter 1
6
investigation into diagnosis of oral cancer tissue that was carried out by Mr Jon M.
Chalmers et al from Derby General City Hospital, UK, in 2002 [26] . Following this
study, we have been collaborating with Professor Mike Chesters, Professor Mike W.
George and his PhD student Benjamin Bird from the Physical Chemistry Department
at The University of Nottingham to obtain a wider sample source and analysis results
of infrared spectroscopy. An overview of this collaborative project is provided in
Figure 1.2.
Figure 1.2 An overview of the project collaboration.
1.2 Aims of this project
The ultimate goal of this research is to establish the techniques necessary to
develop automated diagnosis tools that will be practical and useful across a wide
range of medical domains. This thesis focuses on the development of clustering
techniques that are able to automatically classify the different types of tissue sections
and to investigate whether the infrared spectroscopy can be used as a diagnostic
probe to identify early stages of cancer. In order to achieve this aim, the following
objectives were identified:
Sample Collection and diagnosis
Infrared microscopy scan
Gloucestershire royal hospital
Chemistry department
Computer science
Clustering analysis to separate different types
of tissues
Chapter 1
7
• Implement necessary pre-processing functions for the given FTIR spectra
which have irregularities in cell density across the tissue section so that these
spectra can then be analysed in a standard manner.
• Investigate the use of feature extraction techniques that may be necessary in
order to permit the clustering of large datasets in reasonable time.
• Perform comparison of different (often used) techniques in infrared spectra
analysis and choose a suitable clustering method as the main technique to
classify the infrared spectra.
• Development of an automated clustering technique which can automatically
identify different types of tissue in the given FTIR spectra datasets.
• Determine proper methods to evaluate the obtained clustering results.
• Present the obtained clustering results in a clear and easy to understand way.
1.3 Overview of the Thesis
This thesis continues with the following chapters: Chapter 2 (literature review)
includes a review of various clustering approaches and multivariate analysis method
that have been applied previously in general clustering and infrared spectra analysis,
and introduces cluster validity as a technique to evaluate the quality of the clustering.
In addition, as one of objectives of this research is to enable the clustering method to
automatically identify the number of clusters, previous related work in this area is
also reviewed.
Chapter 3 describes the medical background of this thesis. Two main types of
cancerous tissue sample namely oral and breast cancer are involved in this study. In
Chapter 1
8
this Chapter the collection and preparation of these tissue samples and the
instrumentation used for this purpose are given. Finally, the necessary pre-
processing procedures used to obtain a standardised form of FTIR spectra are
presented.
Chapter 4 compares the performance of three often used clustering techniques,
namely hierarchical cluster analysis, k-means and fuzzy c-means algorithms, in
infrared analysis on seven oral cancerous FTIR datasets. The diagnostic results from
clinical study are considered as the ‘gold standard’ in this thesis to evaluate the
clustering results from these three techniques. Corresponding experiments were
carried out and the results showed that fuzzy c-means is the most suitable clustering
method in this context.
Chapter 5 develops a simulated annealing based fuzzy clustering algorithm
(SAFC) to automatically detect the number of clusters from the given FTIR spectra
datasets. This method is an extension of an original method named variable string
length simulated annealing (VFC-SA) algorithm, incorporating four novel
amendments to improve its performance. The Xie-Beni validity index was employed
to evaluate the quality of the generated clusters. Experiments were performed by
applying fuzzy c-means to seven oral cancerous FTIR spectra datasets. The VFC-SA
and SAFC algorithms were applied and their performance was evaluated. The results
indicated that the SAFC algorithms produced better solutions in that it consistently
obtained better values of the Xie-Beni index .
Chapter 1
9
Chapter 6 applies the techniques developed in Chapter 5 to samples taken from
axillary lymph node tissue section using infrared imaging technique. This tissue
section was firstly examined visually utilising principal component analysis (PCA)
and then clustering was performed using the standard fuzzy c-means method. Next,
both methods were combined so that PCA was used to extract the first ten principal
components of the FTIR spectra to be used as input to fuzzy c-means to perform the
clustering. Experimental results from these three methods were compared and the
results are discussed. In addition, the computational time required by the three
approaches was also compared. The results demonstrated that the combination of
PCA and fuzzy c-means obtained the same good results as fuzzy c-means but using
much less time. Another experiment was conducted using the combination of PCA
with the k-means algorithm and its performance was compared with the combination
of PCA and fuzzy c-means. The results showed that the combination of PCA and the
fuzzy c-means method obtained more stable clustering results (less variation in
clustering results over ten runs of the algorithm). In addition, the combination of
PCA and the k-means method could not separate the main types of tissue section
when the number of clusters was low.
In Chapter 7 the development of an automated clustering method to identify the
number of clusters for axillary lymph node tissue sections is described. The results
obtained in Chapter 4 had shown that the SAFC technique could occasionally
identify an excessive number of clusters. This may have been due to the complexity
of different cell types, such as those found in the axillary lymph node sections. In
order to attempt to solve these problems, an automated method to merge clusters was
Chapter 1
10
developed. Experiments using six different FTIR spectra datasets were carried out
and the results are discussed. The results showed that clusters that have similar
biochemistry were successfully merged, thus indicating that the algorithm can
accurately determine the main tissue types within the given infrared spectral datasets.
Chapter 8 draws conclusions, lists the contributions, and suggests some
interesting potential avenues for future research arising from the work presented in
this thesis. The dissemination arising from this work is also listed.
Chapter 2
11
CHAPTER 2
Literature Review
This Chapter provides a literature review of general clustering techniques and
clustering approaches in FTIR microscopy applications. As this thesis focuses on a
real medical application, robust clustering techniques that can automatically cluster
FTIR spectra data are also introduced. One of the contributions of this thesis is the
development of a new approach to merge clusters and, in light of this fact, previously
published literature on the merging of clusters has also been discussed.
2.1 Clustering Techniques
Clustering is the process of grouping a set of unlabelled multidimensional
patterns (objects or data points), such that patterns in the same cluster have the most
similar characteristics, and patterns within different clusters have the most dissimilar
characteristics. In most cases a cluster is represented by a cluster centre or a
‘centroid’ [27,28]. Clustering has been applied to a wide range of applications, such
as pattern recognition, image segmentation, spatial data analysis, machine learning,
data mining, economic science, and internet portals [28-33]. Classification, another
Chapter 2
12
data analysis method, is often confused with clustering. The distinction between the
two approaches is that classification is a supervised learning process which is trained
on a set of pre-labelled patterns in order to predict into which class new patterns
should be placed. In contrast, clustering is unsupervised, has no predefined classes
and does not involve training examples [34,35]. As mentioned above, the aim of
clustering is to group the patterns into clusters based on their similarity.
A basic outline of a general clustering process can be described as follows [27,35]:
1) Perform feature selection and/or feature extraction from the original
dataset. Feature selection is the process used to find the most representative
subset of the original features to be used within clustering. Feature
extraction uses one or more transformations of the original features to
produce new salient features [27]. The purpose of this step is to make the
clustering process work more ‘efficiently’ (in some way) as only the most
important characteristics need to be considered. The objective is usually to
reduce the time required for the clustering process without adversely
affecting the quality of the clusters obtained.
2) Select a proximity measure to be used. This is used to evaluate the
similarity (or dissimilarity) of two data points. This could be the Euclidean
distance, correlation coefficient or other measures.
3) Apply the clustering technique to classify the dataset. Many different
clustering techniques have been developed within the literature. This
literature review identifies and describes these approaches.
Chapter 2
13
4) Validate the clusters. Cluster validation evaluates the clustering scheme
obtained from step (3). Cluster validity indices are often used to assess the
quality of the clusters.
In general, the different clustering techniques can be divided into two main
categories. They are hierarchical and partitional clustering [27,36]. In each category,
many subtypes and variants have been applied to diverse types of clustering
problems. In conventional clustering algorithms, each pattern has to be assigned
exclusively to one cluster. Where the physical boundaries of clusters are well-
defined, this approach can work well. However, when using data from real world
applications, the boundaries between clusters might be vague. For this reason, fuzzy
clustering extends the traditional clustering concept by allowing each pattern to be
assigned to every cluster with an associated membership value. Therefore, for
unclear cluster boundaries, fuzzy clustering may obtain more reasonable results. In
partitional clustering, normal process is to optimise an objective function which
somehow reflects the quality of the clusters. In order to find better solutions, some
search based approaches have also been combined with these clustering algorithms in
order to maximise or minimise the objective function [37]. In the rest of this section,
the key literature is identified and described. This includes hierarchical clustering,
partitional clustering, fuzzy clustering and hybridisations of the above clustering
approaches with search based algorithms.
Chapter 2
14
2.1.1 Hierarchical Clustering
Hierarchical clustering is a way to group the data in a nested series of clusters
[27]. The output of hierarchical clustering is a cluster tree, termed a ‘dendrogram’,
which represents the similarity level between all of the patterns. Figure 2.1 shows a
two-dimensional dataset which contains three clusters (data points have been labelled
A – G). The dendrogram corresponding to this figure has been displayed in Figure
2.2. Both figures have been taken from [27]. As shown in Figure 2.2, a specific
number of clusters can be generated through the vertical positioning of the cut-off
line (dashed line in the figure). All of the data connected to the vertical line which
has been separated by the cut-off line, belong to one cluster. The position of the cut-
off line is normally subjective and is decided based on the solution requirements. It
should be noted that if the cut-off line is placed higher on the diagram, the total
number of clusters is reduced, whereas, if the cut-off line is lowered, more clusters
are produced. Based on its algorithmic structure and operation, hierarchical
clustering can be further categorised into agglomerative algorithms and divisive
algorithms [27,35]. The agglomerative method initially considers each pattern as an
individual cluster and then, at each step, the two closest clusters (which are measured
based on the corresponding linkage method) are merged to form a new cluster and so
forth, until all the clusters are merged into one cluster. The dendrogram in the
agglomerative approach is generated in a bottom-up fashion. In contrast, the divisive
method starts by considering all patterns in one cluster and, at each step, splits the
cluster into two groups based on the similarity within the patterns, such that patterns
in the same group have the highest similarity and patters in the different groups have
Chapter 2
15
the most dissimilarity. This process continues until each cluster only contains a
single pattern. The divisive approach is based on top-down dendrogram generation.
Figure 2.1 Two dimensional dataset with 3 clusters [27].
Figure 2.2 Dendrogram obtained from Figure 2.1 [27].
As part of the agglomerative algorithm, the linkage method provides a way to
measure the similarity of clusters based on the patterns in the cluster [34]. The main
linkage methods include single linkage, complete linkage and minimum-variant
(Ward) algorithms [27,35]. Most of the other linkage methods are variants of these
three. In the single linkage algorithm, the distance between two clusters is measured
A g g l o m
e r a t i v e
D i v i s i ve
Chapter 2
16
by the two closest patterns within the different clusters. By contrast in the complete
linkage algorithm, the distance between two clusters is measured by the two furthest
patterns within the different clusters. The Minimum-variant algorithm is distinct
from the other two methods because it uses variance analysis to measure the distance
between two clusters. In general, this method attempts to minimise the sum of the
square error of any two hypothetical clusters which can be generated at each step
[38]. This is based on the Euclidean distance between centroids [39].
Depending on the linkage method employed, hierarchical clustering can
generate clusters having different characteristics. For example, the single linkage
algorithm has a tendency to produce a cluster with an elongated and irregular shape
whereas the complete linkage algorithm can produce tight, compact and roughly
hyper-spherical clusters [27], while the minimum-variant algorithm often generates
compact clusters of roughly equal size or dispersion [40].
2.1.2 Partitional Clustering
In contrast to the hierarchical clustering algorithm, partitional clustering
obtains a single partition of the patterns instead of a clustering structure. It usually
generates clusters by evaluating a criterion function which is defined locally or
globally and attempts to recover the natural clusters present in the patterns. The
advantage of partitional clustering methods is that they are especially appropriate in
the analysis of large data sets, where a dendrogram based hierarchical clustering
method is computationally expensive and is impractical with more than a few
hundreds patterns [27,41]. The partitional clustering algorithm typically selects a
Chapter 2
17
criterion and then evaluates it with a certain number of clusters multiple times
associating different initial states. The best partition found during optimisation is
returned as the result of the clustering. A drawback of the approach is that the
number of clusters needs to be specified in advance.
The most commonly and frequently used criterion in partitional clustering is
the squared error criterion which works best with clusters that are isolated and
compact [27,41]. Let us assume that a dataset },...,{ 21 nxxxX = contains n patterns
and is to be clustered into c clusters. },...,{ 21 cvvvV = is the corresponding set of
centres, ic is the number of patterns in the cluster i and each pattern may only belong
to one cluster. The squared error e2 can now be expressed as follows:
2
1 1
2 |||| j
c
j
c
iij vxe
i
∑∑= =
−= (2.1)
where ijx is the ith pattern within the jth cluster; jv is the jth cluster centre, and
|||| jij vx − is the Euclidean distance between ijx and jv .
Based on the squared error criterion function, k-means clustering is one of the
early established algorithm in partitional clustering [42]. The aim of the k-means
algorithm is to minimize the squared error criterion objective function e2 in Equation
2.1. The procedure for the k-means algorithm is shown in Figure 2.3 where, for the
jth centre, jv is calculated as:
∑=
=ic
iij
ij x
cv
1
1 , j=1…c (2.2)
Chapter 2
18
Randomly initialise the position of the c cluster centres.
1) Calculate the distance between all of the patterns and each centre.
2) Each pattern is assigned to a cluster based on the minimum distance.
3) Recalculate the centre positions using Equation (2.2).
4) Recalculate the distance between each pattern and each centre.
5) Reassign each pattern to a cluster.
6) If no data was reassigned, then stop, otherwise repeat from step 3).
Figure 2.3 The k-means clustering algorithm.
2.1.3 Fuzzy Clustering
Fuzzy clustering has become an interesting and important branch of partitional
clustering. It was originally developed in 1969 when Ruspini applied fuzzy set
theory to clustering [43]. One of the major differences between fuzzy clustering and
hard clustering is that fuzzy clustering allows each pattern to belong to more than
one cluster with varying degrees of certainty, based on their distance to the cluster
centres. This is called the ‘membership’ or ‘soft membership’ function.
The fuzzy c-means algorithm is one of the most popular fuzzy clustering
algorithms. It was first developed by Dunn in 1973 [44] and was subsequently
improved by Bezdek in 1981 [45]. In comparison with Dunn’s algorithm, Bezdek’s
fuzzy c-means algorithm introduces a fuzzifier parameter, 1 ≤ m < ∞. The purpose of
the fuzzy c-means algorithm is to minimise the fuzzy objective function as shown in
Equation (2.3).
2
1 1
||||)(),( j
n
i
c
ji
mij vxVUJ ∑∑
= =
−= µ (2.3)
Chapter 2
19
The resulting algorithm can recognise spherical patterns in multi-dimensional space
[46]. Once again, this can be formulated as followings: },...,{ 21 nxxxX = represents a
collection of data and },...,{ 21 cvvvV = is set of corresponding cluster centres (as
previously defined in Section 2.1.2). In addition, ijµ is the membership degree of
pattern ix to the cluster centre jv and ijµ must satisfy the following conditions:
],1,0[=ijµ ,,...1 ni = ,,...1 cj = (2.4)
∑=
=c
jij
1
1µ , (2.5)
Parameter m is called the ‘fuzziness index’ (or fuzzifier) and is used to control the
fuzziness of the membership of each data point. A larger value of m makes the
method ‘more fuzzy’ whilst a smaller value makes the method ‘less fuzzy’. There is
no theoretical basis for the optimal selection of m, but a value of m = 2.0 is most
commonly used [45]. The Euclidean distance between ix and jv is represented by
|||| ji vx − . cnijU ×= )(µ is a fuzzy partition matrix, which contains all of the
membership degree values from each data to all cluster centres. The fuzzy c-means
procedure is shown in the following steps:
Chapter 2
20
1) Fix the number of clusters, c , where nc <≤2 , and initialise the fuzzy partition
matrix U with a random value such that it satisfies conditions (2.4) and (2.5).
2) Calculate the fuzzy centres jv using
cjx
v n
i
mij
n
ii
mij
j ,...,1,)(
)(
1
1 =∀=∑
∑
=
=
µ
µ (2.6)
3) Update the fuzzy partition matrix U with
∑=
−
=c
k
m
ik
ijij
d
d
1
1
2
)(
1µ (2.7)
where |||| jiij vxd −= , ni ...1= and cj ...1= .
4) Repeat step (2) to (3) until one of the termination criterion is satisfied.
Figure 2.4 The fuzzy c-means clustering algorithm [47].
In Figure 2.4, the fuzzy c-means procedure continues until one of the
termination criterion is satisfied. Termination criteria can be that the difference
between updated and previous objective function J is less than a predefined
minimum threshold. Additionally, the maximum number of iteration cycles can also
be a termination criterion.
In this thesis, after the fuzzy c-means clustering process a pattern is set to a
specific cluster for which the degree of membership is maximal. This process is
known as ‘hardening’ the results. Studies have shown that hardening the results
obtained from fuzzy c-means produces different solutions from the hard clustering
results obtained directly from k-means, and that the fuzzy c-means solutions can be
better [33,48]. However, as with the k-means algorithm, fuzzy c-means needs the
Chapter 2
21
number of clusters to be pre-specified in advance as an input parameter to the
algorithm. However, both approaches can still suffer premature convergence to local
optima. This is due to the fact that both these algorithms begin with random
initialisation of the cluster centres. If the initial cluster centres are not appropriate,
the iterative improvement of the centre positions can result in locally optimal
solutions being obtained.
2.1.4 Clustering Based on Simulated Annealing
The limitations of the k-means and fuzzy c-means algorithms (e.g. pre-
specification of the number of clusters; convergence to the local optima) usually
result in non-optimal solutions (i.e. a locally optimal solution, not the global
optiumum). A search-based clustering approach may help to avoid this problem.
Search techniques can be divided into deterministic and stochastic techniques. The
difference between these two methods is that deterministic search techniques
guarantee a global optimal partition after an exhaustive search throughout all of the
solution space but often require a prohibitively large amount of time to do so;
stochastic search techniques may generate a near-optimal partition quickly and some
guarantee convergence to the optimal partition asymptotically [27]. The solutions
produced from stochastic approaches can also avoid following locally optimal
directions in search space [27].
Simulated annealing (SA) [49,50] is a stochastic search technique [27] which
has been used for clustering since 1989 [51]. The simulated annealing process
essentially simulates the physical process of annealing solids which can be described
Chapter 2
22
as follows. Firstly, a solid is heated from a high temperature and is then cooled
slowly so that the system at any time is approximately in thermodynamic
equilibrium. At equilibrium, there may be many configurations with each one
corresponding to a specific energy level. The chance of accepting a change from the
current configuration to a new configuration is related to the difference in energy
between the two states. In order to simulate this physical process within artificial
intelligence search frameworks, we use En and Ec that represent the new energy and
current energy respectively. En is always accepted if it satisfies En < Ec, but if En ≥
Ec, the new energy level is only accepted with a probability as specified by
)/)(exp( TEE cn −− , where T is the current temperature. Hence, worse solutions are
accepted based on the change in solution quality which allows the search to avoid
becoming trapped in local minima. The temperature is then decreased gradually and
the annealing process is repeated until no more improvement is reached or any
termination criteria have been met. A general SA based clustering algorithm can be
described as follows [27].
1) Initialise the start and final temperature Tmax and Tmin respectively, randomly select
an initial partition P0 and calculate the corresponding squared error value Ep0.
2) Choose a neighbour partition of P0 (P1) and calculate its square error Ep1.
if Ep1 < Ep0, then accept it and set P1 to P0. Otherwise,
if Ep1 ≥ Ep0, then accept it only when the accept probability is satisfied and set P1
to P0.
Repeat this step for a certain number of iterations.
3) Decrease the temperature, go back to step 2) until Tmin is reached.
Figure 2.5 Outline of the SA based clustering algorithm.
Chapter 2
23
The optimal number of clusters through this process can be obtained by
choosing the partition which corresponds to the minimal squared error value.
Although SA can escape local optima, it has been shown that it can be slow to find
the best solution [27].
Many simulated annealing based clustering algorithms within the literature
have been used to find optimal or near-optimal solutions. Klein and Dubes [51] used
a SA algorithm for projection in clustering problems. A neighbour, which was
referred to as a ‘move’ in this paper, was generated from random changes in the
assignment of the patterns. In order to reduce the computation time of the search,
this paper employed a simple time saving measure by using the change in cost
function after one move instead of computing the entire cost function, but this
approach was not optimal. In clustering problems, a hard membership is used to
assign the relationship of a pattern to its cluster centre. That is, ‘1’ represents that a
data point belongs to the centre and ‘0’ that it does not. The clustering process in
this paper started by assigning a set of numbers as the number of clusters for the
datasets, and then by randomly reassigning the membership value of the pattern to
the centres, a new clustering structure was generated. Consequently, the cluster
validity index was then computed. A clustering partition which corresponds to the
optimal validity index value (e.g. minimal or maximal) was considered as output.
The results from SA clustering were compared with k-means algorithm from the best
100 runs. The results indicated that the SA clustering produced more reliable results
in some cases, but this came with a prohibitive computation cost.
Chapter 2
24
Selim and Al-Sultan [52] also investigated the use of the SA algorithm in
general clustering problems. The main difference between this approach to the
previous one [51] was the proposed method to generate a neighbour, also called the
‘perturbation process’. The authors propose the following algorithm for obtaining a
neighbouring assignment. First, generate a starting assignment of patterns to
clusters. Then, for each pattern, randomly draw a number between 0 and 1. If this
number is greater than a pre-specified probability threshold, then the pattern is
assigned to a random cluster, otherwise its previous assignment is kept. Another
difference is that the best result is always saved during the SA clustering procedure
in this paper, whereas in the previous one it is not. In addition, the authors also
provided a detailed investigation into the input parameters of the algorithm through a
set of experiments on ten randomly generated datasets and four suggestions were
presented. Firstly, the slower the cooling rate, the better the solution obtained, but
the longer it takes. They recommend that it should be generally set between 0.7 to
0.9. However, when the size of the dataset increases, the cooling rate should be
slower. Secondly, the probability threshold (for changing cluster assignments)
should set to a high value to keep the neighbouring assignment close enough; 0.95
was recommended. Thirdly, the larger the number of iterations, the better the results
will be, but the time will be longer too. Once again, this also depends on the size of
the dataset. The authors recommended 50 to 600 iterations, dependent on the size of
the problem. Finally, the initial temperature should depend on the magnitude of the
objective function of the clustering. The bigger value of the objective function, the
higher initial temperature should be. Tmax = 10 was recommended.
Chapter 2
25
Brown and Huntley [53] formalise clustering as an optimisation problem with a
user-defined objective function named the ‘internal clustering criterion’. They
suggested two criteria: the ‘within-cluster distance’ criterion and Barker’s criterion
[54] (designed to look for areas with high density). SA was used to find a near-
optimal clustering for each criterion to solve a multi-sensor fusion problem. A
perturbation process was used to generate different partitions so as to be evaluated by
SA and is described as follows:
Assume L represents the set of cluster labels in all partitions p, and |L| is the
cardinality of L. cL represents the set of cluster labels not used in partitions p. p’ is
the perturbed partitions. n is the number of data points to be clustered and c is the
number of clusters. i represents one of n data points.
1) Randomly select an object i from n data points.
2) Randomly select an integer m from the range of [0, |L|].
3) If m= 0 and there also exists unused cluster labels in cL then randomly
select a label from cL and assign it to i to form p’.
4) Else randomly select a label from L and assign it to i to form p’.
5) If p’ ≠ p, go back to step 2). Otherwise return p’.
Figure 2.6 The perturbation process in Brown and Huntley [53].
Finally, the paper concluded that SA is both practical and useful in evaluating
internal clustering criteria. It should be noted that the number of clusters is fixed in
the whole of Brown and Huntley’s SA process.
Chapter 2
26
Unlike Brown and Huntley’s approach, Al-Sultan and Selim [55] used a fuzzy
clustering algorithm in combination with a SA algorithm. A perturbation state was
generated by taking a small step along a feasible direction at the current point in
search space. However, as with the preceding approaches, the number of clusters
was fixed. The best partitions were always updated as specified by the SA
procedure. In addition, within the SA process, the fuzzy c-means objective function
J(U, V) was used to evaluate the states of the different partitions. It is known that the
fuzzy c-means algorithm is not guaranteed to generate a global optimal solution and
the combination of SA and fuzzy c-means also failed to obtain a global solution.
The work of Lukashin and Fuchs [56] uses the SA procedure to cluster
temporal gene expression profiles. The perturbation process is similar to the
approaches described previously whereby, at each iterative step, a randomly selected
vector was withdrawn from its old cluster and was reassigned to another randomly
chosen cluster (where a vector here means a gene with M time points or dimensions).
The sum of the within-cluster distance criterion was used as the objective function.
Lukashin and Fuchs also proposed a simple and robust clustering algorithm which
aims to find the correct number of clusters from the given datasets. The method is
established on an equation derived from the relationship between cutoff distance d (if
the Euclidean distance between two vectors is greater than or equal to d, then these
two vectors should not belong to the same cluster), the correct number of clusters c
and a pre-assigned allowance of false positives value p, namely
pcdf =),( (2.8)
Chapter 2
27
where ∑=
=c
c cc
ccdf
1 cluster in pairs vector ofnumber totalcluster in pairsvector incorrect ofnumber 1
),( (2.9)
is the fraction of incorrect vector pairs. The authors suggested that the value of p is
routinely set as 0.055. The value of parameter d was derived from a method the
authors called a ‘reverse engineering procedure’, such that the number of clusters c
can be easily worked out from Equation (2.8). In the reverse engineering procedure,
a dataset with prior known clustering containing the number of clusters was used; an
SA algorithm was then applied to generate different distributions of clusters with
different values of c. The distance value d was then obtained from Equation (2.9). A
value of d=1.1 was chosen at p=0.055 where the known number of clusters was
reached. The authors verified that the value of d depended on the number of time
points rather than the number of clusters.
Recently, Yang et al. used SA and fuzzy c-means clustering to determine the
optimal parameters (number of clusters c and fuzziness index m) in an application of
microarray data [57]. At the beginning of the perturbation process, the parameters c
and m were randomly initialised. The objective function within the SA process was
defined as the fuzzy c-means objective function J(U, V) + a chosen cluster validity
index value. If the minimal value is optimal for the cluster validity then “+” was
used; otherwise, if the maximal value is optimal for the cluster validity, “−” was
used. Therefore, the smaller the objective function value, the better the clustering
structure was. During the annealing process at each temperature, the values of c and
m were randomly chosen to generate a new energy. Yang et al stated that the
experimental results showed that the proposed method ran very quickly and achieved
Chapter 2
28
the pair of values of c and m, that were, on average, very near to the known best
values without exploring the entire search space.
2.1.5 Clustering with Other Stochastic Search Approaches
Apart from the SA algorithm, other stochastic search approaches have also
been applied to clustering problems. Al-Sultan applied tabu search to hard clustering
[58] and more recently, also applied it to fuzzy clustering [59]. The experimental
results showed that the hybrid approaches of tabu search with k-means and fuzzy c-
means achieved better performance than the standard k-means or fuzzy c-means
clustering algorithms. Recently, Sung and Jin [60] combined tabu search with two
newly developed procedures, named packing and releasing, for the clustering
problem. The results showed that this new heuristic algorithm outperformed tabu
search, k-means and simulated-annealing algorithms. However, the performance of
tabu search was sensitive to the selection of various control parameters [27].
Genetic algorithms (GA) have also been applied to the clustering problem.
Hall et al [61] proposed a genetically guided algorithm (GGA) to optimise the hard
(k-means) and fuzzy (fuzzy c-means) clustering objective function used in cluster
analysis. As the k-means and fuzzy c-means clustering algorithms are sensitive to
the initial centre configurations, the GGA algorithm was used to determine the good
initial centres. The results showed that by using outputs from the GGA as the
initialisation for the k-means and fuzzy c-means algorithms, better solutions can be
obtained in comparison to random initialisations. However, the main disadvantage
of the GGA approach is that the time required to discover the initial solution is high.
Chapter 2
29
The experiments also showed that the time taken for GGA and k-means/fuzzy c-
means to find the partition associated with the lowest k-means/fuzzy c-means
clustering objective function is similar to the time taken by one hundred runs of k-
means/fuzzy c-means with random initialisations.
Recently, Maulik and Bandyopadhyay [62] proposed a GA-based clustering
technique, called GA-clustering. In this algorithm, each chromosome represented a
fixed number of cluster centre positions. The fitness function was defined as the
objective function of the k-means algorithm. The results showed that the GA-
clustering algorithm outperformed the k-means algorithm on the seven test data sets
used within the paper.
In 2002, Bandyopadhyay and Maulik [63] extended the work to automatically
discover the ‘correct’ number of clusters in the solution. This was achieved by
randomly initialising the number of clusters c from a pre-specified range of Cmin and
Cmax which represents the minimum and maximum number of clusters respectively.
These c cluster centres were randomly chosen from original dataset and distributed in
the chromosome. The fitness of a chromosome was calculated using the Davies-
Bouldin index [64] where a smaller index value represents a better clustering result
(see the following Section for more details on validity indices). Genetic operations
based on selection, crossover and mutation were applied to the cluster centres within
the chromosome for a specified number of maximum generations. The best string,
corresponding to the smallest index value from all generations, was returned as the
solution. The authors examined the efficiency of their proposed method through
Chapter 2
30
experiments on both artificial and real-life datasets. In addition, a satellite image was
also analysed. The results showed that several land-cover types were able to be
identified. Other automatic clustering approaches that have utilised GAs can be
found in [65,66].
Babu and Murty [67] used an Evolutionary Strategy (ES) in hard and fuzzy
clustering problems. In their paper, a centroid type clustering objective function was
used, which enabled the approach to handle real-valued parameter optimisation
problems. The experiments showed that whilst their approaches performed better
than individual, hard and fuzzy clustering algorithms, once again, the number of
clusters has to be specified in advance.
Lee and Antonsson [68] developed a clustering algorithm which utilised an ES
that can be used to automatically find the cluster centres and number of clusters. An
ES selection strategy whereby population of 10 parents and 60 offspring were
propagated into the next generation was used. They used the mean square error
(MSE) to measure the fitness function. The proposed algorithm was used to solve
the two dimensional spatial data clustering problem and the experimental results that
were provided showed promising results, as the fitness measure values were often
better than the ones obtained from the known ‘true’ clustering. In addition, the
authors suggested that any fitness function could replace the used MSE fitness
measure and a comparison of proposed method with other similar approaches was
recommended.
Chapter 2
31
2.2 Cluster Validity
2.2.1 Introduction
In cluster analysis, one of the most important issues is the measure used to
evaluate the quality of the clustering results that are produced. This measure can
then be used to compare the solutions from different algorithms and can also be used
to steer optimisation search processes in order to find the partitioning that best fits
the underlying dataset [69]. Within clustering problems, this is quantified using a
cluster validity measure. Determining the correct number of clusters in a given
dataset is the most common application of cluster validity [70]. In general, cluster
validity is frequently used to answer the following questions [71]:
1) How many clusters should be used?
2) Is the defined cluster scheme suitable for the dataset?
3) Are there any better partitionings possible?
In general, cluster validity measures can be expressed as three types. Namely, those
based on external criteria; internal criteria and relative criteria [41,69].
External criteria evaluate the results from a clustering algorithm to a pre-specified
clustering structure. For instance, an external criterion that measures the
corresponding degree between the obtained number of clusters and category labels
from a prior assigned structure.
Internal criteria evaluate the fitness between the clustering structure and the data
itself. In essence, it is a measure that can be derived only from the proximity matrix,
that somehow expresses the quality of the given partition.
Chapter 2
32
Relative criteria evaluate two clustering structures in order to determine which one is
a relatively better representation of the given data. For instance, a relative criterion
may measure whether single linkage or complete linkage methods are more suitable
for the data.
The fundamental idea of the first two types of approach (external and internal
criteria) is to test whether the data points in the given dataset are randomly structured
or not, based on statistical testing. This usually requires some sort of calculation
involving pair-wise comparison between each pair of data points and each cluster,
which leads to a computationally expensive procedure. In addition, the indices
related to these approaches aspire to measure the degree of the dataset to a pre-
specified clustering scheme [69]. Conversely, the third approach does not involve
statistical tests and allows for the best clustering structure to be chosen from a set of
schemes, defined based on pre-specified criteria [69].
This thesis focuses on real world problems from the medical domain whereby
the different types of tissue (referring to the number of clusters in the clustering
problem) are usually unknown in advance. However, with the utilisation of a
validity index, the best clustering scheme (which includes the number of clusters)
may be identified. This can be implemented by applying the clustering algorithms
within a range of cluster numbers, and the partition with the best cluster validity
index value (either maximum or minimum) is returned. This procedure is described
in Figure 2.7 [71,72].
Chapter 2
33
In a given dataset X, fix the other clustering parameters except for the number
of clusters, c.
1) Set the minimal and maximal cluster numbers cmin and cmax respectively.
2) For c = cmin : cmax
2.1) Initialise the cluster centres.
2.2) Apply the clustering algorithm with number of clusters c.
2.3) Calculate the validity index of the clustering scheme.
3) End for
4) Return the clustering structure that corresponds to the best validity index
value obtained throughout the procedure.
Figure 2.7 Identification of the number of clusters by using a validity index [56].
As mentioned previously, if an automated approach is to remain practical for
real world medical diagnosis, it is not desirable for the method to have too high a
computational expense. Therefore, this thesis has specifically focussed on using the
relative criteria to evaluate the obtained schema so that the developed approaches
remain acceptable in terms of their required computation time. In the rest of Section
2.2, validity indices suitable for fuzzy clustering are presented.
2.2.2 Partition Coefficient and Partition Entropy Coefficient
In fuzzy clustering, the fuzzy partition matrix, cnijU ×= )(µ (see Section 2.1.3),
represents the membership degree of the data point i to the cluster centre j. The
higher the value of ijµ , the stronger the data point i belongs to cluster j. The
Chapter 2
34
partition coefficient (PC) and partition entropy coefficient (PE) [73] are two validity
indices derived from the membership values ijµ .
The Partition Coefficient (PC) is defined as:
∑∑= =
=n
i
c
jijn
PC1 1
21 µ (2.9)
where n is the number of data in the dataset, c is the number of clusters. ]1,1
[c
PC ∈ .
The range of PC has been obtained from two extreme clustering cases. The first case
occurs when each data point has a membership to its cluster centre of one. In such a
case the PC value would be equal to 1, indicating that all of the clusters have well-
defined borders. In the opposite case, each data point has an equal membership to all
of the cluster centres and the PC value would approximate to c1
, indicating that the
clustering is the most fuzzy. Therefore, as the clustering quality increases, the value
of PC also increases [69].
The Partition Entropy Coefficient (PE) is defined as:
∑∑= =
−=n
i
c
jijaijn
PE1 1
)(log1 µµ (2.10)
where a is the base of the logarithm, ]log,0[ cPE a∈ . In a similar fashion, there are
two extreme cases that form the ranges of possible values for PE. When the clusters
are well separated, PE value is closer to “0”, in contrast, when the clustering is
Chapter 2
35
fuzzier, PE value approaches calog . Thus, the better the clustering achieved, the
smaller the value of the PE coefficient.
The main advantage of these indices is that they are easy to compute.
However, they are only really useful if there are a small number of well-separated
clusters [72] and they have further drawbacks as described below [69]:
1) The values of PC monotonously increase and PE monotonously decrease as
the number of cluster c increases.
2) By only using the membership values of the data (i.e. without using the
data itself), these indices also lack direct connection to the geometrical
properties inherent within the data.
2.2.3 Xie-Beni Validity Index
In order to overcome the problems that exist in the validity indices that were
introduced in Section 2.2.2, Xie-Beni (XB) defined a new validity index which not
only involved the membership values, but also included properties taken from the
data itself [74]. The XB index (also named the compactness and separation validity
function) is a representative index of relative validity indices [69]. In the following,
let us assume that VXB represents the overall XB index value, π is the compactness of
data in the same cluster and s is the separation of the clusters. The XB validity
index can now be expressed as:
sVXB
π= (2.11)
Chapter 2
36
where n
vxc
j
n
ijiij∑∑
= =
−= 1 1
22 ||||µπ (2.12)
and 2min )(ds = , here mind is the minimum distance between cluster centres, given by
||||minmin jiij vvd −= . From the expressions (2.11) and (2.12), it can be seen that a
smaller value of π indicates that the clusters are more compact whilst a larger value
of s indicates the clusters are well separated. As a result, a smaller value of VXB
means that the clusters have a greater separation from each other and are more
compact within each cluster.
It should be noted that the XB validity index also has a disadvantage in that, as
the number of clusters c gets very large or close to the number of data n, the index
value monotonously decreases [69].
2.2.4 Sun-Wang-Jiang Validity Index
Conventional validity indices (such as VXB) have the problem that the total
number of clusters becomes closer to the number of data points as the compactness
value has the tendency to monotonically decrease [75]. Therefore, when the number
of clusters is set to a very large value, the validity index value may not correctly
reflect the quality of the clustering. In order to overcome this problem, Sun, Wang
and Jiang introduced a new validity index measure named the Sun-Wang-Jiang
(WSJ) validity index (VWSJ) which was based on an improvement of the Rezaee-
Letlieveldt-Reiber index [71].
Chapter 2
37
Assume dataset X is in p dimensions. Let Tpjjjj vvvv })( ,...,)(,)({)( 21 σσσσ = and for
each ∑=
−=n
i
pj
piij
pj vx
nv
1
2)(1
)( µσ is the fuzzy variation of the cluster j,.
TpxxxX })(,...)(,)({)( 21 σσσσ = is the variance of the X. For each ∑=
−=n
i
ppi
p xxn
x1
2)(1
)(σ ,
where px is the pth dimension of mean value, ( ∑
=
n
iix
n 1
1 ).
The VWSJ index is also divided into two separate compactness and separation
components and can be expressed in the following form:
)(
)()(),,(
maxcSepcSep
cScatcVUVWSJ += (2.13)
The compactness function ||)(||
||)(||1
)( 1
X
vc
cScat
c
jj
σ
σ∑== , and ]1,0[)( ∈cScat .
As the number of clusters, c, increases, the value of )(cScat generally decreases,
since the clusters are more compact.
The separation function ∑ ∑= =
−−=c
i
c
jji vv
DD
cSep1 1
122min
2max )||||()( , where
||||maxmax jiji vvD −= ≠ and ||||minmin jiji vvD −= ≠ . The benefit of this )(cSep function is
that it considers the geometry of the cluster structures and also avoids a structure
model whereby there are too many clusters. This can be explained by approximately
writing )(cSep as ]1
[1 22
min
2max
cdE
D
D
cc−
, where cd is the average distance between one cluster
centre to rest cluster centres. When the distribution of the clusters is a tetrahedron,
Chapter 2
38
2min
2max
D
D reaches its minimum, which is 1 and ]1
[2cd
E also reaches its minimum. When
the format of the cluster centres becomes irregular, both values become larger.
However, as the number of cluster increases, 2min
2max
D
D will intend to increase more and
]1
[2cd
E will likely become more stable. Therefore 2min
2max
D
D is the main factor to penalise
the situation where too many cluster occur.
)( maxcsep is used as a weighting factor which normalises the range of the
compactness and separation functions. A fuzzy partition which has the minimum
value of VWSJ is considered the best clustering arrangement. The WSJ validity index
shows excellent performance for datasets where specifically the clusters overlap [72].
For further details of the method, see [72].
There are many other validity indices that have been developed in the past
based on different clustering criteria. They include Dunn’s index [44], Davies-
Bouldin index [64], Fukuyama and Sugeno index [76], Rhee and Ho index [77], the
Bandyopadhyay and Maulik index [78], the Xie and Zhao index [79], the Kim and
Lee index [75], the Kim and Ramakrishna index [80] and the Wu and Yang index
[81].
2.3 Auto Clustering
In this Section, auto clustering refers to clustering algorithms which have the
ability to automatically determine the number of clusters (i.e. without being provided
in advance by an input parameter). In most of the clustering algorithms that have
Chapter 2
39
been described so far, the number of clusters need to be specified in advance
[28,68,82]. The determination of the number of clusters is an important application
of clustering techniques as, if the number of clusters can be automatically
discovered, a clustering algorithm becomes more general across a wide range of
datasets and, potentially, also across a wide range of problem domains. In Sections
2.1.4 and 2.1.5, some stochastic search based auto clustering algorithms were
introduced. Besides these, various other clustering algorithms used to automatically
detect the number of clusters have also been proposed within the literature.
As mentioned in Section 2.2.1, cluster validity indices can help automatically
generate an optimal number of clusters from a unknown given dataset (the procedure
was identified previously in Figure 2.6). Based on this framework, many proposed
validity indices have been developed to achieve auto clustering. Ray and Turi [83]
described a cluster validity measure which is a ratio of the intra-cluster distance to
the inter-cluster distance where the intra-cluster distance refers to the average
distance of all data objects to their cluster centres and the inter-cluster distance refers
to the minimum distance between all the combinations of two clusters. The smaller
the intra-cluster distance, the more compact the cluster and, the larger inter-cluster
distance, the more separation that exists between the clusters. The objective is to
minimise the proposed validity index value so as to obtain the best clustering result.
This validity index was firstly used in a k-means clustering algorithm based iteration
process (the number of clusters was set from 2 to a user defined maximum number
Cmax). When the Cmax is reached, the cluster having maximum variance was split into
two clusters. The new two cluster centres were generated based on the original
Chapter 2
40
cluster centre ‘minus’ and ‘plus’ a constant value in each dimension. The purpose of
this process was to minimise the average intra-cluster distance. After the new
clustering scheme was formed, the validity index was calculated again to determine
the optimal number of clusters. Experiments were conducted on two datasets
involving synthetic images and natural images. The results showed that the proposed
validity measure had a tendency to select smaller clusters for natural images but a
modified rule was sufficient to overcome this problem. However, this rule still failed
on the synthetic image when there were only two clusters [83].
Recently, Sun et al [72] developed a new algorithm named Fuzzy c-means-
Based Splitting Algorithm (FBSA) which used the Sun-Wang-Jiang validity index
(Vwsj, see Section 2.2.4) to automatically determine the optimal number of clusters.
Within each iteration from Cmin to Cmax (minimum and maximum number of
clusters), the fuzzy c-means clustering algorithm was firstly applied to form a
clustering solution, and the corresponding validity index was then computed. Next,
the FBSA algorithm split the ‘worst’ cluster into two new clusters. Afterwards, the
validity index was calculated again and the best clustering with the optimal number
of clusters was picked up from the optimal validity index. In order to identify the
‘worst’ cluster, a ‘score’ function, S(j), associated with each cluster, j, was defined:
jclusterinobjectsdataofnumberjS
n
iij
______)( 1
∑==
µ (2.14)
A smaller S(j) value reflects that the cluster j is strongly compact. Therefore, the
‘worst’ cluster corresponds to the cluster with the biggest value for S(j). After
Chapter 2
41
identifying the ‘worst’ cluster, two new cluster centres were generated such that their
positions were as far away as possible from each other and the remaining centres.
Experimental results were achieved using four distinct datasets: the first one from the
public domain-IRIS dataset, the second and third were generated from mixture of
Gaussian distributions and the last one from a real survey of household expenditures
and budgets. The results showed that the combination of FBSA and VWSJ obtained
the best results in comparison with FBSA combined with other validity indices.
Kim et al [75] also proposed a validity index to automatically estimate the
optimal number of fuzzy clusters. This index was developed especially to cope with
the case whereby there exists highly overlapped and potentially vague data. This
index is based on the ratio of the overlap to the separation measure. The degree of
overlap between fuzzy clusters was calculated using the ‘inter-cluster overlap
measure’. If two clusters have a clear separation from one another, then the overlap
degree is very low. The distance between the fuzzy clusters yielded the separation
measure. Therefore, when two clusters have a larger separation distance, the clusters
are well separated. In a good fuzzy partition, the index would have a smaller value.
The fuzzy c-means clustering algorithm was applied several times using each value
within the range of Cmin to Cmax. The clustering result (within these Cmax to Cmin
results) that yielded the minimum validity index was returned. Therefore, the
solution containing the minimum validity also identifies the optimal number of
clusters, Cbest , (where Cmin <= Cbest <= Cmax). The experiments were designed to
compare the performance of the proposed validity index and nine other previously
Chapter 2
42
developed indices in determining the number of clusters. The results showed that the
proposed index was more reliable and efficient than other indices.
Another approach to automatically identifying the optimal number of clusters,
apart from methods involving a validity index, is the merge and split technique.
Iterative Self-Organizing Data Analysis (ISODATA) [84] is a typical algorithm that
uses the merge and split technique to determine the number of clusters in a dataset
[28]. It is based on a similar methodology to the k-means clustering algorithm and
follows the same initialisation and iterative procedure: assign each object to its
closest corresponding cluster centres and then calculate the new centre positions
(mean value of the objects in the cluster). However, ISODATA contains one key
enhancement in that it is also able to merge and split the clusters providing some
criteria are satisfied. The criterion is defined as follows: if the number of objects in a
cluster is less than a certain threshold or if the distance between two centres are less
than a certain threshold, then the clusters may be merged. In addition, if the variance
between two clusters is larger than a certain threshold or if the number of objects in a
cluster is bigger than a certain threshold, then clusters are split. For the k-means
clustering algorithm, the number of clusters is specified in the beginning and remains
fixed until the end of the clustering process. For the ISODATA algorithm, this is not
the case (although the algorithm still needs a prior statement as to the initial number
of clusters, this does not drastically affect the final clustering results obtained [85]).
Therefore, the approach uses the merge and split procedures to find out the optimal
number of clusters based on threshold values. Even though ISODATA has a clear
advantage over the k-means approach, it requires a number of parameters to be
Chapter 2
43
declared in advance which directly affect the clustering results such as the maximal
variance between two clusters, the minimal number of objects within each cluster
etc. This makes the approach very subjective [28,86].
Tou [87] proposed a cluster algorithm named Dynamic Optimal Cluster-
seeking (DYNOC), which is similar to ISODATA in that it also has the ability to
merge and split clusters. In this algorithm, the objective function is to maximise an
index which is formed by the ratio of minimum inter-cluster distance to maximum
intra-cluster distance. Clusters which reach the global maximum index value are
considered the best. This algorithm also requires the user to specify several input
parameters to direct the performance of the split and merge techniques [28,86].
Chaudhuri et al [88] also developed a split and merge clustering technique to
obtain the number of clusters from a dataset. The main difference of this algorithm
with ISODATA is that it splits the clusters by observing the density in different
directions by examining the number of data within different regions made up of
strips. Figure 2.8 shows a two dimensional dataset with strips in four directions.
The method starts by including all of the data as one cluster and then iteratively
performs a split and merge process based on the number of data in the strip area
which is determined by certain threshold values. Although this algorithm achieved
good performance on two sets of experimental data, it once again has the drawback
that many input parameters have to be specified before the approach can be executed.
Chapter 2
44
Figure 2.8 Two dimensional dataset strips in four directions.
Recently, Huang proposed a Synergistic Automatic Clustering Technique
(SYNERACT) as an alternative method to ISODATA [89]. It combined a
hierarchical descending approach with the k-means clustering algorithm to avoid the
limitations that exist in ISODATA. In general, the process of SYNERACT can be
described by the following: a hyperplane splits a cluster into two smaller clusters so
that one small cluster has a positive dot product value with a weight vector and the
other small cluster has zero or negative dot product value with the weight vector.
Next, the mean of the centre of each small cluster is calculated by the mean value of
all data points in each cluster. Then, an iterative optimisation clustering procedure is
employed to estimate the reasonable movement of a point to different clusters based
on the minimisation of an objective function. The new clusters that are successively
generated from each split stage are stored in a binary tree data structure. If no further
separation can be achieved from the hyperplane, then the process stops and the
algorithm completes. According to Huang, SYNERACT has produced similar
Chapter 2
45
accuracies as ISODATA, but can do so much faster. In addition, it does not require
the user to specify the number of clusters and initial centres positions in advance
although two input parameters which control the splitting process still need to be
defined.
Tseng and Yang [66] described a split and merge technique based on a genetic
algorithm for the auto clustering problem. In the first stage of this proposed method,
a nearest-neighbour clustering method was used to group the original dataset into
smaller clusters. The authors stated that the main purpose of this stage is to reduce
the data size so that it may fit into the genetic algorithm framework. The second
stage employs the genetic algorithm to merge the small groups into large clusters and
a heuristic strategy was then used to discover a good clustering scheme. Some
drawbacks of this algorithm include the pre-specification of parameters and failure to
generate the correct clustering if one cluster is partially or completely within another
cluster [65].
Garai and Chaudhuri [65] included a similar split and merge process as Tseng
and Yang’s algorithm for the purpose of generating a good clustering scheme.
However, in order to overcome the limitations that existed in Tseng and Yang’s
algorithm, an Adjacent Cluster Checking Algorithm (ACCA) was developed to
verify the adjacency of any two small clusters and was primarily used to perform a
merge process to form better clustering.
Many other split and merge based clustering algorithms have been proposed
within the literature and include the ‘Density-Based Spatial Clustering’ algorithm
Chapter 2
46
(DBSCAN) [90], the ‘Clustering Using Representative’ algorithm (CURE) [91] and
the ‘Chameleon’ algorithm [92].
2.4 Cluster Merging
Cluster merging is sometimes used as an additional phase within a clustering
algorithm in order to discover a good number of clusters to represent the data. As
with the split and merge technique of the previous section (Section 2.3), the original
dataset is firstly split into small groups and then these clusters are iteratively merged
together until some termination conditions are reached. It is clear that different
merge criteria may lead the clustering to totally different results. In this Section, the
various cluster merging criteria are highlighted.
In ISODATA based algorithms mentioned in Section 2.4, such as DYNOC
[87], the merge criteria is based on the distance between two clusters, if this distance
is smaller than a certain threshold value, then these two clusters are merged. Within
the Chaudhuri algorithm [88], a merge circle defined in the boundary of two clusters
was used as the merging restriction. If the difference of the data points between the
intersection area of two clusters to the merge circle set is less than a threshold value,
then these two clusters can be merged.
In Tseng and Yang’s paper [66], the merging of clusters was facilitated through
the use of a genetic algorithm. The GA was initialised by randomly generating a
population of strings whereby each string is a binary encoding with an equal number
of bits to the number of clusters outputted from the nearest neighbour clustering
Chapter 2
47
algorithm. Each string represents a subset of the clusters where a value of “1” in the
ith bit indicates the inclusion of the ith cluster in this subset and conversely, a value
of “0” indicates the exclusion of the cluster. After initialisation, each string, in turn,
was evaluated. This involved setting the subset of clusters as the initial cluster
centres (as identified through the “1” values within the binary encoding) and the
reallocation of each data point from the remaining clusters to its nearest selected
cluster (identified by the “0” bits within the string). Finally, the fitness was
determined using the intra-cluster and inter-cluster distances. The members of the
population were sorted by their fitness and were chosen for reproduction using
roulette wheel selection. Two point crossover was employed to interchange the two
substrings with a crossover probability of 80%. The bits of the strings were chosen
and their values flipped from “0” to “1” or from “1” to “0” with a mutation
probability of 5%. The best string was returned after a user specified number of
generations was reached.
Garai and Chaudhuri [65] used a similar GA algorithm to merge the closest
smaller clusters into large clusters. The improvement of this paper over Tseng’s
paper is the ‘Adjacent Cluster Checking Algorithm’ (ACCA) which was primarily
performed as the merge process to overcome the problem when one cluster is
confined fully or partly within another cluster or clusters. This was implemented by
checking two threshold values: the number of boundary points Tb and data density
difference Td between two smaller nearby clusters for deciding the merging of a pair
of clusters. If the boundary points between the two clusters was greater than Tb and
the difference of density was less than Td , then these two smaller clusters were
Chapter 2
48
verified as adjacent to each other. In the experiments performed, both of the
thresholds needed to be specified manually.
Kelly [93] proposed an algorithm for merging hyper-ellipsoidal clusters. In
this paper, an effective merging radius was introduced, where i and j represents two
clusters and rij was defined as the minimum effective cluster radius such that the
boundary of hyper-ellipsoids i and j intersect on the segment between their mean
vectors, as shown in Figure 2.9 from [93]. The smallest rij within all possible pairs
of clusters was chosen and the two corresponding clusters were merged. Two stop
criteria were given in the paper: 1) if the current cluster size is reduced to a given
size; 2) a threshold value R was specified such that the merge process was continued
until rij was greater than R.
Figure 2.9 Effective merging radius for clusters i and j.
Recently, Xie et al [79] developed a ‘multi-step maxmin and merge’ (3M)
algorithm. A new cluster validity based on compactness and separation measures
was described and used to evaluate the obtained clustering schema. The maximum
validity index that was achieved corresponded to the best clustering scheme. The
algorithm can be implemented in two steps. Firstly, a multi-step maxmin algorithm
i
j
Chapter 2
49
(a modified version of the maxmin algorithm proposed by Tou and Gonzalez [94])
was used to group the original data into c clusters which corresponded to the
maximum index value. Secondly, a merge process was performed on the obtained c
clusters: the ‘worst’ cluster (corresponding to the minimum validity index) was
deleted, and all data points were reassigned to their nearest existing cluster centres,
the new centres were formed by calculating the median of each cluster. This iterative
merge process continued until the total number of clusters was two. The cluster
validity was computed at each step of iteration and the clustering scheme that
corresponded to the minimal index value was returned.
2.5 Clustering in FTIR Spectroscopy
2.5.1 Introduction
In order to analyse the FTIR microscopic data from existing tissue samples,
various techniques have been used. These include: point spectroscopy analysis
technique, greyscale functional group mapping, digital staining [14]; discriminant
function analysis [15] and multivariate statistical analysis methods such as principal
component analysis (PCA) [16]. Apart from these, multivariate clustering
techniques have often been used to separate sets of unlabelled infrared spectra data
into different clusters based on their characteristics (this is an unsupervised process).
Through the examination of the underlying structure of a set of spectra data, different
types of cells can be separated within biological tissue. There are many clustering
techniques that have been applied for the purpose of FTIR spectroscopic analysis.
Chapter 2
50
These include hierarchical clustering analysis (HCA) [17-19], k-means clustering
[17] and fuzzy c-means clustering [16,17].
The application of commonly used techniques in FTIR spectroscopy analysis,
namely a multivariate analysis method (PCA) and multivariate clustering techniques
(HCA, k-means and fuzzy c-means), is now discussed.
2.5.2 Principal Component Analysis in Cluster Analysis
Principal component analysis (PCA) is a multivariate statistical technique that
has been widely applied in the field of data analysis and compression. This
technique linearly transforms a number of correlated variables into a new set of
uncorrelated variables, called principal components (PCs). PCA achieves this by
rotating the original axes to produce orthogonal axes that are uncorrelated to each
other [95]. The rotation procedure is a linear transformation of the original dataset
and, therefore, if all the variables are included in the rotation, then all information is
preserved. Within the new transformed variables, the first principal component
(PC1) identifies the dimension with the maximum variation in the original data and
the second principal component (PC2) is the dimension with the second largest
variation, and so forth. Therefore, the first few principal components usually contain
the most influential variations from the original data and it is this property that yields
the major advantage of the approach by allowing the dimensionality of the data to be
reduced whilst the most significant features of the data are retained. However, this
may not necessarily be the case; it depends on the application. The other main
application of PCA is to detect the underlying data structure, this is often used in
Chapter 2
51
cluster analysis [96]. The techniques for using PCA in FTIR spectroscopy cluster
analysis are now introduced.
As mentioned above, most of the variation within the original data can be
represented by the first a few PCs. Thus, if the data is plotted into the different PC
space, the data structure which encloses different clusters may simply be detected or
verified. For example, if PC1 and PC2 are plotted, then the dimensionality of the
data has been reduced to two dimensional space. Goncalves et al [97] found that by
plotting the FTIR spectra from sugarcane bagasse samples into two dimension space
using PC1 and PC2 the different types of samples can be detected through
visualisation. Indeed, on further analysis they discovered that, by reducing the data
to the first two PCS, the dimensionality could be reduced significantly and yet these
components still retained 88% of the variation of the original data. As for how to
calculate the percent of the variation, see the end of Section 5.4 (page 107).
In Lasch et al [20], PCA was employed for two purposes within the cluster
analysis of FTIR data. Firstly, it was used to generate a coloured image of different
types of tissue from the FTIR spectra. This was achieved initially by picking six
representative reference spectra from the FTIR maps based on histological
information after Haematoxylin and Eosin (HE) staining tissue section. Each
representative reference spectrum was considered as an origin and the geometrical
distances from the origin to the rest of the spectra were then calculated in any two PC
space. The distance between each spectrum to every origin was normalised by
dividing by the maximal distance from that origin and a series of values based on
Chapter 2
52
each origin were then obtained. By combining these values with the original spatial
information, colour or grey scale maps could be yielded. Thus, the different types of
tissue could be pictured through these image maps. In order to verify whether
conventional light microscopic analysis matched the biochemical information
obtained from FTIR analysis, PCA was employed again. The first six PCs of the
original data was considered as input data for the hierarchical clustering algorithm
(based on Ward’s algorithm [39]). The clustering results showed that IR-based
classification matched the visual light microscopic investigations.
Kim et al [98] also recently employed PCA as a pre-processing step for cluster
analysis, in a similar fashion to Lasch et al [20], using seven different species of
FTIR plants spectra. The spectra were plotted onto PC1 vs. PC2 two dimensional
space and the authors showed that the different categories of plants could be
identified simply using the visualisation. Subsequently, the results based on PCA
were used as the input data to the hierarchical clustering technique. The authors
showed that the generated dendrogram was in agreement with the known taxonomy
of the plants and this indicated that the FTIR data reflected phylogenetic
relationships between the plants [98].
PCA techniques have also been used in order to evaluate the tumour tissue
from FTIR spectroscopic maps. In Richter et al [16], PCA techniques were used to
extracted the first 20 PCs of the original dataset which were then used in a fuzzy c-
means clustering approach. Richter et al identified that different tissue types could
be displayed by showing all spectral score maps in an entire sample image. The
Chapter 2
53
score maps showed all spectra in the new coordinate system defined by the PCs
whereby the first score map explains all of the new spectral coordinates in PC1 and
so forth. As the PCs increase, the authors showed that less information can be
discovered and, in fact, the first 4 PCs score maps covered 99% of the total variance
of the dataset. The authors also provided an interesting analysis into the information
that could be gathered from each of the principal components. Using the first score
map, only the whole sample shape could be identified. From the second score map,
different regions of the tissue began to emerge. However, after the fourth score map,
the effects started to become less and less and yielded very little additional
information. In order to improve the interpretability of the PCA score maps, the
authors also combined the scores of second, third and fourth PCs to form one
coloured image (each PC was shown as a separate colour). The experiments
indicated that fuzzy c-means clustering in combination with PCA was suitable to
separate different types of tissue based on their FTIR maps. The authors claimed
that, in comparison with PCA, the fuzzy c-means clustering algorithm offered a
clearer view of the main features of the tumour section [16].
2.5.3 Hierarchical Clustering Analysis
When IR spectra are analysed using multivariate clustering techniques, each
spectra is considered as an individual data point. The p absorbance (or features)
associated with each spectra are considered as the different dimensions of the data
points. Therefore, the IR spectra in cluster analysis are illustrated as data points in a
p-dimensional space. Hierarchical clustering analysis (HCA) on IR spectra can be
Chapter 2
54
described as follows [17]: in a space containing n data points, a distance matrix
between all points within p dimensions is calculated. The size of the distance matrix
is n×n, and is symmetric along its diagonal. The two data points that are closest and,
therefore, most similar to each other are then merged together to form a new cluster.
The new distance matrix is now reduced to (n-1)×(n-1). The procedure continually
merges the two closest data points until all data points are merged into one cluster.
Once again, the output of HCA is the dendrogram (as described in Section 2.1.1) and
a cut-off line can be drawn to define the number of clusters and to generate the final
partition. Whilst this step is normally subjectively performed, the number of clusters
are usually provided as an input parameter to the HCA algorithm.
In most HCA analysed IR spectra, Ward’s algorithm (also named the
minimum-variant algorithm in section 2.1.1) is frequently used because it tends to
produce dense clusters [17]. Ward’s algorithm finds clusters by minimising the total
sum of squared error from all clusters [39]. Ward’s original method for calculating
distance between two clusters (r and s) is relatively complex, and has been simplified
to an ‘equivalent’ distance within the MATLAB implementation used in this thesis
as follows [99]:
sr
rssr nn
dnnsrd
+=
22 ),( (2.15)
where 22 |||| srrs xxd −= is the distance between clusters r and s. rx and
sx are the
centres of clusters r and s, which are equal to the mean value of all the data points in
Chapter 2
55
each of the respective clusters. The Euclidean distance is represented by ||.|| and nr
and ns represent the number of data points in clusters r and s respectively.
Apart from the Euclidean distance measure, another distance (or similarity)
measure, the correlation coefficient or correlation, is also often used in FTIR spectra
analysis. It is used to evaluate the strength and direction of a linear relationship
between two spectra and can be defined as below [19]:
∑∑
∑
==
=
−−
−−=
p
jj
p
jj
p
jjj
RS
RRSS
RRSS
C
11
1
)()(
))((
(2.16)
where spectra S and R are two spectra which can be considered as two 1-dimensional
vectors with p absorbances (in cluster analysis, the absorbance values can be viewed
as the variables) and S and R are the mean values of each vector. The output CRS is
known as covariance matrix and , as it is a symmetric, only the upper half is needed.
Out of the different correlation measures, the most frequently used is Pearson’s
correlation coefficient. It is defined as follows [83] :
∑ ∑
∑
= =
=
⋅−⋅⋅−
⋅⋅−⋅=
p
j
p
jjjjj
p
jjjjj
RS
SpSRpR
SRpSR
PCC
1 1
2222
1
)()(
)(
(2.17)
The output of the correlation coefficient is within range of [−1, 1] and indicates
the similarity of the two spectra. If the correlation coefficient is close to −1, these
two spectra are completely opposite whereas, if its value is close to 1, the two spectra
Chapter 2
56
are identical. If the correlation coefficient is 0, there is no correlation between the
two spectra. Therefore, the spectra with the highest correlation coefficients are
considered to be the most similar spectra and it is these spectra that are merged at
each iterative step of the hierarchical clustering algorithm.
Different HCA techniques have been widely proposed for the application of
FTIR spectroscopy analysis (e.g. based on various linkage methods) using various
distance measure (e.g. Euclidean distance, correlation) in the literature. Wood et al
[19] applied a HCA method involving a distance measure correlation coefficient to
investigate FTIR spectra from cervical cancer. Up to ten clusters were selected
according to the major anatomical features and the mean spectrum from each cluster
was extracted for comparative purposes. The clustering results showed that the
approach was able to separate normal and diseased tissues from each other, in
comparison with conventional histological analysis.
Zhao et al [100] investigated the use of FTIR spectroscopy to characterise a
group of 20 different bacteria using the Euclidean distance as similarity measure to
construct a dendrogram. Twenty different bacteria were to be classified by DNA
sequencing. Two other techniques, namely 16S rDNA sequencing and fluorescent
amplified fragment length polymorphism (AFLP), were also used to analyse the
same bacteria and the results obtained from these three methods were compared. It
was found that all approaches generated similar outcomes. However, in comparison
with other two techniques, FTIR was the faster method. In addition, it was easy to
use and inexpensive in terms of laboratory use.
Chapter 2
57
Salman et al [21] employed Ward’s algorithm to look into the detection of cells
infected by different variations of the herpes virus using FTIR spectroscopic
methods. In order to obtain the best partition results, cluster analysis was performed
on different segments of the spectra. The results indicated that spectra wavelength in
a range of between 950 and 1350cm-1 achieved the best results. The paper concluded
that the normal and herpes infected cells can be discriminated within the early stages
of the infection.
Naumann et al [101] used Ward’s algorithm to detect the fungi in wood
through FTIR microscopy and imaging. A false colour image was used to display
the FTIR spectra cluster analysis results from wood fibres, empty vessel lumina and
mycelium of both fungal species. The image showed that the differentiation between
the three wood blocks could be distinguished. The paper also concluded that FTIR
microscopy technique had the potential to identify different fungal species decaying
wood.
The combination of different distance measures with Ward’s algorithm have
also been developed in past. Schultz et al [22] apply Euclidean distance in Ward’s
algorithm to study the chronic lymphocytic leukaemia cells using FTIR
spectroscopy. Recently, Romeo and Diem [18] made use of the correlation
coefficient in Ward’s algorithm to reduce the artefacts present in the related area of
‘infrared transflection micro-spectra’ to improve the quality of the spectral analysis.
Lasch et al [17] employed a linear transformation of Pearson’s correlation coefficient
Chapter 2
58
measure in Ward’s algorithm. IR spectra maps from a colorectal adenocarcinoma
section were investigated in this study.
Although in these experiments, HCA techniques performed well in cluster
analysis of FTIR spectra, the various authors identified some major drawbacks of the
method. Firstly, the size of the correlation matrix (or distance matrix) requires a
large amount of computer memory [18]; secondly, it has a high computational
requirement which becomes especially evident when analysis is performed on large
datasets. In Lasch et al’s paper [17], the authors reported that the application of
HCA to analyse 8281 spectra took 4.5 hours which, in a practical environment, is
unacceptable.
2.5.4 K-means Clustering Analysis
The application of k-means clustering algorithm on IR spectra can be described
as follows [17]. In a space containing n data points (spectra), and each data point
associates p dimensional (absorbance) values once again, an initial k data points are
randomly chosen. Each of these k points represent the initial cluster centres. The
distances between all data points to these chosen k cluster centres are then calculated
and each data point is assigned to the cluster that has the minimal distance value.
Next, a set of new cluster centre positions are computed based on the newly formed
clusters (the mean value of the data points in each cluster is considered as the new
cluster centre position). This procedure is repeated again. During the iterative
process, as the cluster centres change, each data point may be reassigned to different
cluster centre many times. The method will stop when all of the cluster centres are in
Chapter 2
59
stable positions. Compared with HCA analysis, k-means is not time-consuming and
its execution times are linearly proportional to number of spectra n.
In a paper by Zhang et al [23], a k-means clustering technique was used for the
assignment of pixels in an image to identify cell and non-cell categories, where the
breast cell lines were evaluated using an FTIR microscopic imaging measurement.
Lasch et al [17] also employed a k-means clustering method to evaluate the FTIR
microspectroscopy imaging of colorectal adenocarcinoma tissue sections. A various
different number of clusters were pre-specified (2, 4, 6, 8 and 11, respectively) for
use in the experiments. The results showed that k-means technique can categorise
the principal differentiation from histopathology and, especially when the number of
clusters was 6, all spectral clusters could be clearly assigned to a specific histological
structure. Nevertheless, when further increases were made in the number of clusters,
the k-means approach failed to further discriminate between the information within
the histological structures [17]. It should be noted again that the k-means clustering
algorithm always requires the user to specify the number of cluster prior to execution
of the algorithm.
2.5.5 Fuzzy C-Means Clustering Analysis
In most applications of the fuzzy c-means algorithm, each spectrum needs to
undergo a ‘hardening’ process to convert the results into crisp values. This is simply
performed by assigning each spectrum to a cluster according to its highest
membership degree or to a membership degree greater than a threshold value.
Studies have shown that by hardening the clustering results from fuzzy c-means,
Chapter 2
60
better clustering results can be achieved than using the inherently hard results from
k-means [33,48]. Similarly to the k-means algorithm, the fuzzy c-means clustering
technique does not require high computational requirements as its execution time is
also proportional to number of spectra.
McIntosh et al [14] employed a fuzzy c-means clustering algorithm for the
analysis of IR microscopic maps of human skin. Five distinct clusters were
identified and each cluster was correlated to separate histological tissue components.
The authors report that the approach can clearly separate tumour-bearing skin from
normal skin. On the obtained solution, the five centroid spectra from each cluster
were able to clearly separate the chemical differences between the tissue types.
Mansfield et al [102] analysed the remnants of a work of art (a 16th Century
Flemish line drawing) using near-IR spectroscopic imaging technique. Firstly, each
spectrum was associated with a pixel element of the image. Then the fuzzy c-means
algorithm was applied to both raw and normalised spectroscopic data (mean centred,
scaled) in order to isolate all four components from background easily. In this study,
the authors used a threshold fuzzy membership value of 0.975 for “hardening” of the
results. The authors report that fuzzy c-means cluster analysis is an excellent
exploratory methodology as it does not require knowledge of the sample’s
composition or spectral properties in advance. The use of spectral normalization
routines in conjunction with fuzzy c-means cluster analysis provides a more detailed
picture of the range of spectral types in the test sample.
Chapter 2
61
Richter et al [16] used fuzzy c-means to evaluate the tumour tissue from FTIR
spectroscopic maps. The highest membership grade was chosen when assigning
each spectrum to a specific cluster. Each cluster was then encoded into a different
colour to display the cluster results. The experiments showed that fuzzy c-means
provided a general view of the main features of the tumour thin section.
Recently, Lasch et al [17] utilised an fuzzy c-means technique to investigate
FTIR microspectroscopy imaging in colorectal adenocarcinoma tissue section. In
this study, all membership values were encoded by colour intensities. Fuzzy c-
means cluster images were then plotted into two dimensional space by PCA so as to
compare with other cluster analysis and histological results. The computational
experiments indicated that when the number of clusters was set to 3, 4 or 6, the fuzzy
c-means images could be assigned to the specific tissue structures. However, when
the number of clusters was increased further, the results became more and more
vague in terms of the relationship to the known histopathology.
2.6 Summary
In this chapter, an overview of both the general clustering techniques utilised
within the literature and the clustering approaches used within FTIR spectroscopic
applications were provided. The purpose of clustering is to group the objects
(spectra) so that they have the most similarity in the same cluster and objects have
Chapter 2
62
the most dissimilarity in the different clusters, thus, through the clustering process of
different FTIR spectra, diverse types of cells can be separated.
In different clustering procedures it is commonly required that the quality of
the clustering schema is verified. This is achieved by a cluster validity measure.
This section identified several of the most relevant cluster validity measures that
have been used within the literature in order to evaluate the partition results.
This section also identified the clustering literature in which algorithms are
proposed for the purpose of automatically and correctly identifying the most
appropriate number of clusters with which to represent a dataset. As the main focus
of this study is the clustering of different tissue types, particular consideration has
been given to algorithms that also aim to cluster tissue types. It was identified that
occasionally, during the automatic clustering procedure, an excessive number of
clusters was obtained.
Finally, a number of publications that utilise various clustering techniques for
the processing of FTIR spectra data were discussed. In all of these approaches, focus
was drawn to the many problems and disadvantages that exist in these techniques.
In the next Chapter, an overview of the medical background relevant to this
thesis is presented.
Chapter 3
63
CHAPTER 3
Medical Background
This research is a collaboration between the Computer Science department
(Xiao Ying Wang supervised by Dr Jon Garibaldi) and the Chemistry department
(Benjamin Bird supervised by Professor Michael George) at the University of
Nottingham, UK, and is motivated by the study of Mr. John M. Chalmers et al which
focussed on the use of FTIR microscopy of oral tissue samples. In light of the fact
that the medical background information and technical considerations for the FTIR
microspectroscopy did not fall under the remit of the School of Computer Science for
this research, most of this chapter has been derived from either our joint publications
(including several conference papers, journal papers and a refereed book chapter),
our colleagues in the Chemistry department, or from the internal report of John M.
Chalmers et al.
3.1 Introduction
In this Chapter, the general medical background related to this thesis is
described and explained. Two types of cancer cells have been investigated in this
Chapter 3
64
study, namely, oral cancer and breast cancer. In order to examine whether the
suspected patients have these types of cancer, tissue samples are collected and,
therefore, this Chapter will also describe the process of collecting, preparing and
conducting FTIR microscopy analysis on these samples.
Oral cancer is a type of disease which can result a large quantity of fatalities.
The latest statistics from Cancer Research UK website (2005) shows that nearly
4,500 oral cancer cases are diagnosed and more than 1600 deaths in the UK each
year [103]. The late detection of this disease is often the cause of mortality. Breast
cancer is the second most common cancer in the UK. The latest statistics from
Cancer Research UK website (2006) shows that more than 41,700 women are
diagnosed with breast cancer and around 300 men also are diagnosed in the UK each
year. More than 12,400 deaths are caused by this disease every year in the UK [104].
The ability to accurately identify the malignancy is crucial for prognosis and
preparation of effective treatment.
Currently, the diagnosis process for oral cancer is firstly through visual checks
of the patient’s mouth and throat by a doctor, if there are any abnormal areas present,
then a small piece of tissue is removed for further investigation by a pathologist
under a microscope (removing tissue to look for cancer cells in medical terms is
called a biopsy). The purpose of this procedure is to check whether cancerous cells
exist in within the tissue area. However, this traditional histology (the study of plant
or animal tissue, usually this involves studying thin cross-sections of tissue under a
microscope [105]) remains a subjective technique and some problems are
Chapter 3
65
occasionally encountered such as missed lesions, broken samples and unsatisfactory
levels of discrepancy. Discrepancies can be both inter-observer (discrepancy
between two different observers) or intra-observer (discrepancy between two
different examinations by the same observer).
For breast cancer, some preoperative imaging methodologies, such as x-ray
mammography and ultrasound, can identify areas of tumour growth in the breast
based on the identification of density changes within the tissue. However they
cannot be used to reliably diagnose whether the tumours are benign or cancerous in
nature [106]. Additionally, the diagnosis of breast cancer can often also be achieved
by assessing the lymph nodes in the ipsilateral axilla (located on or affecting the
same side of the axilla). The presence of metastasis (cancer spread from its original
location) is an indicator for local disease recurrence and thus a method for
identifying patients who are at high risk of developing a cancer variant that could
spread throughout the body. The well-established procedure to access lymph node
metastases is axillary lymph node dissection (ALND). Nevertheless, this is a rather
substantial surgical procedure that can lead to several serious side effects, such as
shoulder dysfunction and lymphoedema (swelling, especially in subcutaneous
tissues, as a result of obstruction of lymphatic vessels or lymph nodes, with
accumulation of lymph in the affected region) [107]. The introduction of
mammography screening programmes, together with a greater public awareness of
breast cancer have meant that the majority of patients who do not have axillary
lymph node metastases at presentation do not have to undergo ALND [108].
Chapter 3
66
Intra-operative diagnosis has become increasingly important with the recent
introduction of sentinel lymph node biopsy [109]. The sentinel node can be
described as any lymph node that has a direct lymphatic connection to the tumour,
and would be the first invaded by cancer spreading from the breast [106], see figure
3.1 [110]. Surgical studies have clearly shown that if cancer cannot be found in the
sentinel lymph node, the chance of disease being found further down the chain of
lymph nodes that drain the breast is negligible [109]. Therefore accurate analysis of
the sentinel lymph node can alleviate the necessity to remove all suspected nodes
present.
Present techniques have been employed to facilitate fast intra-operative
diagnosis of sentinel nodes, such as imprint cytology and frozen section assessment
[111]. However, these approaches report a wide variation in their sensitivity and
specificity to detect cancerous lesions, with detection levels as low as 44% and as
high as 93% when compared against conventional histology [111-114]. In addition,
these techniques are heavily reliant upon the availability of an experienced
cytopathologist, thus the examination from general pathologist may result in lower
accuracies than those from specialist clinics. This general lack of consistency
between different pathologists leads to less reliability of such intra-operative tools for
sentinel lymph node diagnosis.
Chapter 3
67
Figure 3.1 Typical location of lymph nodes that drain lymph from the breast.
The difficulties that exist with the current cancer diagnosis techniques have
resulted in a variety of different spectroscopic methods being investigated in order to
determine whether such approaches could be used to generate a reliable aid for
diagnosis [115-117].
3.2 Instrumentation
Infrared spectroscopy has shown much potential as a tool for analysing the
biological materials over the past decades [113,118,119]. When biological
molecules are exposed to radiation in the mid-infrared region of the electromagnetic
spectrum (400−4000cm-1), characteristic absorptions from the excitation and
vibration of bounds within the molecules can be exhibited [106]. FTIR
microspectrometry, obtained through the coupling of an infrared microscope to a
Sentinel lymph node
Chapter 3
68
FTIR spectrometer, has been proving a potent new technique as a diagnostic tool for
the determination of a variety of tissue structure [120]. FTIR microscopic spectra
can detect subtle changes in spectral peaks and their position from the biomolecule
constituents, such as: proteins, lipids and nucleic acids, therefore, the very small
biochemical changes that occur between different cell types can be noticed, even
with very complex cells.
For the purposes of this research, three types of instrumentation have been
utilised, including two types of IR spectrometers, namely a Nicolet Continuum, a
Perkin Elmer Spotlight Imager and a Nicolet Nic-Plan microscope that was coupled
to a synchrotron source (see Section 3.2.3). Figure 3.2 shows the Perkin Elmer
Spotlight imager used in this study. In the following Sections, each of these
instruments are described in detail.
Figure 3.2 Perkin elmer spotlight imager.
Chapter 3
69
3.2.1 Nicolet Continuum FTIR Microspectrometer
The apparatus is comprised of a Nicolet Nexus 730 FTIR spectrometer (Nicolet
Instruments, Inc., Madison, USA), fitted with a potassium bromide (KBr) beam
splitter. This spectrometer is interfaced to a Nic-Plan IR-microscope that comprises
its own liquid nitrogen cooled narrow-band (1800−900cm-1) mercury-cadmium-
telluride single element detector. Transmission spectra were recorded either at 4cm-1
or 8cm-1 spectral resolution, typically 512 or 1024 scans per spectrum. The FTIR
microscopy was operated using a 32× objective lens. Background single-beam
spectra were recorded through a blank Barium Fluoride (BaF2) window [26].
3.2.2 Perkin Elmer Spotlight Imager
The Perkin Elmer Spotlight Imager (Perkin-Elmer Corp., Sheldon,
Connecticut) was also used in this study and is also a FTIR microscope that is similar
to the Nicolet Continuum instrument. The main difference between the two
instruments is that the Perkin Elmer Spotlight Imager comprises of a dual set of
detectors. The microscope is equipped with both a 100µm single element detector
and an array detector. When operated in array mode, the system utilises a 16 × 1
element (400µm × 25µm) linear array of small area narrow band (4000−720cm-1)
detectors coupled with an electronic stage, to raster across the sample in both the
horizontal (X) and vertical (Y) planes, thus a microscopic IR image can be
constructed. The advantage of the array mode is that it has the capability of scanning
16 different spatial areas at once, thus enabling larger sample areas to be examined
rapidly at the microscopic level. The microscope can purify dry air using a specially
Chapter 3
70
designed Perspex box to reduce spectral contributions from atmospheric CO2 and
water vapour. Spectra were collected in transmission mode, using a clean BaF2
window as background, with a spectral resolution of 8cm-1. Each pixel sampled a
6.25µm × 6.25µm area of the sample. An appropriate background spectrum was
collected from the sample in order to ratio against the single beam spectra. These
ratioed spectra were then converted to absorbance values, with each spectrum
containing 821 data points (4cm-1 data point interval within range of 4000−720cm-1)
[106].
3.2.3 FTIR microspectroscopy utilising a Synchrotron Radiation Source
A FIIR microspectrometer can measure tissue sections containing very small
cells (approximately 10µm diameter). This is because a Synchrotron Radiation
Source (SRS) (which is built into FTIR spectrometers) enables the collection of IR
spectra at these spatial sizes with a significantly higher signal to noise ratio, which
enhances the brightness of the IR radiation resulting in readings that are up to 1000
times stronger than [121]. Synchrotron light is produced when an electron is
accelerated to near-relativistic speeds by a magnetic field made up of multiple huge
dipole electromagnets. Initially electrons are emitted from a hot cathode and
accelerated to approximately 12 MeV by a linear accelerator. They are then injected
into a booster ring raising their energy to 600 MeV before finally being injected into
a storage ring where they circulate at approximately 2 GeV. Accelerating charged
species via magnets induces the release of radiation creating a circular beam that
encompasses a broad spectrum of wavelengths. The type of light that is produced is
Chapter 3
71
dependent upon the magnetic field strength and the energy of the electron beam
[122]. Therefore the maximisation of both these factors produces the shortest
wavelength of light. To stop the reduction of the energy of the light after it is
emitted, Radio Frequency (RF) devices are used to add energy to the beam. Beam
blockers ensure that the appropriate wavelengths of radiation reach the different
experimental stations. The synchrotron radiation should be very stable with gradual
exponential decay over many hours. The beam that emerges at the IR station is
filtered to give the correct wavelengths, and then directed via mirrors to a
conventional infrared microscope. In this study, the IR beamline located at the UK
SRS laboratory in Daresbury where their synchrotron source has been coupled to a
Nicolet Nic-Plan FTIR microscope was utilised.
3.3 Sample Preparation and Data Collection
In this thesis, two types of cancer cells, namely oral cancer and breast cancer,
have been used for the investigation. The following describes the preparation of the
tissue samples and FTIR spectroscopic data collection.
3.3.1 Oral cancer tissue samples
Oral cancer tissue specimens were collected from three patients who had been
diagnosed with oral cancer. The samples have kindly been provided by Derby
General Hospital with full consent of the patients in question. With each patient,
several samples were collected from various areas and encompassed a mixture of
tissue types. Once the samples have been taken, they are immediately frozen in
Chapter 3
72
liquid nitrogen to preserve their biochemical condition. The samples are
subsequently cut using a freezing-microtome to obtain a 5µm thick tissue section.
These sections were then mounted on 0.5mm thickness BaF2 windows for infrared
analysis. The remainder of the tissue specimen was then used for analysis through
conventional methods involving Hematoxylin and Eosin (H&E) staining for the
identification of the regions of particular interest for histology. The results from
these two parallel sections can be used for comparative analysis by a pathologist.
Some of the sections were applied to the infrared analysis first and were simply
stained afterwards to obtain their histological examination [26,106].
3.3.2 Breast cancer tissue samples
Breast cancer tissue specimens were collected during routine surgical resection
for breast cancer with approval from Gloucestershire Research Ethics Committee and
fully informed and consenting patients. From each appropriate case, a small portion
of one axillary lymph node was collected and the tissue areas contained a variety of
lymph node tissue types that were chosen for analysis. There was also a need to
immediately freeze and cut the samples using a freezing-microtome, obtaining a 7µm
thick tissue section. These sections were then placed onto a barium fluoride disc and
stored in a cryovial ready for infrared analysis. In a similar way to the oral cancer
tissue samples, the parallel sections were stained using H&E staining procedures to
obtain the comparative analysis results by a consultant breast histopathologist.
Chapter 3
73
3.4 Data Pre-processing
The output from the IR spectrometry, the FTIR spectra, need to be pre-treated
before undergoing multivariate analysis, this is also called data pre-processing. Pre-
treatment includes the removal of absorption intrusions from atmospheric water
vapour and CO2, and baseline correction is often applied to correct the sloping and
curved baselines encountered in cell spectra. In addition, due to the irregular
thickness in each sample, normalisation is required to remove the effects. In oral
cancer tissue samples, some pre-processing was undertaken using routines within the
Nicolet OMNIC32TM software supplied with the FTIR spectrometer − Nicolet
Continuum. Baseline points that have been chosen to flatten the existing spectra are:
4000, 3750, 1815, 930, 700, 650cm-1. Normalisation was undertaken on each
spectrum by setting the intensity maximum of the Amide II band at ca. 1542 cm-1 to
1 absorbance unit. The data pre-processing mentioned above is also called basic pre-
processing, and was performed using the Pirouette multivariate analysis software.
For the breast cancer tissue samples, the tissue area is greater than in the case
of the oral cancer cases, therefore the Perkin-Elmer Spotlight imager spectrometer
was used to obtain the spectropic data. However, the software that comes with the
spectrometer (Infometrix Pirouette®, version 3.02, multivariate data analysis
software, Infometrix, Inc., Woodinville, WA, USA) is not good for making these
corrections. Therefore originally, the necessary baseline correction was performed
spectrum by spectrum (similarly with normalisation) and was very time-consuming.
In order to solve the baseline correction and normalisation problems in the breast
Chapter 3
74
cancer (lymph node) tissue samples, the author of this thesis implemented these two
corrections using Matlab version 6.5, release 13.0.1 (Mathworks, Natick, MA, USA).
The six baseline points chosen in the lymph node tissue samples were: 4000, 3744,
2200, 1836, 876, and 720cm-1. Two types of normalisation, namely peak area and
vector normalisation were implemented to normalise the spectra. Peak area
normalisation was achieved by scaling all spectra such that the sum of absorption
over the indicated wave-number (4000-720cm-1) equals unity; vector normalisation
was achieved by scaling all spectra such that the sum squared deviation over the
indicated wave-number equals unity. In this thesis, most of the normalisation
utilised peak area normalisation as it is faster than vector normalisation. It should be
noted that in lymph nodes tissue samples, only the basic pre-processing was
undertaken on the spectra.
3.5 Summary
In this chapter, the medical background of this research was presented, the
current diagnosis procedures for oral cancer (briefly) and breast cancer were
described and the difficulties existing in the processes were illustrated. These
difficulties are the main motivation behind a considerable research effort to
investigate whether infrared spectroscopy can be used as a diagnostic probe to
identify early stages of cancer since these techniques are sensitive to biological
changes within cells. Finally, this Section identified the FTIR microspectroscopy
procedures that we have used to investigate the above mentioned two types of cancer
Chapter 3
75
tissue samples. These include the instrumentation used to obtain the FTIR spectra
data, tissue sample preparation, spectral data collection and pre-processing.
In the remainder of this thesis, the investigations of FTIR spectroscopic data
from these cancerous tissue samples are presented. In the next Chapter, three
clustering techniques which are often used in FTIR spectra analysis, namely
hierarchical cluster analysis, k-means and fuzzy c-means, are applied to the seven
sets of oral cancer FTIR spectral data.
Chapter 4
76
CHAPTER 4
A Comparison of Hierarchical,
K-Means and Fuzzy C-Means
Clustering of Oral Cancer Cells
4.1 Introduction
In 2002, John M. Chalmers et al reported the analysis of sets of FTIR spectra
taken from oral cancer tissue samples [26]. In general, the experiments analyzed the
tissue samples in two parallel processes. In the first process, the samples were
scanned by FTIR spectroscopy and pre-processing procedure were applied to the
output of spectra from IR spectrometry (see section 3.4). Furthermore, a set of extra
various pre-processing techniques, such as mean-centring, variance scaling and first
derivative were also performed on the FTIR spectra empirically for the specific
multivariate analysis in order to utilise classification of different tissue types. HCA
(average linkage) was mainly used to classify the spectral data from different types
Chapter 4
77
of tissue area and PCA was used to distinguish these data by visual inspection. In the
second process, the samples were stained with a chemical solution and then
examined through conventional cytology to group the samples into different
functional groups. The outcomes from these two processes were then compared.
The clustering results showed that accurate clustering could only be achieved by
manually applying extra pre-processing techniques that varied according to the
particular sample characteristics and clustering algorithms. However, the pre-
processing procedures needed extra time, software tools and significant human
expertise. If a clustering technique could be developed which could obtain clustering
results as good or even better than conventional clinical analysis without the
necessity for pre-processing procedures, it would make the diagnosis more efficient
and enable automation.
4.2 Oral Cancer Datasets Description
In the oral cancer FTIR spectra, there are a total of seven datasets taken from
three different patients. The spectral range in this study was limited to a
900−1800cm-1 interval. Figure 4.1 (a) shows a 4× magnification visual image from
one of Hematoxylin and Eosin stained oral tissue sections. There are two types of
cells (stroma and tumour) in this section with their regions are clearly identifiable by
their light and dark coloured stains respectively. Figure 4.1 (b) shows a 32×
magnified visual image from a portion of a parallel, unstained section; the
superimposed dashed white line separates the visually different morphologies. Five
single point spectra were recorded from each of the three distinct regions. The
Chapter 4
78
locations of these are marked by “+” on Figure 4.1 (b) and numbered as 1−5 for the
upper tumour region, 6−10 for the central stroma layer, and 11−15 for the lower
tumour region. The fifteen FTIR transmission spectra from these positions are
recorded as dataset 1, and the corresponding FTIR spectra (without extra pre-
processing) are shown in Figure 4.2.
(a) (b) Figure 4.1 Tissue sample from Dataset 1; (a) 4× stained picture; (b) 32×
unstained picture.
Figure 4.2 FITR spectra from Dataset 1.
Chapter 4
79
Figure 4.3 shows a 32× magnified visual image of dataset 2, unstained section;
the superimposed dashed white line separates the visually different morphologies.
Ten single point spectra numbered as 16−25 on the right hand side for the tumour
region, and rest of eight spectra numbered 26−33 on the left hand side for the stroma
region.
Figure 4.3 32× unstained picture from tissue sample Dataset 2.
Figure 4.4 shows a 32× magnified visual image of dataset 3, unstained section.
There are also two types of cells (stroma and tumour) in this section with their
regions. Four spectra numbered as 34−37 for the left tumour region, three spectra
numbered as 38−40 from the central stroma layer, and rest of four spectra numbered
as 41−44 from the right tumour region.
Chapter 4
80
Figure 4.4 32× unstained picture from tissue sample Dataset 3.
Figure 4.5 shows a white light image of three types of tissue sample from
dataset 4 and different morphologies can be visualised in the picture. The
corresponding spectra numbers are also shown below (The distinct grey-scale
contrast between the left half and right half of the image is artificial. It is a
consequence of the image being a composite of two independent pictures
corresponding to each half). It may be noticed that the boundary between stroma and
early keratinisation follows a meandering way through area numbers 88, 72, 56 and
55; and in a similar manner, the boundary between the marked tumour and stroma
region does not follow a vertical line as indicated, but rather appears to meander
somewhere through the area contained within the area numbers 50−52, 65−67 and
80−82. A closer histopathological inspection highlighted that there had been
invasion of the stroma region by tumour within the vicinity of the boundary between
the two layers. At this stage of the study, we are only concerned with ascertaining
Chapter 4
81
spectral characteristic of essentially distinct classes of tissue cells, rather than
gradation processes or mixed types [26]. Therefore those within the two boundary
regions and invasion area are excluded. These spectra number include: 46, 50, 51,
55, 56, 65, 66, 71, 72, and 81−88. That is, the number of spectra was reduced from
the original 48 to 31. Subsequently, the corresponding spectral points were
renumbered sequentially from 45 to 75. The three different categories of tissue types
in the new spectral numbering are as follows:
Tumour: 45−48, 56−59, 68−71.
Stroma: 49−51, 60−63.
Early keratinisation: 52−55, 64−67, 72−75.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 7661 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 9277 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
Figure 4.5 White light image of tissue sample Dataset 4.
Chapter 4
82
Figure 4.6 presents a tissue sample from dataset 5. Thirty spectra were
recorded in each grid on the white light image and their corresponding spectra
numbers are also displayed in Figure 4.6 (a). Figure 4.6(b) shows the same tissue
area “spectroscopic-staining” image according well with that from conventional
histopathology H&E staining. Two types of tissue cells (stroma and tumour) exists
in this section, however in the boundary region coloured as purple in Figure 4.6(b)
were closer to tumour than stroma through the analysis.
Chapter 4
83
(a)
(b)
Figure 4.6 Tissue section from dataset 5 (a) white light image (b) spectroscopic-staining image.
Chapter 4
84
Figure 4.7 displays a tissue section and fifteen spectra, taken in two images.
Three visually different areas numbered as 131−135, 136−140 and 141−145 are
associated with characteristic of tumour, stroma and tumour respectively.
(a)
(b)
Figure 4.7 White image of tissue sample for dataset 6 (a) part 1 (b) part 2.
Chapter 4
85
Figure 4.8 shows a set of five white light images taken from an oral tissue
section from a third patient. Histopatholgical examination showed that this is a
complex region containing stroma, tumour and necrotic tissue. A linear scan
consisting of consecutive spectral points was recorded. Similar to dataset 4, some
spectra which lie in a boundary between cell types or have spectral characteristic
which are not clear were eliminated from the original recorded points, leaving 42
spectra. After readjustment of the numbering, the spectra are distributed as follows:
Tumour: 201−210, 225−235.
Stroma: 211−224.
Necrotic: 236−242.
Figure 4.8 White image of tissue sample for Dataset 7.
Chapter 4
86
4.3 Experiments on Oral Cancer Datasets
In this Chapter, three data clustering techniques that have often been used in
FTIR spectroscopy analysis, namely hierarchical cluster analysis (HCA), k-means
and fuzzy c-means clustering, are used to classify the seven oral cancer FTIR spectra
datasets introduced above. These had been obtained through conventional cytology
[26], and no further extra pre-processing was applied (only basic pre-processing). In
hierarchical clustering, four different types of linkage methods, namely “single”,
“average”, “complete” and “ward” were conducted individually. Due to the k-means
and fuzzy c-means algorithms being sensitive to the initial states, each method was
run ten times. The parameters setting for these three clustering algorithms that were
used are as follows:
HCA: The Euclidean distance was used to calculate the distance
between different data points.
K-means: The Squared Euclidean distance was used to compute the
distance between each data point to its centroid; Maximum
number of iterations was 100.
Fuzzy c-means: Fuzziness index is equal to 2; maximum number of iterations
was 100; minimal amount of improvement was 10-5.
Similarly to k-means, the squared Euclidean distance was
also used to calculate the distances between data points to
centroids.
Chapter 4
87
The implementation of these algorithm were performed using Matlab (version
6.5.0, release 13.0.1).
4.3.1 Results
The distribution of the numbers of different types of tissue identified clinically
and as obtained by the three clustering techniques are displayed in Table 4.1. As
mentioned previously, clustering is an unsupervised process; this means that the
results of the clustering are simply to group the data into two or more unlabelled
categories. In the results presented below, the clusters were mapped to the actual
known classifications in such a way as to minimise the number of disagreements
from clinical studies in each case. The results are presented in comparison with a
previous study on the same data where the data was pre-processed empirically before
a diagnosis analysis. In this study, all three clustering analyses were performed using
MATLAB (version 6.5.0, release 13.0.1).
Chapter 4
88
Table 4.1 Distribution of the different tissue types identified
clinically and as obtained by the various clustering techniques.
Single Average Complete WardTumour 10 10 10 10 10Stroma 5 5 5 5 5
Tumour 10 17 9 9 9Stroma 8 1 9 9 9
Tumour 8 4 4 8 7 3 6 4Stroma 3 7 7 3 4 8 5 7Tumour 12 19 19 12 12 11 19 13 19 11Stroma 7 5 5 7 7 8 5 6 5 8
Early keratinisation 12 7 7 12 12 12 7 12 7 12Tumour 18 1 18 18 18 17Stroma 12 29 12 12 12 13
Tumour 10 10 10 10 10Stroma 5 5 5 5 5
Tumour 21 28 17 17 15Stroma 14 13 18 18 20
Necrotic 7 1 7 7 7168
1416105
fuzzy c-means
105
18
9947
9
1416
10
k-means
1059
51718Dataset 7
Dataset 6
7
Dataset 3
Dataset 4
Dataset 5
Clinical study
Dataset 2
Hierarchical clustering
Dataset 1
Datasets names
Tissue types
From Table 4.1 it can be seen that, in most of datasets, the number of data
belonging to the various categories do not exactly match the results from the clinical
study. This is because some of the data that should have been classified in the
tumour cluster has been misclassified into the stroma cluster and vice versa. For
example, in data set 2, using the hierarchical clustering single linkage method, the
numbers of data considered as tumour is 17, while 1 is considered as stroma.
Actually, there are 10 data belong to tumour and 8 belong to stroma. 7 data points
have been misclassified. These missed data points are misclassified into tumour
cluster as extra data points. The extra data from these clustering techniques will be
regarded as the number of disagreements of classification in comparison to the
results from previous clinical study. The comparison results are shown in Table 4.2
Chapter 4
89
Table 4.2 Comparison results based on the number of disagreements
between clinical study and the various clustering results.
Single Average Complete WardTumour 0 0 0 0Stroma 0 0 0 0Tumour 7 0 0 0Stroma 0 1 1 1Tumour 0 0 0 0 0 0 0Stroma 4 4 5 3 5 2 4Tumour 7 7 3 3 3 7 3 3 7Stroma 5 5 3 3 4 5 2 4 5
Early keratinisation 0 0 0 0 0 0 0 0 0Tumour 12 0 0 0 0Stroma 1 0 0 0 1Tumour 0 0 0 0Stroma 0 0 0 0Tumour 7 0 0 0Stroma 0 4 4 6
Necrotic 1 0 0 021
0400
fuzzy c-means
00
0
0104
1
04
0
k-means
000
004Dataset 7
Dataset 6
0
Dataset 3
Dataset 4
Dataset 5
Dataset 2
Hierarchical clustering
Dataset 1
Datasets names
Tissue types
After running each clustering technique ten times, it can be seen that the k-
means and fuzzy c-means algorithms obtained more than one clustering result in
some datasets. This is because different initialisation may lead to different partitions
for both of these algorithms. From Tables 4.1 and 4.2, k-means has more variations
(3 out of 7 datasets) than fuzzy c-means (1 out of 7 datasets), and their corresponding
frequency (out of 10 runs) is shown in Table 4.3.
Chapter 4
90
Table 4.3 Clustering variations for k-means and fuzzy c-means
within three datasets.
Datasets
names K-means Fuzzy c-means
Dataset 3 2/10 3/10 5/10 -
Dataset 4 3/10 3/10 4/10 9/10 1/10
Dataset 5 5/10 5/10 -
4.3.2 Discussion
In order to further investigate the performance of the different clustering
methods, the average number of disagreements for all datasets was calculated, as
shown in Table 4.4. It can be seen that the hierarchical clustering single linkage
method has the worst performance, the average linkage performance is better than
single linkage, while the complete linkage and ward methods perform the best
overall, However, hierarchical clustering techniques are computationally expensive
(proportional to n2, where n is the number of spectral data), therefore, they are not
suitable for very large datasets [17]. K-means and fuzzy c-means have fairly good
performance, and for both the computational effort is approximation linearly with n.
Hence, compared with hierarchical clustering, these techniques will be far less time-
consuming on large datasets [17]. Moreover, although k-means has a slightly better
performance than fuzzy c-means (slightly fewer disagreements, on average), it can be
Chapter 4
91
seen from the standard deviations in Table 4.4 that k-means exhibits more variation
in its results than fuzzy c-means. Hence, the overall conclusion is that fuzzy c-means
is the most suitable clustering method in this context.
Table 4.4 Average number of disagreements obtained in the three
clustering methods.
Single Average Complete Ward
19.5±1.6
2.7±0.8 2.8±0.2
Hierarchical clusteringK-means
Fuzzy c-means
16 16 18.8±5.8
6.3 3.0 2.3 2.3
Average (S.D.) Number of
Disagreements per Run
Average (S.D.) Number of
Disagreements per Run per Dataset
44 21
4.4 Summary
In their previous study, Chalmers et al investigated seven sections of tissue
samples containing oral cancer cells using two comparative parallel processes: that
is, histological analysis and FTIR spectroscopy with the subsequent application of
multivariate analysis [26]. Prior to the multivariate analysis, all spectral data had to
be empirically pre-processed. It was found that accurate clustering could only be
achieved by manually applying extra pre-processing techniques that varied according
to the particular sample characteristics. Furthermore, these pre-processing methods
required additional time, software tools and significant human expertise to perform.
Chapter 4
92
In this Chapter, three commonly used clustering techniques in FTIR spectroscopic
data analysis, namely, HCA, k-means and fuzzy c-means were applied to the same
seven spectral datasets as Chalmers et al reported but without any extra pre-
processing procedure. Single, average, complete linkage and Ward’s method were
employed in the HCA clustering techniques.
The experimental results showed that the single linkage method obtained the
worst clustering results, average linkage method’s performance was better than
single linkage but, overall, complete linkage and Ward’s method obtained the best of
the solutions. However, one of major drawback for HCA clustering algorithm is
high computation expense. Therefore, for very large datasets (which normally
appear in practical FTIR spectral analysis), this method may not be suitable. On the
other hand, the k-means and fuzzy c-means algorithms performances also achieved
the good performance. In addition, they require less computational resources in
comparison with the HCA method. However, from the clustering results it can be
seen that k-means clustering algorithm generated less consistent clustering results
than fuzzy c-means. Overall, it may be suggested that fuzzy c-means is a more
suitable method to classify the FTIR spectral data in this study.
Chapter 5
93
CHAPTER 5
Methods for Automatically
Determining the Number of Clusters
5.1 Introduction
In a real medical diagnostic application, for a previously unseen tissue sample,
the number of different types of cells is normally not known in advance. Based on
this fact, a clustering technique which can automatically obtain the appropriate
number of tissue types is required. There have been many clustering methods have
been developed in attempt to automatically determine the optimal number of clusters.
Recently, Bandyopadhyay proposed a Variable String Length Simulated Annealing
(VFC-SA) algorithm [123], which applied a simulated annealing algorithm to the
fuzzy c-means clustering technique and used a cluster validity index measure as the
energy function. This has the advantage that, by using simulated annealing, the
algorithm can escape local optima and, therefore, may be able to find the globally
optimal solution(s). The Xie-Beni index was used as the cluster validity index to
Chapter 5
94
evaluate the quality of the solutions; the author stated that this is because it has been
shown to be able to detect the correct number of clusters in several experiments
[124]. The smallest index value corresponds to the best clustering obtained from all
partitions that are generated by the clustering method. Hence this VFC-SA algorithm
can generally avoid the limitations which exist in the standard fuzzy c-means
algorithm. However when we implemented this proposed algorithm, it was found
that sub-optimal solutions could be obtained in certain circumstances. In order to
overcome this limitation, we extended the original VFC-SA algorithm to produce the
Simulated Annealing Fuzzy Clustering (SAFC) algorithm. In this chapter, the
original VFC-SA and the extended SAFC algorithm are described in detail. The
experiments as described in Chapter 4 were performed on the same seven FTIR
spectra datasets containing oral cancer cells in order to evaluate the performance of
the VFC-SA and SAFC clustering algorithms in comparison to the original fuzzy c-
means algorithm.
5.2 VFC-SA Clustering Algorithm
In this algorithm, a variable number of cluster centres were encoded using a
variable length string to which simulated annealing was applied. At a given
temperature, the new state (string encoding) was accepted with a probability:
))/)(exp(1/(1 TEE cn −−+ , where En and Ec represents the new energy and current
energy respectively, and T is the current temperature.
The Xie-Beni index, VXB, was used to compute the evaluation of a cluster. The
initial state of the VFC-SA was generated by randomly choosing c points to be
Chapter 5
95
cluster centres from the datasets where c is an integer within the range ],[ maxmin cc .
The values 2min =c and nc =max (where n is the number of data points) was used
following the suggestion proposed by Bezdek in [73]. The initial temperature T was
set to a high temperature maxT , a neighbour of the solution was produced by making
one of several possible random alterations to the string describing the cluster centres
(as described below) and then the energy of the new solution was calculated. The
new solution was kept if it satisfied the simulated annealing acceptance requirement.
This process was repeated for a certain number of iterations, k , at the given
temperature. A cooling rate, r , where 10 << r , was used to decrease the current
temperature by rTT = . This was repeated until the T reached the termination
criteria temperature minT , at which point the current solution was returned. The
whole VFC-SA algorithm process is summarised in the steps shown in Figure 5.1.
The process of altering the current cluster centres comprised three functions.
They are: perturbing an existing centre (Perturb Centre), splitting an existing centre
(Split Centre) and deleting an existing centre (Delete Centre). At each iteration, one
of the three functions was randomly chosen. When splitting or deleting a centre, the
cluster sizes were used to select a centre. The size, jC , of a cluster, j , can be
expressed by:
∑=
=n
iijjC
1
|| µ , cj ,...1=∀ (5.1)
where c is the number of clusters. The three functions are described below.
Chapter 5
96
1) Set parameters rkcTT ,,,, minmax .
2) Initialised the string by randomly choosing c data points from the dataset to be
cluster centres.
3) Compute the corresponding membership values using equation
∑=
−
=C
k
m
ik
ijij
d
d
1
12
)(
1µ (2.7).
4) Calculate the initial energy cE using VXB index from equation s
VXB
π= (2.11).
5) Set the current temperature maxTT = .
6) While minTT ≥
6.1) For 1=i to k
6.1.1) Randomly alter a current centre in the string.
6.1.2) Compute the corresponding membership values using equation
(2.7).
6.1.3) Compute the corresponding centres with the equation
Cjx
v n
i
mij
n
ii
mij
j ,...,1,)(
)(
1
1 =∀=∑
∑
=
=
µ
µ (2.6).
6.1.4) Calculate the new energy nE from the new string.
6.1.5) If cn EE < , then accept the new string and set it as current
string.
6.1.6) Else accept the new string with a certain probability.
6.2) End for
6.3) rTT = .
7) End while.
8) Return the current string as the final solution.
Figure 5.1 VFC-SA clustering algorithm procedure.
Chapter 5
97
a) Perturb Centre
A random centre in the string is selected. This centre position is then modified
through addition of the change rate ][][ dvprrdcr current⋅⋅= , where currentv is the
selected centre and Nd ,...,1= , where N is the number of dimensions. r is a random
number between [−1, 1] and pr is the perturbation rate which was set through initial
experimentation as 0.007 as this gave the best trade-off between the quality of the
solutions produced and time taken to achieve them. If ][dvcurrent and ][dvnew
represent the current and new centre, respectively, then Perturb Centre can then be
expressed as: ][][][ dcrdvdv currentnew += .
b) Split Centre
The size of each cluster is calculated using equation (5.1). The centre of the
largest cluster is then replaced by two new centres created by the following
procedure. The point in the cluster with a cluster membership value less than but
closest 0.5 to the selected centre is identified as the reference, referencew . Then the
distance between this reference point and the current chosen centre is calculated
using: |][][|][ dwdvddist referencecurrent −= . Finally, the two new centres are then
obtained by ][][][ ddistdvdv currentnew ±= .
c) Delete Centre
As opposed to Split Centre, the smallest cluster is identified and its centre
deleted from the string encoding.
Chapter 5
98
5.3 SAFC Clustering Algorithm
When the original VFC-SA algorithm was implemented on a wider set of test
cases than used by the original authors [123], it was found to suffer from several
difficulties. In order to overcome these difficulties, four extensions to the algorithm
have been proposed. In addition, some details were not explicit in the original
algorithm, so that there were ambiguities present. In this Section, the focus is placed
on the extensions to VFC-SA in order to describe the proposed SAFC algorithm.
Also, the entire algorithm is stated explicitly in order to resolve the ambiguities.
The first extension is in the initialisation of the string. Instead of the original
initialisation in which random data points were chosen as initial cluster centres, the
fuzzy c-means clustering algorithm was applied using the random integer
],[ maxmin ccc ∈ as the number of clusters. The cluster centres obtained from the fuzzy
c-means clustering are then utilised as the initial cluster centres for SAFC. This is
because using the clustering results from previous clustering results leads to a better
initialization.
The second extension is in Perturb Centre. The method of choosing a centre in
the VFC-SA algorithm is to randomly select a centre from the current string.
However, this means that even a ‘good’ centre can be altered. In contrast, if the
weakest (smallest) centre is chosen, the situation in which an already good (large)
centre is destabilized is avoided. Ultimately, this can lead to a quicker and more
productive search as the poorer regions of a solution can be concentrated upon.
Chapter 5
99
The third extension is in Split Centre. If the boundary between the biggest
cluster and the other clusters is not obvious (not very marked), then the approach that
original authors use is to choose a reference point with a membership degree that is
less than but closest to 0.5. That is to say there are some data points whose
membership degree to the chosen centre is close to 0.5. However, there is another
situation that can also occur in the process of splitting centre; the biggest cluster is
separate and distinct from the other clusters. For example, let there be two clusters in
a set of data points which are separated, with a clear boundary between them. The
corresponding cluster centres at a specific time in the search are v1 and v2, as shown
in Figure 5.3 (shown in two-dimensions). The biggest cluster is chosen, say v1.
Then a data point whose membership degree is closest to but less than 0.5 can only
be chosen from the data points that belong to v2 (where the data points have
membership degrees less than 0.5 to v1). So, for example, the data point w1 (which is
closest to v1) is chosen as the reference data point. The new centres will then move
to vnew1 and vnew2. Obviously these centres are far from the ideal solution. Although
the new centres would be changed by the Perturb Centre function afterwards, it will
inevitably take a longer time to ‘repair’ the solutions. In the modified approach, two
new centres are created within the biggest cluster. The same dataset as in Figure 5.3
is used to illustrate this process. A data point is chosen, w1, that has a cluster
membership closest to the mean value of the membership degree above 0.5.
Remembering that the memberships of all clusters sums to one, it is obvious that if
the membership is greater than 0.5 then this must be the largest membership. Hence,
points with memberships above 0.5 can be deemed to be ‘close’ to the cluster centre.
Chapter 5
100
The mean of memberships above 0.5 thus represents a point which is close, but not
too close, to the cluster centre. Then two new centres vnew1 and vnew2 are created
according the distance between v1 and w1. This is shown in Figure 5.4. It is obvious
that the new centres are better than the ones in Figure 5.3 and therefore better
solutions are likely to be found in same time (number of iterations).
A brief overview of the split centre procedure is as follows:
1) Calculate the size of the cluster and select the biggest cluster, whereby its
cluster centre is v1.
2) Check whether there is any data point within the biggest cluster, which has
membership value to v1 is less than 0.5 but greater than 0.4.
2.1) If there is , then apply the approach that the original author used to
find the reference data point, as illustrated in Figure 5.3.
2.2) Else, apply the extended split centre approach to find the reference
point, as illustrated in Figure 5.4.
Figure 5.2 The split centre procedure.
The fourth extension is in the final step of the algorithm (return the current
solution as the final solution). In the SAFC algorithm, the best centre positions (with
the best VXB index value) that have been encountered are stored throughout the
search. At the end of the search, rather than returning the current solution, the best
solution seen throughout the whole duration of the search is returned.
Aside from these four extensions, we also ensure that the number of clusters
never violates the criteria whereby the number of clusters C should be within the
range of ],[ maxmin cc . Therefore when splitting a centre, if the number of clusters has
Chapter 5
101
reached maxc then the operation is disallowed. Dually, when deleting a centre, the
operation is not allowed if the number of clusters in the current solution is minc .
Figure 5.3 An illustration of Split Centre from the original algorithm with distinct clusters (where 11µ and 12µ represent the membership degree of w1 to
the centres v1 and v2 respectively).
Figure 5.4 The new Split Centre applied to the same dataset as Figure 5.3, above, (where w1 is now the data point that is closest to the mean value of the
membership degree above 0.5).
Based on all the extensions and enhancements to the VFC-SA algorithm, the
SAFC algorithm procedure can be described in the following steps:
v1(deleted)
v2 vnew1
vnew2
v1
v2 w1
85.0,15.0 1211 == µµ
v1(deleted)
v2 vnew1
vnew2
v1
v2
w1
Chapter 5
102
1) Set parameters rkcTT ,,,, minmax .
2) Initialised the string by applying fuzzy c-means algorithm to generate c cluster
centres from the original dataset.
3) Calculate the initial current energy cE and best energy bE based on the obtained
cluster centres and membership values to apply VXB index from equation (2.11).
4) Set the current temperature maxTT = .
5) while minTT ≥
5.1) For 1=i to k
5.1.1) Randomly alter the state of a current centre in the string.
5.1.2) Compute the corresponding membership values using equation
(2.7).
5.1.3) Compute the corresponding centres with the equation (2.6).
5.1.4) Calculate the new energy nE from the new string.
5.1.5) If cn EE < , then accept the new string and set it as current
string.
5.1.6) Else, accept the new string with a certain probability.
5.1.7) if bc EE < , then cb EE = , and set current string as the best string.
5.2) End for
5.3) rTT = .
6) End while.
7) Return the best string as the final solution.
Figure 5.5 The SAFC clustering algorithm.
Chapter 5
103
5.4 Evaluation of VFC−SA and SAFC Clustering of Oral
Cancer Cells
In order to assess the relative performance of the VFC-SA and SAFC
algorithms in comparison with the standard fuzzy c-means algorithm, the following
experiments were conducted. The same clinical seven oral cancer datasets as used in
chapter 4 were used in this investigation. The number of different types of cells in
each tissue section from clinical analysis was considered as the number of clusters to
be referenced. They were also used as the parameter for fuzzy c-means. The VXB
Xie-Beni index value has been utilised throughout to evaluate the quality of the
classification for these three algorithms. The parameters for VFC-SA and SAFC
were: 5min 10−=T , 40=k , 9.0=r . maxT was set as 3 in all cases. That is because the
maximum temperature has a direct impact on how much worse the XB index value
of a solution can be accepted at the beginning. If the maxT value is set too high, this
may result in the earlier stages of the search being less productive because simulated
annealing will accept almost all of the solutions and, therefore, will behave like
random search. In the original VFC-SA algorithm, the initialization value for
maxT was 100, but this led to a large time being spent on random search. In the
present experiments, maxT was empirically determined to be three based on the
observation that the percentage of worse solutions that were accepted was around
60%. In 1996, Rayward-Smith et al discussed starting temperatures for simulated
annealing search procedures and concluded that a starting temperature that results in
60% of worse solutions being accepted yields a good balance between the usefulness
Chapter 5
104
of the initial search and overall search time (i.e. high enough to allow some worse
solutions, but low enough to avoid conducting a random walk through the search
space and wasting search time) [125].
Solutions for the seven FTIR datasets were generated by using the fuzzy c-
means, VFC-SA and SAFC algorithms. Each method was allowed 10 runs on each
dataset. As mentioned at the beginning of this section, the number of clusters was
predetermined for fuzzy c-means through clinical analysis. The outputs of fuzzy c-
means (centres and membership degrees) then used to compute the corresponding
VXB index value. VFC-SA and SAFC automatically found the number of clusters by
choosing the solution with the smallest VXB index value. Table 5.1 shows the average
VXB index values obtained after ten runs of each algorithm (best average is in bold).
Table 5.1 Average of the VXB index values obtained when using the
fuzzy c-means, VFC-SA and SAFC algorithms.
Average VXB Index Value Dataset
Fuzzy C-Means VFC-SA SAFC
1 0.048036 0.047837 0.047729
2 0.078896 0.078880 0.078076
3 0.291699 0.282852 0.077935
4 0.416011 0.046125 0.046108
5 0.295937 0.251705 0.212153
6 0.071460 0.070533 0.070512
7 0.140328 0.149508 0.135858
Chapter 5
105
From Table 5.1, it can be seen that in all of these seven datasets, the average
VXB values of the solutions found by SAFC are smaller than both VFC-SA and fuzzy
c-means. This means that the clusters obtained by SAFC have, on average, better
VXB index values than the other two approaches. Put another way, it may also
indicate that SAFC is able to escape sub-optimal solutions better than the other two
methods.
In the datasets 1, 2, 4 and 6, the average of VXB index values in SAFC is only
slightly smaller than that obtained using VFC-SA. Nevertheless, when the Mann-
Whitney test (with p<0.01) [126] was performed on the results of these two
algorithms, the VXB index for SAFC was found to be statistically significantly lower
than that for VFC-SA for all datasets.
The number of clusters obtained by VFC-SA and SAFC for each dataset is
presented in Table 5.2. The brackets indicate the number of runs for which that
particular cluster number was returned. For example on dataset 5, the VFC-SA
algorithm found 2 clusters in 5 runs and 3 clusters in the other 5 runs. The number of
clusters identified by clinical analysis is also shown for comparative purposes.
From Table 5.2, it can be observed that in datasets 3, 4, 5 and 7, either one or
both of the VFC-SA and SAFC obtain solutions with a different number of clusters
than provided by clinical analysis. In fact, with datasets 5 and 7, VFC-SA even
produced a variable number of clusters within the 10 runs. Returning to the VXB
index values of Table 5.1, it was shown that all the average VXB index values
obtained by SAFC are better.
Chapter 5
106
Table 5.2 Comparison of the number of clusters achieved by clinical
analysis, VFC-SA and the SAFC methods.
Clinical VFC-SA SAFC1 2 2(10) 2(10)2 2 2(10) 2(10)3 2 2(10) 3(10)4 3 2(10) 2(10)5 2 2(5), 3(5) 3(10)6 2 2(10) 2(10)7 3 3(9), 4(1) 3(10)
DatasetNumber of Clusters in Solution
It can be observed that the corresponding VXB average index values for SAFC
for datasets 3, 4 and 5 produced much smaller values than fuzzy c-means. These
three datasets are also the datasets for which SAFC obtained a different number of
clusters to clinical analysis. In dataset 3, the average VXB index value in SAFC is
much smaller than in VFC-SA. This is because the number of clusters obtained from
these two algorithms is different (see Table 5.2). Obviously a different number of
clusters lead to a different cluster structure, and so there can be a big difference in the
validity index. In datasets 5 and 7, the differences of VXB index values are noticeable,
though not as big as dataset 3. This is because in these two datasets, some runs of
VFC-SA obtained the same number of clusters as SAFC.
In order to examine the results further, the data has been plotted using the first
and second principal components in two dimensions. These have been extracted
using the principal component analysis (PCA) technique [95,127]. The data has been
plotted in this way because, although the FTIR spectra are limited to within
Chapter 5
107
11800900 −− cm , there are still 901 absorbance values corresponding to each
wavenumber for each data. The first and second principal components are the
components that have the most variance in the original data. Therefore, although the
data is multidimensional, the principal components can be plotted to give an
approximate visualisation of the solutions that have been achieved. Figures 5.6−5.12
show the results for datasets 1-7 respectively using fuzzy c-means, VFC-SA and
SAFC (the data in each cluster is depicted using different markers and each cluster
centre is represented by a star). The first and second principal components in datasets
1-7 contain 96.14, 96.30, 89.76, 93.57, 79.28, 94.17 and 82.64 percent of the
variances in the original data, respectively. The percent of the total variability
explained by the first two principal components was obtained from the third output
(variances) of the function ‘princomp’ in Matlab. The formula can be expressed by:
percent = 100 × sum(first N variances) / sum(all variances) (5.2)
It should be noted that when a figure depicts a cluster result from more than
one algorithm, such as in Figure 5.6, it means that the partition results obtained from
those algorithms are the same. It may be that the positions of the centres are slightly
different as the validity index values from each algorithm are not exactly the same.
In each case, the results of clinical analysis are shown , either in a legend or by
directly labelling the points.
Chapter 5
108
-1 -0.5 0 0.5 1 1.5-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
1
2
3
4
5
11
12
13
14
15
6
78
9
10
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
tumourstromacentre
Figure 5.6 Fuzzy C-Means, VFC-SA and SAFC cluster results for dataset 1.
-1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
19
26
27
28
29
30
31
32
33
16
17
18
20
21
222324
25
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
-1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
26
27
28
29
30
31
32
33
16
17
18
1920
21
222324
25
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
(a) (b)
Figure 5.7 Cluster results for dataset 2 obtained from
(a) Fuzzy C-Means, VFC-SA and 3/10 runs from SAFC (b) 7/10 runs from SAFC.
tumour
stroma
Chapter 5
109
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
34
35
36
37
38
39
40
41
42
43
44
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
35
36
37
41
42
43
44
34
38
39
40
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
(a) (b) Figure 5.8 Cluster results for dataset 3 obtained from
(a) Fuzzy C-Means and VFC-SA (b) SAFC.
-1 -0.5 0 0.5 1 1.5-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
52
53
54
64
65
67
75
45
4647
48
49
5051
56
57
58
59
60
6162 63
68
69
70
71
55
6672
73
74
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
-1 -0.5 0 0.5 1 1.5-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
52
53
54
55
64
65
66
67
72
73
74
75
45
4647
48
49
5051
56
57
58
59
60
6162
63
68
69
70
71
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
(a) (b) Figure 5.9 Cluster results for dataset 4 obtained from
(a) Fuzzy C-Means (b) VFC-SA and SAFC.
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
105
106
107
108
109110
116117
118
119120
127
128
129
101
102103
104
111
112113
114
115
121
122
123
124
125
126
130
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
101
102103
104
111
112113
114121
122
123
124
107
109110
116117
127
128
129
105
106108
115 118
119120125
126
130
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
(a) (b) Figure 5.10 Cluster results for dataset 5 obtained from
(a) Fuzzy C-Means and 5/10 runs from VFC-SA (b) SAFC and 5/10 runs from VFC-SA.
tumour
stroma
stroma
Early keratinisation
tumour
tumour
stroma
Chapter 5
110
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
136
137
138
139
140
131
132
133
134135
141
142
143
144145
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
stromatumourcentre
Figure 5.11 Fuzzy C-means, VFC-SA and SAFC cluster results for dataset 6.
-1 -0.5 0 0.5 1 1.5-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
204208 211 212
213214
215
216
217
218219
220
221 222223
224
235
236
237
238
239
240241
242
201202
203
205
206
207
209
210225
226
227228229
230
231
232
233234
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
-1 -0.5 0 0.5 1 1.5-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
204
207208
213214
215
218219
220226
235
236
237
238
239
240241
242
201202
203
205
206
209
210225227228
229
230
231
232
233234 211 212
216
217221 222
223
224
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
(a) (b)
Figure 5.12 Cluster results for dataset 7 obtained from (a) Fuzzy C-Means, 9 runs from
VFC-SA and SAFC (b) 1 run from VFC-SA.
From the clustering results displayed in these figures, it can be seen that three
of the clustering algorithms generated the same partition results in datasets 1 and 6
(as shown in Figure 5.6 and Figure 5.11 respectively). In addition, the obtained the
two clusters in each dataset also matched the clinical analysis results (see legends in
these two figures).
tumour
stroma necrotic
Chapter 5
111
In Figure 5.7, the number of clusters obtained from VFC-SA and SAFC are the
same as clinical analysis results, for example two clusters. However, it can be seen
that in Figure 5.7(a), the clustering output from VFC-SA and 3 out of 10 runs of
SAFC are the same as that from fuzzy c-means and, further, that there is one tumour
data point (19) that was misclassified as stroma. Figure 5.7(b) shows the clustering
results from the other 7 out of 10 runs of SAFC, in which this data point was
correctly categorised as tumour. This indicates that running both algorithms the
same number of times; SAFC has more probability to obtain the right classification.
Figure 5.8(a) displays the cluster results from dataset 3. It can be seen that
VFC-SA generated the same number of clusters as clinical analysis, and the partition
result is the same as fuzzy c-means. However, there are four data points in tumour
cluster, namely 34, 35, 36 and 37 which were misclassified as stroma. On the other
hand, although SAFC produced a different number of clusters than clinical analysis,
from this two dimensional PC space (Figure 5.8(b)), it would appear more reasonable
to group this dataset into three clusters rather than two. In addition, if the clusters
which have the most similar biochemical characteristics could be merged together,
then the squared green cluster (points 35-37) and the diamond red cluster (points 41-
44) will be merged. In this case, the accuracy of the clustering will be significantly
improved. In contrast, the clustering results generated from Figure 5.8(a) cannot
obtain the similar results by such a merging technique.
Cluster results for dataset 4 are presented in Figure 5.9. The clinical analysis
was that there are three types of cells in this tissue section, as shown in Figure 5.9(a).
Chapter 5
112
Applying the fuzzy c-means algorithm using the number of clusters from clinical
analysis, the clustering output obtained is also displayed in Figure 5.9(a). The early
keratinisation data points were split into two clusters, the data points belong to
stroma and tumour clusters were mixed up. From the data distribution in PC1 and
PC2 space, it will appear to be very difficult to separate tumour and stroma data
points as clinical study described (although, of course, the clinical partition might
become more apparent if further dimensions were to be considered). Figure 5.9(b)
shows the two clusters obtained from VFC-SA and SAFC, which appears more
representational of the existing data structure, although the data points belong to
stroma and tumour were still joined together. This is also why both algorithms
produced different number of clusters from clinical analysis.
Figure 5.10 displays the clustering algorithms applied on dataset 5, in which
two types of tissue samples were achieved from clinical study (as shown in Figure
5.10(a)). This figure also shows the clustering results from fuzzy c-means and 5 out
of 10 runs of the VFC-SA algorithm. Although the number of clusters obtained from
some runs of VFC-SA is the same as clinical analysis, from Figure 5.10(a) it can be
seen that some data points which belong to the tumour cluster were misclassified as
stroma, for instance, points 115, 125, 126 and 130. On the other hand, although
SAFC and the rest of the 5 runs of VFC-SA algorithms produced a different number
of clusters from clinical analysis (Figure 5.10(b)), three clusters appear more natural
than two clusters by visual inspection. Similar to the hypothesis for Figure 5.8(b), if
a technique can merge clusters with the most similar biochemical characteristic
Chapter 5
113
clusters, then in this case, all the data points belong to tumour will be combined, and
so the same clustering results as clinical study would be obtained.
Finally, Figure 5.12 presents the results obtained on dataset 7 for the fuzzy c-
means, VFC-SA and SAFC algorithms. Figure 5.12(a) shows that in 9 out of 10 runs
of VFC-SA algorithm and all 10 runs of SAFC algorithm the same number of
clusters as clinical analysis were obtained. However, three data points were
misclassified (points 235, 204 and 208). All these three points should belong to the
tumour cluster, but in this case, data point 235 was marked as necrotic and data
points 204, 208 were marked as stroma. Nevertheless, apart from these three points,
the rest of the data points were correctly categorised. Figure 5.12(b) shows the
clustering result from 1 out 10 runs of the VFC-SA algorithm in which four clusters
were obtained. This is due to the fact that the data points that should belong to
tumour and stroma clusters were split into three groups with the third cluster being
on the border between the tumour and stroma clusters. Although this occurred quite
rarely (only in 1 out of 10 runs), it does indicate the variety of clusters obtained from
the VFC-SA clustering algorithm.
From Figure 5.6 − 5.12, it can be seen that although in some datasets, such as
dataset 3 and 5, the VFC-SA algorithm (and some runs of the VFC-SA algorithm)
obtained the same number of clusters as clinical analysis, while the SAFC algorithm
did not. This does not necessarily mean that the clustering accuracy of VFC-SA on
these datasets is better than SAFC’s. Rather, it is just that the clustering results from
SAFC appear more reasonable through visual inspection. In addition, if a technique
Chapter 5
114
which can merge the most similar biochemical characteristic clusters could be
developed and then be applied to the partition results from SAFC, the accuracy of
clustering results will be significantly improved. In the clustering results from
dataset 2 (Figure 5.7), both VFC-SA and SAFC algorithms achieved the same
number of clusters as clinical analysis. However, when these two algorithms were
run ten times, SAFC is more likely to achieve the same results as clinical study. In
dataset 4, both VFC-SA and SAFC algorithms obtained a different number of
clusters from clinical analysis. Nevertheless, two well separated clusters can be seen
when displaying this dataset in the first two PCs dimensional space. Thus, it is hard
to see how any technique might end up with three clusters (to match clinical
analysis) for this particular dataset.
Although within Figures 5.6 − 5.12, the different number of clusters obtained
by the SAFC algorithm (compared to clinical analysis) have good visual
interpretation, there are at least three possible explanations for the difference.
Firstly, the clinical analysis may not be correct – this could potentially be caused by
the different types of cells in the tissue sample not being noticed by the clinical
observers or the cells within each sample could have been mixed with others.
Secondly, it could be that although a smaller VXB index value was obtained, indicating
a ‘better’ solution in technical terms, the VXB index is not accurately capturing the real
validity of the clusters. Put another way, although the SAFC finds the better solution
in terms of VXB index, this is not actually the best set of clusters in practice. A third
possibility is that the FTIR spectroscopic data has not extracted the required
information necessary in order to permit a correct determination of cluster numbers –
Chapter 5
115
i.e. there is a methodological problem with the technique itself. None of these
explanations of the difference between the clustering results obtained automatically
and those from clinical analysis detract from the fact that the SAFC produces better
solutions than VFC-SA in that it consistently finds better (statistically lower) values
of the objective function (VXB index).
5.5 Summary
In this Chapter, a new SAFC method has been proposed which has been
extended from the original VFC-SA algorithm in four ways. The newly proposed
algorithm’s performance has been evaluated on seven oral cancer FTIR spectra data
and compared to clinical analysis, the standard fuzzy c-means and the original VFC-
SA. The XB validity index was used as the evaluation method to measure the quality
of the clusters produced. The experimental results have shown that the SAFC
algorithm can escape the sub-optimal solutions obtained in the other two approaches
and hence produce better clusters. On the other hand, the numbers of clusters
obtained by SAFC in some datasets are not in agreement with those provided through
clinical analysis. This can be visualised by plotting the clustering results into the
first two dimensions of PC space. For the different number of clusters datasets,
SAFC results appear to more reasonably reflect the structure of the underlying data.
In addition, this also can be explained in following three ways. Firstly, the number
of clusters identified from clinical analysis may not be correct; secondly, the XB
validity index may not be suitable to apply on these clinical data; and thirdly, the
FTIR technique has not (for these datasets) captured sufficient information to
Chapter 5
116
facilitate correct classification. However, more results and information are needed
before any definitive conclusion can be made in this case. Nevertheless, this SAFC
algorithm is a further step towards the automatic classification of data for real
medical application. The next Chapter presents the investigations carried out on
FTIR spectral data collected by imaging an area of tissue section in many positions
(i.e. at a high spatial resolution), which results in a larger dataset than used up to
now.
Chapter 6
117
CHAPTER 6
Methods for the Examination of
Tissue Sections
6.1 Introduction
Infrared imaging techniques have become more frequently employed for the
investigation of tissue cells, with the main advantage that the scanning of the tissue
section is quicker and thus a larger area can be examined. However, in comparison
to previous oral cancer datasets, the size of the data collected using the infrared
imaging technique is several orders of magnitude larger and, therefore, any
developed technique must be capable of operating on such large datasets. This
Chapter focuses on the investigation of FTIR spectra data obtained by employing
infrared imaging technology to analyse lymph node tissue sections. For the purpose
of this Chapter’s study, a tissue area that incorporated a variety of lymph node tissue
types was used. The most important feature of this sample area was that it contained
sections of both cancerous invasion and healthy nodal tissue.
Chapter 6
118
The first part of this Chapter focuses on the investigation into a technique
named PCA−fuzzy c-means, which combines both the PCA technique and the fuzzy
c-means clustering algorithm in order to speed up the clustering analysis process by
reducing the size of the lymph node tissue spectral dataset without losing significant
information from the original data. The PCA technique and fuzzy c-means clustering
algorithm were also individually applied to the same lymph node dataset for
comparative purposes. The clustering results obtained from these three techniques
were displayed in false-colour images and their processing times were calculated and
compared. As mentioned above, the infrared imaging method allows for
significantly larger datasets and, indeed, the image created and used throughout this
Chapter was composed of 7497 spectra. This is a significantly higher number than in
previous studies and is beneficial in order to be able to assess the diagnostic ability of
the clustering techniques. The clustering algorithms used in the previous chapters of
this thesis, such as HCA and SAFC, were not suitable for this study due to the high
computational requirement of operating on such large-scale datasets. In the second
part of the chapter, a PCA−k-means technique, which is similar to PCA−fuzzy c-
means method, is also used to analyse the same lymph node tissue section. The
clustering results obtained from both algorithms are then encoded into false-colour
images and are compared.
6.2 Lymph Node Dataset Description
Lymph nodes are round kidney-shaped organs distributed throughout the body
along the lymphatic vessel system. They can vary in size from a few millimetres to
Chapter 6
119
more than 2cm and have two main functions. Lymph is filtered through them to
allow the removal of foreign particles by phagocyte cells. Additionally, foreign
antigens are trapped on the surface of antigen presenting cells and presented to
memory B-lymphocytes. These then migrate to the germinal centre of the follicles
and give rise to the synthesis of antibodies that combat disease invading the body
[106]. Figure 6.1(a) shows the H&E stained parallel tissue section used for infrared
analysis and allows the main structure of the node to be identified. The IR image
was collected from a particularly interesting site on the lymph node where several
different types of tissue are present and which, more importantly, includes areas of
both cancerous invasion and healthy nodal tissue. This selected tissue section has
been named LNII5. Figure 6.1(b) shows the spectral LNII5 at higher magnification
and this allows for easy identification of the surrounding capsule, cortex and
invading beast cancer tissue. In the centre of the cortex, with a lighter pigmentation,
is a stimulated proliferating follicle or germinal centre. Reticular cells that extend
into the sinuses can also be seen and characteristically form a delicate network
between the capsule and trabeculae (small, often microscopic, tissue elements in the
form of a small beam, strut or rod, generally having a mechanical function). At the
top left corner of the spectral image (see Figure 6.1(d)) lies a small pocket of fatty
tissue that normally surrounds the lymph node. This small area of fatty tissue has
been included in the infrared analysis. However, in the corresponding H&E stained
image, it has unfortunately been omitted due to the fact that the piece to be stained
was cut slightly below the fatty tissue. In Figure 6.1(c), the different types of tissue
are identified.
Chapter 6
120
(a) (b)
(c) (d)
Figure 6.1 (a) Photomicrograph of the H&E stained parallel lymph node tissue section used for IR analysis (b) selected area – LNII5 at high magnification (c)
different tissue types description (d) LNII5 spectral image.
Cancerous cortex tissue
Normal cortex tissue
Capsule (fibrocollagenous tissue)
Secondary follicle (normal cortex tissue)
Reticulum (fibrocollagenous tissue)
Chapter 6
121
6.3 A Combination of Principal Component Analysis and
Fuzzy C-Means Clustering
6.3.1 Introduction
The principal component analysis (PCA) technique (see Chapter 2.3.2) has
been widely used for FTIR spectra analysis. As mentioned in Section 2.3.2, PCA
can detect structure in the relationships between variables of data and can also be
used to reduce the dimensionality of a dataset.
In our previous work, the seven oral cancer IR spectral sizes are comparatively
small; the number of spectra is 15, 18, 11, 31, 30, 15 and 42 in datasets 1 to 7
respectively. In each dataset, some scattered individual spectra data points from
different types of tissue were chosen for the FTIR scan. However, the experiments
carried on the lymph node tissue section reported in this chapter are using 7497
spectra which have been taken from a whole sub area of one axillary lymph node –
LNII5. Each spectrum contains 821 absorbency measurements evenly distributed
across the indicated wave-number range (4000-720cm-1) at a frequency of every
fourth wave-number (such as 4000, 3996, 3992, …, 720). Therefore, the total size of
the LNII5 tissue section dataset is 7497×821 wave-numbers. It is apparent that if we
can reduce the dimensionality from the original data without losing too much useful
information, the clustering analysis will be more computationally efficient. In this
respect, PCA was used to adjust the coordinates of the original data and, in this
thesis, the first 10 PCs were selected on which to perform the data analysis. This
Chapter 6
122
number was chosen empirically on the basis that these PCs were found to contain
99.1% of the variances from the original data. Thus, these ten principal components
are still highly reflective of the original data. As PCA can also be used to detect the
structure from the original data, experiments utilising both PCA and the standard
fuzzy c-means clustering algorithm separately were also conducted. The clustering
results from these three techniques are compared using false-colour weighted images
and their corresponding computation times are also presented.
6.3.2 Experiments and Results of PCA
The collected spectral dataset was subjected to PCA to obtain the principal
components. The correlation coefficient between each spectrum and each PC was
then calculated. A bigger value indicates that the spectrum has a closer relationship
to this PC, and vice versa. After calculating the values from all the spectra to each
PC, normalisation was conducted to limit values within the range [0, 1]. Each point
on the image was then falsely coloured to represent the strength of the correlation
coefficient (as shown on the colour bars to the right of Figure 6.2). Since each
spectrum on the IR image has a unique spatial (x, y) position, false colour images can
be generated by plotting specially coloured pixels as a function of the spatial
coordinate. In the results, in order to distinguish between different clusters, each
cluster was assigned a unique colour. Images created for each of the first 10 PCs
using false colours are displayed in Figure 6.2. In the colour bar, red indicates that
the spectra corresponding to the given point are very similar to the specified PC, and
blue are greatly dissimilar.
Chapter 6
123
Figure 6.2(a) shows the original H&E stained image of LNII5. The first PC
image is presented in Figure 6.2(b) and, from the picture, it can be seen that the fatty
tissue situated in the corner of the image has the closest correlated relationship with
PC1. In contrast, the rest of the tissue area is much less closely related to PC1. The
second PC image, as shown in Figure 6.2(c), demonstrates that the germinal centre,
cancerous and some normal tissue have strong correlation with PC2, some reticulum
and normal area has less correlation with PC2 and lastly, capsule, fatty tissue, some
reticulum and normal tissue have the least correlation with PC2. It is apparent that
different types of tissue have been mixed around in each cluster marked by different
colours. Furthermore, in the PC2 image, the main normal and cancerous tissue area
are not separated. For the third PC image displayed in Figure 6.2(d) it can be seen
that capsule, fatty tissue, germinal centre, some normal and cancerous tissue, and
reticulum are the closest to PC3, and the rest of the normal and cancerous tissue is far
from PC3. In this image, more tissue types are mixed together, and normal and
cancerous tissue are still not clearly separated from each other. A similar situation
also occurs in the rest of the images. As the number of the PC is increased, the less
and less information appears to be contained within the images.
6.3.3 Experiments and Results of Fuzzy C-Means
The fuzzy c-means clustering algorithm is sensitive to the initial position of the
cluster centres and, as these are positioned randomly at the outset of the algorithm,
the method was run ten times with the number of clusters subjectively set from 2 to
9. The squared Euclidean distance was used to calculate the distances between the
Chapter 6
124
spectral data points to the cluster centres, and the fuzziness index m was set to a
value of 2. Initially, the two termination criteria were when the number of iterations
reached 100 and/or when the improvement obtained at each iteration was less than
10-5 (the stopping criterion of the iteration). Later, the minimum acceptable
improvement was altered to 10-7, as described below.
(a) (b) (c) (d) (e) (f)
(g) (h) (i) (j) (k)
Figure 6.2 IR imaging of lymph node tissue section LNII5 by PCA (a) H&E stained image of LNII5 (b)−(k) false colour weighted images for PC1−PC10
respectively.
Chapter 6
125
6.3.3.1 Setting the minimal amount of improvement for fuzzy c-means
During the initial stage of the experiments, the results produced by fuzzy c-
means varied considerably on each run. Figure 6.3 shows examples from 3 separate
runs (the number of clusters was 2). In these examples, the fuzzy c-means clustering
did not perform well. The main types of tissue were mixed throughout the section.
In each colour region, fuzzy c-means could not clearly separate the main types of
tissue. For example, in Figure 6.3(a), the red region included cancerous and normal
tissues, reticulum and capsule whilst the blue area included fatty tissue, capsule,
reticulum, and cancerous tissue. Therefore, there were no clear tissue types within
any cluster. A similar situation is also shown in Figure 6.3(b) and (c).
(a) (b) (c)
Figure 6.3 Clustering results from three separate runs with fuzzy c-means.
Chapter 6
126
Based on this observation, when the spectral data were plotted in the first 3
PCs (which contain 93.2% of the variance from the original data), as shown in Figure
6.4, it was found that the ranges of the components were:
[−0.0075, 0.0751]
[−0.0117, 0.0069]
and [−0.0096, 0.0047] repectively.
The order of −0.0075 is −10-2 and 0.0751 is 10-1 and so on. Thus, their
corresponding sizes of the range are 10-1, 10-2 and 10-2. It is well known that as the
component range becomes smaller, the data is more compact and, thus, the distances
between the data and their ideal centres are smaller. In fuzzy c-means, the objective
function J(U,V) (see Equation 2.3) is proportional to the squared Euclidean distance
between the data and the centres. In this case, the range sizes are 10-1, 10-2, their
squared Euclidean distances are then 10-2 and 10-4 (or even smaller). Hence, a small
range size may lead to a very small objective function value. One of the stopping
criteria occurs when the difference between two objective function values is less than
the minimal amount of improvement. Therefore, if the minimal amount of
improvement was not small enough (i.e. 10-5 as the initial setting) to allow
improvements in the centre positions, the performance of fuzzy c-means was found
to be bad. Due to this, a value of 10-7 was used as the minimal amount of
improvement for the remainder of the experiments. It was found that the
performance of fuzzy c-means improved and consistently achieved stable clustering
results.
Chapter 6
127
This finding also can be demonstrated in the seven oral cancer FTIR spectra
data sets which were used in the previous oral cancer experiments (see Chapters 4
and 5). In this previous work, the minimal amount of improvement had been set to
10-5. In these data sets, fuzzy c-means performed well because the range sizes were
compatible. The range sizes in the first three PCs are shown in Table 6.1.
It can be seen from Table 6.1 that the order of the first PC range is 100 for all of
the datasets (the squared Euclidean distance is therefore also of order 100). In the
second and third PCs, the range sizes are either 100 or 10-1, and so the corresponding
squared Euclidean distance are 100 and 10-2 respectively. Compared to these, a value
of 10-5 for the minimal amount of improvement is sufficiently small to allow the
centre positions to improve. This may also be the reason why fuzzy c-means
obtained good results for these datasets.
Figure 6.4 A three – dimensional scatter plot of the tissue section spectra projected onto the first 3 PCs.
Chapter 6
128
Table 6.1 The ranges of the first 3 PCs in seven oral cancer oral
cancer FTIR datasets.
Data sets Variances ranges of
First pc Second pc Third pc
Data set 1 1.8458 0.6141 0.2575
Data set 2 2.8569 1.2562 0.4795
Data set 3 1.5960 0.5859 0.6600
Data set 4 2.1702 0.8569 0.4112
Data set 5 1.6224 0.8317 0.5496
Data set 6 1.7023 0.8902 0.3342
Data set 7 1.8750 1.5900 0.8741
6.3.3.2 Clustering results obtained from fuzzy c-means
After resetting the minimal amount of improvement, fuzzy c-means produced
consistent clustering results. The number of clusters was subjectively set from 2 to 9
and their corresponding clustering results displayed in false colour weighted images
are shown in Figure 6.5. It can be seen that as the amount of clusters has been
increased from 2 to 5, as shown in Figure 6.5(b) – (e), the number of tissue types that
can be discriminated is increased. When comparing these images against the H&E
stained parallel section in Figure 6.5(a), the fuzzy c-means images for 5 clusters, as
shown in Figure 6.5(e), gives a very good agreement given that this is from an
adjacent tissue section and small morphological changes are likely. Each colour
Chapter 6
129
within the image can generally be assigned to a specific tissue type. The orange
cluster is the capsule, green the reticulum, maroon the healthy normal cortex tissue
surrounding the germinal centre in dark blue, and finally the invading cancerous
breast tissue is described by a light blue colour. The only misclassification
corresponds to spectra originating from fatty tissue located at the top left hand corner
of the image that have been grouped into the same cluster as the invading cancerous
tissue. Correct clustering of fatty tissue spectra into a single group was not
achievable via fuzzy c-means analysis. This is a direct consequence of their position
in multi-dimensional PC space, and shall be discussed in detail in Section 6.3.5. As
the amount of clusters is further increased from 6 to 8, as shown in Figure 6.5(f) –
(i), these main tissue types are then further subdivided. The capsule and reticulum
begin to show shared clusters and the formation of a lining that surrounds these
tissues. This is understandable as these types of tissue are very similar in nature.
The invading cancerous tissue also begins to display a second cluster that may
describe tissue with a different degree of malignancy, but not recognised via
conventional histology. When the number of clusters was again increased (above
nine), no further beneficial tissue discrimination could be made, with images
becoming needlessly complex and hard to interpret.
Chapter 6
130
(a) (b) 2 clusters (c) 3 clusters (d) 4 clusters (e) 5 clusters
(f) 6 clusters (g) 7 clusters (h) 8 clusters (i) 9 clusters
Figure 6.5 IR imaging of lymph node tissue section LNII5 by fuzzy c-means (a) H&E stained image of LNII5 (b)−(i) fuzzy c-means false colour weighted
clustering results, the number of clusters were from 2 – 9 respectively.
6.3.4 Experimental Results of the PCA – Fuzzy C-Means Technique
Finally, the collected spectra were subjected to PCA – fuzzy c-means
clustering analysis, where the dataset was initially compressed via PCA to its first ten
PCs, accounting for 99.1% of the variance contained in the original dataset. These
Chapter 6
131
new extracted variables were then clustered via conventional fuzzy c-means
methodology. Although this algorithm consecutively performs two different
multivariate analyses, the total computation time is significantly faster than
traditional fuzzy c-means analysis. This is a consequence of the dataset now only
being described by ten dimensions rather than 821 dimensional wave-numbers.
Results from the analysis were again visually displayed as false colour images and
are shown in Figure 6.6. When comparing these PCA-fuzzy c-means images directly
with those created via conventional fuzzy c-means clustering, no significant or
worrying loss of image quality can be observed. This quite clearly demonstrates that
data compression used in a correct statistical fashion can be an effective tool for
reduced computation requirements and analysis times.
6.3.5 Computational Time
For the examination of very large spectral datasets, such as those collected
from an entire lymph node, it is important that the type of analysis used is both fast
and efficient. Therefore, in this Section, the computation times for the three
multivariate techniques used in these experiments are examined. For each technique,
analyses were repeated 10 times and the average computation times determined. All
calculations were carried out on a 1.8 GHz Intel Pentium IV PC that utilised a 1GB
RAM, and ran under the Windows XP operating system.
The computation time for PCA ( PCACT ) is divided into two parts. This can be
expressed as:
Chapter 6
132
PCsIMGPlotPCAPCA TTCT 10+= (6.1)
where PCAT is the time taken by the principal component analysis; and PCsIMGPlotT10 is
the time taken to plot the first ten principal components images. In these
experiments, it was found that the average of 3.60=PCAT seconds and
(a) (b) 2clusters (c) 3clusters (d) 4clusters (e) 5clusters
(f) 6clusters (j) 7clusters (h) 8clusters (i) 9clusters
Figure 6.6 IR imaging of lymph node tissue section LNII5 by PCA–fuzzy c-means (a) H&E stained image of LNII5 (b)−(i) fuzzy c-means false colour
weighted clustering results, the number of clusters were from 2 – 9 respectively.
Chapter 6
133
6.8410 =PCsIMGPlotT seconds. Therefore, the total computation time for PCA ( PCACT )
was 144.9 seconds; approximately 2.4 minutes.
Computation time for fuzzy c-means ( FCMCT ) is also composed of two parts,
and can be expressed as:
)()( iimgPlotiFCMFCM TTCT += (6.2)
where, )(iFCMT is the time spent on fuzzy c-means clustering, i represents the number
of clusters (i=2,3,4…9); and )(iimgPlotT is the time for the imaging plot of i clusters.
Since the number of clusters that was calculated varied, the computational time
required for each was different. The average FCMCT for each cluster number is
shown in Table 6.2. The total FCMCT was calculated to be 1684.9 seconds,
approximately 28 minutes.
Finally, the computation time for the PCA-fuzzy c-means technique
( FCMPCACT − ) comprises four parts. This can be expressed as:
)()(10 iimgPlotiFCMPCsextractPCAFCMPCA TTTTCT +++=− (6.3)
where PCAT is the principal component analysis time, PCsextractT 10 is the time
taken to extract the first ten principal components, )(iFCMT is the time spent on fuzzy
c-means clustering and )(iimgPlotT represents the time for the imaging plot of i clusters.
It should be noted that PCAT and PCsextractT 10 were only performed once before applying
the fuzzy c-means clustering algorithm. A summary of the computational times for
the PCA-fuzzy c-means analysis is displayed in Table 6.3. The results in Table 6.3
Chapter 6
134
show that the total computation time taken for the PCA-fuzzy c-means technique
( FCMPCACT − ) was 151.29 seconds, approximately 2.5 minutes. A comparison of
computation time for all three techniques is shown in Table 6.4.
6.3.6 Discussion of Results
In comparing these three techniques from the computational time point of
view, it can be seen that PCA and PCA – fuzzy c-means took almost the same time
to complete the experiments, although PCA slightly faster. In contrast, the standard
fuzzy c-means algorithm took the longest time. Nevertheless, the results obtained
from these techniques are not proportional to their computation time. In the first PC
image, fatty tissue was separated from the rest of the other types of tissue. However,
from PC2 to PC10, different tissue sections started to become mixed together. In
particular, when the number of PCs increased beyond three the less and less
information was revealed. In fuzzy c-means clustering analysis, as the number of
clusters is increased from two to five, more types of tissue can be discriminated. In
the five clusters image, Figure 6.5 (e), most of the tissue sections can be correctly
assigned to their histological groups apart from the fatty tissue area. The reason for
this incorrect clustering can be explained by examination of the spectra in multi-
dimensional PC space. Figure 6.7(a) displays the original dataset plot in three
dimensional PCs space with five clusters. Figure 6.7(b) is a rotated picture of (a) and
can best describe the differences between the outlier fatty tissue (encircled) and
remaining tissue spectra. In the fuzzy c-means algorithm, the Euclidean distance is
used to define membership values; this means that when the shapes of the clusters
Chapter 6
135
are significantly different from spherical, the clustering results will be not effective.
In this dataset the first PC is descriptive of the lipid content in the fatty tissue. The
small amount of spectra collected from this region on the tissue section displays a
very large natural variation in the intensity of these lipid peaks. This has caused the
fatty tissue spectra to be distributed along this PC axis, so that the fuzzy c-means
clustering is less effective. Unfortunately this has led to misclassification of the fatty
tissue spectra into the same cluster as the cancerous spectra (dark red).
Table 6.2 Summary of fuzzy c-means clustering computation time.
fuzzy c-means clustering (sec) Image plot (sec) Total2 30.4 0.3 30.73 174 0.3 174.34 128.6 0.3 128.95 169.4 0.3 169.76 441.7 0.3 4427 325.3 0.3 325.68 413.4 0.3 413.7
Total 1682.8 2.1 1684.9
Number of Clusters Computation time for fuzzy c-means technique
Table 6.3 Summary of PCA-fuzzy c-means computation time.
fuzzy c-means clustering (sec) Image plot (sec) Total2 0.67 0.36 1.033 4.15 0.37 4.524 3.34 0.34 3.685 4.31 0.39 4.706 9.83 0.42 10.257 7.90 0.39 8.298 12.63 0.39 13.02
Subtotal 42.83 2.66 45.49TPCA - - 60.30
Textract10PCs - - 0.01Total - - 151.29
Number of Clusters Computation time for PCA-fuzzy c-means technique
Chapter 6
136
Table 6.4 Computation time comparison between PCA, Fuzzy c-
means and PCA-fuzzy c-means analysis techniques.
Techniques Computation times (mins)PCA 2.4
Fuzzy c-means 28PCA-fuzzy c-means 2.5
(a) (b)
Figure 6.7 LNII5 tissue section spectra plot in three dimensional PCs space (a) original plot with 5 clusters (b) rotated plot of picture (a).
Finally, the combined PCA – fuzzy c-means analysis achieved similar
discrimination as that achieved via fuzzy c-means analysis, but also showed a greatly
improved computational speed without a significant loss of information from the
original dataset. Thus, this allows high quality cluster analysis of large FTIR spectra
datasets in dramatically reduced times.
Chapter 6
137
6.4 Comparison of K−Means and Fuzzy C−Means in Lymph
Node Tissue Sections
Based on the PCA – fuzzy c-means technique, PCA can now be easily applied
prior to the k-means algorithm to generate a ‘PCA – k-means’ method. Thus, the
performance between k-means and fuzzy c-means clustering algorithms on large
lymph node tissue section LNII5 can be fairly easily compared. In the following
experiments, the PCA technique will be initially applied to all the spectra within
LNII5, and then the first ten PCs are extracted to be used by the k-means clustering
algorithm. The number of clusters was set in the same way as PCA – fuzzy c-means
methods, ranging from 2 to 9. The following settings were used for the k-means
experiments; the squared Euclidean distance was used as the distance measure, the
initial cluster centre positions were randomly selected, the maximum number of
iterations was set to 100 and the experiments were also run ten times. The purpose
of this Section is to compare the clustering results obtained from the k-means and
fuzzy c-means algorithms after PCA had been used to reduce the dimensionality of
the dataset. Therefore, from now on in this Section, PCA – fuzzy c-means and PCA
– k-means techniques are simply addressed as fuzzy c-means and k-means.
6.4.1 Results obtained from K-means and Fuzzy C-means (using PCA)
After both techniques had been run ten times for each number of clusters, it
was identified that fuzzy c-means had produced the more consistent clustering
results, although k-means was also able to produce fairly stable results. However, in
Chapter 6
138
the cases where there were 2, 5, 6, 7 and 9 clusters, the k-means algorithm produced
variable clustering results. In order to facilitate the comparison, results obtained in
Section 5.3.4 are also redisplayed in this Section.
Figure 6.8(a) and (b) are two examples of the variation obtained in results from
k-means; Figure 6.8(c) is from fuzzy c-means. In Figure 6.8(a), it can be seen that k-
means separated the fat from the other types of tissue. In comparison, Figure 6.8(b)
and (c) are almost the same. Based on the histological stained picture analysis shown
in Figure 6.1(c), the red region covers capsule and reticulum tissue area, whereas the
blue region includes the fatty tissue and nodal tissue (including both cancerous and
normal). Figure 6.9 (a) – (g) displays results obtained using k-means clustering
when the numbers of clusters is set to 3 – 9, and the corresponding results produced
from fuzzy c-means are shown in Figure 6.10(a) – (g). Due to fact that variations
occurred in k-means clustering with several different number of clusters; Figure 6.9
displays the cluster results which most frequently appeared (from the ten runs) using
the k-means algorithm.
(a) (b) (c)
Figure 6.8 Clustering results from k-means (a&b) and fuzzy c-means (c) in 2 clusters.
Chapter 6
139
(a) 3 clusters (b) 4 clusters (c) 5 clusters (d) 6 clusters (e) 7 clusters (f) 8 clusters (g) 9 clusters
Figure 6.9 K-means clustering results in 3 − 9 clusters.
(a) 3 clusters (b) 4 clusters (c) 5 clusters (d) 6 clusters (e) 7 clusters (f) 8 clusters (g) 9 clusters
Figure 6.10 Fuzzy c-means clustering results in 3 − 9 clusters.
Within the clustering process of the k-means algorithm, variable results also
occurred when the number of clusters was set to 5, 6, 7 and 9. These variations are
displayed in Figure 6.11 – 6.14.
(a) (b)
Figure 6.11 Variation in k-means clustering results for 5 clusters.
Chapter 6
140
(a) (b)
Figure 6.12 Variation in k-means clustering results for 6 clusters.
(a) (b)
Figure 6.13 Variation in k-means clustering results for 7 clusters.
(a) (b)
Figure 6.14 Variation in k-means clustering results for 9 clusters.
Chapter 6
141
6.4.2 Discussion of the Clustering Results for K-means and Fuzzy C-means
By observing the clustering results obtained from k-means, as displayed in
Figure 6.9(a), it can be seen that when the number of clusters was set to three, k-
means separated the fatty tissue successfully, but the normal tissue area (germinal
centre and normal cortex tissue) was grouped with cancerous cortex tissue. Capsule
and reticulum were still mixed together as when the number of clusters was two (as
shown in Figure 6.8(b)). This also happened in the corresponding clustering results
from the fuzzy c-means algorithm in Figure 6.10(a). However, the difference
between fuzzy c-means and k-means in this Figure was that, in the case of fuzzy c-
means, the normal germinal centre tissue was distinct from the cancerous cortex
tissue whereas, in the case of k-means, these two types of tissue were clustered
together. However, for fuzzy c-means some of the fatty tissue and a small area of
normal tissue (outside of the germinal centre) was still misclassified with cancerous
tissue.
When the number of clusters was increased to four, the k-means algorithm still
could not differentiate the cancerous tissue from the normal tissue (e.g. germinal
centre). This is demonstrated in Figure 6.9(b). On the other hand, some reticulum
was separated from capsule. However, these separated reticulum were mixed with
normal cortex tissue (outside of the germinal centre). This situation also occurred in
the fuzzy c-means results, as shown in Figure 6.10(b). Nevertheless, fuzzy c-means
was able to split the normal tissue (germinal centre and outside normal cortex tissue)
and cancerous tissue.
Chapter 6
142
As the number of clusters was further increased to five, different variations
started appearing within the k-means clustering results. In Figure 6.9(c) and Figure
6.11(b) the cancerous tissue still did not distinguish the normal germinal centre tissue
and cancer cortex tissue; the other variation displayed in Figure 6.11(a), other than
the separated normal and cancer tissue, was that some reticulum was still mixed in
with the normal cortex tissue (outsides of germinal centre). The corresponding
results from fuzzy c-means in Figure 6.10(c) showed that, apart from the fact that
fatty tissue was still mixed with cancerous tissue, the rest of the other types of tissue
obtained a similar classification as the clinical analysis, as shown in Figure 6.1(c).
Starting from six clusters, the k-means algorithm began to separate normal and
cancerous tissue consistently, although there was a small amount of reticulum that
was also classified with the capsule. As the number of clusters was further
increased, the k-means approach started to classify more subtypes within the capsule
and fatty tissue area, whereas fuzzy c-means classified more subtypes in the
cancerous and capsule tissue area. Finally, increasing the number of clusters to nine
yielded more and more tissue types being mixed together. The additional clusters
within these tissue sections may identify potential subtypes of tissue which cannot
currently be identified by pathological analysis and may potentially be useful for
diagnosis. Of course, they may also be clustering noise!
Overall, the fuzzy c-means algorithm was able to split the normal and
cancerous tissues in the early stage of the clustering process (as the number of
clusters was low) and, when the number of clusters was increased, the main different
Chapter 6
143
types of tissue can also be separated by the k-means algorithm. However, the fatty
tissue cannot be separated from the cancerous region using fuzzy c-means with any
of the cluster numbers used within this experiment (2 to 9 clusters). In contrast, k-
means can separate the fatty tissue almost regardless of the number of clusters.
Nevertheless, from six to nine clusters, although k-means sometimes obtained better
clustering results than fuzzy c-means, it is not stable, in the sense that it also
sometimes produced worse results. Therefore, it would not appear to be consistent
enough for real world application. In addition, as the number of clusters increases,
more and more information is obtained about the tissue which cannot be identified by
the pathologist.
6.5 Summary
A tissue section (LNII5) which was collected from a particularly interesting
site on the lymph node was used in the study of IR imaging presented in this Chapter.
Due to the large size of the FTIR spectral data obtained from LNII5, a technique
named PCA – fuzzy c-means which combined PCA and fuzzy c-means methods was
employed to speed up the cluster analysis with no significant information loss from
the original dataset. Experiments were conducted to apply the PCA, fuzzy c-means
and PCA – fuzzy c-means techniques individually to the LNII5 tissue section and the
results showed that fuzzy c-means and PCA – fuzzy c-means obtained almost the
same clustering results. These both performed better than PCA. However, PCA –
fuzzy c-means was almost ten times faster than the fuzzy c-means algorithm alone.
This speed benefit was obtained through the reduction in the size of the data using
Chapter 6
144
PCA. Moreover, when fuzzy c-means was initially applied to the LNII5 dataset, it
was shown that the performance of fuzzy c-means was poor. However, by
investigating the size of the first three PCs ranges, it was found that the parameter
which specifies the minimal amount of improvement was not small enough, causing
the fuzzy c-means algorithm to stop prematurely without making any further
improvement. Based on this finding, the setting for the minimal amount of
improvement was reduced and the performance of the fuzzy c-means algorithm then
improved significantly.
The k-means algorithm was also used to cluster the reduced dataset from PCA
and the results were compared to the PCA – fuzzy c-means technique. The results
demonstrated that, whilst PCA – fuzzy c-means can separate the main different tissue
types in the early stage of clustering, the PCA –k-means approach was only able to
satisfactorily classify them when the number of clusters was increased beyond five.
As the number of clusters was increased further, more information was obtained
within the classification (in particular, the possible identification of tissue subtypes)
which cannot be recognised by the pathologist.
Chapter 7
145
CHAPTER 7
A Cluster Merging Algorithm
7.1 Introduction
The motivation behind this Chapter is to create an automated method to
analyse FTIR spectra and to separate them into a clinically meaningful number of
clusters. In Chapter 5, a fuzzy clustering algorithm featuring simulated annealing
(SAFC) was used to automatically detect the ‘optimal’ number of clusters found
within a FTIR dataset. The dataset used in this study comprised of spectra that had
been collected from a variety of different tissue types. The SAFC algorithm begins
by generating a random number of clusters and then traverses the search space using
three different neighbourhood operations: i) perturb centre, ii) delete centre and iii)
split centre. The configuration with the minimum cluster validity index value is
returned. The results showed that this algorithm was able to obtain the same number
of clusters as clinical analysis in four out of seven datasets. However, smaller Xie-
Beni validity index values were achieved in some datasets even though the numbers
of clusters were different from the clinical analysis. Although the SAFC algorithm
Chapter 7
146
performed well on the small datasets, it proved to be very time consuming on larger
datasets. With the aim of overcoming this problem, in this Chapter, a refined fuzzy
c-means based clustering algorithm was developed to find the ‘optimal’ number of
clusters.
Both the SAFC and fuzzy c-means based clustering algorithms can
automatically detect the number of clusters based on the clustering structure and this
results in the minimum validity index value. However, both algorithms occasionally
identified an excessive number of clusters compared to clinical analysis. This was
partly due to the fuzzy c-means algorithm and cluster validity index, where all
distances between data points and cluster centres are calculated using their Euclidean
distances. This means that when the shapes of the clusters were significantly non-
spherical, the clustering and validity measures were not effective. However, the
complexity and range of the different cell types (e.g. healthy, pre-cancerous and
mature cancer) may also lead to an excessive number of clusters being identified.
The focus of this Chapter is on grouping the cells with the same clinical diagnosis
into one cluster so that the main types of the tissue can be explored through further
clinical analysis. In order to achieve this, it is necessary to combine the clusters with
most similar characteristics together, e.g. within the suspected pre-cancerous and
mature cancer cell types, as they may exhibit similar properties to one another even
though they are different stages of cancer. This information may be contained in the
existing infrared spectra.
Chapter 7
147
In this Chapter, a new method is proposed to automatically merge clusters in an
iterative manner. The algorithm identifies the two most similar clusters generated by
the initial fuzzy c-means based clustering algorithm and merges them. The merged
cluster will then be considered as a new cluster to rejoin the remainder of the
iterative merging process until a stop criterion is reached. In these experiments,
either the Xie-Beni index (VXB) [74] or the Sun-Wang-Jiang index [72] is used as the
cluster validity index depending on the size of the dataset undergoing clustering.
From observation, it is apparent that VXB is more suitable for small FTIR datasets
(less than 1000 data points), and VSWJ is more suitable for large FTIR datasets. This
may be because the fuzzy c-means algorithm utilising the VSWJ index often results in
an unstable and excessive number of clusters in comparison with VXB in small
datasets, and the VXB can usually generate less numbers of clusters than expected in
large datasets. In the following Sections, the feature selection and reduction methods
are described. Subsequently the fuzzy c-means based clustering algorithm is
presented in brief and then the proposed cluster merging method is described in
detail. Then the new algorithm is used to analyse the FTIR images of three selected
tissue sections and its results are analysed and conclusions are drawn.
7.2 Feature Extraction
The infrared spectra lymph node datasets utilised in this Chapter contain two
big datasets, which contain 276 and 343 spectra respectively; and three other large
datasets, which have 5764, 7497 and 7216 spectra respectively. For each spectal
data, there are 821 absorbance values, one corresponding to a data point every 4 cm-
Chapter 7
148
1. Thus the total size of the data in these five datasets are: 276×821(2.3×105),
343×821(2.8×105), 5764×821(4.7×106), 7497×821(6.2×106) and 7216×821(5.9×106),
respectively. As discussed in the previous Chapter, the application of clustering
algorithms to such large datasets can be very time consuming. In addition, it is
difficult to visualise the distribution of such data. In this Chapter, the same approach
as in Chapter 6 (PCA) is used to reduce the number of variables and also used to
permit visualisation. Once again, the first ten principal components (PCs) were used
as the input of the clustering method. For the five datasets used in this investigation,
the first two PCs represent, respectively, 78.9%, 73.8%, 70.9%, 89.1% and 94.1% of
the variance in the original datasets, whilst increasing this to the first ten PCs
improves the representation to 98.7%, 96.4%, 95.6%, 99.1% and 99.8%,
respectively. In order to visualise the data distribution, the original data can be
plotted in the first two PC dimensions. Details on feature reduction approaches have
been described previously by many authors, see, for example [128].
7.3 Fuzzy C-Means Based Clustering Algorithm
The fuzzy c-means based clustering algorithm is a good example of using
cluster validity to determine the optimal number of clusters from a given dataset (as
shown in Chapter 2, Figure 2.7). It is composed of two parts, where the first part is
based on running the fuzzy c-means clustering method within a certain range of
number of clusters. The best data structure (C) is obtained by choosing the
corresponding optimal value of cluster validity V from all the possible clustering
structures [72]; the second part of this algorithm is based on taking the best data
Chapter 7
149
structure from the first part, and then slightly perturbing each cluster centre for a
number of iterations while the validity index for each new data structure is
calculated. The data structure corresponding to the optimal validity index is then
returned. In the description of the algorithm shown in Figure 7.1, the minimal and
maximal number of clusters is referred to as cmin and cmax.
7.4 The Basis of a New Automated Method to Merge Clusters
The motivation to create a new merge clustering method was based on previous
attempts at clustering spectral data. In the previous analysis, a dataset that comprised
of a variety of different tissue types was analysed via SAFC and fuzzy c-means
based clustering algorithms. This dataset comprised 276 individual spectra that were
collected from regions of a tissue section diagnosed as ‘cancer’ (cancerous cortex
tissue: 159 spectra), ‘normal’ (benign cortex tissue: 72 spectra) and ‘reticulum’
(fibrocollagenous tissue: 45 spectra). In clinical diagnosis on this tissue section, the
histologist identified three different types of tissue. However, when the number of
clusters was set to a value of three for the SAFC and fuzzy c-means methods, the
clustering results did not match those of clinical diagnosis (possibly due to the
Euclidean distance measurement in fuzzy c-means). Some FTIR spectra from
cancerous regions were incorrectly grouped with spectra taken from regions of non-
cancerous tissue.
Chapter 7
150
1) Set cmin and cmax (in the experiments, we set cmin=2 and cmax=10)
2) For c = cmin to cmax
2.1) Initialise the cluster centres
2.2) Apply the standard fuzzy c-means algorithm and obtain the new centre and new fuzzy
partition matrix.
2.3) Calculate the cluster validity V.
3) Obtain the good data structure (C) that corresponds to the optimal cluster validity index
value V.
4) Set current C as the best data structure (Cbest).
5) For i = 1 to 100
5.1) Random slightly perturb the current C.
5.2) Calculate the new membership value and validity index value V corresponding
to the new data structure (Cnew).
5.3) If the new V is smaller than current Cbest V value, then set the Cnew as the Cbest,
otherwise, go back to step 5.1.
5.4) Set the Cbest as current C, and go back to step 5.1).
5.5) End for loop.
6) Return the best data structure Cbest with the optimal V value.
Figure 7.1 The fuzzy c-means based clustering algorithm.
When the fuzzy c-means based clustering algorithm was subsequently applied,
four clusters were obtained. However, two of the four clusters corresponded to one
type of tissue (cancerous). In the remaining two clusters, the majority of the data
was classified into the correct group (although there were two spectra data that were
misclassified). The results are shown in Figure 7.2. The excessive number of
clusters may have been caused by the fact that the FTIR spectra taken from
cancerous regions were taken from diverse areas of tissue, which might have
contained cells at different stages of the cancer (e.g. pre-cancerous and mature
Chapter 7
151
-5 0 5 10
x 10-3
-4
-3
-2
-1
0
1
2
3x 10
-3
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
reticulumcancernormalcancercentre
Figure 7.2 An extracted spectral dataset after applying fuzzy c-means based clustering algorithm.
cancer). As mentioned before, at this stage, we only wish to cluster cells which have
the same clinical diagnosis.
In the literature, many split-and-merge techniques have been used to determine
the correct number of clusters (see Chapter 2, Section 2.4). However, in general, all
these algorithms perform the split and merge procedure based on the dataset itself.
Besides these, the merge criteria used within a variety of other clustering algorithms
have also been reviewed in Chapter 2 (Section 2.5). Two alternative types of merge
criteria that have previously appeared in the literature are illustrated by using the
example shown in Figure 7.2. The first criterion identifies and merges clusters that
lie ‘closest’ to each other in multi-dimensional space [39]. In contrast, the second
criterion identifies the ‘worst’ clusters and merges them together. This is achieved
C3
C1
C2
C4
Chapter 7
152
by use of a cluster validity function, the most common of which measure the
compactness of the defined clusters [79]. Informally, a ‘good’ cluster is defined by
the property that data points within the cluster are tightly condensed around the
centre (high compactness). When applying these criteria to the dataset shown in
Figure 7.2, the two closest clusters are C2 and C3 (see the distance between each
cluster centre). In this dataset, C1 and C2 are more compact than the other two and
the worst two clusters are C3 and C4. However, the two clusters that should be
merged together are C1 and C3 (both are collected from cancer tissue). Hence,
neither of these approaches for merging clusters was suitable for solving the
problems encountered in FTIR clustering.
An alternative solution was developed based on examining the original infrared
spectra rather than searching for a relationship using the clustering structures in the
PCA space. Plotting the mean spectra from the separate clusters allows the major
differences between them to be more clearly visualised. The similarity between
clusters is more obvious at the wave-number corresponding to the IR frequency that
provides the largest variance between spectra. The proposed automated cluster
merging method described in this Chapter is based on this observation and can be
divided into two main stages. The first stage is to identify the frequency at which the
greatest variance between mean spectra is observed. The second step is to repeatedly
determine the most similar clusters and merge them, until a suitable termination
criterion has been reached. In the following Section these two steps are described in
detail.
Chapter 7
153
Step1: Determine a Reference Frequency
The reference frequency is defined as the frequency at which the biggest
difference between any two mean spectra is found. The full procedure of
determining this frequency is shown in Figure 7.3.
1) Obtain the clustering results from the fuzzy c-means based clustering algorithm.
2) Calculate the mean spectra iA for each cluster,
∑=
=iN
jij
ii A
NA
1
1 (i=1...c) (7.1)
where Ni is the number of spectra in the cluster i; Aij is the absorbance of the
spectrum j in cluster i; c is the number of clusters. The size of iA is p, the
number of wave-numbers in each spectrum (each mean spectrum is a vector of p
elements).
3) Compute the vector of pair-wise absolute differences Dij between all mean
spectra,
jiij AAD −= (i=1…c, j=1…c) (7.2)
4) Find the largest single element, dmax, within the set of vectors D.
5) Determine the frequency corresponding to the maximal element dmax.
Figure 7.3 The procedure of determining a reference wave-number.
Chapter 7
154
Figure 7.4 Mean infrared spectra obtained from different clusters.
The mean spectra obtained for four clusters are displayed in Figure 7.4. The set of
differences, D, was calculated between each pair of mean spectra using Equation
(7.2). The largest difference dmax exists between C1 and C4, as shown in Figure 7.5.
The IR frequency that corresponds to d is 2924 cm-1.
Figure 7.5 Enlarged region of Figure 7.4.
1000 1500 2000 2500 3000 3500
10
See Figure 7.5
8
6
4
2
0
Wavenumber/cm-1
2800 2850 2900 2950 3000 3050
1.5
2.0
2.5
3.0
3.5
4.0
4.5 C1 (cancer) C2 (normal) C3 (cancer) C4 (reticulum)
Wavenumber/cm
Abs
orba
nce
(x 1
0-3)
C1 (cancer) C2 (normal) C3 (cancer) C4 (reticulum)
Abs
orba
nce
(x 1
0-3)
Chapter 7
155
Step2: Automatically Merging Clusters
The next step is to merge the most similar clusters and then to merge them
together. This is determined by using the absorbance intensity for each mean
spectrum at the reference frequency. Clusters are therefore merged dependant upon
similarities in their IR spectra rather than clustering structure in multivariate space.
As this an iterative process, the merging procedure will end when at least one of the
termination criteria has been satisfied. Assume currently there are C mean spectra.
The detailed information can be described as shown in Figure 7.6.
In this Section, the same example as used above is utilised to illustrate this
procedure. In Figure 7.7, iA (i=1...4) is the mean absorbance values from the
normalised spectra of each obtained cluster. These are 1A = 0.0045,
2A =0.0034,
3A =0.0041, and 4A =0.0028, respectively. The line corresponds to the reference
frequency of 2924 cm-1. After sorting iA in ascending order, their new arrangement
is4A , 2A ,
3A and 1A . The distances between the average absorbance intensities are
represented as dist =. It is then trivial to calculate d1 = 0.0006, d2 = 0.0007, and d3 =
0.0004. d3 is obviously the minimum distance, distmin. The average of rest of dist =
(0.0006+0.0007)/2 = 0.00065 is greater than distmin. This satisfies the merging
condition in (4), and so the two clusters which correspond to distmin (i.e. C1 and C3)
are merged together. After this, the average of the mean spectral absorbance of these
two clusters ( newA =0.0043) replaces these values. The new array of the mean spectra
absorbance intensities is then re-sorted to be 4A ,
Chapter 7
156
1) Obtain C absorbance values of mean spectra at reference frequency from step 1,
re-sort them in ascending order.
2) Calculate the distance dist between these sorted and adjacent absorbance values
(note that the size of dist now is C-1)
3) Pick up the smallest distance distmin and find out the two most similar clusters
which correspond to this distance.
4) Merge these two clusters if they satisfy the merging condition: distmin ≤ average of
rest of dist (without distmin). The average absorbance value for the two merge
clusters is then calculated and is considered as a new object to join the rest of
merging iteration. Go back to 1)
5) When there are only two dist left, merge the two clusters which corresponding the
distmin if the following merging conditions satisfied: distmin ≤ 1/2 rest of dist OR
(distmin-1/2 rest of dist)/ distmin ≤ 0.1. Again, the average of these two mean
spectra absorbances is considered as a new object to replace them.
6) The merging process stops if there are only two clusters left or no merging
conditions are satisfied.
Figure 7.6 The procedure of automated merge clusters.
2A and newA , as displayed in Figure 7.8. The corresponding new distances are dnew1 =
0.0006 and dnew2 = 0.0009; see step (5). As distmin (0.0006) is not smaller than or
equal to 1/2 rest of dist (0.00045), it does not satisfy the either the first or second
conditions. Hence, in this situation, no merging conditions are satisfied, and so the
iterative process stops as defined in step (6).
Chapter 7
157
4A 2A
3A 1A
Figure 7.7 Four mean spectra absorbance at reference wave-number.
4A
2A newA
Figure7.8 The resultant absorbance distribution obtained after merging the two most similar clusters.
The merging condition in step (5) is different from the one when there are more
than two distances left, as shown in step (4). This is because, when there are only
two distances (i.e. 3 clusters) left, if the same merging condition as in step (4) is
used, this may lead to two clusters being merged in which their corresponding mean
spectra absorbance distance is slightly less than and nearly equal to the other
distance. For example, in Figure 7.9, if d2 is a slightly less than d1, then clusters b
and c will be merged together. Visually, this is not convincing. In order to achieve
the same effect as in the previous merging scenarios, the merging conditions
described in step (5) is generated. For example, in Figure 7.10, if d2 is smaller than
half of d1 (similar to the case where there are three distances) or the extra distance of
d2 to half of d1 is less than one tenth of d2, then cluster b and c are merged together.
In summary, the whole algorithm for the automated merging of clusters can be
described as shown in Figure 7.11.
dnew2 dnew1
d3 d2 d1
Chapter 7
158
a b c
Figure 7.9 The merging situation when there are two dist left (type 1).
a b c
Figure 7.10 The merging situation when there are two dist left (type 2).
Entire Automated Merging Clustering Procedure
1) The FTIR spectra are initially pre-processed to account for irregularities in cell
density across the tissue section. This includes baseline correction and peak area
normalisation.
2) PCA is applied to the processed dataset to reduce its dimensionality. Only the
first 10 PCs are extracted and utilised for subsequent fuzzy c-means cluster
analysis.
3) The fuzzy c-means based clustering algorithm is applied to this reduced dataset
and optimal clustering structure is adopted by finding the best clustering Cbest with
the minimal VXB or VSWJ value.
4) The merge cluster algorithm then identifies the reference frequency at which the
variance is maximal for the calculated mean spectra.
5) The algorithm merges the calculated clusters until a stop criterion is reached.
6) The final clustering results are correlated to the original 2D-FTIR image that was
collected. Each pixel/spectrum within this image is designated a colour dependant
on the cluster with which it belongs.
Figure 7.11 Entire automated merging clustering procedure.
d2 d1
d2 d1
Chapter 7
159
To demonstrate the new algorithm in entirety, the same dataset is utilised
again. The clustering results are shown in Figure 7.12. This can be compared to
Figure 7.2 and demonstrates the successful merging of spectra from regions of
cancerous tissue.
-5 0 5 10
x 10-3
-4
-3
-2
-1
0
1
2
3x 10
-3
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
normalreticulumcancercentre
Figure 7.12 The extracted spectral dataset after applying the proposed automated merging cluster method.
7.5 Experimental Results
In these experiments, the FTIR spectral datasets can be divided into two types,
based on the method of data collection. The first type is termed the ‘extracted
datasets’, to indicate that the spectral data within each dataset were individually but
not adjacently taken from different types of tissues. A sample of the extracted
dataset is shown in Figure 7.13. The two circled regions are cancer and normal
tissue sections, and the rest is reticulum. The plus symbols, ‘+’, represent the points
Chapter 7
160
chosen for FTIR spectra analysis. There were three extracted datasets (one from the
lymph node dataset, named extracted LNII7, and two from the previous oral cancer
dataset 3 and dataset 5) used in these experiments.
Figure 7.13 An example of an extracted dataset.
The second type of tissue samples is termed the ‘whole sub-area datasets’, to
indicate that the spectral data within each dataset is taken from a whole sub area of
axillary lymph node which contain several tissue sections. In contrast to the first
type of datasets, this type was captured from an entire sub tissue area. A
corresponding sample is shown in Figure 7.14. Within an entire lymph node tissue
sample, which contains only normal and cancer tissue sections, a whole sub area
which includes both tissue types was selected. Spectra were numbered following the
grid arrangement, sequentially from left to right, and from bottom to top, as shown
by the direction of the arrows in Figure 7.14. There were three whole sub area
lymph node datasets used in these experiments, namely LNII7, LNII5 and LN57.
Cancer
Normal
Reticulum
Entire tissue sample
Chapter 7
161
Figure 7.14 An example of a whole sub area of lymph node dataset.
These two types of FTIR dataset were examined using the newly developed
algorithm. The collected IR spectral datasets from each tissue section (apart from
dataset 3 and dataset 5) were initially clustered using the fuzzy c-means based
clustering algorithm. The generated clusters were then combined using the newly
proposed automated merge clustering method to help define the main types of tissue
found in each section. It should be noted that the initial clustering results of dataset
3 and dataset 5 were taken from previous outputs of SAFC algorithm (see Chapter 5
for details). However, these two datasets have only been used for purposes of
verification of the automated merge clustering method. Due to the nature of the
random initialisation, the final clustering results may vary. Therefore, the fuzzy c-
means based clustering algorithm was applied ten times for each dataset. The results
are displayed in Figures 7.15 to 7.20, below.
7.5.1 Extracted Datasets
Figure 7.15 (a) shows the clustering results of extracted LNII7 obtained from
the fuzzy c-means based clustering algorithm (where VXB index was used) in first two
Cancer
Normal
Entire lymph node tissue sample
Whole sub area
Chapter 7
162
PCs space. Two types of tissue sections, namely cancer and normal are contained in
this LNII7 tissue section (details of LNII7 tissue is shown later). A total of 343
spectral data were extracted; among these, 105 were defined as normal by the
pathologist, 238 were cancerous. After applying the fuzzy c-means based clustering
algorithm, three clusters were obtained, and within these, two clusters should belong
to a cancer cluster, as shown in Figure 7.15(a). Figure 7.15(b) displays the results
after applying the newly developed automated merge clustering method. It clearly
shows that the separate cancer clusters have now been correctly merged.
-4 -3 -2 -1 0 1 2 3 4
x 10-3
-4
-3
-2
-1
0
1
2
3
4x 10
-3
1st Principal Component
2nd
Pri
ncip
al C
om
pon
ent
cancernormalcancercentre
-4 -3 -2 -1 0 1 2 3 4
x 10-3
-4
-3
-2
-1
0
1
2
3
4x 10
-3
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
normalcancercentre
(a) (b)
Figure 7.15 (a) Extracted LNII7 clustering results after applying fuzzy c-means based clustering algorithm. (b) Extracted LNII7merged clusters results.
In order to verify the automated merge clustering algorithm, the method was
further applied to the previous oral cancer datasets (see Chapter 5), for which the
SAFC clustering algorithm obtained three clusters, rather than the two determined by
the histological analysis. Two datasets suffer this problem, namely dataset 3 and
dataset 5. Figure 7.16(a) and Figure 7.17(a) display the initial clustering results
Chapter 7
163
obtained from the SAFC algorithm for these two datasets, respectively, Figure
7.16(b) and Figure 7.17(b) are the corresponding results obtained after applying the
proposed merge clustering algorithm, in which the separate tumour clusters have
again been correctly merged. Although in dataset 3, the spectral data point numbered
34 was still misclassified in the result obtained after merging clusters, when
compared to previous SAFC clustering results (see Figure 5.8), the accuracy of the
results was, in general, much improved.
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
41
42
43
4435
36
37
34
38
39
40
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
tumourtumourstromacentre
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
34
38
39
40
35
36
37
41
42
43
44
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
stromatumorcentre
(a) (b)
Figure 7.16 (a) Dataset 3 clustering results obtained from SAFC algorithm. (b) Dataset 3 merged clusters results.
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
101
102103
104
111
112113
114121
122
123
124
105
106108
115 118
119120125
126
130
107
109110
116117
127
128
129
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
stromatumourtumourcentre
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
101
102103
104
111
112113
114121
122
123
124
105
106
107
108
109110
115
116117
118
119120125
126
127
128
129
130
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
stromatumorcentre
(a) (b)
Figure 7.17 (a) Dataset 5 clustering results obtained from SAFC algorithm. (b) Dataset 5 merged clusters results.
Chapter 7
164
7.5.2 Whole Sub Area Datasets
The tissue section displayed in Figure 7.18 was collected from a positive
axillary lymph node named LNII7 that displayed large areas of malignancy.
Invading cancer from the breast had almost fully infiltrated the lymph node with only
small remnants of normal nodal tissue remaining. Figure 7.18 (a) displays the total
absorbance IR image collected from an area on the tissue section where both
cancerous and normal nodal tissue was present. The IR spectral dataset comprised of
5764 spectra. Figure 7.18 (b) displays the H&E stained image for the same area
collected from the parallel stained tissue section. Initial clustering results from the
fuzzy c-means based selection algorithm are shown in Figures 7.18 (c) – (e) and the
final merged cluster image is shown in Figure 7.18 (f).
(a) (b) (c) (d) (e) (f)
Figure 7.18 Lymph node tissue section LNII7. Sampled area was 275µm × 818.75µm in size. (a) Total absorbance IR image (b) H&E stained image. Clustering results after fuzzy c-means based clustering algorithm. Each colour represents a different cluster of IR spectra (c) 5 cluster image (d) 6 cluster image (e) 9 cluster image (f) Final results obtained from automated merge clustering algorithm – this image contained two final clusters of IR spectra.
Normal cortex tissue
Cancerous cortex
Chapter 7
165
The second tissue section was also taken from a positive axillary lymph node
named LNII5 (from the same tissue as previously described in Chapter 6). The area
studied by IR now displayed several different types of tissue and the existence of a
secondary follicle that comprised of proliferating B-lymphocytes. Both the total
absorbance IR image and the H&E stained image are shown in Figure 7.19 (a) and
(b) respectively. The IR spectral dataset for the examined region comprised a total of
7497 spectra. Only two clustering results were obtained from the initial clustering
algorithm, and are displayed in Figure 7.19 (c) and (e). The corresponding merged
cluster image from both clustering structures resulted in three final clusters, as shown
in Figure 7.19 (d) and (f).
The final tissue section examined, named LN57, was collected from a benign
axillary lymph node. This node had been surrounded by large areas of fatty tissue
which had, in some regions, infiltrated close to the capsule of the node. An infrared
image was collected from this region which comprised 7216 spectra. The total IR
absorbance and H&E stained images from the examined sample area are shown in
Figure 7.20 (a) – (b). Initial clustering produced three different results, as shown in
Figure 7.20 (c) – (e). Further merging resulted in all images being made up of three
clusters as shown in Figure 7.20 (f).
Chapter 7
166
(a) (b)
(c) (d) (e) (f)
Figure 7.19 Lymph node tissue section LNII5. Sampled area was 30625µm × 95625µm in size. (a) Total absorbance IR image (b) H&E stained image. Results after fuzzy c-means based clustering algorithm. Each colour represents a different cluster of IR spectra (c) 5 cluster image (d) merged cluster result from 5 cluster image (e) 4 cluster image (f) merged cluster result from 4 cluster image. Both merged cluster results contained three clusters of IR spectra.
(a) (b)
(c) (d) (e) (f)
Figure 7.20 Lymph node tissue section LN57. Sampled area was 550µm × 512.5µm in size. (a) Total absorbance IR image (b) H&E stained image. Clustering results after fuzzy c-means based clustering algorithm. Each colour represents a different cluster of IR spectra (c) 3 cluster image (d) 4 cluster image (e) 5 cluster image (f) Final result obtained from automated merge clustering algorithm. Image contained three final clusters of IR spectra.
Cancerous cortex tissue
Secondary follicle (normal cortex tissue)
Reticulum (fibrocollagenous tissue)
Capsule (fibrocollagenous tissue)
Normal cortex tissue
Capsule
Fatty tissue with capsule
Cortex
Chapter 7
167
To help understand these results from the initial fuzzy c-means based
clustering algorithm, the clustering results obtained have been listed in Table 7.1.
This shows the number of clusters that were initially determined by the fuzzy c-
means based algorithm, and the number of clusters that were finally obtained after
merging. The value enclosed in parentheses indicates the number of times that the
specified number of clusters was returned by the fuzzy c-means based clustering
algorithm. For example, when studying the results for lymph node LNII7, the initial
algorithm obtained five clusters in two of the runs and six clusters in another five of
the runs.
Table 7.1 The number of clusters obtained
at different stages of clustering.
LNII7 5(2) 26(5) 29(3) 2
LNII5 5(9) 34(1) 3
LN57 3(7) 34(1) 35(2) 3
Number of clusters afterfuzzy c-means based clustering algorithm
automoted merging cluster algorithm
Chapter 7
168
7.6 Discussion of Results
When scrutinising the results from the initial fuzzy c-means based clustering
algorithm or SAFC clustering algorithm, a number of different cluster results was
obtained. This may be due to the random initialisation at the beginning of the
clustering process. A suitable initial cluster configuration may prevent the
occurrence of this problem. However, this can be considered as a research topic in
its own right.
A tendency to produce an excessive number of clusters from initial clustering
(SAFC or fuzzy c-means based clustering algorithm) was observed. The additional
clusters created by the algorithm may describe potential subtypes of tissue that are
presently not identified by conventional histopathology. After the application of the
novel cluster merging algorithm, a more stable clustering result was obtained. These
more clearly described the main types of tissue that existed within the samples
analysed, especially in whole sub area lymph node tissue sections.
When comparing the two types of experimental datasets, it is easier to interpret
the results obtained from the extracted FTIR spectral datasets, because different
types of spectral data points were withdrawn from a tissue section area for which the
type is certain. In contrast to this, the whole sub area lymph node FTIR spectra
datasets are more complex, since they are taken from entire sub-areas of tissue
samples. In the following, the clustering results obtained from the three lymph nodes
tissue sections LNII7, LNII5 and LN57 are analysed in more detail.
Chapter 7
169
As shown in the H&E stained image for tissue section LNII7 (Figure 7.18b),
the invading cancerous tissue exists at the bottom of the collected sample area (pink
colouration) and the normal tissue at the top (purple colouration). When studying the
initial clustering results (Figures 7.18c-e), it can be seen that several extra clusters
have been created in both the normal and cancerous areas. It is possible that the
clusters found in the area diagnosed as being cancerous might be representative of
several different sub-classes of malignancy not normally recognised by histology. In
contrast, the extra clusters found in the normal area could be descriptive of normal
tissue that is beginning to take on cancerous characteristics. However, it would be
preferable to merge possible sub-types of tissue into one defining group. This would
allow a simplified characterisation of the tissue section to be made. After applying
the newly proposed merge method, the initial overly complex images have now been
merged into a single more simplified one (Figure 7.18f). This newly created image
is now more representative of the main characteristics of the tissue section.
The area examined on tissue section LNII5 revealed the more complex
infrastructure of a lymph node (Figure 7.19b). Tissue types found within the sample
area include capsule, reticulum, normal cortex and cancerous cortex. It should also
be noted that a small spherical region known as a secondary follicle was present in
the centre of the normal cortex and rapidly proliferating lymphocytes have
congressed at this location. The initial clustering image shown in Figure 7.19c,
displaying five clusters of IR spectra, was obtained in nine out of ten repeats. When
compared against the H&E stained image for the same area (Figure 7.19b), the full
range of tissue types have been correctly characterised, including the secondary
Chapter 7
170
follicle. When applying the automated merge method, the clusters which are merged
have similar biochemistry, producing three main clusters that describe the main types
of tissue found within the sample area (Figure 7.19 d). These include
fibrocollagenous tissue (capsule and reticulum), normal cortex tissue (normal cortex
and secondary follicle) and finally cancerous cortex tissue.
The second and rare type of output from initial clustering is shown in Figure
7.19 (e), in which four clusters were obtained for the same IR spectral dataset. The
main difference between this output and the first is the incorrect grouping of both
reticulum and normal cortex into the same cluster. This is a consequence of the
spectra for these types of tissue being close in proximity to each other in PCA space.
Due to random initialisation of cluster centres, the algorithm has on this occasion
calculated that four clusters would best describe the data structure, representing a
local minimum validity index value. Hence these two types of tissue spectra were
not separated but grouped into the same cluster. After the merging process,
reticulum was now grouped with normal cortex tissue rather than capsule tissue.
However, the merge method still discriminated the capsule, normal cortex and
cancerous cortex tissue.
Tissue section LN57 was collected from a benign lymph node that exhibited
large surrounding areas of fatty tissue. The sample area analysed via IR contained
three main types of tissue. These were normal cortex tissue (mainly made up of a
secondary follicle), capsule tissue, and small pockets of fatty tissue that had in some
regions infiltrated the capsule of the node. Three results obtained from the initial
Chapter 7
171
clustering algorithm are shown in Figures 7.20 (c)–(e). The three clusters shown in
Figure 7.20 (c) characterise the main tissue types found in the examined sample area.
The green colour in the image describes the capsule of the lymph node, whereas the
red areas represent locations of fatty tissue invasion. Finally, the blue colour is
descriptive of normal cortex tissue.
The second type of output, as shown in Figure 7.20 (d), displays four clusters
of tissue spectra. Again the capsule has been separated into two clusters that
describe areas with or without fatty tissue (blue and yellow colours respectively).
But on this occasion an area can be seen that lies beneath the capsule of the lymph
node (cyan colour). This is likely to describe a region of the cortex called the
subcapsular sinus that lies directly underneath the capsule and allows lymph to enter
the node. The final type of output comprised five clusters, as displayed in Figure
7.20 (e). A similar scenario to the previous appears to have occurred. However, on
this occasion an additional cluster has been created that may further describe a layer
of fatty tissue within the capsule (red colour). After the automated merge clustering
method was applied to these different outputs, a final result comprising three clusters
was obtained, corresponding to the three tissue types present. It should be noted that
when the initial output was for three clusters, the merge method did not attempt to
further combine these, thus verifying the robustness of this cluster structure.
From these experiments, it can be seen that the proposed automated merge
cluster method can rapidly and efficiently obtain major types of biochemical tissue
from the existing samples. However, in order to transfer this algorithm into the
Chapter 7
172
clinical setting an extensive and rigorous verification and evaluation programme
would have to be conducted.
7.7 Summary
In this Chapter, fuzzy c-means based clustering algorithm was applied to
automatically generate the ‘optimal’ cluster structure for a given dataset (i.e. the
structure that yields the minimum validity index value). However, due to the
complexity of biological systems, an excessive number of clusters can sometimes be
obtained. In order to address this problem, an automated cluster merging algorithm
was developed and described in this Chapter. To demonstrate the proposed
algorithm, six FTIR spectra datasets (two from oral cancer tissue sections, and four
from axillary lymph node tissue sections) were analysed using this method. The
results indicated that the clusters that have similar biochemistry were successfully
merged and, therefore, demonstrated that the algorithm is successful in determining
the main tissue types within the different sections used. Further verification and
evaluation of this novel method would be required in order to transfer this algorithm
into the clinical setting.
Chapter 8
173
CHAPTER 8
Conclusions
Cancer has become a major adversary to human health, and the development
and enhancement of techniques for use in its diagnosis and treatment has
increasingly become a focus of worldwide research. Fourier Transform Infrared
(FTIR) spectroscopy is a powerful tool for determining the biochemical composition
within a biological system. This capability to provide an insight into the biochemical
changes that occur within cells has led, in recent years, to FTIR spectroscopy being
investigated in the study of various biomedical conditions.
In order to analyse the FTIR spectroscopic data from tissue samples,
multivariate clustering techniques have often been used to separate sets of unlabelled
infrared spectral data into different clusters based on their characteristics. The
purpose of clustering is to group the spectral data such that the data in the same
clusters are as similar as possible and data within different clusters are as dissimilar
as possible. Hence, different types of cells can be separated within biological tissue.
Among existing clustering techniques, it has been shown that fuzzy clustering
Chapter 8
174
techniques such as fuzzy c-means can have clear advantages over crisp and
probabilistic clustering methods, and they have been widely used in medical
diagnosis and pattern recognition. This thesis focuses on the development of fuzzy
clustering techniques that are able to automatically classify the cells present in a
variety of tissue sections and to investigate whether infrared spectroscopy can be
used as a diagnostic probe to identify early stages of cancer. In this Chapter, the
contributions of this thesis are summarised in the next Section, followed by a
discussion of some of the avenues of possible future work. Finally, the
dissemination which has resulted from this research is listed.
8.1 Contributions
This thesis has made the following contributions:
8.1.1 Comparison of three often used clustering techniques, namely hierarchical
clustering analysis, k-means and fuzzy c-means performance on oral cancer FTIR
spectral data.
Hierarchical clustering analysis, k-means and fuzzy c-means algorithms are
three frequently used clustering methods in infrared spectroscopy analysis.
However, a systematic comparison of these techniques on oral cancer FTIR spectra
had not been done performed prior to the present work. Furthermore, in previous
analysis [17,19,20,22,26,98,100,101,129] extra pre-processing steps, such as mean-
centring, variance scaling and first derivatives, had been carried out in an ‘ad hoc’
manner prior to further analysis being undertaken. Another benefit of the clustering
Chapter 8
175
methods developed in this thesis is that all FTIR spectral data undergo only basic
pre-processing, for example, water vapour removal, baseline correction and
normalisation to account for irregularities in cell density across the tissue section.
All these techniques are well established for FTIR spectra and can be easily
automated.
In Chapter 4, experiments based on these three techniques were carried out on
seven FTIR spectra datasets taken from patients with oral cancer. In addition, the
results were compared and discussed. In this Chapter, a novel method of analysis
was introduced in which the disagreements between the number of spectral data
between different clustering results and the clinical diagnosis from pathologists was
used to evaluate the quality of the clustering methods. This makes the differentiation
within diverse techniques more easy to identify.
8.1.2 Improvement of the previously introduced ‘Variable String Length Simulated
Annealing’ (VFC-SA) algorithm to enhance its stability and efficiency.
VFC-SA clustering is a method featuring a simulated annealing algorithm in
which a cluster validity index measure was used as the energy function in order to
automatically determine the best number of clusters. The advantage of using
simulated annealing is that it can escape local optima of cluster configurations as
present in the standard fuzzy c-means algorithm and hence may be able to find
globally optimal solutions. The Xie-Beni index was used as the cluster validity
index to evaluate the quality of the solutions. However, during the implementation
Chapter 8
176
of this proposed algorithm, it was found that sub-optimal solutions could be obtained
in certain circumstances.
In order to overcome this limitation, in Chapter 5, the original VFC-SA
algorithm was extended in four novel ways in order to produce the ‘Simulated
Annealing Fuzzy Clustering’ (SAFC) algorithm. An evaluation of the performance
of fuzzy c-means, VFC-SA and SAFC clustering algorithms on seven oral cancerous
FTIR spectra datasets was carried out and this demonstrated that SAFC obtained the
smallest Xie-Beni validity index values in all seven datasets. Particularly in
comparison with the VFC-SA algorithm, SAFC generates better quality, more stable
results in roughly the same computational time.
8.1.3 An analysis of lymph node tissue sections obtained from an infrared imaging
technique by using principal component analysis, fuzzy c-mean clustering and a
combination of both methods.
Following the relatively recent introduction of an infrared imaging technique
which allows a large number of FTIR spectra data to be obtained in a quick and
efficient way, it is now possible to analyse large areas of tissue section. In Chapter 6,
spectral data which had been collected using this technique from an area of lymph
node tissue section (named LNII5) which contained different types of cells, were
selected for investigation.
Principal component analysis (PCA) is a typical multivariate statistical
technique that has been widely applied in the field of data analysis and compression.
In this thesis, PCA is mainly used to reduce the number of dimensions for large FTIR
Chapter 8
177
datasets. However, in Chapter 6 another feature of PCA, namely data structure
detection, was employed to explore the underlying formation of cells within the
selected lymph node tissue section LNII5. This was implemented by calculating a
value which is used to measure the correlation relationship between all spectra to
each principal component (PC). A bigger value indicates that the spectrum has a
closer relationship to this PC and vice versa. Finally, the whole set of values for
each PC was displayed in a false colour weighted image which has the same size
pixels as the original LNII5 image. The technique was applied utilising the first 10
PCs and the results were shown.
The standard fuzzy c-means clustering algorithm was also used to analyse this
same LNII5 tissue section. The number of clusters was set from 2 to 9 successively
and the clustering results were also displayed in false colour weighted images.
The third method used to analyse LNII5 was to combine both PCA and fuzzy
c-means together, termed PCA-fuzzy c-means. However, unlike in the first method,
PCA was used here in the more conventional manner to reduce the dimensionality of
the variables without significantly loosing information from original data (as there
are 821 wave-numbers within each spectrum from the given lymph node tissue
section). The transformed data in the first 10 PCs space were utilised as the input to
the fuzzy c-means clustering algorithm. The clustering results were again displayed
in false colour weighted images. The computational time of all three techniques was
also calculated and compared.
Chapter 8
178
The experimental results showed that the PCA and PCA-fuzzy c-means
methods took almost the same computational time, and both were much faster than
the standard fuzzy c-means algorithm. At the same time, the PCA-fuzzy c-means
method obtained very similar clustering results to standard fuzzy c-means. However,
from PC3 onwards, the PCA method did not yield any additional useful information.
On the other hand, as the number of clusters was increased, more types of tissue
could be discriminated in the fuzzy c-means and PCA-fuzzy c-means methods.
Hence, it can be stated that (in this context) PCA-fuzzy c-means contains advantages
over both PCA and fuzzy c-means, and so can be taken as a good technique to be
used to analyse large FTIR spectral datasets.
8.1.4 Identification of the relationship between the parameter in fuzzy c-means
which controls the minimal acceptable amount of improvement and the range of data
in the first three principal components space. Comparison of clustering performance
between k-means and fuzzy c-means algorithms on a lymph node tissue section after
dimensionality reduction.
During the experiments to classify the lymph node tissue section LNII5 using
the fuzzy c-means algorithm it was found that, by comparing the data range sizes in
the first three principal components space, when the minimal amount of
improvement was not small enough the algorithm effectively stopped prematurely.
Based on this finding, the minimal amount of improvement setting was adjusted and
the performance of fuzzy c-means algorithm was then found to be significantly
improved.
Chapter 8
179
In order to compare clustering performance of k-means and fuzzy c-means in
the large FTIR spectra dataset LNII5, a similar process to the PCA-fuzzy c-means
technique was utilised on k-means algorithm, namely dimensionality reduction by
PCA prior to clustering was implemented. The corresponding experiments were
carried out and results were discussed in Chapter 6.
8.1.5 Development of a novel automated method to appropriately merge clusters
for FTIR spectral clustering analysis.
The objective of this research is to facilitate the automation of tools to be used
to enhance clinical analysis. This would require robust and consistent identification
of the appropriate number of clusters for the given clinical context. Therefore, a
clustering algorithm that can automatically detect the number of clusters is required.
In Chapter 5, the SAFC algorithm was designed and implemented with this purpose
in mind. However, in clustering large datasets, it can be very time consuming. With
the aim of overcoming this problem, a refined fuzzy c-means based clustering
algorithm was developed to find the ‘optimal’ number of clusters.
Both the SAFC and the fuzzy c-means based clustering algorithms can
automatically detect the number of clusters. However, both algorithms occasionally
identified an excessive number of clusters compared to clinical analysis.
Furthermore, the excessive number of clusters generated from the fuzzy c-means
based clustering algorithm separated the same type of tissue into two or more
clusters. In contrast, when the number of clusters was fixed to be the same as that
obtained from clinical analysis, the clustering results did not match clinical
Chapter 8
180
diagnosis. It is thought that this may be due to the fact that the distance measure in
the standard fuzzy c-means algorithm is a Euclidean distance. In light of this
observation, an automated method to merge clusters was developed in Chapter 7.
Six FTIR datasets were used to verify this newly proposed algorithm. The
experimental results indicated that successful merging of clusters that have similar
biochemistry was achieved, thus demonstrating that the method can successfully
determine the main tissue types within the given sections.
8.1.6 Summary
This thesis investigates and develops clustering techniques which can classify
different types of tissue FTIR spectra taken from a range of oral cancer and breast
cancer lymph node tissue sections. It includes various comparisons of alternative
clustering algorithms in small (oral cancer) and large (lymph node tissue) FTIR
spectral datasets, and describes the development of a novel clustering algorithm
which can automatically identify the correct number of tissue types as that obtained
by clinical diagnosis. The experiments carried out in this research indicate that
infrared spectroscopy may indeed become a powerful diagnostic probe to identify the
early stages of cancer.
8.2 Future Work
There are many possible potential avenues of further research that arise from
the work carried out in this thesis. Some of the more obvious ones are outlined
below.
Chapter 8
181
8.2.1 Clustering Algorithms
This thesis mainly focuses on modifications and enhancements of the fuzzy c-
means clustering algorithm to classify the FTIR spectral data, along with k-means,
hierarchical clustering and simulated annealing fuzzy clustering. Different clustering
methods have their own advantages and disadvantages for specific clustering criteria.
Therefore further investigation of the combination of other clustering algorithms
with diverse optimisation techniques which could inherit the advantages of each
individual method may also be an interesting future research direction. Other than
partitional clustering algorithms, density-based and model based clustering algorithm
could be examined.
In Chapter 7, an automatic clustering process based on cluster validity was
used to obtain the best clustering and the results showed that it works well.
However, this method has to run the fuzzy c-means algorithm within a series of
numbers of clusters. Development of an automatic clustering method which can
avoid running the same algorithm many times to discover the correct number of
clusters might be another future direction.
8.2.2 Distance Measures
The distance measurement used throughout this work is Euclidean distance.
This is suitable to measure the data in multi-dimensional space. However, when the
shape of the clusters different greatly from the spherical, it may result in an
inappropriate clustering. In light of this fact, other distance measures could also be
Chapter 8
182
examined. Examples of such alternative distance measures are the squared
Mahalanobis distance, mutual neighbour distance and the Chebychev distance.
8.2.3 Cluster Validity
In the process of automatic clustering, it has been shown that different cluster
validity indices may lead to different clustering results being obtained. In this thesis,
the Xie-Beni and the Sun-Wang-Jiang validity indices were applied, dependent on
different size of FTIR datasets. Various other cluster validity indices may also to be
investigated. Examples of some alternative indices that might be investigated can be
found at the end of Section 2.2.4.
8.2.4 Setting Initial Cluster Centres
As the initial cluster centres in fuzzy c-means clustering are set randomly, this
will often result in minor differences in clustering results in different runs of the
algorithm (indeed, sometimes major differences can be obtained). Therefore, in
order to obtain a representative set of results, a number of runs of this algorithm were
instigated. There remains scope for an investigation of a method which can set
‘good’ or ‘appropriate’ initial cluster centre positions that will then make the fuzzy c-
means clustering algorithm work more efficiently.
8.2.5 Data Sources
A wider source of infrared spectroscopy data, particularly from a wider range
of different types of cancer, are required in order to carry out further evaluation and
verification of the clustering algorithms developed in this thesis. In order to transfer
Chapter 8
183
these algorithms into the clinical setting, a far more thorough set of validation
experiments will obviously need to be carried out. Naturally, it is currently
expensive (in terms of both equipment and manpower) to obtain such data; and until
a technique is proven it can be hard to obtain the necessary funds. This is, of course,
a ‘catch 22’ situation, but it is hoped that the contributions presented in this thesis
will add to the weight of supporting evidence necessary to convince others of the
potential of the FTIR technique.
8.2.6 Infrared Spectroscopy Expert System
In future, if sufficient infrared spectral data were collected and analysed, a
database of infrared spectroscopic analysis could be established. Thus, any new
unlabeled spectra could then be identified through comparison of the existing spectra
in the database and a corresponding expert system could also be generated.
8.2.7 Dedicated Software For Infrared Spectrometry Hardware
The work developed in this thesis, and future works based on it, will hopefully
have potential to be a built into a software ‘sub-system’ within an infrared
spectrometry machine, once adequate evaluation and verification process had been
carried out. This would require a significant effort in software engineering, but could
then be sold / licensed to the manufacturers of the FTIR machines. Hence, in the
long term, there may be the prospect of an economic return from this research, in
addition to the obvious potential human benefits obtained from earlier and more
accurate diagnosis of cancers.
Chapter 8
184
8.3 Dissemination
The research described in this thesis has been disseminated through a number
of book chapters, journal papers and international conference papers. Most of the
work described in each of the main body Chapters of this thesis has also been
disseminated in this manner. A formal list of the publications and presentations
derived from this work now follows.
8.3.1 Book Chapter
Wang, X.Y., Garibaldi, J.M., Bird, B. and George, M. W., Novel
Developments in Fuzzy Clustering for the Classification of Cancerous Cells using
FTIR Spectroscopy, Book Chapter, Jose Valente de Oliveira and Witold Pedrycz
(eds), accepted for publication in the book Advances in Fuzzy Clustering and its
Applications, John Wiley and Sons, 2007. (Chapter 4, 5 and 7)
8.3.2 Journal Papers
Wang, X.Y., Garibaldi, J.M., Bird, B. and George, M. W., A Novel Fuzzy
Clustering Algorithm for the Analysis of Axillary Lymph Node Tissue Sections,
accepted to be published in Applied Intelligence, 2006. (Chapter 7)
Wang, X.Y., Garibaldi, J.M., Simulated Annealing Fuzzy Clustering in
Cancer Diagnosis, Informatica, vol 29, no. 1, pp 61-70, 2005. (Chapter 5)
Chapter 8
185
8.3.3 Conference Papers
Wang, X.Y., Garibaldi, J.M., Bird, B. and George, M. W., Fuzzy Clustering
in Biochemical Analysis of Cancer Cells , in the Proceedings of Fourth Conference
of the European Society for Fuzzy Logic and Technology (EUSFLAT 2005) and
Eleventh Rencontres Francophones sur la Logique Floue et ses Applications (LFA
2005). pp. 1118-1123, Barcelona, Spain, September, 7-9, 2005. (Chapter 7)
Wang, X.Y., Garibaldi, J.M., , A Comparison of Fuzzy and Non-Fuzzy
Clustering Techniques in Cancer Diagnosis , in the Proceedings of second
international conference in Computational Intelligence in Medicine and Healthcare
(The Biopattern Conference), pp. 250-256, Costa da Caparica, Lisbon, Portugal, 29
June - 1 July 2005. (Chapter 6)
Wang, X.Y., Whitwell, G. and Garibaldi, J.M., The Application of a
Simulated Annealing Fuzzy Clustering Algorithm for Cancer Diagnosis, in the
Proceedings of the IEEE 4th International Conference on Intelligent System Design
and Application, pp 467-472, Budapest, Hungary, August 26-28, 2004, ISBN 963-
71546-30-2. (Chapter 5)
Wang, X.Y., Garibaldi, J.M. and Ozen, T. , Application of The Fuzzy C-
Means Clustering Method on the Analysis of non Pre-processed FTIR Data for
Cancer Diagnosis , in the Proceedings of the 8th Australian and New Zealand
Conference on Intelligent Information Systems, pp. 233-238, Sydney, Australia,
December 10-12, 2003. (Chapter 4)
Chapter 8
186
8.3.4 Presentations
Wang, X.Y., Garibaldi, J.M., Bird, B. and George, M. W., A Novel Fuzzy
Clustering Algorithm for the Analysis of Potentially Cancerous Lymph Node Cells,
(oral presentation) in Automated Scheduling Optimisation and Planning Research
Group seminar, December, 2005.
Wang, X.Y., Garibaldi, J.M., Bird, B. and George, M. W., Fuzzy Clustering
in Biochemical Analysis of Cancer Cells , (oral presentation) in the Fourth
Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT
2005) and Eleventh Rencontres Francophones sur la Logique Floue et ses
Applications (LFA 2005). Barcelona, Spain, September 7-9, 2005.
Wang, X.Y., Garibaldi, J.M., A Comparison of Fuzzy and Non-Fuzzy
Clustering Techniques in Cancer Diagnosis , (oral presentation) in the Second
International Conference in Computational Intelligence in Medicine and Healthcare
The Biopattern Conference, Lisbon, Portugal, 29 June - 1 July 2005.
Wang, X.Y., Whitwell, G. and Garibaldi, J.M., The Application of a
Simulated Annealing Fuzzy Clustering Algorithm for Cancer Diagnosis, (oral
presentation) in the IEEE 4th International Conference on Intelligent System Design
and Application, Budapest, Hungary, August 26-28, 2004.
Wang, X.Y., Garibaldi, J.M. and Ozen, T., Application of The Fuzzy C-
Means Clustering Method on the Analysis of non Pre-processed FTIR Data for
Cancer Diagnosis, (oral presentation) in Automated Scheduling Optimisation and
Planning Research Group seminar, December, 2003.
Chapter 8
187
Wang, X.Y., Garibaldi, J. M., Fuzzy Clustering, (oral presentation) in
Automated Scheduling Optimisation and Planning Research Group seminar, April,
2003.
Bird, B., George, M. W., Wang, X-Y., Garibaldi, J. M., Stone, N., Smith, J.
and Barr, H., A Combined Infrared and Raman study of Axillary Lymph Nodes in
Breast Cancer. (Poster presentation) 3rd International Conference on Advanced
Vibrational Spectroscopy, Wisconsin, USA, August, 14-19, 2005.
Bird, B., George, M. W., Wang, X-Y., Garibaldi, J. M., Stone, N., Smith, J.
and Barr, H., A Combined Infrared and Raman study of Axillary Lymph Nodes in
Breast Cancer. (Oral Presentation) Mini-Symposium on Optoelectronics for Use in
the Diagnosis of Cancer, Grasmere, Lake District, UK, June 20-23, 2005.
Bird, B., Chesters, M. A., Chalmers, J., Tobin, M., Wang, X. Y., Garibaldi,
J. M., Hitchcock, A. and Symonds, I., Infrared Microspectroscopy as a Potential
Tool for Cervical Cancer Diagnosis. (Poster Presentation) Faraday Discussion 126,
Applications of Spectroscopy to Biomedical Problems, University of Nottingham,
UK, September 1-3, 2003.
Bird, B., Chesters, M. A., Chalmers, J., Tobin, M., Wang, X. Y., Garibaldi,
J. M., Hitchcock, A. and Symonds, I., Infrared Microspectroscopy as a Potential
Tool for Cervical Cancer Diagnosis. (Poster Presentation) 2nd International
Conference on Advanced Vibrational Spectroscopy, University of Nottingham, UK,
August 24-29, 2003.
References
188
References
[1] National Cancer Research Institute website, 2006, www.ncri.org.uk
[2] Zachariadou-Veneti, S., 2000, "A Tribute George Papanicolaou (1883-1962)", Cytopathology, vol. 11, pp. 152-157.
[3] Stuttaford, T., 4 May 2001, "Greatest need is annual smear", The Times newspaper.
[4] 4 May 2001, "Imperfect cervical cencer tests are better than none at all", The Times newspaper.
[5] Robles, S. C., 2002, "Deconstructing the Myths of Cervical Cancer", Perspectives in Health, vol. 5, no. 2,
[6] Mantsch, H. and McElhaney, R. N., 1990, "Application of IR spectroscopy to biology and medicine", J Molec Struc, vol. 217, pp. 347-362.
[7] Wong, P. T. T. and Rigas, B., 1990, "Infrared spectra of microtome sections of human colon tissues", Applied Spectroscopy, vol. 44, pp. 1715-1718.
[8] Rigas, B., Morgello, S., Goldan, I. S., and Wong, P. T. T., 1990, "Human colorectal cancers display abnormal Fourier transform infrared spectra", in Proceedings of the National Academia of Science, USA, vol. 87, pp. 84-88.
[9] Wong, P. T. T., Goldstein, S. M., Grekin, R. C., Godwin, T. A., Pivik, C., and Rigas, B., 1993, "Distinct infrared spectroscopic patterns of human basal cell carcinoma of the skin", Cancer Research, vol. 53, no. 4, pp. 762-765.
[10] Morris, B. J., Lee, C., Nightingale, B. N., Molodysky, E., Morris, L. J., and Appio, R., 1995, "Fourier transform infrared spectroscopy of dysplastic, papillomavirus-positive cervicovaginal lavage speciens", Gynecological Oncology, vol. 56, no. 2, pp. 245-249.
[11] Jackson, M. and Mantsch, H., 1996, Biomedical Infrared Spectroscopy, in Infrared Spectroscopy of Biomolecules, Mantsch, H. and Chapman, D. (eds), Wiley-Liss Inc., New York, pp. 311-340.
[12] Benedetti, E., Teodori, L., Trinca, M. L., Vergamini, P., Slavati, F., Mauro, F., and Spremolla, G., 1990, "A new approach to the study of human solid tumor cells by means of FT-IR microspectroscopy", Applied Spectroscopy, vol. 44, pp. 1276-1280.
[13] Benedetti, E., Papineschi, F., Vergamini, P., Consolini, R., and Spremolla, G., 1984, "Analytical infrared spectral differences between human normal and leukaemic cells (CLL) — I", Leukemia Research, vol. 8, no. 3, pp. 483-489.
References
189
[14] McIntosh, L., Mansfield, J., Crowson, A., Mantsch, H., and Jaskson, M., 1999, "Analysis and Interpretation of Infrared Microscopic Maps: Visualization and Classification of Skin Components by Digital Staining and Multivariate Analysis", Biospectroscopy, vol. 5, pp. 265-275.
[15] Goodacre, R., Timmins, E., Burton, R., Kaderbhai, N., Woodward, A., Kell, D., and Rooney, P., 1998, "Rapid Identification of Urinary Tract Infection Bacteria Using Hyperspectral Whole-organism Fingerprinting and Artifical Neural Networks", Microbiology, vol. 144, pp. 1157-1170.
[16] Richter, T., Steiner, G., Abu-Id, M., Salzer, R., Bergmann, R., Rodig, H., and Johannsen, B., 2002, "Identification of Tumor Tissue by FTIR Spectroscopy in Combination with Positron Emission Tomography", Vibrational Spectroscopy, vol. 28, pp. 103-110.
[17] Lasch, P., Haensch, W., Naumann, D., and Diem, M, 2004, "Imaging of colorectal adenocarcinoma using FT-IR microspectroscopy and cluster analysis", Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease, vol. 1688, no. 2, pp. 176-186.
[18] Romeo, M. J. and Diem, M., 2005, "Infrared Spectral Imaging of Lymph Nodes: Strategies for Analysis and Artifact Reduction", Vibrational Spectroscopy, vol. 38, pp. 115-119.
[19] Wood, B. R., Chiriboga, L., Yee, H., Quinn, M. A., McNaughton, D., and Diem, M, 2004, "Fourier transform infrared (FTIR) spectral mapping of the cervical transformation zone, and dysplastic squamous epithelium", Gynecologic Oncology, vol. 93, no. 1, pp. 59-68.
[20] Lasch, P., Wasche, W., McCarthy, W. J., Muller, G., and Naumann, D., 1998, "Imaging of Human Colon Carcinoma Thin Sections by FTIR Microspectrometry", Infrared Spectroscopy: New Tool in Medicine, vol. 3257, no. 3, pp. 187-197.
[21] Salman, A., Erukhimovitch, V., Talyshinsky, M., and Huleihil, M., 2002, "FTIR Spectroscopic Method for Detection of Cells Infected with Herpes Viruses", Biopolymers ( Biospectroscopy), vol. 67, pp. 406-412.
[22] Schultz, C. P., Liu, K., Johnston, J. B., and Mantsch, H., 1996, "Study of Chronic Lymphocytic Leukemia Cells by FT-IR Spectroscopy and Cluster Analysis", Leukemia Research, vol. 20, no. 8, pp. 649-655.
[23] Zhang, L., Small, G. W., Haka, A. S., Kidder, L. H., and Lewis, E. N., 2003, "Classification of Fourier Transform Infrared Microscopic Imaging Data of Human Breast Cells by Cluster Analysis and Artificial Neural Networks", Applied Spectroscopy, vol. 57, no. 1, pp. 14-22.
[24] 2001, Handbook of Analytical Method for Materials, Materials Evaluation and Engineering, Inc.
[25] Berkhin, P., 2002, "Survey of Clustering Data Mining Techniques", San Jose, CA, USA, Accrue Software.
References
190
[26] Allibone, R., Chalmers, J. M., Chesters, M. A., Fisher, S., Hitchcock, A., Pearson, M., Rutten, F. J. M., Symonds, I., and Tobin, M., 2002, "FT-IR microscopy of oral and cervical tissue samples", Derby City General Hospital, Internal Report.
[27] Jain, A. K., Murty, M. N., and Flynn, P. J., 1999, "Data Clustering: A Review", ACM Computing Surveys, vol. 31, no. 3, pp. 264-323.
[28] Omran, M., 2004, Particle Swarm Optimization Methods for Pattern Recognition and Image Processing, Ph.D Thesis, Faculty of Engineering, Built Environment and Information Technology, University of Pretoria.
[29] Han, J. and Kamber, M., 2001, Data Mining: Concepts and Techniques, Morgan Kaufmann. Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA.
[30] Carpineto, C. and Romano, G., 1996, "A Lattice Conceptual Clustering System and Its Application to Browsing Retrieval", Machine learning, vol. 24, no. 2, pp. 95-122.
[31] Judd, D., Mckinley, P., and Jain, A., 1998, "Large-scale Parallel Data Clustering", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 871-876.
[32] Ozer, M., 2005, "Fuzzy c-means Clustering and Internet Portals: A Case Study", European Journal of Operational Research, vol. 164, no. 3, pp. 696-714.
[33] Hamerly, G. and Elkan, C., 2002, "Alternatives to the k-means algorithm that find better clusterings", in Proceedings of CIKM-02, 11th ACM International Conference on Information and Knowledge Management, pp. 600-607.
[34] Garrett-Mayer,E. and Parmigiani,G., 2004, "Clustering and Classification Methods for Gene Expression Data Analysis", Johns Hopkins University, Dept.of Biostatistics Working Papers, Johns Hopkins University, The Berkeley Electronic Press(bepress), http://www.bepress.com/jhubiostat/.
[35] Jiang, D., Tang, C., and Zhang, A., 2004, "Cluster Analysis for Gene Express Data: A Survey", IEEE Transactions on Knowledge and data Engineering, vol. 16, no. 11, pp. 1370-1386.
[36] Anderberg, M.R., 1973, Cluster Analysis for Application, Academic Press. New York.
[37] Blum, C. and Roli, A., 2003, "Metaheuristics in Combinatorial Optimization: Overview and Conceptual Comparison", ACM Computing Surveys, vol. 35, no. 3, pp. 268-308.
[38] 2004, "Cluster Analysis", Copyright StatSoft,Inc.,1984-2004, http://www.statsoft.com/textbook/stcluan.html.
[39] Ward, J. H., 1963, "Hierarchical Grouping to Optimize an Objective Function", Journal of the American Statistical Association, vol. 58, no. 301, pp. 236-244.
References
191
[40] 1999, "Characteristics of Methods for Clustering Observations", http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap8/sect4.htm, SAS/STAT User's guide onlineDoc,Version 8, SAS Institute Inc.,Cary,NC,USA.
[41] Jain, A.K. and Dubes, R.C., 1988, Algorithms for Clustering Data, Prentice-Hall advanced reference series, Prentice-Hall. Englewood Cliffs, NJ, USA.
[42] MacQueen, J. B., 1967, "Some methods of classification and analysis of multivariate observations", in Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California, Berkeley, pp. 281-297.
[43] Ruspini, E. H., 1969, "A New Approach to Clustering", Information and Control, vol. 15, no. 1, pp. 22-32.
[44] Dunn, J. C., 1973, "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics, vol. 3, no. 3, pp. 32-57.
[45] Bezdek, J., 1981, Pattern Recognition With Fuzzy Objective Function Algorithms, Plenum. New York.
[46] Hoppner, F., Klawonn, F., Kruse, R., and Runkler, T., 1999, Fuzzy Cluster Analysis Methods for Classification, Data Analysis and Image Recognition, John Wiley and Sons Ltd.
[47] Lampinen, T., Koivisto, H., and Honkanen, T., 2002, "Profiling Network Applications with Fuzzy C-Means Clustering and Self-organizing Map", in Proceedings of 1st International Conference on Fuzzy Systems and Knowledge Discovery: Computational Intelligence for the E-Age, Orchid Country Club, Singapore, vol. 1, pp. 300-394.
[48] Zhao, Y. and Karypis, G., 2004, "Soft Clustering Criterion Functions for Partitional Document Clustering: A Summary of Results", in Proceedings of CIKM-04, 13th ACM International Conference on Information and Knowledge Management, pp. 246-247.
[49] Kirkpatrick, S., Gelatt, C. D., Jr., and Vecchi, M. P., 1983, "Optimization by Simulated Annealing", Science, vol. 220, no. 4598, pp. 671-680.
[50] Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E., 1953, "Equation of State Calculations by Fast Computing Machines", Journal of Chemical Physics, vol. 21, no. 6, pp. 1087-1092.
[51] Klein, R. and Dubes, R. C., 1989, "Experiments in Projection and Clustering by Simulated Annealing", Pattern Recognition, vol. 22, pp. 213-220.
[52] Selim, S. Z. and Al-Sultan, K., 1991, "A Simulated Annealing Algorithm for the Clustering Problem", Pattern Recognition, vol. 24, no. 10, pp. 1003-1008.
References
192
[53] Brown, D. E. and Huntley, C. L., 1992, "A Practical Application of Simulated Annealing to Clustering", Pattern Recognition, vol. 25, no. 4, pp. 401-412.
[54] Barker, A., 1989, Neural Networks for Data Fusion, Master Thesis, University of Virginia, Charlottesville, Virginia.
[55] Al-Sultan, K. and Selim, S. Z., 1993, "A Global Algorithm for the Fuzzy Clustering Problem", Pattern Recognition, vol. 26, no. 9, pp. 1357-1361.
[56] Lukashin, A. V. and Fuchs, R., 2001, "Analysis of Temporal Gene Expression Profiles: Clustering by Simulated Annealing and Determining the Optimal Number of Clusters", Bioinformatics, vol. 17, no. 5, pp. 405-414.
[57] Yang, W., Rueda, L., and Ngom, A., 2005, "A Simulated Snnealing Approach to Find the Optimal Parameters for Fuzzy Clustering Microarray Data", in Proceedings of XXV International Conference of the Chilean Computer Science Society - SCCC 2005, Valdivia, Chile, pp. 45-54.
[58] Al-Sultan, K., 1995, "A Tabu Search Approach to the Clustering Problem", Pattern Recognition, vol. 28, no. 9, pp. 1443-1451.
[59] Al-Sultan, K. and Fedjki, C., 1997, "A Tabu Search-Based Algorithm for the Fuzzy Clustering Problem", Pattern Recognition, vol. 30, no. 12, pp. 2023-2030.
[60] Sung, C. S. and Jin, H. W., 2000, "A Tabu-Search-Based Heuristic for Clustering", Pattern Recognition, vol. 33, pp. 849-858.
[61] Hall, L. O., Ozyurt, B., and Bezdek, J. C., 1999, "Clustering with a Genetically Optimized Approach", Evolutionary Computation, IEEE Transations on, vol. 3, no. 2, pp. 103-112.
[62] Maulik, U. and Bandyopadhyay, S., 2000, "Genetic Algorithm-Based Clustering Technique", Pattern Recognition, vol. 33, pp. 1455-1465.
[63] Bandyopadhyay, S. and Maulik, U., 2002, "Genetic Clustering for Automatic Evolution of Clusters and Application to Image Classification", Pattern Recognition, vol. 35, pp. 1197-1208.
[64] Davies, D. L. and Bouldin, D. W., 1979, "A Cluster Separation Measure", IEEE Trans.Pattern Anal.Mach.Intell., vol. 1, no. 2, pp. 224-227.
[65] Garai, G. and Chaudhuri, B. B., 2004, "A Novel Genetic Algorithm for Automatic Clustering", Pattern Recognition Letters, vol. 25, pp. 173-187.
[66] Tseng, L. and Yang, S., 2001, "A Genetic Approach to the Automatic Clustering Problem", Pattern Recognition, vol. 34, pp. 415-424.
[67] Babu, G. P. and Murty, M. N., 1994, "Clustering with Evolution Strategies", Pattern Recognition, vol. 27, no. 2, pp. 321-329.
References
193
[68] Lee, C. and Antonsson, E., 2000, "Dynamic Partitional Clustering Using Evolution Strategies", in Proceedings of the 3rd Asia-Pacific Conference on Simulated Evolution and Learning, Nagoya, Japan,
[69] Halkidi, M., Batistakis, Y., and Vazirgiannis, M., 2001, "On Clustering Validation Techniques", Journal of Intelligent Information Systems, vol. 17, no. 2/3, pp. 107-145.
[70] Su, M., December 2005, "A New Index of Cluster Validity", http://www.cecs.missouri.edu/~skubic/8820/ClusterValid.pdf, Electrical and Computer Engineering Department, University of Missouri-Columbia, Columbia, MO, USA.
[71] Rezaee, M. R., Lelieveldt, B. P. F., and ReiBer, J. H. C., 1998, "A New Cluster Validity Index for the Fuzzy C-Means", Pattern Recognition Letters, vol. 19, pp. 237-246.
[72] Sun, H., Wang, S., and Jiang, Q., 2004, "FCM-Based Model Selection Algorithms for Determining the Number of Clusters", Pattern Recognition, vol. 37, pp. 2027-2037.
[73] Bezdek, J., 1998, Pattern Recognition in Handbook of Fuzzy Computation, IOP Publishing Ltd. Boston, NY.
[74] Xie, X. L. and Beni, G., 1991, "A Validity Measure for Fuzzy Clustering", IEEE Trans.Pattern Analysis and Machine Intelligence., vol. 13, no. 8, pp. 841-847.
[75] Kim, D. W., Lee, K. H., and Lee, D., 2004, "On Cluster Validity Index for Estimation of the Optimal Number of Fuzzy Clusters", Pattern Recognition, vol. 37, pp. 2009-2025.
[76] Fukuyama, Y. and Sugeno, M., 1989, "A New Method of Choosing the Number of Clusters for the Fuzzy C-Means Method", in Proceedings of Fifth Fuzzy System Symposium, pp. 247-250.
[77] Rhee, H. and Oh, K., 1996, "A Validity Measure for Fuzzy Clustering and Its Uses in Selecting Optimal Number of Clusters", in Processdings of IEEE, pp. 1020-1025.
[78] Bandyopadhyay, S. and Maulik, U., 2001, "Nonparametric Genetic Clustering: Comparison of Validity Indices", IEEE Transactions on System, Man, and Cybernetic, vol. 31, no. 1, pp. 120-125.
[79] Xie, Y., Raghavan, V. V., and Zhao, X., 2002, "3M Algorithm: Finding an Optimal Fuzzy Cluster Scheme for Proximity Data", in Proceedings of the FUZZY-IEEE conference-2002 IEEE world congress on Computational Intelligence, Honolulu, HI, vol. 1, pp. 627-632.
[80] Kim, M. and Ramakrishna, R. S., 2005, "New Indices for Cluster Validity Assessment", Pattern Recognition Letters, vol. 26, pp. 2353-2363.
References
194
[81] Wu, K. and Yang, M., 2005, "A Cluster Validity Index for Fuzzy Clustering", Pattern Recognition Letters, vol. 26, no. 9, pp. 1275-1291.
[82] Hamerly, G. and Elkan, C., 2003, "Learning the K in K-means", in Proceedings of the 17th Annual Conference on Neural Information Processing Systems, British Columbia, Canada,
[83] Ray, S. and Turi, R., 1999, "Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentaion", in Proceedings of 4th International Conference on Advances in Pattern Recognition and Digital Techniques (ICAPRDT' 99), New Delhi, India, pp. 137-143.
[84] Ball, G. and Hall, D., 1965, "A Novel Method of Data Analysis and Classification", Stanford University, Stanford, CA, Technique Report AD-699616.
[85] Tran, T., Wehrens, R., and Buydens, L., 2005, "Clustering Multispectral Images: A Tutorial", Chemometrics and Intelligent Laboratory Systems, vol. 77, pp. 3-17.
[86] Turi, R.H., 2001, Clustering-Based Colour Image Segmentation, Ph.D Thesis, School of Computer Science and Software Engineering, Monash University, Australia.
[87] Tou, J., 1979, "DYNOC - A Dynamic Optimal Cluster-Seeking Technique", International Journal of Parallel Programming, vol. 8, no. 6, pp. 541-547.
[88] Chaudhuri, D., Chaudhuri, B. B., and Murthy, C. A., 1992, "A New Split-and-Merge Clustering Technique", Pattern Recognition Letters, vol. 13, pp. 399-409.
[89] Huang, K., 2002, "A Synergistic Automatic Clustering Technique (Syneract) for Multispectral Image Analysis", Photogrammetric Engineering and Remote Sensing, vol. 1, no. 1, pp. 33-40.
[90] Ester, M., Kriegel, H., Sander, J., and Xu, X., 1996, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", in Proceedings of 2nd Internation Conference on Knowledge Discovery and Data Mining.(KDD-96), Menlo Park, CA, pp. 226-231.
[91] Guha, S., Rastogi, R., and Shim, K., 1998, "CURE: An Efficient Clustering Algorithm for Large Database", in Proceedings of ACMSIGMOD international conference on Management Data, New York, pp. 73-84.
[92] Karypis, G., Han, E.-H., and Kumar, V., 1999, "Chameleon: Hierarchical Clustering Using Dynamic Modeling", IEEE computer, vol. 32, pp. 68-75.
[93] Kelly, P. M., 1994, "An Algorithm for Merging Hyperellipsoidal Clusters", Technical report LA-UR-94-3306.
[94] Tou, J.T. and Gonzalez, R., 1972, Pattern Recognition Principle, Addison -Wesley. Reading, MA.
References
195
[95] Jolliffe, I.T., 1986, Principal Component Analysis, Springer-Verlag. New York.
[96] Jolliffe, I.T., 2002, Principal Component Analysis (Second Edition), Aberdeen, UK.
[97] Goncalves, A. R., Esposito, E., and Benar, P., 1998, "Evaluation of Panus Tigrinus in the Delignification of Sugarcane Bagasse by FTIR-PCA and Pulp Properites", Journal of Biotechnology, vol. 66, pp. 177-185.
[98] Kim, S. W., Ban, S. H., Chung, H., Cho, S., Chung, H. J., Choi, P. S., Yoo, O. J., and Liu, J. R., 2004, "Taxonomic Discrimination of Flowering Plants by Multivariate Analysis of Fourier Transform Infrared Spectroscopy Data", Plant Cell Reports, vol. 23, no. 4, pp. 246-250.
[99] 2003, "Matlab Statistics Toolbox:Linkage", Matlab version 6.5.0, release 13.0.1,
[100] Zhao, H., Kassama, Y., Young, M., Kell, D. B., and Goodacre, R., 2004, "Differentiation of Micromonospora Isolates from a Coastal Sediment in Wales on the Basis of Fourier Transform Infrared Spectroscopy, 16S rRNA Sequence Analysis, and the Amplified Fragment Length Polymorphism Technique", Applied and Environmental Microbiology, vol. 70, no. 11, pp. 6619-6627.
[101] Naumann, A., Navarro-Gonzalez, M., Peddireddi, S., Kues, U., and Polle, A., 2005, "Fourier Transform Infrared Microscopy and Imaging: Detection of Fungi in Wood", Fungal Genetics and Biology, vol. 42, pp. 829-835.
[102] Mansfield, J., Sowa, M., Majzels, C., Collins, C., Cloutis, E., and Mantsch, H., 1999, "Near Infrared Spectroscopic Reflectance Imaging: Supervised vs. Unsupervised Analysis Using An Art Conservation Application", Vibrational Spectroscopy, vol. 19, pp. 33-45.
[103] Cancer Research UK website for Oral Cancer, 2005, http://info.cancerresearchuk.org/cancerstats/types/oral/?a=5441
[104] Cancer Research UK website for Oral Cancer, 2005, http://info.cancerresearchuk.org/cancerstats/types/oral/?a=5441
[105] The Concise Biotech Dictionary website, 2004, www.thebiotechdictionary.com/term/histology
[106] Bird, B., 2006, FTIR Imaging: A Route Toward Automated Histopathology, Ph. D Thesis, The Department of Chemistry, The University of Nottingham, UK.
[107] Kissin, M. W., Querci-della-Rovere, G., Easton, D., and Westbury, G., 1986, "Risk of Lymphoedema Following the Treatment of Breast Cancer", British Journal of Surgery, vol. 73, pp. 580-584.
[108] Reddy, M. and Given-wilson, R., 2004, "Screening for Breast Cancer", Surgery, vol. 22, no. 7, pp. 155-160.
References
196
[109] Turner, R. R., Ollila, D. W., Krasne, D. L., and Giuliano, A. E., 1997, "Histopathologic validation of the sentinel lymph node hypothesis for breast cancer", Annals of Surgery, vol. 226, pp. 271-278.
[110] Bird, B, June 2005, "Fourier Transform Infrared (FTIR) Imaging - A Potential Tool for Cancer Diagnosis", School of Chemistry, The University of Nottingham, UK.
[111] van Diest, P. J., Torrenga, H., Borgstein, P. J., Pijpers, R., Bleichrodt, R. P., Rahusen, F. D., and Meijer, S., 1999, "Reliability of Intraoperative Frozen Section and Imprint Cytological Investigation of Sentinel Lymph Nodes in Breast Cancer", Histopahthology, vol. 35, no. 1, pp. 14-18.
[112] Gulec, S. A., Su, J., O'Leary, J. P., and Stolier, A., 2001, "Clinical utility of frozen section in sentinel node biopsy in breast cancer", American Surgeon, vol. 67, no. 6, pp. 529-532.
[113] Salem, A. A., Douglas-Jounes, A. G., Moneypenny, I. J., Sweetland, H. M., Webster, D. J., Newcombe, R. G., and Mansel, R. E., 2002, "Detection of Axillary Node Status During Breast Cancer Surgery", European Journal of Surgical Oncology, vol. 28, pp. 789-
[114] Swenson, K. K., Nissen, M. J., Ceronsky, C., Swenson, L., Lee, M. W., and Tuttle, T. M., 2002, "Comparison of Side Effects Between Sentinel Lymph Node and Axillary Lymph Node Sissection for Breast Cancer", Annuals of Surgical Oncology, vol. 9, pp. 745-753.
[115] Johnson, K. S., Chicken, D. W., Pickard, D. C. O., Lee, A. C., Briggs, G., Falzon, M., Bigio, I. J., Keshtgar, M. R., and Bown, S. G., 2004, "Elastic scattering spectroscopy for intraoperative determination of sentinel lymph node status in the breast", Journal of Biomedical Optics, vol. 9, no. 6, pp. 1122-1128.
[116] Godavarty, A., Thompson, A. B., Roy, R., Eppstein, M. J., Zhang, C., Gurfinkel, M., and Sevick-Muraca, E. M., 2004, "Diagnostic imaging of breast cancer using fluorescence-enhanced optical tomography: phantom studies", Journal of Biomedical Optics, vol. 9, no. 3, pp. 486-496.
[117] Smith, J., Kendall, C., Sammon, A., Christie-Brown, J., and Stone, N., 2003, "Raman Spectral Mapping in the Assessment of Axillary Lymph Nodes in Breast Cancer", Technology in Cancer Research & Treatment, vol. 2, no. 4, pp. 327-332.
[118] Contractor, K., Burke, M., Singhal, H., Bonsal, U., Boyle, S., Williams, G., Bostwick, P., and Mitchel, R., 2002, "Contact Cytology in the Intraoperative Detection of Sentinel Node Metastasis", Journal of Surgical Oncology, vol. 28, pp. 787-
[119] Surewicz, W. K., Mantsch, H. H., and Chapman, D., 1993, "Determination of Protein Secondary Structure by Fourier Transform Infrared Spectroscopy: A Critical Assessment", Biochemistry, vol. 32, no. 2, pp. 389-395.
References
197
[120] Lasch, P., Schmitt, J., and Naumann, D., 2000, "Colorectal Adenocarcinoma Diagnosis by FT-IR Micropectrometry", Proceedings of SPIE, Biomedical Spectroscopy: Vibrational Spectroscopy and Other Novel Techniques, vol. 3918, pp. 45-56.
[121] Perelman, L., ., Backman, V., Wallace, M., Zonios, G., Manoharan, R., Nusrat, A., Shields, S., Seiler, M., Lima, C., Hamano, T., Itzkan, I., Van Dam, J., Crawford, J. M., and Feld, M. S., 1998, "Observation of periodic fine structure in reflectance from biological tissue: A new technique for measuring nuclearsize distribution", Physical Review Letters, vol. 80, pp. 627-630.
[122] Mourant, J. R., Hielscher, A. H., Eick, A. A., Shen, D., Johnson, T. M., and Freyer, J. P., 1998, "Evidence for intrinsic differences in light-scattering properties of tumorigenic and nontumorigenic cells", Cancer Cytopathology, vol. 84, no. 6, pp. 366-374.
[123] Bandyopadhyay, S., 2003, "Simulated Annealing for Fuzzy Clustering: Variable Representation, Evolution of the Number of Clusters and remote Sensing Applications", unpublished, private communication.
[124] Pal, N. R. and Bezdek, J., 1995, "On Cluster Validity for the Fuzzy C-Means Model", IEEE Trans.Fuzzy System., vol. 3, pp. 370-379.
[125] Rayward-Smith, V.J., Osman, I.H., Reeves, C.R., and Smith, G.D., 1996, Modern Heuristic Search Methods, John Wiley & Sons.
[126] Conover, W.J., 1999, Practical Nonparametric Statistics, John Wiley & Sons.
[127] Causton, D.R., 1987, A Biologist's Advanced mathematics, Allen & Unwin. London.
[128] Liu, H., 1998, Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic.
[129] Diem, M, Chiriboga, L., and Yee, H., 2000, "Infrared Spectroscopy of Human Cells and Tissue. VIII. Strategies for Analysis of Infrared Tissue Mapping Data and Applications to Liver Tissue", Biopolymers ( Biospectroscopy), vol. 57, pp. 282-290.
Appendix
198
Appendix
Medical Terminologies
In the following, the medical terms mentioned within the thesis are explained and the
explanations are either provided by Chemistry Department or searched from online
medical dictionary.
● Biopsy:
The removal of a small portion of tissue from the body for microscopic examination.
● Cortex tissue:
The outer layer of an internal organ or body structure.
● Fibrocollagenous tissue:
A type of tissue is fibrous and collagenous; pertaining to or composed of fibrous
tissue mainly composed of collagen.
● Inter-observer:
Discrepancy between two different observers.
● Intra-observer:
Discrepancy between two different examinations by the same observer.
● Ipsilateral axilla:
Located on or affecting the same side of the axilla, where axilla is the cavity beneath
the junction of a forelimb and the body.
Appendix
199
● Keratinisation:
The conversion of thin outer cells into a tough material.
● Lymphoedema:
Swelling, especially in subcutaneous tissues, as a result of obstruction of lymphatic
vessels or lymph nodes, with accumulation of lymph in the affected region.
● Metastasis:
Cancer spread from its original location.
● Morphological changes:
By studying the form and structure of the cells without consideration of function.
● Necrotic tissue:
Dead tissue through injury or disease.
● Pap smear:
A method for the early detection of cancer especially of the uterine cervix that
involves the staining of exfoliated cells using a special technique which differentiates
diseased tissue. The name ‘Pap’ is taken from the surname of the inventor of the
screening test, Dr. George Papanicolaous.
● Phagocyte cell:
A type of cell in the body which can absorb waste material, harmful microorganisms,
such as a white blood cell, it protects the body against infection by destroying
bacteria.
● Reticulum:
A fine network formed by cells, by certain structures within cells, or by connective-
tissue fibres between cells.
Appendix
200
● Stroma tissue:
Connective tissue, also refer to normal tissue.
● Trabeculae:
Small, often microscopic, tissue elements in the form of a small beam, strut or rod,
generally having a mechanical function.
● Traditional/conventional histology:
The study of plant or animal tissue, usually this involves studying thin cross-sections
of tissue under a microscope.
● Tumour tissue:
An abnormal new growth of tissue.