fuzzy clustering in the analysis of fourier transform infrared spectra for cancer diagnosis

Fuzzy Clustering in the Analysis of

Fourier Transform Infrared Spectra

for Cancer Diagnosis

by Xiao Ying Wang, BSc

Thesis Submitted to the University of Nottingham

for the degree of Doctor of Philosophy

School of Computer Science and Information Technology

September 2006

Table of Contents

i

Table of Contents

1 Introduction .................................................................................................... 1

1.1 Background and Motivation ............................................................................. 1

1.2 Aims of this project .......................................................................................... 6

1.3 Overview of the Thesis..................................................................................... 7

2 Literature Review......................................................................................... 11

2.1 Clustering Techniques .................................................................................... 11

2.2 Cluster Validity .............................................................................................. 31

2.3 Auto Clustering .............................................................................................. 38

2.4 Cluster Merging.............................................................................................. 46

2.5 Clustering in FTIR Spectroscopy ................................................................... 49

2.6 Summary......................................................................................................... 61

3 Medical Background .................................................................................... 63

3.1 Introduction .................................................................................................... 63

3.2 Instrumentation............................................................................................... 67

3.3 Sample Preparation and Data Collection........................................................ 71

3.4 Data Pre-processing........................................................................................ 73

3.5 Summary......................................................................................................... 74

Table of Contents

ii

4 A Comparison of Hierarchical, K-Means and Fuzzy C-Means

Clustering of Oral Cancer Cells ............................................................... 76

4.1 Introduction .................................................................................................... 76

4.2 Oral Cancer Datasets Description................................................................... 77

4.3 Experiments on Oral Cancer Datasets ............................................................ 86

4.4 Summary......................................................................................................... 91

5 Methods for Automatically Determining the Number of Clusters .......... 93

5.1 Introduction .................................................................................................... 93

5.2 VFC-SA Clustering Algorithm....................................................................... 94

5.3 SAFC Clustering Algorithm........................................................................... 98

5.4 Evaluation of VFC−SA and SAFC Clustering of Oral Cancer Cells ........... 103

5.5 Summary....................................................................................................... 115

6 Methods for the Examination of Tissue Sections..................................... 117

6.1 Introduction .................................................................................................. 117

6.2 Lymph Node Dataset Description ................................................................ 118

6.3 A Combination of Principal Component Analysis and Fuzzy C-

Means Clustering.......................................................................................... 121

6.4 Comparison of K−Means and Fuzzy C−Means in Lymph Node

Tissue Sections ............................................................................................. 137

6.5 Summary....................................................................................................... 143

Table of Contents

iii

7 A Cluster Merging Algorithm ................................................................... 145

7.1 Introduction .................................................................................................. 145

7.2 Feature Extraction......................................................................................... 147

7.3 Fuzzy C-Means Based Clustering Algorithm............................................... 148

7.4 The Basis of a New Automated Method to Merge Clusters......................... 149

7.5 Experimental Results.................................................................................... 159

7.6 Discussion of Results ................................................................................... 168

7.7 Summary....................................................................................................... 172

8 Conclusions ................................................................................................. 173

8.1 Contributions ................................................................................................ 174

8.2 Future Work.................................................................................................. 180

8.3 Dissemination ............................................................................................... 184

9 References.................................................................................................... 188

10 Appendix ..................................................................................................... 198

List of Figures

iv

LIST OF FIGURES

Figure 1.1 FTIR Microscopy spectra for paint analysis [24]. .................................. 5

Figure 1.2 An overview of the project collaboration. .............................................. 6

Figure 2.1 Two dimensional dataset with 3 clusters [27]. ..................................... 15

Figure 2.2 Dendrogram obtained from Figure 2.1 [27].......................................... 15

Figure 2.3 The k-means clustering algorithm. ....................................................... 18

Figure 2.4 The fuzzy c-means clustering algorithm [47]. ...................................... 20

Figure 2.5 Outline of the SA based clustering algorithm....................................... 22

Figure 2.6 The perturbation process in Brown and Huntley [53]. ......................... 25

Figure 2.7 Identification of the number of clusters by using a validity index [56].33

Figure 2.8 Two dimensional dataset strips in four directions. ............................... 44

Figure 2.9 Effective merging radius for clusters i and j. ........................................ 48

Figure 3.1 Typical location of lymph nodes that drain lymph from the breast...... 67

Figure 3.2 Perkin elmer spotlight imager............................................................... 68

Figure 4.1 Tissue sample from Dataset 1; (a) 4× stained picture; (b) 32× unstained

picture. .................................................................................................. 78

Figure 4.2 FITR spectra from Dataset 1................................................................. 78

Figure 4.3 32× unstained picture from tissue sample Dataset 2............................. 79

Figure 4.4 32× unstained picture from tissue sample Dataset 3............................. 80

Figure 4.5 White light image of tissue sample Dataset 4....................................... 81

List of Figures

v

Figure 4.6 Tissue section from dataset 5 (a) white light image (b) spectroscopic-

staining image. ...................................................................................... 83

Figure 4.7 White image of tissue sample for dataset 6 (a) part 1 (b) part 2........... 84

Figure 4.8 White image of tissue sample for Dataset 7. ........................................ 85

Figure 5.1 VFC-SA clustering algorithm procedure. ............................................. 96

Figure 5.2 The split centre procedure................................................................... 100

Figure 5.3 An illustration of Split Centre from the original algorithm with distinct

clusters (where 11µ and 12µ represent the membership degree of w1 to

the centres v1 and v2 respectively)....................................................... 101

Figure 5.4 The new Split Centre applied to the same dataset as Figure 5.3, above,

(where w1 is now the data point that is closest to the mean value of the

membership degree above 0.5). .......................................................... 101

Figure 5.5 The SAFC clustering algorithm.......................................................... 102

Figure 5.6 Fuzzy C-Means, VFC-SA and SAFC cluster results for dataset 1. .... 108

Figure 5.7 Cluster results for dataset 2 obtained from (a) Fuzzy C-Means, VFC-SA

and 3/10 runs from SAFC (b) 7/10 runs from SAFC. ........................ 108

Figure 5.8 Cluster results for dataset 3 obtained from (a) Fuzzy C-Means and

VFC-SA (b) SAFC.............................................................................. 109

Figure 5.9 Cluster results for dataset 4 obtained from (a) Fuzzy C-Means (b) VFC-

SA and SAFC...................................................................................... 109

Figure 5.10 Cluster results for dataset 5 obtained from (a) Fuzzy C-Means and 5/10

runs from VFC-SA (b) SAFC and 5/10 runs from VFC-SA. ............. 109

Figure 5.11 Fuzzy C-means, VFC-SA and SAFC cluster results for dataset 6...... 110

List of Figures

vi

Figure 5.12 Cluster results for dataset 7 obtained from (a) Fuzzy C-Means, 9 runs

from VFC-SA and SAFC (b) 1 run from VFC-SA............................. 110

Figure 6.1 (a) Photomicrograph of the H&E stained parallel lymph node tissue

section used for IR analysis (b) selected area – LNII5 at high

magnification (c) different tissue types description (d) LNII5 spectral

image................................................................................................... 120

Figure 6.2 IR imaging of lymph node tissue section LNII5 by PCA (a) H&E

stained image of LNII5 (b)−(k) false colour weighted images for

PC1−PC10 respectively. ..................................................................... 124

Figure 6.3 Clustering results from three separate runs with fuzzy c-means. ....... 125

Figure 6.4 A three – dimensional scatter plot of the tissue section spectra projected

onto the first 3 PCs.............................................................................. 127

Figure 6.5 IR imaging of lymph node tissue section LNII5 by fuzzy c-means (a)

H&E stained image of LNII5 (b)−(i) fuzzy c-means false colour

weighted clustering results, the number of clusters were from 2 – 9

respectively. ........................................................................................ 130

Figure 6.6 IR imaging of lymph node tissue section LNII5 by PCA–fuzzy c-means

(a) H&E stained image of LNII5 (b)−(i) fuzzy c-means false colour

weighted clustering results, the number of clusters were from 2 – 9

respectively. ........................................................................................ 132

Figure 6.7 LNII5 tissue section spectra plot in three dimensional PCs space (a)

original plot with 5 clusters (b) rotated plot of picture (a).................. 136

List of Figures

vii

Figure 6.8 Clustering results from k-means (a&b) and fuzzy c-means (c) in 2

clusters. ............................................................................................... 138

Figure 6.9 K-means clustering results in 3 − 9 clusters. ...................................... 139

Figure 6.10 Fuzzy c-means clustering results in 3 − 9 clusters.............................. 139

Figure 6.11 Variation in k-means clustering results for 5 clusters......................... 139




Figure 7.1 The fuzzy c-means based clustering algorithm................................... 150

Figure 7.2 An extracted spectral dataset after applying fuzzy c-means based

clustering algorithm. ........................................................................... 151

Figure 7.3 The procedure of determining a reference wave-number. .................. 153

Figure 7.4 Mean infrared spectra obtained from different clusters...................... 154

Figure 7.5 Enlarged region of Figure 7.4. ............................................................ 154

Figure 7.6 The procedure of automated merge clusters. ...................................... 156

Figure 7.7 Four mean spectra absorbance at reference wave-number. ................ 157

Figure7.8 The resultant absorbance distribution obtained after merging the two

most similar clusters. .......................................................................... 157

Figure 7.9 The merging situation when there are two dist left (type 1). .............. 158

Figure 7.10 The merging situation when there are two dist left (type 2). .............. 158

Figure 7.11 Entire automated merging clustering procedure. ................................ 158

Figure 7.12 The extracted spectral dataset after applying the proposed automated

merging cluster method. ..................................................................... 159

List of Figures

viii

Figure 7.13 An example of an extracted dataset. ................................................... 160

Figure 7.14 An example of a whole sub area of lymph node dataset. ................... 161

Figure 7.15 (a) Extracted LNII7 clustering results after applying fuzzy c-means

based clustering algorithm. (b) Extracted LNII7merged clusters results.

............................................................................................................ 162

Figure 7.16 (a) Dataset 3 clustering results obtained from SAFC algorithm. (b)

Dataset 3 merged clusters results. ...................................................... 163

Figure 7.17 (a) Dataset 5 clustering results obtained from SAFC algorithm. (b)

Dataset 5 merged clusters results. ...................................................... 163

Figure 7.18 Lymph node tissue section LNII7. Sampled area was 275µm ×

818.75µm in size. (a) Total absorbance IR image (b) H&E stained

image. Clustering results after fuzzy c-means based clustering

algorithm. Each colour represents a different cluster of IR spectra (c) 5

cluster image (d) 6 cluster image (e) 9 cluster image (f) Final results

obtained from automated merge clustering algorithm – this image

contained two final clusters of IR spectra........................................... 164

Figure 7.19 Lymph node tissue section LNII5. Sampled area was 30625µm ×

95625µm in size. (a) Total absorbance IR image (b) H&E stained

image. Results after fuzzy c-means based clustering algorithm. Each

colour represents a different cluster of IR spectra (c) 5 cluster image (d)

merged cluster result from 5 cluster image (e) 4 cluster image (f)

merged cluster result from 4 cluster image. Both merged cluster results

contained three clusters of IR spectra. ................................................ 166

List of Figures

ix

Figure 7.20 Lymph node tissue section LN57. Sampled area was 550µm × 512.5µm

in size. (a) Total absorbance IR image (b) H&E stained image.

Clustering results after fuzzy c-means based clustering algorithm. Each

colour represents a different cluster of IR spectra (c) 3 cluster image (d)

4 cluster image (e) 5 cluster image (f) Final result obtained from

automated merge clustering algorithm. Image contained three final

clusters of IR spectra........................................................................... 166

List of Tables

x

LIST OF TABLES

Table 4.1 Distribution of the different tissue types identified clinically and as

obtained by the various clustering techniques. ..................................... 88

Table 4.2 Comparison results based on the number of disagreements between

clinical study and the various clustering results. .................................. 89

Table 4.3 Clustering variations for k-means and fuzzy c-means within three

datasets.................................................................................................. 90

Table 4.4 Average number of disagreements obtained in the three clustering

methods. ................................................................................................ 91

Table 5.1 Average of the VXB index values obtained when using the fuzzy c-

means, VFC-SA and SAFC algorithms. ............................................. 104

Table 5.2 Comparison of the number of clusters achieved by clinical analysis,

VFC-SA and the SAFC methods. ....................................................... 106

Table 6.1 The ranges of the first 3 PCs in seven oral cancer oral cancer FTIR

datasets................................................................................................ 128

Table 6.2 Summary of fuzzy c-means clustering computation time. ................. 135

Table 6.3 Summary of PCA-fuzzy c-means computation time. ......................... 135

Table 6.4 Computation time comparison between PCA, Fuzzy c-means and PCA-

fuzzy c-means analysis techniques. .................................................... 136

Table 7.1 The number of clusters obtained at different stages of clustering. ..... 167

Abstract

xi

ABSTRACT

This thesis focuses on the development of fuzzy clustering techniques to

investigate the use of infrared spectroscopy as a diagnostic probe for the

identification of the early stages of cancer. Several new clustering approaches are

developed and compared to existing clustering algorithms from the literature using

two different types of spectra datasets with an aim to automatically identifying the

different types of tissue present within any given spectral dataset. The datasets have

been obtained from actual infrared spectroscopy performed at the University of

Nottingham on oral cancer and lymph node tissue sections.

Firstly, a simulated annealing based clustering algorithm is developed that can

automatically obtain the correct number of clusters from the given oral cancer

spectra datasets. Through the use of principal component analysis, a multivariate

statistical technique, the results on different spectral datasets are visualised and

compared to existing approaches, such as k-means and fuzzy c-means clustering.

The new simulated annealing approach, developed in this thesis, obtains better and

more consistent clustering results than the original variable string length simulated

annealing algorithm from literature.

The thesis continues by developing a new technique for the purpose of merging

the clusters with the same biochemical characteristics. This is to overcome the

problem observed in previous clustering algorithms in which an excessive number of

clusters occasionally occurred. In comparison with above mentioned simulated

Abstract

xii

annealing clustering algorithm, this novel clustering method can more often identify

the correct number of clusters in analysis of the same datasets. In addition, this

technique also allows for the classification of large amounts of spectra data in a

practical, acceptable time. This makes the developed techniques good candidates for

transferring to medical diagnosis in the real world.

Acknowledgements

xiii

ACKNOWLEDGEMENTS

Firstly, I would like to thank my supervisor, Dr. Jon Garibaldi, for providing

support, guidance and opportunities during my study. I am extremely grateful for his

valuable comments and help he has afforded throughout this research programme.

I would also like to thank Professor Edmund Burke for giving me the

opportunity to study my PhD in ASAP research group.

I also thank Dr. Benjamin Bird, Professor Michael George and Mr. John M.

Chalmers from the Department of Chemistry within the University for providing

infrared spectra data and diagnoses results. I am thankful for their valuable

discussion and support throughout the period of this work.

In particular, I would like to thank Dr. Glenn Whitwell and Dr. Turhan Ozen

for their precious comments, support and help during my study.

Finally, I would like to express my sincerest thanks and gratitude to my family

in China and especially to my mum Gui Ying, my dad Ya Min and sister Xiao Jia

who have provided their deep love and unfailing encouragement throughout my

recent and previous studies.

Chapter 1

1

CHAPTER 1

Introduction

Currently there is a substantial effort into investigating whether infrared

spectroscopy can be used as a diagnostic probe to identify the early stages of cancer.

This thesis investigates this possibility by developing and utilising fuzzy clustering

techniques to classify data from infrared spectroscopy. This Chapter introduces the

background and motivation for this research and then details its aims and objectives.

Finally, an overview of the remainder of this thesis is provided.

1.1 Background and Motivation

Cancer has become one of the most frequent causes of mortality around the

world and research into its diagnosis and treatment has become an important issue

for the scientific community. In Britain, more than one in three people will be

diagnosed with cancer during their lifetime and, on current statistics, 25% of the

population will die from a cancer related illness [1]. Accurate diagnostic techniques

could enable various cancers to be detected in their infancy and, consequently, the

appropriate treatments could be undertaken earlier.

Chapter 1

2

Currently, most cancers are diagnosed by physically removing a piece of

sample tissue from the patient and then observing this tissue section under

microscope in order to obtain diagnostic results. In the following sections, the

example of cervical cancer will be used to illustrate the current diagnosis techniques

and the problems that exist during routine clinical practice which might result in

misdiagnosis of the cancer. Obviously, any misdiagnosis or delay in diagnosis can

have severe repercussions for the patient’s chances of a full recovery.

In most cases, pre-cancerous or cancerous cells of the cervix are first detected

with a technique known as the Pap smear (the name ‘Pap’ is taken from the surname

of the inventor of the screening test, Dr. George Papanicolaous) [2]. The procedure

is as follows: the physician firstly obtains the cervical cells and then gives the

specimen to a nurse, physician’s assistant or other specially trained medical

professional who then smears some of the cells onto glass slides and takes them to

the lab to be evaluated. In the laboratory, a pathologist will analyse the slides under

a microscope with regards to categorising the cells as either: abnormal (cancerous or

pre-cancerous) or normal (non-cancerous). If abnormal cells are found, the physician

requests further examinations of the patient. This often means that the existing cells,

which have been classified as abnormal, are sent to a senior cytologist for additional

investigation and additionally secondary cell samples may be required from the

patient for re-examination.

In the above diagnostic procedure, some factors may lead to abnormalities

being undetected by the smears. For instance, if the doctors or nurses did not take a

Chapter 1

3

good smear (sometimes occurring with nervous patients), the smear contains an

inadequate number of cells or other substances mask the cells required, then the

skilled laboratory technicians who examine the smears may make a mistake [3].

Also, in extremely rare cases, it is possible that the sample labels can even get mixed

up [4]. In the early stages of cancer formation, the composition of cells may change

very slightly. Whilst the smear may contain several thousands of cells, in the case

that only a few of these show signs of a change in composition, it can be very hard to

identify the onset of cancer. As it is impractical to scan every individual cell on the

smear; the technician will randomly select some cells from each sample and,

inevitably, abnormal cells may be missed [4].

As mentioned above, the main technique used to provide diagnosis is the

observation of morphological changes within the cells (by studying the form and

structure of the cells without consideration of function) [5]. In recent years, Fourier

Transform Infrared Micro-Spectroscopy (also referred to as FTIR Microscopy) has

been increasingly applied to the study of biomedical conditions and could become a

very powerful tool for the determination and monitoring of chemical composition

within biological systems [6]. It has also been used as a diagnostic tool for various

human cancers and other diseases [7-23]. This technology works by measuring the

wavelengths at which different functional groups of chemical samples absorb

infrared radiation (IR) and the intensities of these absorptions. The quantity of

absorption depends on the chemical bonds and the structure of the molecule and,

hence, small changes in this molecular structure can significantly affect the

absorption intensity. Since different chemical functional groups absorb light at

Chapter 1

4

different wavelengths, the resultant FTIR Microscopy spectrum can be likened to a

molecular ‘fingerprint’. If the characteristic spectrum of abnormal and normal tissue

components are known (in a ‘fingerprint library’), it may be possible to compare

each of the obtained spectra to the reference spectra within the fingerprint library and

an accurate diagnosis may be achieved. Figure 1.1 shows an example of an FTIR

Microscopy spectrum from a non-biochemical application in which standard and

unknown paint samples are compared [24]. In the context of cancer diagnosis, the

FTIR Microscopy technique detects the molecular differences within the cell rather

than visual changes in the cell structure and hence FTIR may lead to an earlier

detection of cell abnormalities. In comparison with conventional histology, FTIR

Microscopy has several advantages:

a) It has the potential for fully automatic measurement and analysis.

b) It is very sensitive; very small samples are adequate.

c) It is potentially much quicker and cheaper for large scale screening

procedures.

d) It has the potential to detect changes in cellular composition prior to such

changes being detectable by other means.

There are two types of FTIR detection: FTIR mapping and FTIR imaging. In

FTIR mapping, the IR spectrum of the samples is collected a point at a time and

many separate collections must be made to examine different areas of the tissue

sample. However, in the more recently developed imaging technique [18], it is

possible to produce multiple IR spectra of the samples in a single collection. In

Chapter 1

5

addition, the imaging technique allows for the collection of images in faster time and

with higher spatial resolution.

Figure 1.1 FTIR Microscopy spectra for paint analysis [24].

Clustering is a multivariate analysis technique that has been adopted in both

medical diagnosis studies and pattern recognition areas [25]. By examining the

underlying structure of a dataset, cluster analysis aims to categorise data (in this case,

IR spectra) into separate groups according to their characteristics. Clustering is

performed such that the spectra held within a cluster are as similar as possible and

those found in different clusters are as dissimilar as possible. Therefore, different

cells types found within biological tissue can be separated and characterised.

This research is based on the development of fuzzy clustering techniques to

investigate whether infrared spectroscopy can be utilised as a diagnostic probe to

identify cancer in its early stages and whether cancerous cells can be detected more

robustly than the currently employed procedures. This study is motivated by the

Chapter 1

6

investigation into diagnosis of oral cancer tissue that was carried out by Mr Jon M.

Chalmers et al from Derby General City Hospital, UK, in 2002 [26] . Following this

study, we have been collaborating with Professor Mike Chesters, Professor Mike W.

George and his PhD student Benjamin Bird from the Physical Chemistry Department

at The University of Nottingham to obtain a wider sample source and analysis results

of infrared spectroscopy. An overview of this collaborative project is provided in

Figure 1.2.

Figure 1.2 An overview of the project collaboration.

1.2 Aims of this project

The ultimate goal of this research is to establish the techniques necessary to

develop automated diagnosis tools that will be practical and useful across a wide

range of medical domains. This thesis focuses on the development of clustering

techniques that are able to automatically classify the different types of tissue sections

and to investigate whether the infrared spectroscopy can be used as a diagnostic

probe to identify early stages of cancer. In order to achieve this aim, the following

objectives were identified:

Sample Collection and diagnosis

Infrared microscopy scan

Gloucestershire royal hospital

Chemistry department

Computer science

Clustering analysis to separate different types

of tissues

Chapter 1

7

• Implement necessary pre-processing functions for the given FTIR spectra

which have irregularities in cell density across the tissue section so that these

spectra can then be analysed in a standard manner.

• Investigate the use of feature extraction techniques that may be necessary in

order to permit the clustering of large datasets in reasonable time.

• Perform comparison of different (often used) techniques in infrared spectra

analysis and choose a suitable clustering method as the main technique to

classify the infrared spectra.

• Development of an automated clustering technique which can automatically

identify different types of tissue in the given FTIR spectra datasets.

• Determine proper methods to evaluate the obtained clustering results.

• Present the obtained clustering results in a clear and easy to understand way.

1.3 Overview of the Thesis

This thesis continues with the following chapters: Chapter 2 (literature review)

includes a review of various clustering approaches and multivariate analysis method

that have been applied previously in general clustering and infrared spectra analysis,

and introduces cluster validity as a technique to evaluate the quality of the clustering.

In addition, as one of objectives of this research is to enable the clustering method to

automatically identify the number of clusters, previous related work in this area is

also reviewed.

Chapter 3 describes the medical background of this thesis. Two main types of

cancerous tissue sample namely oral and breast cancer are involved in this study. In

Chapter 1

8

this Chapter the collection and preparation of these tissue samples and the

instrumentation used for this purpose are given. Finally, the necessary pre-

processing procedures used to obtain a standardised form of FTIR spectra are

presented.

Chapter 4 compares the performance of three often used clustering techniques,

namely hierarchical cluster analysis, k-means and fuzzy c-means algorithms, in

infrared analysis on seven oral cancerous FTIR datasets. The diagnostic results from

clinical study are considered as the ‘gold standard’ in this thesis to evaluate the

clustering results from these three techniques. Corresponding experiments were

carried out and the results showed that fuzzy c-means is the most suitable clustering

method in this context.

Chapter 5 develops a simulated annealing based fuzzy clustering algorithm

(SAFC) to automatically detect the number of clusters from the given FTIR spectra

datasets. This method is an extension of an original method named variable string

length simulated annealing (VFC-SA) algorithm, incorporating four novel

amendments to improve its performance. The Xie-Beni validity index was employed

to evaluate the quality of the generated clusters. Experiments were performed by

applying fuzzy c-means to seven oral cancerous FTIR spectra datasets. The VFC-SA

and SAFC algorithms were applied and their performance was evaluated. The results

indicated that the SAFC algorithms produced better solutions in that it consistently

obtained better values of the Xie-Beni index .

Chapter 1

9

Chapter 6 applies the techniques developed in Chapter 5 to samples taken from

axillary lymph node tissue section using infrared imaging technique. This tissue

section was firstly examined visually utilising principal component analysis (PCA)

and then clustering was performed using the standard fuzzy c-means method. Next,

both methods were combined so that PCA was used to extract the first ten principal

components of the FTIR spectra to be used as input to fuzzy c-means to perform the

clustering. Experimental results from these three methods were compared and the

results are discussed. In addition, the computational time required by the three

approaches was also compared. The results demonstrated that the combination of

PCA and fuzzy c-means obtained the same good results as fuzzy c-means but using

much less time. Another experiment was conducted using the combination of PCA

with the k-means algorithm and its performance was compared with the combination

of PCA and fuzzy c-means. The results showed that the combination of PCA and the

fuzzy c-means method obtained more stable clustering results (less variation in

clustering results over ten runs of the algorithm). In addition, the combination of

PCA and the k-means method could not separate the main types of tissue section

when the number of clusters was low.

In Chapter 7 the development of an automated clustering method to identify the

number of clusters for axillary lymph node tissue sections is described. The results

obtained in Chapter 4 had shown that the SAFC technique could occasionally

identify an excessive number of clusters. This may have been due to the complexity

of different cell types, such as those found in the axillary lymph node sections. In

order to attempt to solve these problems, an automated method to merge clusters was

Chapter 1

10

developed. Experiments using six different FTIR spectra datasets were carried out

and the results are discussed. The results showed that clusters that have similar

biochemistry were successfully merged, thus indicating that the algorithm can

accurately determine the main tissue types within the given infrared spectral datasets.

Chapter 8 draws conclusions, lists the contributions, and suggests some

interesting potential avenues for future research arising from the work presented in

this thesis. The dissemination arising from this work is also listed.

Chapter 2

11

CHAPTER 2

Literature Review

This Chapter provides a literature review of general clustering techniques and

clustering approaches in FTIR microscopy applications. As this thesis focuses on a

real medical application, robust clustering techniques that can automatically cluster

FTIR spectra data are also introduced. One of the contributions of this thesis is the

development of a new approach to merge clusters and, in light of this fact, previously

published literature on the merging of clusters has also been discussed.

2.1 Clustering Techniques

Clustering is the process of grouping a set of unlabelled multidimensional

patterns (objects or data points), such that patterns in the same cluster have the most

similar characteristics, and patterns within different clusters have the most dissimilar

characteristics. In most cases a cluster is represented by a cluster centre or a

‘centroid’ [27,28]. Clustering has been applied to a wide range of applications, such

as pattern recognition, image segmentation, spatial data analysis, machine learning,

data mining, economic science, and internet portals [28-33]. Classification, another

Chapter 2

12

data analysis method, is often confused with clustering. The distinction between the

two approaches is that classification is a supervised learning process which is trained

on a set of pre-labelled patterns in order to predict into which class new patterns

should be placed. In contrast, clustering is unsupervised, has no predefined classes

and does not involve training examples [34,35]. As mentioned above, the aim of

clustering is to group the patterns into clusters based on their similarity.

A basic outline of a general clustering process can be described as follows [27,35]:

1) Perform feature selection and/or feature extraction from the original

dataset. Feature selection is the process used to find the most representative

subset of the original features to be used within clustering. Feature

extraction uses one or more transformations of the original features to

produce new salient features [27]. The purpose of this step is to make the

clustering process work more ‘efficiently’ (in some way) as only the most

important characteristics need to be considered. The objective is usually to

reduce the time required for the clustering process without adversely

affecting the quality of the clusters obtained.

2) Select a proximity measure to be used. This is used to evaluate the

similarity (or dissimilarity) of two data points. This could be the Euclidean

distance, correlation coefficient or other measures.

3) Apply the clustering technique to classify the dataset. Many different

clustering techniques have been developed within the literature. This

literature review identifies and describes these approaches.

Chapter 2

13

4) Validate the clusters. Cluster validation evaluates the clustering scheme

obtained from step (3). Cluster validity indices are often used to assess the

quality of the clusters.

In general, the different clustering techniques can be divided into two main

categories. They are hierarchical and partitional clustering [27,36]. In each category,

many subtypes and variants have been applied to diverse types of clustering

problems. In conventional clustering algorithms, each pattern has to be assigned

exclusively to one cluster. Where the physical boundaries of clusters are well-

defined, this approach can work well. However, when using data from real world

applications, the boundaries between clusters might be vague. For this reason, fuzzy

clustering extends the traditional clustering concept by allowing each pattern to be

assigned to every cluster with an associated membership value. Therefore, for

unclear cluster boundaries, fuzzy clustering may obtain more reasonable results. In

partitional clustering, normal process is to optimise an objective function which

somehow reflects the quality of the clusters. In order to find better solutions, some

search based approaches have also been combined with these clustering algorithms in

order to maximise or minimise the objective function [37]. In the rest of this section,

the key literature is identified and described. This includes hierarchical clustering,

partitional clustering, fuzzy clustering and hybridisations of the above clustering

approaches with search based algorithms.

Chapter 2

14

2.1.1 Hierarchical Clustering

Hierarchical clustering is a way to group the data in a nested series of clusters

[27]. The output of hierarchical clustering is a cluster tree, termed a ‘dendrogram’,

which represents the similarity level between all of the patterns. Figure 2.1 shows a

two-dimensional dataset which contains three clusters (data points have been labelled

A – G). The dendrogram corresponding to this figure has been displayed in Figure

2.2. Both figures have been taken from [27]. As shown in Figure 2.2, a specific

number of clusters can be generated through the vertical positioning of the cut-off

line (dashed line in the figure). All of the data connected to the vertical line which

has been separated by the cut-off line, belong to one cluster. The position of the cut-

off line is normally subjective and is decided based on the solution requirements. It

should be noted that if the cut-off line is placed higher on the diagram, the total

number of clusters is reduced, whereas, if the cut-off line is lowered, more clusters

are produced. Based on its algorithmic structure and operation, hierarchical

clustering can be further categorised into agglomerative algorithms and divisive

algorithms [27,35]. The agglomerative method initially considers each pattern as an

individual cluster and then, at each step, the two closest clusters (which are measured

based on the corresponding linkage method) are merged to form a new cluster and so

forth, until all the clusters are merged into one cluster. The dendrogram in the

agglomerative approach is generated in a bottom-up fashion. In contrast, the divisive

method starts by considering all patterns in one cluster and, at each step, splits the

cluster into two groups based on the similarity within the patterns, such that patterns

in the same group have the highest similarity and patters in the different groups have

Chapter 2

15

the most dissimilarity. This process continues until each cluster only contains a

single pattern. The divisive approach is based on top-down dendrogram generation.

Figure 2.1 Two dimensional dataset with 3 clusters [27].

Figure 2.2 Dendrogram obtained from Figure 2.1 [27].

As part of the agglomerative algorithm, the linkage method provides a way to

measure the similarity of clusters based on the patterns in the cluster [34]. The main

linkage methods include single linkage, complete linkage and minimum-variant

(Ward) algorithms [27,35]. Most of the other linkage methods are variants of these

three. In the single linkage algorithm, the distance between two clusters is measured

A g g l o m

e r a t i v e

D i v i s i ve

Chapter 2

16

by the two closest patterns within the different clusters. By contrast in the complete

linkage algorithm, the distance between two clusters is measured by the two furthest

patterns within the different clusters. The Minimum-variant algorithm is distinct

from the other two methods because it uses variance analysis to measure the distance

between two clusters. In general, this method attempts to minimise the sum of the

square error of any two hypothetical clusters which can be generated at each step

[38]. This is based on the Euclidean distance between centroids [39].

Depending on the linkage method employed, hierarchical clustering can

generate clusters having different characteristics. For example, the single linkage

algorithm has a tendency to produce a cluster with an elongated and irregular shape

whereas the complete linkage algorithm can produce tight, compact and roughly

hyper-spherical clusters [27], while the minimum-variant algorithm often generates

compact clusters of roughly equal size or dispersion [40].

2.1.2 Partitional Clustering

In contrast to the hierarchical clustering algorithm, partitional clustering

obtains a single partition of the patterns instead of a clustering structure. It usually

generates clusters by evaluating a criterion function which is defined locally or

globally and attempts to recover the natural clusters present in the patterns. The

advantage of partitional clustering methods is that they are especially appropriate in

the analysis of large data sets, where a dendrogram based hierarchical clustering

method is computationally expensive and is impractical with more than a few

hundreds patterns [27,41]. The partitional clustering algorithm typically selects a

Chapter 2

17

criterion and then evaluates it with a certain number of clusters multiple times

associating different initial states. The best partition found during optimisation is

returned as the result of the clustering. A drawback of the approach is that the

number of clusters needs to be specified in advance.

The most commonly and frequently used criterion in partitional clustering is

the squared error criterion which works best with clusters that are isolated and

compact [27,41]. Let us assume that a dataset },...,{ 21 nxxxX = contains n patterns

and is to be clustered into c clusters. },...,{ 21 cvvvV = is the corresponding set of

centres, ic is the number of patterns in the cluster i and each pattern may only belong

to one cluster. The squared error e2 can now be expressed as follows:

2

1 1

2 |||| j

c

j

c

iij vxe

i

∑∑= =

−= (2.1)

where ijx is the ith pattern within the jth cluster; jv is the jth cluster centre, and

|||| jij vx − is the Euclidean distance between ijx and jv .

Based on the squared error criterion function, k-means clustering is one of the

early established algorithm in partitional clustering [42]. The aim of the k-means

algorithm is to minimize the squared error criterion objective function e2 in Equation

2.1. The procedure for the k-means algorithm is shown in Figure 2.3 where, for the

jth centre, jv is calculated as:

∑=

=ic

iij

ij x

cv

1

1 , j=1…c (2.2)

Chapter 2

18

Randomly initialise the position of the c cluster centres.

1) Calculate the distance between all of the patterns and each centre.

2) Each pattern is assigned to a cluster based on the minimum distance.

3) Recalculate the centre positions using Equation (2.2).

4) Recalculate the distance between each pattern and each centre.

5) Reassign each pattern to a cluster.

6) If no data was reassigned, then stop, otherwise repeat from step 3).

Figure 2.3 The k-means clustering algorithm.

2.1.3 Fuzzy Clustering

Fuzzy clustering has become an interesting and important branch of partitional

clustering. It was originally developed in 1969 when Ruspini applied fuzzy set

theory to clustering [43]. One of the major differences between fuzzy clustering and

hard clustering is that fuzzy clustering allows each pattern to belong to more than

one cluster with varying degrees of certainty, based on their distance to the cluster

centres. This is called the ‘membership’ or ‘soft membership’ function.

The fuzzy c-means algorithm is one of the most popular fuzzy clustering

algorithms. It was first developed by Dunn in 1973 [44] and was subsequently

improved by Bezdek in 1981 [45]. In comparison with Dunn’s algorithm, Bezdek’s

fuzzy c-means algorithm introduces a fuzzifier parameter, 1 ≤ m < ∞. The purpose of

the fuzzy c-means algorithm is to minimise the fuzzy objective function as shown in

Equation (2.3).

2

1 1

||||)(),( j

n

i

c

ji

mij vxVUJ ∑∑

= =

−= µ (2.3)

Chapter 2

19

The resulting algorithm can recognise spherical patterns in multi-dimensional space

[46]. Once again, this can be formulated as followings: },...,{ 21 nxxxX = represents a

collection of data and },...,{ 21 cvvvV = is set of corresponding cluster centres (as

previously defined in Section 2.1.2). In addition, ijµ is the membership degree of

pattern ix to the cluster centre jv and ijµ must satisfy the following conditions:

],1,0[=ijµ ,,...1 ni = ,,...1 cj = (2.4)

∑=

=c

jij

1

1µ , (2.5)

Parameter m is called the ‘fuzziness index’ (or fuzzifier) and is used to control the

fuzziness of the membership of each data point. A larger value of m makes the

method ‘more fuzzy’ whilst a smaller value makes the method ‘less fuzzy’. There is

no theoretical basis for the optimal selection of m, but a value of m = 2.0 is most

commonly used [45]. The Euclidean distance between ix and jv is represented by

|||| ji vx − . cnijU ×= )(µ is a fuzzy partition matrix, which contains all of the

membership degree values from each data to all cluster centres. The fuzzy c-means

procedure is shown in the following steps:

Chapter 2

20

1) Fix the number of clusters, c , where nc <≤2 , and initialise the fuzzy partition

matrix U with a random value such that it satisfies conditions (2.4) and (2.5).

2) Calculate the fuzzy centres jv using

cjx

v n

i

mij

n

ii

mij

j ,...,1,)(

)(

1

1 =∀=∑

∑

=

=

µ

µ (2.6)

3) Update the fuzzy partition matrix U with

∑=

−

=c

k

m

ik

ijij

d

d

1

1

2

)(

1µ (2.7)

where |||| jiij vxd −= , ni ...1= and cj ...1= .

4) Repeat step (2) to (3) until one of the termination criterion is satisfied.

Figure 2.4 The fuzzy c-means clustering algorithm [47].

In Figure 2.4, the fuzzy c-means procedure continues until one of the

termination criterion is satisfied. Termination criteria can be that the difference

between updated and previous objective function J is less than a predefined

minimum threshold. Additionally, the maximum number of iteration cycles can also

be a termination criterion.

In this thesis, after the fuzzy c-means clustering process a pattern is set to a

specific cluster for which the degree of membership is maximal. This process is

known as ‘hardening’ the results. Studies have shown that hardening the results

obtained from fuzzy c-means produces different solutions from the hard clustering

results obtained directly from k-means, and that the fuzzy c-means solutions can be

better [33,48]. However, as with the k-means algorithm, fuzzy c-means needs the

Chapter 2

21

number of clusters to be pre-specified in advance as an input parameter to the

algorithm. However, both approaches can still suffer premature convergence to local

optima. This is due to the fact that both these algorithms begin with random

initialisation of the cluster centres. If the initial cluster centres are not appropriate,

the iterative improvement of the centre positions can result in locally optimal

solutions being obtained.

2.1.4 Clustering Based on Simulated Annealing

The limitations of the k-means and fuzzy c-means algorithms (e.g. pre-

specification of the number of clusters; convergence to the local optima) usually

result in non-optimal solutions (i.e. a locally optimal solution, not the global

optiumum). A search-based clustering approach may help to avoid this problem.

Search techniques can be divided into deterministic and stochastic techniques. The

difference between these two methods is that deterministic search techniques

guarantee a global optimal partition after an exhaustive search throughout all of the

solution space but often require a prohibitively large amount of time to do so;

stochastic search techniques may generate a near-optimal partition quickly and some

guarantee convergence to the optimal partition asymptotically [27]. The solutions

produced from stochastic approaches can also avoid following locally optimal

directions in search space [27].

Simulated annealing (SA) [49,50] is a stochastic search technique [27] which

has been used for clustering since 1989 [51]. The simulated annealing process

essentially simulates the physical process of annealing solids which can be described

Chapter 2

22

as follows. Firstly, a solid is heated from a high temperature and is then cooled

slowly so that the system at any time is approximately in thermodynamic

equilibrium. At equilibrium, there may be many configurations with each one

corresponding to a specific energy level. The chance of accepting a change from the

current configuration to a new configuration is related to the difference in energy

between the two states. In order to simulate this physical process within artificial

intelligence search frameworks, we use En and Ec that represent the new energy and

current energy respectively. En is always accepted if it satisfies En < Ec, but if En ≥

Ec, the new energy level is only accepted with a probability as specified by

)/)(exp( TEE cn −− , where T is the current temperature. Hence, worse solutions are

accepted based on the change in solution quality which allows the search to avoid

becoming trapped in local minima. The temperature is then decreased gradually and

the annealing process is repeated until no more improvement is reached or any

termination criteria have been met. A general SA based clustering algorithm can be

described as follows [27].

1) Initialise the start and final temperature Tmax and Tmin respectively, randomly select

an initial partition P0 and calculate the corresponding squared error value Ep0.

2) Choose a neighbour partition of P0 (P1) and calculate its square error Ep1.

if Ep1 < Ep0, then accept it and set P1 to P0. Otherwise,

if Ep1 ≥ Ep0, then accept it only when the accept probability is satisfied and set P1

to P0.

Repeat this step for a certain number of iterations.

3) Decrease the temperature, go back to step 2) until Tmin is reached.

Figure 2.5 Outline of the SA based clustering algorithm.

Chapter 2

23

The optimal number of clusters through this process can be obtained by

choosing the partition which corresponds to the minimal squared error value.

Although SA can escape local optima, it has been shown that it can be slow to find

the best solution [27].

Many simulated annealing based clustering algorithms within the literature

have been used to find optimal or near-optimal solutions. Klein and Dubes [51] used

a SA algorithm for projection in clustering problems. A neighbour, which was

referred to as a ‘move’ in this paper, was generated from random changes in the

assignment of the patterns. In order to reduce the computation time of the search,

this paper employed a simple time saving measure by using the change in cost

function after one move instead of computing the entire cost function, but this

approach was not optimal. In clustering problems, a hard membership is used to

assign the relationship of a pattern to its cluster centre. That is, ‘1’ represents that a

data point belongs to the centre and ‘0’ that it does not. The clustering process in

this paper started by assigning a set of numbers as the number of clusters for the

datasets, and then by randomly reassigning the membership value of the pattern to

the centres, a new clustering structure was generated. Consequently, the cluster

validity index was then computed. A clustering partition which corresponds to the

optimal validity index value (e.g. minimal or maximal) was considered as output.

The results from SA clustering were compared with k-means algorithm from the best

100 runs. The results indicated that the SA clustering produced more reliable results

in some cases, but this came with a prohibitive computation cost.

Chapter 2

24

Selim and Al-Sultan [52] also investigated the use of the SA algorithm in

general clustering problems. The main difference between this approach to the

previous one [51] was the proposed method to generate a neighbour, also called the

‘perturbation process’. The authors propose the following algorithm for obtaining a

neighbouring assignment. First, generate a starting assignment of patterns to

clusters. Then, for each pattern, randomly draw a number between 0 and 1. If this

number is greater than a pre-specified probability threshold, then the pattern is

assigned to a random cluster, otherwise its previous assignment is kept. Another

difference is that the best result is always saved during the SA clustering procedure

in this paper, whereas in the previous one it is not. In addition, the authors also

provided a detailed investigation into the input parameters of the algorithm through a

set of experiments on ten randomly generated datasets and four suggestions were

presented. Firstly, the slower the cooling rate, the better the solution obtained, but

the longer it takes. They recommend that it should be generally set between 0.7 to

0.9. However, when the size of the dataset increases, the cooling rate should be

slower. Secondly, the probability threshold (for changing cluster assignments)

should set to a high value to keep the neighbouring assignment close enough; 0.95

was recommended. Thirdly, the larger the number of iterations, the better the results

will be, but the time will be longer too. Once again, this also depends on the size of

the dataset. The authors recommended 50 to 600 iterations, dependent on the size of

the problem. Finally, the initial temperature should depend on the magnitude of the

objective function of the clustering. The bigger value of the objective function, the

higher initial temperature should be. Tmax = 10 was recommended.

Chapter 2

25

Brown and Huntley [53] formalise clustering as an optimisation problem with a

user-defined objective function named the ‘internal clustering criterion’. They

suggested two criteria: the ‘within-cluster distance’ criterion and Barker’s criterion

[54] (designed to look for areas with high density). SA was used to find a near-

optimal clustering for each criterion to solve a multi-sensor fusion problem. A

perturbation process was used to generate different partitions so as to be evaluated by

SA and is described as follows:

Assume L represents the set of cluster labels in all partitions p, and |L| is the

cardinality of L. cL represents the set of cluster labels not used in partitions p. p’ is

the perturbed partitions. n is the number of data points to be clustered and c is the

number of clusters. i represents one of n data points.

1) Randomly select an object i from n data points.

2) Randomly select an integer m from the range of [0, |L|].

3) If m= 0 and there also exists unused cluster labels in cL then randomly

select a label from cL and assign it to i to form p’.

4) Else randomly select a label from L and assign it to i to form p’.

5) If p’ ≠ p, go back to step 2). Otherwise return p’.

Figure 2.6 The perturbation process in Brown and Huntley [53].

Finally, the paper concluded that SA is both practical and useful in evaluating

internal clustering criteria. It should be noted that the number of clusters is fixed in

the whole of Brown and Huntley’s SA process.

Chapter 2

26

Unlike Brown and Huntley’s approach, Al-Sultan and Selim [55] used a fuzzy

clustering algorithm in combination with a SA algorithm. A perturbation state was

generated by taking a small step along a feasible direction at the current point in

search space. However, as with the preceding approaches, the number of clusters

was fixed. The best partitions were always updated as specified by the SA

procedure. In addition, within the SA process, the fuzzy c-means objective function

J(U, V) was used to evaluate the states of the different partitions. It is known that the

fuzzy c-means algorithm is not guaranteed to generate a global optimal solution and

the combination of SA and fuzzy c-means also failed to obtain a global solution.

The work of Lukashin and Fuchs [56] uses the SA procedure to cluster

temporal gene expression profiles. The perturbation process is similar to the

approaches described previously whereby, at each iterative step, a randomly selected

vector was withdrawn from its old cluster and was reassigned to another randomly

chosen cluster (where a vector here means a gene with M time points or dimensions).

The sum of the within-cluster distance criterion was used as the objective function.

Lukashin and Fuchs also proposed a simple and robust clustering algorithm which

aims to find the correct number of clusters from the given datasets. The method is

established on an equation derived from the relationship between cutoff distance d (if

the Euclidean distance between two vectors is greater than or equal to d, then these

two vectors should not belong to the same cluster), the correct number of clusters c

and a pre-assigned allowance of false positives value p, namely

pcdf =),( (2.8)

Chapter 2

27

where ∑=

=c

c cc

ccdf

1 cluster in pairs vector ofnumber totalcluster in pairsvector incorrect ofnumber 1

),( (2.9)

is the fraction of incorrect vector pairs. The authors suggested that the value of p is

routinely set as 0.055. The value of parameter d was derived from a method the

authors called a ‘reverse engineering procedure’, such that the number of clusters c

can be easily worked out from Equation (2.8). In the reverse engineering procedure,

a dataset with prior known clustering containing the number of clusters was used; an

SA algorithm was then applied to generate different distributions of clusters with

different values of c. The distance value d was then obtained from Equation (2.9). A

value of d=1.1 was chosen at p=0.055 where the known number of clusters was

reached. The authors verified that the value of d depended on the number of time

points rather than the number of clusters.

Recently, Yang et al. used SA and fuzzy c-means clustering to determine the

optimal parameters (number of clusters c and fuzziness index m) in an application of

microarray data [57]. At the beginning of the perturbation process, the parameters c

and m were randomly initialised. The objective function within the SA process was

defined as the fuzzy c-means objective function J(U, V) + a chosen cluster validity

index value. If the minimal value is optimal for the cluster validity then “+” was

used; otherwise, if the maximal value is optimal for the cluster validity, “−” was

used. Therefore, the smaller the objective function value, the better the clustering

structure was. During the annealing process at each temperature, the values of c and

m were randomly chosen to generate a new energy. Yang et al stated that the

experimental results showed that the proposed method ran very quickly and achieved

Chapter 2

28

the pair of values of c and m, that were, on average, very near to the known best

values without exploring the entire search space.

2.1.5 Clustering with Other Stochastic Search Approaches

Apart from the SA algorithm, other stochastic search approaches have also

been applied to clustering problems. Al-Sultan applied tabu search to hard clustering

[58] and more recently, also applied it to fuzzy clustering [59]. The experimental

results showed that the hybrid approaches of tabu search with k-means and fuzzy c-

means achieved better performance than the standard k-means or fuzzy c-means

clustering algorithms. Recently, Sung and Jin [60] combined tabu search with two

newly developed procedures, named packing and releasing, for the clustering

problem. The results showed that this new heuristic algorithm outperformed tabu

search, k-means and simulated-annealing algorithms. However, the performance of

tabu search was sensitive to the selection of various control parameters [27].

Genetic algorithms (GA) have also been applied to the clustering problem.

Hall et al [61] proposed a genetically guided algorithm (GGA) to optimise the hard

(k-means) and fuzzy (fuzzy c-means) clustering objective function used in cluster

analysis. As the k-means and fuzzy c-means clustering algorithms are sensitive to

the initial centre configurations, the GGA algorithm was used to determine the good

initial centres. The results showed that by using outputs from the GGA as the

initialisation for the k-means and fuzzy c-means algorithms, better solutions can be

obtained in comparison to random initialisations. However, the main disadvantage

of the GGA approach is that the time required to discover the initial solution is high.

Chapter 2

29

The experiments also showed that the time taken for GGA and k-means/fuzzy c-

means to find the partition associated with the lowest k-means/fuzzy c-means

clustering objective function is similar to the time taken by one hundred runs of k-

means/fuzzy c-means with random initialisations.

Recently, Maulik and Bandyopadhyay [62] proposed a GA-based clustering

technique, called GA-clustering. In this algorithm, each chromosome represented a

fixed number of cluster centre positions. The fitness function was defined as the

objective function of the k-means algorithm. The results showed that the GA-

clustering algorithm outperformed the k-means algorithm on the seven test data sets

used within the paper.

In 2002, Bandyopadhyay and Maulik [63] extended the work to automatically

discover the ‘correct’ number of clusters in the solution. This was achieved by

randomly initialising the number of clusters c from a pre-specified range of Cmin and

Cmax which represents the minimum and maximum number of clusters respectively.

These c cluster centres were randomly chosen from original dataset and distributed in

the chromosome. The fitness of a chromosome was calculated using the Davies-

Bouldin index [64] where a smaller index value represents a better clustering result

(see the following Section for more details on validity indices). Genetic operations

based on selection, crossover and mutation were applied to the cluster centres within

the chromosome for a specified number of maximum generations. The best string,

corresponding to the smallest index value from all generations, was returned as the

solution. The authors examined the efficiency of their proposed method through

Chapter 2

30

experiments on both artificial and real-life datasets. In addition, a satellite image was

also analysed. The results showed that several land-cover types were able to be

identified. Other automatic clustering approaches that have utilised GAs can be

found in [65,66].

Babu and Murty [67] used an Evolutionary Strategy (ES) in hard and fuzzy

clustering problems. In their paper, a centroid type clustering objective function was

used, which enabled the approach to handle real-valued parameter optimisation

problems. The experiments showed that whilst their approaches performed better

than individual, hard and fuzzy clustering algorithms, once again, the number of

clusters has to be specified in advance.

Lee and Antonsson [68] developed a clustering algorithm which utilised an ES

that can be used to automatically find the cluster centres and number of clusters. An

ES selection strategy whereby population of 10 parents and 60 offspring were

propagated into the next generation was used. They used the mean square error

(MSE) to measure the fitness function. The proposed algorithm was used to solve

the two dimensional spatial data clustering problem and the experimental results that

were provided showed promising results, as the fitness measure values were often

better than the ones obtained from the known ‘true’ clustering. In addition, the

authors suggested that any fitness function could replace the used MSE fitness

measure and a comparison of proposed method with other similar approaches was

recommended.

Chapter 2

31

2.2 Cluster Validity

2.2.1 Introduction

In cluster analysis, one of the most important issues is the measure used to

evaluate the quality of the clustering results that are produced. This measure can

then be used to compare the solutions from different algorithms and can also be used

to steer optimisation search processes in order to find the partitioning that best fits

the underlying dataset [69]. Within clustering problems, this is quantified using a

cluster validity measure. Determining the correct number of clusters in a given

dataset is the most common application of cluster validity [70]. In general, cluster

validity is frequently used to answer the following questions [71]:

1) How many clusters should be used?

2) Is the defined cluster scheme suitable for the dataset?

3) Are there any better partitionings possible?

In general, cluster validity measures can be expressed as three types. Namely, those

based on external criteria; internal criteria and relative criteria [41,69].

External criteria evaluate the results from a clustering algorithm to a pre-specified

clustering structure. For instance, an external criterion that measures the

corresponding degree between the obtained number of clusters and category labels

from a prior assigned structure.

Internal criteria evaluate the fitness between the clustering structure and the data

itself. In essence, it is a measure that can be derived only from the proximity matrix,

that somehow expresses the quality of the given partition.

Chapter 2

32

Relative criteria evaluate two clustering structures in order to determine which one is

a relatively better representation of the given data. For instance, a relative criterion

may measure whether single linkage or complete linkage methods are more suitable

for the data.

The fundamental idea of the first two types of approach (external and internal

criteria) is to test whether the data points in the given dataset are randomly structured

or not, based on statistical testing. This usually requires some sort of calculation

involving pair-wise comparison between each pair of data points and each cluster,

which leads to a computationally expensive procedure. In addition, the indices

related to these approaches aspire to measure the degree of the dataset to a pre-

specified clustering scheme [69]. Conversely, the third approach does not involve

statistical tests and allows for the best clustering structure to be chosen from a set of

schemes, defined based on pre-specified criteria [69].

This thesis focuses on real world problems from the medical domain whereby

the different types of tissue (referring to the number of clusters in the clustering

problem) are usually unknown in advance. However, with the utilisation of a

validity index, the best clustering scheme (which includes the number of clusters)

may be identified. This can be implemented by applying the clustering algorithms

within a range of cluster numbers, and the partition with the best cluster validity

index value (either maximum or minimum) is returned. This procedure is described

in Figure 2.7 [71,72].

Chapter 2

33

In a given dataset X, fix the other clustering parameters except for the number

of clusters, c.

1) Set the minimal and maximal cluster numbers cmin and cmax respectively.

2) For c = cmin : cmax

2.1) Initialise the cluster centres.

2.2) Apply the clustering algorithm with number of clusters c.

2.3) Calculate the validity index of the clustering scheme.

3) End for

4) Return the clustering structure that corresponds to the best validity index

value obtained throughout the procedure.

Figure 2.7 Identification of the number of clusters by using a validity index [56].

As mentioned previously, if an automated approach is to remain practical for

real world medical diagnosis, it is not desirable for the method to have too high a

computational expense. Therefore, this thesis has specifically focussed on using the

relative criteria to evaluate the obtained schema so that the developed approaches

remain acceptable in terms of their required computation time. In the rest of Section

2.2, validity indices suitable for fuzzy clustering are presented.

2.2.2 Partition Coefficient and Partition Entropy Coefficient

In fuzzy clustering, the fuzzy partition matrix, cnijU ×= )(µ (see Section 2.1.3),

represents the membership degree of the data point i to the cluster centre j. The

higher the value of ijµ , the stronger the data point i belongs to cluster j. The

Chapter 2

34

partition coefficient (PC) and partition entropy coefficient (PE) [73] are two validity

indices derived from the membership values ijµ .

The Partition Coefficient (PC) is defined as:

∑∑= =

=n

i

c

jijn

PC1 1

21 µ (2.9)

where n is the number of data in the dataset, c is the number of clusters. ]1,1

[c

PC ∈ .

The range of PC has been obtained from two extreme clustering cases. The first case

occurs when each data point has a membership to its cluster centre of one. In such a

case the PC value would be equal to 1, indicating that all of the clusters have well-

defined borders. In the opposite case, each data point has an equal membership to all

of the cluster centres and the PC value would approximate to c1

, indicating that the

clustering is the most fuzzy. Therefore, as the clustering quality increases, the value

of PC also increases [69].

The Partition Entropy Coefficient (PE) is defined as:

∑∑= =

−=n

i

c

jijaijn

PE1 1

)(log1 µµ (2.10)

where a is the base of the logarithm, ]log,0[ cPE a∈ . In a similar fashion, there are

two extreme cases that form the ranges of possible values for PE. When the clusters

are well separated, PE value is closer to “0”, in contrast, when the clustering is

Chapter 2

35

fuzzier, PE value approaches calog . Thus, the better the clustering achieved, the

smaller the value of the PE coefficient.

The main advantage of these indices is that they are easy to compute.

However, they are only really useful if there are a small number of well-separated

clusters [72] and they have further drawbacks as described below [69]:

1) The values of PC monotonously increase and PE monotonously decrease as

the number of cluster c increases.

2) By only using the membership values of the data (i.e. without using the

data itself), these indices also lack direct connection to the geometrical

properties inherent within the data.

2.2.3 Xie-Beni Validity Index

In order to overcome the problems that exist in the validity indices that were

introduced in Section 2.2.2, Xie-Beni (XB) defined a new validity index which not

only involved the membership values, but also included properties taken from the

data itself [74]. The XB index (also named the compactness and separation validity

function) is a representative index of relative validity indices [69]. In the following,

let us assume that VXB represents the overall XB index value, π is the compactness of

data in the same cluster and s is the separation of the clusters. The XB validity

index can now be expressed as:

sVXB

π= (2.11)

Chapter 2

36

where n

vxc

j

n

ijiij∑∑

= =

−= 1 1

22 ||||µπ (2.12)

and 2min )(ds = , here mind is the minimum distance between cluster centres, given by

||||minmin jiij vvd −= . From the expressions (2.11) and (2.12), it can be seen that a

smaller value of π indicates that the clusters are more compact whilst a larger value

of s indicates the clusters are well separated. As a result, a smaller value of VXB

means that the clusters have a greater separation from each other and are more

compact within each cluster.

It should be noted that the XB validity index also has a disadvantage in that, as

the number of clusters c gets very large or close to the number of data n, the index

value monotonously decreases [69].

2.2.4 Sun-Wang-Jiang Validity Index

Conventional validity indices (such as VXB) have the problem that the total

number of clusters becomes closer to the number of data points as the compactness

value has the tendency to monotonically decrease [75]. Therefore, when the number

of clusters is set to a very large value, the validity index value may not correctly

reflect the quality of the clustering. In order to overcome this problem, Sun, Wang

and Jiang introduced a new validity index measure named the Sun-Wang-Jiang

(WSJ) validity index (VWSJ) which was based on an improvement of the Rezaee-

Letlieveldt-Reiber index [71].

Chapter 2

37

Assume dataset X is in p dimensions. Let Tpjjjj vvvv })( ,...,)(,)({)( 21 σσσσ = and for

each ∑=

−=n

i

pj

piij

pj vx

nv

1

2)(1

)( µσ is the fuzzy variation of the cluster j,.

TpxxxX })(,...)(,)({)( 21 σσσσ = is the variance of the X. For each ∑=

−=n

i

ppi

p xxn

x1

2)(1

)(σ ,

where px is the pth dimension of mean value, ( ∑

=

n

iix

n 1

1 ).

The VWSJ index is also divided into two separate compactness and separation

components and can be expressed in the following form:

)(

)()(),,(

maxcSepcSep

cScatcVUVWSJ += (2.13)

The compactness function ||)(||

||)(||1

)( 1

X

vc

cScat

c

jj

σ

σ∑== , and ]1,0[)( ∈cScat .

As the number of clusters, c, increases, the value of )(cScat generally decreases,

since the clusters are more compact.

The separation function ∑ ∑= =

−−=c

i

c

jji vv

DD

cSep1 1

122min

2max )||||()( , where

||||maxmax jiji vvD −= ≠ and ||||minmin jiji vvD −= ≠ . The benefit of this )(cSep function is

that it considers the geometry of the cluster structures and also avoids a structure

model whereby there are too many clusters. This can be explained by approximately

writing )(cSep as ]1

[1 22

min

2max

cdE

D

D

cc−

, where cd is the average distance between one cluster

centre to rest cluster centres. When the distribution of the clusters is a tetrahedron,

Chapter 2

38

2min

2max

D

D reaches its minimum, which is 1 and ]1

[2cd

E also reaches its minimum. When

the format of the cluster centres becomes irregular, both values become larger.

However, as the number of cluster increases, 2min

2max

D

D will intend to increase more and

]1

[2cd

E will likely become more stable. Therefore 2min

2max

D

D is the main factor to penalise

the situation where too many cluster occur.

)( maxcsep is used as a weighting factor which normalises the range of the

compactness and separation functions. A fuzzy partition which has the minimum

value of VWSJ is considered the best clustering arrangement. The WSJ validity index

shows excellent performance for datasets where specifically the clusters overlap [72].

For further details of the method, see [72].

There are many other validity indices that have been developed in the past

based on different clustering criteria. They include Dunn’s index [44], Davies-

Bouldin index [64], Fukuyama and Sugeno index [76], Rhee and Ho index [77], the

Bandyopadhyay and Maulik index [78], the Xie and Zhao index [79], the Kim and

Lee index [75], the Kim and Ramakrishna index [80] and the Wu and Yang index

[81].

2.3 Auto Clustering

In this Section, auto clustering refers to clustering algorithms which have the

ability to automatically determine the number of clusters (i.e. without being provided

in advance by an input parameter). In most of the clustering algorithms that have

Chapter 2

39

been described so far, the number of clusters need to be specified in advance

[28,68,82]. The determination of the number of clusters is an important application

of clustering techniques as, if the number of clusters can be automatically

discovered, a clustering algorithm becomes more general across a wide range of

datasets and, potentially, also across a wide range of problem domains. In Sections

2.1.4 and 2.1.5, some stochastic search based auto clustering algorithms were

introduced. Besides these, various other clustering algorithms used to automatically

detect the number of clusters have also been proposed within the literature.

As mentioned in Section 2.2.1, cluster validity indices can help automatically

generate an optimal number of clusters from a unknown given dataset (the procedure

was identified previously in Figure 2.6). Based on this framework, many proposed

validity indices have been developed to achieve auto clustering. Ray and Turi [83]

described a cluster validity measure which is a ratio of the intra-cluster distance to

the inter-cluster distance where the intra-cluster distance refers to the average

distance of all data objects to their cluster centres and the inter-cluster distance refers

to the minimum distance between all the combinations of two clusters. The smaller

the intra-cluster distance, the more compact the cluster and, the larger inter-cluster

distance, the more separation that exists between the clusters. The objective is to

minimise the proposed validity index value so as to obtain the best clustering result.

This validity index was firstly used in a k-means clustering algorithm based iteration

process (the number of clusters was set from 2 to a user defined maximum number

Cmax). When the Cmax is reached, the cluster having maximum variance was split into

two clusters. The new two cluster centres were generated based on the original

Chapter 2

40

cluster centre ‘minus’ and ‘plus’ a constant value in each dimension. The purpose of

this process was to minimise the average intra-cluster distance. After the new

clustering scheme was formed, the validity index was calculated again to determine

the optimal number of clusters. Experiments were conducted on two datasets

involving synthetic images and natural images. The results showed that the proposed

validity measure had a tendency to select smaller clusters for natural images but a

modified rule was sufficient to overcome this problem. However, this rule still failed

on the synthetic image when there were only two clusters [83].

Recently, Sun et al [72] developed a new algorithm named Fuzzy c-means-

Based Splitting Algorithm (FBSA) which used the Sun-Wang-Jiang validity index

(Vwsj, see Section 2.2.4) to automatically determine the optimal number of clusters.

Within each iteration from Cmin to Cmax (minimum and maximum number of

clusters), the fuzzy c-means clustering algorithm was firstly applied to form a

clustering solution, and the corresponding validity index was then computed. Next,

the FBSA algorithm split the ‘worst’ cluster into two new clusters. Afterwards, the

validity index was calculated again and the best clustering with the optimal number

of clusters was picked up from the optimal validity index. In order to identify the

‘worst’ cluster, a ‘score’ function, S(j), associated with each cluster, j, was defined:

jclusterinobjectsdataofnumberjS

n

iij

______)( 1

∑==

µ (2.14)

A smaller S(j) value reflects that the cluster j is strongly compact. Therefore, the

‘worst’ cluster corresponds to the cluster with the biggest value for S(j). After

Chapter 2

41

identifying the ‘worst’ cluster, two new cluster centres were generated such that their

positions were as far away as possible from each other and the remaining centres.

Experimental results were achieved using four distinct datasets: the first one from the

public domain-IRIS dataset, the second and third were generated from mixture of

Gaussian distributions and the last one from a real survey of household expenditures

and budgets. The results showed that the combination of FBSA and VWSJ obtained

the best results in comparison with FBSA combined with other validity indices.

Kim et al [75] also proposed a validity index to automatically estimate the

optimal number of fuzzy clusters. This index was developed especially to cope with

the case whereby there exists highly overlapped and potentially vague data. This

index is based on the ratio of the overlap to the separation measure. The degree of

overlap between fuzzy clusters was calculated using the ‘inter-cluster overlap

measure’. If two clusters have a clear separation from one another, then the overlap

degree is very low. The distance between the fuzzy clusters yielded the separation

measure. Therefore, when two clusters have a larger separation distance, the clusters

are well separated. In a good fuzzy partition, the index would have a smaller value.

The fuzzy c-means clustering algorithm was applied several times using each value

within the range of Cmin to Cmax. The clustering result (within these Cmax to Cmin

results) that yielded the minimum validity index was returned. Therefore, the

solution containing the minimum validity also identifies the optimal number of

clusters, Cbest , (where Cmin <= Cbest <= Cmax). The experiments were designed to

compare the performance of the proposed validity index and nine other previously

Chapter 2

42

developed indices in determining the number of clusters. The results showed that the

proposed index was more reliable and efficient than other indices.

Another approach to automatically identifying the optimal number of clusters,

apart from methods involving a validity index, is the merge and split technique.

Iterative Self-Organizing Data Analysis (ISODATA) [84] is a typical algorithm that

uses the merge and split technique to determine the number of clusters in a dataset

[28]. It is based on a similar methodology to the k-means clustering algorithm and

follows the same initialisation and iterative procedure: assign each object to its

closest corresponding cluster centres and then calculate the new centre positions

(mean value of the objects in the cluster). However, ISODATA contains one key

enhancement in that it is also able to merge and split the clusters providing some

criteria are satisfied. The criterion is defined as follows: if the number of objects in a

cluster is less than a certain threshold or if the distance between two centres are less

than a certain threshold, then the clusters may be merged. In addition, if the variance

between two clusters is larger than a certain threshold or if the number of objects in a

cluster is bigger than a certain threshold, then clusters are split. For the k-means

clustering algorithm, the number of clusters is specified in the beginning and remains

fixed until the end of the clustering process. For the ISODATA algorithm, this is not

the case (although the algorithm still needs a prior statement as to the initial number

of clusters, this does not drastically affect the final clustering results obtained [85]).

Therefore, the approach uses the merge and split procedures to find out the optimal

number of clusters based on threshold values. Even though ISODATA has a clear

advantage over the k-means approach, it requires a number of parameters to be

Chapter 2

43

declared in advance which directly affect the clustering results such as the maximal

variance between two clusters, the minimal number of objects within each cluster

etc. This makes the approach very subjective [28,86].

Tou [87] proposed a cluster algorithm named Dynamic Optimal Cluster-

seeking (DYNOC), which is similar to ISODATA in that it also has the ability to

merge and split clusters. In this algorithm, the objective function is to maximise an

index which is formed by the ratio of minimum inter-cluster distance to maximum

intra-cluster distance. Clusters which reach the global maximum index value are

considered the best. This algorithm also requires the user to specify several input

parameters to direct the performance of the split and merge techniques [28,86].

Chaudhuri et al [88] also developed a split and merge clustering technique to

obtain the number of clusters from a dataset. The main difference of this algorithm

with ISODATA is that it splits the clusters by observing the density in different

directions by examining the number of data within different regions made up of

strips. Figure 2.8 shows a two dimensional dataset with strips in four directions.

The method starts by including all of the data as one cluster and then iteratively

performs a split and merge process based on the number of data in the strip area

which is determined by certain threshold values. Although this algorithm achieved

good performance on two sets of experimental data, it once again has the drawback

that many input parameters have to be specified before the approach can be executed.

Chapter 2

44

Figure 2.8 Two dimensional dataset strips in four directions.

Recently, Huang proposed a Synergistic Automatic Clustering Technique

(SYNERACT) as an alternative method to ISODATA [89]. It combined a

hierarchical descending approach with the k-means clustering algorithm to avoid the

limitations that exist in ISODATA. In general, the process of SYNERACT can be

described by the following: a hyperplane splits a cluster into two smaller clusters so

that one small cluster has a positive dot product value with a weight vector and the

other small cluster has zero or negative dot product value with the weight vector.

Next, the mean of the centre of each small cluster is calculated by the mean value of

all data points in each cluster. Then, an iterative optimisation clustering procedure is

employed to estimate the reasonable movement of a point to different clusters based

on the minimisation of an objective function. The new clusters that are successively

generated from each split stage are stored in a binary tree data structure. If no further

separation can be achieved from the hyperplane, then the process stops and the

algorithm completes. According to Huang, SYNERACT has produced similar

Chapter 2

45

accuracies as ISODATA, but can do so much faster. In addition, it does not require

the user to specify the number of clusters and initial centres positions in advance

although two input parameters which control the splitting process still need to be

defined.

Tseng and Yang [66] described a split and merge technique based on a genetic

algorithm for the auto clustering problem. In the first stage of this proposed method,

a nearest-neighbour clustering method was used to group the original dataset into

smaller clusters. The authors stated that the main purpose of this stage is to reduce

the data size so that it may fit into the genetic algorithm framework. The second

stage employs the genetic algorithm to merge the small groups into large clusters and

a heuristic strategy was then used to discover a good clustering scheme. Some

drawbacks of this algorithm include the pre-specification of parameters and failure to

generate the correct clustering if one cluster is partially or completely within another

cluster [65].

Garai and Chaudhuri [65] included a similar split and merge process as Tseng

and Yang’s algorithm for the purpose of generating a good clustering scheme.

However, in order to overcome the limitations that existed in Tseng and Yang’s

algorithm, an Adjacent Cluster Checking Algorithm (ACCA) was developed to

verify the adjacency of any two small clusters and was primarily used to perform a

merge process to form better clustering.

Many other split and merge based clustering algorithms have been proposed

within the literature and include the ‘Density-Based Spatial Clustering’ algorithm

Chapter 2

46

(DBSCAN) [90], the ‘Clustering Using Representative’ algorithm (CURE) [91] and

the ‘Chameleon’ algorithm [92].

2.4 Cluster Merging

Cluster merging is sometimes used as an additional phase within a clustering

algorithm in order to discover a good number of clusters to represent the data. As

with the split and merge technique of the previous section (Section 2.3), the original

dataset is firstly split into small groups and then these clusters are iteratively merged

together until some termination conditions are reached. It is clear that different

merge criteria may lead the clustering to totally different results. In this Section, the

various cluster merging criteria are highlighted.

In ISODATA based algorithms mentioned in Section 2.4, such as DYNOC

[87], the merge criteria is based on the distance between two clusters, if this distance

is smaller than a certain threshold value, then these two clusters are merged. Within

the Chaudhuri algorithm [88], a merge circle defined in the boundary of two clusters

was used as the merging restriction. If the difference of the data points between the

intersection area of two clusters to the merge circle set is less than a threshold value,

then these two clusters can be merged.

In Tseng and Yang’s paper [66], the merging of clusters was facilitated through

the use of a genetic algorithm. The GA was initialised by randomly generating a

population of strings whereby each string is a binary encoding with an equal number

of bits to the number of clusters outputted from the nearest neighbour clustering

Chapter 2

47

algorithm. Each string represents a subset of the clusters where a value of “1” in the

ith bit indicates the inclusion of the ith cluster in this subset and conversely, a value

of “0” indicates the exclusion of the cluster. After initialisation, each string, in turn,

was evaluated. This involved setting the subset of clusters as the initial cluster

centres (as identified through the “1” values within the binary encoding) and the

reallocation of each data point from the remaining clusters to its nearest selected

cluster (identified by the “0” bits within the string). Finally, the fitness was

determined using the intra-cluster and inter-cluster distances. The members of the

population were sorted by their fitness and were chosen for reproduction using

roulette wheel selection. Two point crossover was employed to interchange the two

substrings with a crossover probability of 80%. The bits of the strings were chosen

and their values flipped from “0” to “1” or from “1” to “0” with a mutation

probability of 5%. The best string was returned after a user specified number of

generations was reached.

Garai and Chaudhuri [65] used a similar GA algorithm to merge the closest

smaller clusters into large clusters. The improvement of this paper over Tseng’s

paper is the ‘Adjacent Cluster Checking Algorithm’ (ACCA) which was primarily

performed as the merge process to overcome the problem when one cluster is

confined fully or partly within another cluster or clusters. This was implemented by

checking two threshold values: the number of boundary points Tb and data density

difference Td between two smaller nearby clusters for deciding the merging of a pair

of clusters. If the boundary points between the two clusters was greater than Tb and

the difference of density was less than Td , then these two smaller clusters were

Chapter 2

48

verified as adjacent to each other. In the experiments performed, both of the

thresholds needed to be specified manually.

Kelly [93] proposed an algorithm for merging hyper-ellipsoidal clusters. In

this paper, an effective merging radius was introduced, where i and j represents two

clusters and rij was defined as the minimum effective cluster radius such that the

boundary of hyper-ellipsoids i and j intersect on the segment between their mean

vectors, as shown in Figure 2.9 from [93]. The smallest rij within all possible pairs

of clusters was chosen and the two corresponding clusters were merged. Two stop

criteria were given in the paper: 1) if the current cluster size is reduced to a given

size; 2) a threshold value R was specified such that the merge process was continued

until rij was greater than R.

Figure 2.9 Effective merging radius for clusters i and j.

Recently, Xie et al [79] developed a ‘multi-step maxmin and merge’ (3M)

algorithm. A new cluster validity based on compactness and separation measures

was described and used to evaluate the obtained clustering schema. The maximum

validity index that was achieved corresponded to the best clustering scheme. The

algorithm can be implemented in two steps. Firstly, a multi-step maxmin algorithm

i

j

Chapter 2

49

(a modified version of the maxmin algorithm proposed by Tou and Gonzalez [94])

was used to group the original data into c clusters which corresponded to the

maximum index value. Secondly, a merge process was performed on the obtained c

clusters: the ‘worst’ cluster (corresponding to the minimum validity index) was

deleted, and all data points were reassigned to their nearest existing cluster centres,

the new centres were formed by calculating the median of each cluster. This iterative

merge process continued until the total number of clusters was two. The cluster

validity was computed at each step of iteration and the clustering scheme that

corresponded to the minimal index value was returned.

2.5 Clustering in FTIR Spectroscopy

2.5.1 Introduction

In order to analyse the FTIR microscopic data from existing tissue samples,

various techniques have been used. These include: point spectroscopy analysis

technique, greyscale functional group mapping, digital staining [14]; discriminant

function analysis [15] and multivariate statistical analysis methods such as principal

component analysis (PCA) [16]. Apart from these, multivariate clustering

techniques have often been used to separate sets of unlabelled infrared spectra data

into different clusters based on their characteristics (this is an unsupervised process).

Through the examination of the underlying structure of a set of spectra data, different

types of cells can be separated within biological tissue. There are many clustering

techniques that have been applied for the purpose of FTIR spectroscopic analysis.

Chapter 2

50

These include hierarchical clustering analysis (HCA) [17-19], k-means clustering

[17] and fuzzy c-means clustering [16,17].

The application of commonly used techniques in FTIR spectroscopy analysis,

namely a multivariate analysis method (PCA) and multivariate clustering techniques

(HCA, k-means and fuzzy c-means), is now discussed.

2.5.2 Principal Component Analysis in Cluster Analysis

Principal component analysis (PCA) is a multivariate statistical technique that

has been widely applied in the field of data analysis and compression. This

technique linearly transforms a number of correlated variables into a new set of

uncorrelated variables, called principal components (PCs). PCA achieves this by

rotating the original axes to produce orthogonal axes that are uncorrelated to each

other [95]. The rotation procedure is a linear transformation of the original dataset

and, therefore, if all the variables are included in the rotation, then all information is

preserved. Within the new transformed variables, the first principal component

(PC1) identifies the dimension with the maximum variation in the original data and

the second principal component (PC2) is the dimension with the second largest

variation, and so forth. Therefore, the first few principal components usually contain

the most influential variations from the original data and it is this property that yields

the major advantage of the approach by allowing the dimensionality of the data to be

reduced whilst the most significant features of the data are retained. However, this

may not necessarily be the case; it depends on the application. The other main

application of PCA is to detect the underlying data structure, this is often used in

Chapter 2

51

cluster analysis [96]. The techniques for using PCA in FTIR spectroscopy cluster

analysis are now introduced.

As mentioned above, most of the variation within the original data can be

represented by the first a few PCs. Thus, if the data is plotted into the different PC

space, the data structure which encloses different clusters may simply be detected or

verified. For example, if PC1 and PC2 are plotted, then the dimensionality of the

data has been reduced to two dimensional space. Goncalves et al [97] found that by

plotting the FTIR spectra from sugarcane bagasse samples into two dimension space

using PC1 and PC2 the different types of samples can be detected through

visualisation. Indeed, on further analysis they discovered that, by reducing the data

to the first two PCS, the dimensionality could be reduced significantly and yet these

components still retained 88% of the variation of the original data. As for how to

calculate the percent of the variation, see the end of Section 5.4 (page 107).

In Lasch et al [20], PCA was employed for two purposes within the cluster

analysis of FTIR data. Firstly, it was used to generate a coloured image of different

types of tissue from the FTIR spectra. This was achieved initially by picking six

representative reference spectra from the FTIR maps based on histological

information after Haematoxylin and Eosin (HE) staining tissue section. Each

representative reference spectrum was considered as an origin and the geometrical

distances from the origin to the rest of the spectra were then calculated in any two PC

space. The distance between each spectrum to every origin was normalised by

dividing by the maximal distance from that origin and a series of values based on

Chapter 2

52

each origin were then obtained. By combining these values with the original spatial

information, colour or grey scale maps could be yielded. Thus, the different types of

tissue could be pictured through these image maps. In order to verify whether

conventional light microscopic analysis matched the biochemical information

obtained from FTIR analysis, PCA was employed again. The first six PCs of the

original data was considered as input data for the hierarchical clustering algorithm

(based on Ward’s algorithm [39]). The clustering results showed that IR-based

classification matched the visual light microscopic investigations.

Kim et al [98] also recently employed PCA as a pre-processing step for cluster

analysis, in a similar fashion to Lasch et al [20], using seven different species of

FTIR plants spectra. The spectra were plotted onto PC1 vs. PC2 two dimensional

space and the authors showed that the different categories of plants could be

identified simply using the visualisation. Subsequently, the results based on PCA

were used as the input data to the hierarchical clustering technique. The authors

showed that the generated dendrogram was in agreement with the known taxonomy

of the plants and this indicated that the FTIR data reflected phylogenetic

relationships between the plants [98].

PCA techniques have also been used in order to evaluate the tumour tissue

from FTIR spectroscopic maps. In Richter et al [16], PCA techniques were used to

extracted the first 20 PCs of the original dataset which were then used in a fuzzy c-

means clustering approach. Richter et al identified that different tissue types could

be displayed by showing all spectral score maps in an entire sample image. The

Chapter 2

53

score maps showed all spectra in the new coordinate system defined by the PCs

whereby the first score map explains all of the new spectral coordinates in PC1 and

so forth. As the PCs increase, the authors showed that less information can be

discovered and, in fact, the first 4 PCs score maps covered 99% of the total variance

of the dataset. The authors also provided an interesting analysis into the information

that could be gathered from each of the principal components. Using the first score

map, only the whole sample shape could be identified. From the second score map,

different regions of the tissue began to emerge. However, after the fourth score map,

the effects started to become less and less and yielded very little additional

information. In order to improve the interpretability of the PCA score maps, the

authors also combined the scores of second, third and fourth PCs to form one

coloured image (each PC was shown as a separate colour). The experiments

indicated that fuzzy c-means clustering in combination with PCA was suitable to

separate different types of tissue based on their FTIR maps. The authors claimed

that, in comparison with PCA, the fuzzy c-means clustering algorithm offered a

clearer view of the main features of the tumour section [16].

2.5.3 Hierarchical Clustering Analysis

When IR spectra are analysed using multivariate clustering techniques, each

spectra is considered as an individual data point. The p absorbance (or features)

associated with each spectra are considered as the different dimensions of the data

points. Therefore, the IR spectra in cluster analysis are illustrated as data points in a

p-dimensional space. Hierarchical clustering analysis (HCA) on IR spectra can be

Chapter 2

54

described as follows [17]: in a space containing n data points, a distance matrix

between all points within p dimensions is calculated. The size of the distance matrix

is n×n, and is symmetric along its diagonal. The two data points that are closest and,

therefore, most similar to each other are then merged together to form a new cluster.

The new distance matrix is now reduced to (n-1)×(n-1). The procedure continually

merges the two closest data points until all data points are merged into one cluster.

Once again, the output of HCA is the dendrogram (as described in Section 2.1.1) and

a cut-off line can be drawn to define the number of clusters and to generate the final

partition. Whilst this step is normally subjectively performed, the number of clusters

are usually provided as an input parameter to the HCA algorithm.

In most HCA analysed IR spectra, Ward’s algorithm (also named the

minimum-variant algorithm in section 2.1.1) is frequently used because it tends to

produce dense clusters [17]. Ward’s algorithm finds clusters by minimising the total

sum of squared error from all clusters [39]. Ward’s original method for calculating

distance between two clusters (r and s) is relatively complex, and has been simplified

to an ‘equivalent’ distance within the MATLAB implementation used in this thesis

as follows [99]:

sr

rssr nn

dnnsrd

+=

22 ),( (2.15)

where 22 |||| srrs xxd −= is the distance between clusters r and s. rx and

sx are the

centres of clusters r and s, which are equal to the mean value of all the data points in

Chapter 2

55

each of the respective clusters. The Euclidean distance is represented by ||.|| and nr

and ns represent the number of data points in clusters r and s respectively.

Apart from the Euclidean distance measure, another distance (or similarity)

measure, the correlation coefficient or correlation, is also often used in FTIR spectra

analysis. It is used to evaluate the strength and direction of a linear relationship

between two spectra and can be defined as below [19]:

∑∑

∑

==

=

−−

−−=

p

jj

p

jj

p

jjj

RS

RRSS

RRSS

C

11

1

)()(

))((

(2.16)

where spectra S and R are two spectra which can be considered as two 1-dimensional

vectors with p absorbances (in cluster analysis, the absorbance values can be viewed

as the variables) and S and R are the mean values of each vector. The output CRS is

known as covariance matrix and , as it is a symmetric, only the upper half is needed.

Out of the different correlation measures, the most frequently used is Pearson’s

correlation coefficient. It is defined as follows [83] :

∑ ∑

∑

= =

=

⋅−⋅⋅−

⋅⋅−⋅=

p

j

p

jjjjj

p

jjjjj

RS

SpSRpR

SRpSR

PCC

1 1

2222

1

)()(

)(

(2.17)

The output of the correlation coefficient is within range of [−1, 1] and indicates

the similarity of the two spectra. If the correlation coefficient is close to −1, these

two spectra are completely opposite whereas, if its value is close to 1, the two spectra

Chapter 2

56

are identical. If the correlation coefficient is 0, there is no correlation between the

two spectra. Therefore, the spectra with the highest correlation coefficients are

considered to be the most similar spectra and it is these spectra that are merged at

each iterative step of the hierarchical clustering algorithm.

Different HCA techniques have been widely proposed for the application of

FTIR spectroscopy analysis (e.g. based on various linkage methods) using various

distance measure (e.g. Euclidean distance, correlation) in the literature. Wood et al

[19] applied a HCA method involving a distance measure correlation coefficient to

investigate FTIR spectra from cervical cancer. Up to ten clusters were selected

according to the major anatomical features and the mean spectrum from each cluster

was extracted for comparative purposes. The clustering results showed that the

approach was able to separate normal and diseased tissues from each other, in

comparison with conventional histological analysis.

Zhao et al [100] investigated the use of FTIR spectroscopy to characterise a

group of 20 different bacteria using the Euclidean distance as similarity measure to

construct a dendrogram. Twenty different bacteria were to be classified by DNA

sequencing. Two other techniques, namely 16S rDNA sequencing and fluorescent

amplified fragment length polymorphism (AFLP), were also used to analyse the

same bacteria and the results obtained from these three methods were compared. It

was found that all approaches generated similar outcomes. However, in comparison

with other two techniques, FTIR was the faster method. In addition, it was easy to

use and inexpensive in terms of laboratory use.

Chapter 2

57

Salman et al [21] employed Ward’s algorithm to look into the detection of cells

infected by different variations of the herpes virus using FTIR spectroscopic

methods. In order to obtain the best partition results, cluster analysis was performed

on different segments of the spectra. The results indicated that spectra wavelength in

a range of between 950 and 1350cm-1 achieved the best results. The paper concluded

that the normal and herpes infected cells can be discriminated within the early stages

of the infection.

Naumann et al [101] used Ward’s algorithm to detect the fungi in wood

through FTIR microscopy and imaging. A false colour image was used to display

the FTIR spectra cluster analysis results from wood fibres, empty vessel lumina and

mycelium of both fungal species. The image showed that the differentiation between

the three wood blocks could be distinguished. The paper also concluded that FTIR

microscopy technique had the potential to identify different fungal species decaying

wood.

The combination of different distance measures with Ward’s algorithm have

also been developed in past. Schultz et al [22] apply Euclidean distance in Ward’s

algorithm to study the chronic lymphocytic leukaemia cells using FTIR

spectroscopy. Recently, Romeo and Diem [18] made use of the correlation

coefficient in Ward’s algorithm to reduce the artefacts present in the related area of

‘infrared transflection micro-spectra’ to improve the quality of the spectral analysis.

Lasch et al [17] employed a linear transformation of Pearson’s correlation coefficient

Chapter 2

58

measure in Ward’s algorithm. IR spectra maps from a colorectal adenocarcinoma

section were investigated in this study.

Although in these experiments, HCA techniques performed well in cluster

analysis of FTIR spectra, the various authors identified some major drawbacks of the

method. Firstly, the size of the correlation matrix (or distance matrix) requires a

large amount of computer memory [18]; secondly, it has a high computational

requirement which becomes especially evident when analysis is performed on large

datasets. In Lasch et al’s paper [17], the authors reported that the application of

HCA to analyse 8281 spectra took 4.5 hours which, in a practical environment, is

unacceptable.

2.5.4 K-means Clustering Analysis

The application of k-means clustering algorithm on IR spectra can be described

as follows [17]. In a space containing n data points (spectra), and each data point

associates p dimensional (absorbance) values once again, an initial k data points are

randomly chosen. Each of these k points represent the initial cluster centres. The

distances between all data points to these chosen k cluster centres are then calculated

and each data point is assigned to the cluster that has the minimal distance value.

Next, a set of new cluster centre positions are computed based on the newly formed

clusters (the mean value of the data points in each cluster is considered as the new

cluster centre position). This procedure is repeated again. During the iterative

process, as the cluster centres change, each data point may be reassigned to different

cluster centre many times. The method will stop when all of the cluster centres are in

Chapter 2

59

stable positions. Compared with HCA analysis, k-means is not time-consuming and

its execution times are linearly proportional to number of spectra n.

In a paper by Zhang et al [23], a k-means clustering technique was used for the

assignment of pixels in an image to identify cell and non-cell categories, where the

breast cell lines were evaluated using an FTIR microscopic imaging measurement.

Lasch et al [17] also employed a k-means clustering method to evaluate the FTIR

microspectroscopy imaging of colorectal adenocarcinoma tissue sections. A various

different number of clusters were pre-specified (2, 4, 6, 8 and 11, respectively) for

use in the experiments. The results showed that k-means technique can categorise

the principal differentiation from histopathology and, especially when the number of

clusters was 6, all spectral clusters could be clearly assigned to a specific histological

structure. Nevertheless, when further increases were made in the number of clusters,

the k-means approach failed to further discriminate between the information within

the histological structures [17]. It should be noted again that the k-means clustering

algorithm always requires the user to specify the number of cluster prior to execution

of the algorithm.

2.5.5 Fuzzy C-Means Clustering Analysis

In most applications of the fuzzy c-means algorithm, each spectrum needs to

undergo a ‘hardening’ process to convert the results into crisp values. This is simply

performed by assigning each spectrum to a cluster according to its highest

membership degree or to a membership degree greater than a threshold value.

Studies have shown that by hardening the clustering results from fuzzy c-means,

Chapter 2

60

better clustering results can be achieved than using the inherently hard results from

k-means [33,48]. Similarly to the k-means algorithm, the fuzzy c-means clustering

technique does not require high computational requirements as its execution time is

also proportional to number of spectra.

McIntosh et al [14] employed a fuzzy c-means clustering algorithm for the

analysis of IR microscopic maps of human skin. Five distinct clusters were

identified and each cluster was correlated to separate histological tissue components.

The authors report that the approach can clearly separate tumour-bearing skin from

normal skin. On the obtained solution, the five centroid spectra from each cluster

were able to clearly separate the chemical differences between the tissue types.

Mansfield et al [102] analysed the remnants of a work of art (a 16th Century

Flemish line drawing) using near-IR spectroscopic imaging technique. Firstly, each

spectrum was associated with a pixel element of the image. Then the fuzzy c-means

algorithm was applied to both raw and normalised spectroscopic data (mean centred,

scaled) in order to isolate all four components from background easily. In this study,

the authors used a threshold fuzzy membership value of 0.975 for “hardening” of the

results. The authors report that fuzzy c-means cluster analysis is an excellent

exploratory methodology as it does not require knowledge of the sample’s

composition or spectral properties in advance. The use of spectral normalization

routines in conjunction with fuzzy c-means cluster analysis provides a more detailed

picture of the range of spectral types in the test sample.

Chapter 2

61

Richter et al [16] used fuzzy c-means to evaluate the tumour tissue from FTIR

spectroscopic maps. The highest membership grade was chosen when assigning

each spectrum to a specific cluster. Each cluster was then encoded into a different

colour to display the cluster results. The experiments showed that fuzzy c-means

provided a general view of the main features of the tumour thin section.

Recently, Lasch et al [17] utilised an fuzzy c-means technique to investigate

FTIR microspectroscopy imaging in colorectal adenocarcinoma tissue section. In

this study, all membership values were encoded by colour intensities. Fuzzy c-

means cluster images were then plotted into two dimensional space by PCA so as to

compare with other cluster analysis and histological results. The computational

experiments indicated that when the number of clusters was set to 3, 4 or 6, the fuzzy

c-means images could be assigned to the specific tissue structures. However, when

the number of clusters was increased further, the results became more and more

vague in terms of the relationship to the known histopathology.

2.6 Summary

In this chapter, an overview of both the general clustering techniques utilised

within the literature and the clustering approaches used within FTIR spectroscopic

applications were provided. The purpose of clustering is to group the objects

(spectra) so that they have the most similarity in the same cluster and objects have

Chapter 2

62

the most dissimilarity in the different clusters, thus, through the clustering process of

different FTIR spectra, diverse types of cells can be separated.

In different clustering procedures it is commonly required that the quality of

the clustering schema is verified. This is achieved by a cluster validity measure.

This section identified several of the most relevant cluster validity measures that

have been used within the literature in order to evaluate the partition results.

This section also identified the clustering literature in which algorithms are

proposed for the purpose of automatically and correctly identifying the most

appropriate number of clusters with which to represent a dataset. As the main focus

of this study is the clustering of different tissue types, particular consideration has

been given to algorithms that also aim to cluster tissue types. It was identified that

occasionally, during the automatic clustering procedure, an excessive number of

clusters was obtained.

Finally, a number of publications that utilise various clustering techniques for

the processing of FTIR spectra data were discussed. In all of these approaches, focus

was drawn to the many problems and disadvantages that exist in these techniques.

In the next Chapter, an overview of the medical background relevant to this

thesis is presented.

Chapter 3

63

CHAPTER 3

Medical Background

This research is a collaboration between the Computer Science department

(Xiao Ying Wang supervised by Dr Jon Garibaldi) and the Chemistry department

(Benjamin Bird supervised by Professor Michael George) at the University of

Nottingham, UK, and is motivated by the study of Mr. John M. Chalmers et al which

focussed on the use of FTIR microscopy of oral tissue samples. In light of the fact

that the medical background information and technical considerations for the FTIR

microspectroscopy did not fall under the remit of the School of Computer Science for

this research, most of this chapter has been derived from either our joint publications

(including several conference papers, journal papers and a refereed book chapter),

our colleagues in the Chemistry department, or from the internal report of John M.

Chalmers et al.

3.1 Introduction

In this Chapter, the general medical background related to this thesis is

described and explained. Two types of cancer cells have been investigated in this

Chapter 3

64

study, namely, oral cancer and breast cancer. In order to examine whether the

suspected patients have these types of cancer, tissue samples are collected and,

therefore, this Chapter will also describe the process of collecting, preparing and

conducting FTIR microscopy analysis on these samples.

Oral cancer is a type of disease which can result a large quantity of fatalities.

The latest statistics from Cancer Research UK website (2005) shows that nearly

4,500 oral cancer cases are diagnosed and more than 1600 deaths in the UK each

year [103]. The late detection of this disease is often the cause of mortality. Breast

cancer is the second most common cancer in the UK. The latest statistics from

Cancer Research UK website (2006) shows that more than 41,700 women are

diagnosed with breast cancer and around 300 men also are diagnosed in the UK each

year. More than 12,400 deaths are caused by this disease every year in the UK [104].

The ability to accurately identify the malignancy is crucial for prognosis and

preparation of effective treatment.

Currently, the diagnosis process for oral cancer is firstly through visual checks

of the patient’s mouth and throat by a doctor, if there are any abnormal areas present,

then a small piece of tissue is removed for further investigation by a pathologist

under a microscope (removing tissue to look for cancer cells in medical terms is

called a biopsy). The purpose of this procedure is to check whether cancerous cells

exist in within the tissue area. However, this traditional histology (the study of plant

or animal tissue, usually this involves studying thin cross-sections of tissue under a

microscope [105]) remains a subjective technique and some problems are

Chapter 3

65

occasionally encountered such as missed lesions, broken samples and unsatisfactory

levels of discrepancy. Discrepancies can be both inter-observer (discrepancy

between two different observers) or intra-observer (discrepancy between two

different examinations by the same observer).

For breast cancer, some preoperative imaging methodologies, such as x-ray

mammography and ultrasound, can identify areas of tumour growth in the breast

based on the identification of density changes within the tissue. However they

cannot be used to reliably diagnose whether the tumours are benign or cancerous in

nature [106]. Additionally, the diagnosis of breast cancer can often also be achieved

by assessing the lymph nodes in the ipsilateral axilla (located on or affecting the

same side of the axilla). The presence of metastasis (cancer spread from its original

location) is an indicator for local disease recurrence and thus a method for

identifying patients who are at high risk of developing a cancer variant that could

spread throughout the body. The well-established procedure to access lymph node

metastases is axillary lymph node dissection (ALND). Nevertheless, this is a rather

substantial surgical procedure that can lead to several serious side effects, such as

shoulder dysfunction and lymphoedema (swelling, especially in subcutaneous

tissues, as a result of obstruction of lymphatic vessels or lymph nodes, with

accumulation of lymph in the affected region) [107]. The introduction of

mammography screening programmes, together with a greater public awareness of

breast cancer have meant that the majority of patients who do not have axillary

lymph node metastases at presentation do not have to undergo ALND [108].

Chapter 3

66

Intra-operative diagnosis has become increasingly important with the recent

introduction of sentinel lymph node biopsy [109]. The sentinel node can be

described as any lymph node that has a direct lymphatic connection to the tumour,

and would be the first invaded by cancer spreading from the breast [106], see figure

3.1 [110]. Surgical studies have clearly shown that if cancer cannot be found in the

sentinel lymph node, the chance of disease being found further down the chain of

lymph nodes that drain the breast is negligible [109]. Therefore accurate analysis of

the sentinel lymph node can alleviate the necessity to remove all suspected nodes

present.

Present techniques have been employed to facilitate fast intra-operative

diagnosis of sentinel nodes, such as imprint cytology and frozen section assessment

[111]. However, these approaches report a wide variation in their sensitivity and

specificity to detect cancerous lesions, with detection levels as low as 44% and as

high as 93% when compared against conventional histology [111-114]. In addition,

these techniques are heavily reliant upon the availability of an experienced

cytopathologist, thus the examination from general pathologist may result in lower

accuracies than those from specialist clinics. This general lack of consistency

between different pathologists leads to less reliability of such intra-operative tools for

sentinel lymph node diagnosis.

Chapter 3

67

Figure 3.1 Typical location of lymph nodes that drain lymph from the breast.

The difficulties that exist with the current cancer diagnosis techniques have

resulted in a variety of different spectroscopic methods being investigated in order to

determine whether such approaches could be used to generate a reliable aid for

diagnosis [115-117].

3.2 Instrumentation

Infrared spectroscopy has shown much potential as a tool for analysing the

biological materials over the past decades [113,118,119]. When biological

molecules are exposed to radiation in the mid-infrared region of the electromagnetic

spectrum (400−4000cm-1), characteristic absorptions from the excitation and

vibration of bounds within the molecules can be exhibited [106]. FTIR

microspectrometry, obtained through the coupling of an infrared microscope to a

Sentinel lymph node

Chapter 3

68

FTIR spectrometer, has been proving a potent new technique as a diagnostic tool for

the determination of a variety of tissue structure [120]. FTIR microscopic spectra

can detect subtle changes in spectral peaks and their position from the biomolecule

constituents, such as: proteins, lipids and nucleic acids, therefore, the very small

biochemical changes that occur between different cell types can be noticed, even

with very complex cells.

For the purposes of this research, three types of instrumentation have been

utilised, including two types of IR spectrometers, namely a Nicolet Continuum, a

Perkin Elmer Spotlight Imager and a Nicolet Nic-Plan microscope that was coupled

to a synchrotron source (see Section 3.2.3). Figure 3.2 shows the Perkin Elmer

Spotlight imager used in this study. In the following Sections, each of these

instruments are described in detail.

Figure 3.2 Perkin elmer spotlight imager.

Chapter 3

69

3.2.1 Nicolet Continuum FTIR Microspectrometer

The apparatus is comprised of a Nicolet Nexus 730 FTIR spectrometer (Nicolet

Instruments, Inc., Madison, USA), fitted with a potassium bromide (KBr) beam

splitter. This spectrometer is interfaced to a Nic-Plan IR-microscope that comprises

its own liquid nitrogen cooled narrow-band (1800−900cm-1) mercury-cadmium-

telluride single element detector. Transmission spectra were recorded either at 4cm-1

or 8cm-1 spectral resolution, typically 512 or 1024 scans per spectrum. The FTIR

microscopy was operated using a 32× objective lens. Background single-beam

spectra were recorded through a blank Barium Fluoride (BaF2) window [26].

3.2.2 Perkin Elmer Spotlight Imager

The Perkin Elmer Spotlight Imager (Perkin-Elmer Corp., Sheldon,

Connecticut) was also used in this study and is also a FTIR microscope that is similar

to the Nicolet Continuum instrument. The main difference between the two

instruments is that the Perkin Elmer Spotlight Imager comprises of a dual set of

detectors. The microscope is equipped with both a 100µm single element detector

and an array detector. When operated in array mode, the system utilises a 16 × 1

element (400µm × 25µm) linear array of small area narrow band (4000−720cm-1)

detectors coupled with an electronic stage, to raster across the sample in both the

horizontal (X) and vertical (Y) planes, thus a microscopic IR image can be

constructed. The advantage of the array mode is that it has the capability of scanning

16 different spatial areas at once, thus enabling larger sample areas to be examined

rapidly at the microscopic level. The microscope can purify dry air using a specially

Chapter 3

70

designed Perspex box to reduce spectral contributions from atmospheric CO2 and

water vapour. Spectra were collected in transmission mode, using a clean BaF2

window as background, with a spectral resolution of 8cm-1. Each pixel sampled a

6.25µm × 6.25µm area of the sample. An appropriate background spectrum was

collected from the sample in order to ratio against the single beam spectra. These

ratioed spectra were then converted to absorbance values, with each spectrum

containing 821 data points (4cm-1 data point interval within range of 4000−720cm-1)

[106].

3.2.3 FTIR microspectroscopy utilising a Synchrotron Radiation Source

A FIIR microspectrometer can measure tissue sections containing very small

cells (approximately 10µm diameter). This is because a Synchrotron Radiation

Source (SRS) (which is built into FTIR spectrometers) enables the collection of IR

spectra at these spatial sizes with a significantly higher signal to noise ratio, which

enhances the brightness of the IR radiation resulting in readings that are up to 1000

times stronger than [121]. Synchrotron light is produced when an electron is

accelerated to near-relativistic speeds by a magnetic field made up of multiple huge

dipole electromagnets. Initially electrons are emitted from a hot cathode and

accelerated to approximately 12 MeV by a linear accelerator. They are then injected

into a booster ring raising their energy to 600 MeV before finally being injected into

a storage ring where they circulate at approximately 2 GeV. Accelerating charged

species via magnets induces the release of radiation creating a circular beam that

encompasses a broad spectrum of wavelengths. The type of light that is produced is

Chapter 3

71

dependent upon the magnetic field strength and the energy of the electron beam

[122]. Therefore the maximisation of both these factors produces the shortest

wavelength of light. To stop the reduction of the energy of the light after it is

emitted, Radio Frequency (RF) devices are used to add energy to the beam. Beam

blockers ensure that the appropriate wavelengths of radiation reach the different

experimental stations. The synchrotron radiation should be very stable with gradual

exponential decay over many hours. The beam that emerges at the IR station is

filtered to give the correct wavelengths, and then directed via mirrors to a

conventional infrared microscope. In this study, the IR beamline located at the UK

SRS laboratory in Daresbury where their synchrotron source has been coupled to a

Nicolet Nic-Plan FTIR microscope was utilised.

3.3 Sample Preparation and Data Collection

In this thesis, two types of cancer cells, namely oral cancer and breast cancer,

have been used for the investigation. The following describes the preparation of the

tissue samples and FTIR spectroscopic data collection.

3.3.1 Oral cancer tissue samples

Oral cancer tissue specimens were collected from three patients who had been

diagnosed with oral cancer. The samples have kindly been provided by Derby

General Hospital with full consent of the patients in question. With each patient,

several samples were collected from various areas and encompassed a mixture of

tissue types. Once the samples have been taken, they are immediately frozen in

Chapter 3

72

liquid nitrogen to preserve their biochemical condition. The samples are

subsequently cut using a freezing-microtome to obtain a 5µm thick tissue section.

These sections were then mounted on 0.5mm thickness BaF2 windows for infrared

analysis. The remainder of the tissue specimen was then used for analysis through

conventional methods involving Hematoxylin and Eosin (H&E) staining for the

identification of the regions of particular interest for histology. The results from

these two parallel sections can be used for comparative analysis by a pathologist.

Some of the sections were applied to the infrared analysis first and were simply

stained afterwards to obtain their histological examination [26,106].

3.3.2 Breast cancer tissue samples

Breast cancer tissue specimens were collected during routine surgical resection

for breast cancer with approval from Gloucestershire Research Ethics Committee and

fully informed and consenting patients. From each appropriate case, a small portion

of one axillary lymph node was collected and the tissue areas contained a variety of

lymph node tissue types that were chosen for analysis. There was also a need to

immediately freeze and cut the samples using a freezing-microtome, obtaining a 7µm

thick tissue section. These sections were then placed onto a barium fluoride disc and

stored in a cryovial ready for infrared analysis. In a similar way to the oral cancer

tissue samples, the parallel sections were stained using H&E staining procedures to

obtain the comparative analysis results by a consultant breast histopathologist.

Chapter 3

73

3.4 Data Pre-processing

The output from the IR spectrometry, the FTIR spectra, need to be pre-treated

before undergoing multivariate analysis, this is also called data pre-processing. Pre-

treatment includes the removal of absorption intrusions from atmospheric water

vapour and CO2, and baseline correction is often applied to correct the sloping and

curved baselines encountered in cell spectra. In addition, due to the irregular

thickness in each sample, normalisation is required to remove the effects. In oral

cancer tissue samples, some pre-processing was undertaken using routines within the

Nicolet OMNIC32TM software supplied with the FTIR spectrometer − Nicolet

Continuum. Baseline points that have been chosen to flatten the existing spectra are:

4000, 3750, 1815, 930, 700, 650cm-1. Normalisation was undertaken on each

spectrum by setting the intensity maximum of the Amide II band at ca. 1542 cm-1 to

1 absorbance unit. The data pre-processing mentioned above is also called basic pre-

processing, and was performed using the Pirouette multivariate analysis software.

For the breast cancer tissue samples, the tissue area is greater than in the case

of the oral cancer cases, therefore the Perkin-Elmer Spotlight imager spectrometer

was used to obtain the spectropic data. However, the software that comes with the

spectrometer (Infometrix Pirouette®, version 3.02, multivariate data analysis

software, Infometrix, Inc., Woodinville, WA, USA) is not good for making these

corrections. Therefore originally, the necessary baseline correction was performed

spectrum by spectrum (similarly with normalisation) and was very time-consuming.

In order to solve the baseline correction and normalisation problems in the breast

Chapter 3

74

cancer (lymph node) tissue samples, the author of this thesis implemented these two

corrections using Matlab version 6.5, release 13.0.1 (Mathworks, Natick, MA, USA).

The six baseline points chosen in the lymph node tissue samples were: 4000, 3744,

2200, 1836, 876, and 720cm-1. Two types of normalisation, namely peak area and

vector normalisation were implemented to normalise the spectra. Peak area

normalisation was achieved by scaling all spectra such that the sum of absorption

over the indicated wave-number (4000-720cm-1) equals unity; vector normalisation

was achieved by scaling all spectra such that the sum squared deviation over the

indicated wave-number equals unity. In this thesis, most of the normalisation

utilised peak area normalisation as it is faster than vector normalisation. It should be

noted that in lymph nodes tissue samples, only the basic pre-processing was

undertaken on the spectra.

3.5 Summary

In this chapter, the medical background of this research was presented, the

current diagnosis procedures for oral cancer (briefly) and breast cancer were

described and the difficulties existing in the processes were illustrated. These

difficulties are the main motivation behind a considerable research effort to

investigate whether infrared spectroscopy can be used as a diagnostic probe to

identify early stages of cancer since these techniques are sensitive to biological

changes within cells. Finally, this Section identified the FTIR microspectroscopy

procedures that we have used to investigate the above mentioned two types of cancer

Chapter 3

75

tissue samples. These include the instrumentation used to obtain the FTIR spectra

data, tissue sample preparation, spectral data collection and pre-processing.

In the remainder of this thesis, the investigations of FTIR spectroscopic data

from these cancerous tissue samples are presented. In the next Chapter, three

clustering techniques which are often used in FTIR spectra analysis, namely

hierarchical cluster analysis, k-means and fuzzy c-means, are applied to the seven

sets of oral cancer FTIR spectral data.

Chapter 4

76

CHAPTER 4

A Comparison of Hierarchical,

K-Means and Fuzzy C-Means

Clustering of Oral Cancer Cells

4.1 Introduction

In 2002, John M. Chalmers et al reported the analysis of sets of FTIR spectra

taken from oral cancer tissue samples [26]. In general, the experiments analyzed the

tissue samples in two parallel processes. In the first process, the samples were

scanned by FTIR spectroscopy and pre-processing procedure were applied to the

output of spectra from IR spectrometry (see section 3.4). Furthermore, a set of extra

various pre-processing techniques, such as mean-centring, variance scaling and first

derivative were also performed on the FTIR spectra empirically for the specific

multivariate analysis in order to utilise classification of different tissue types. HCA

(average linkage) was mainly used to classify the spectral data from different types

Chapter 4

77

of tissue area and PCA was used to distinguish these data by visual inspection. In the

second process, the samples were stained with a chemical solution and then

examined through conventional cytology to group the samples into different

functional groups. The outcomes from these two processes were then compared.

The clustering results showed that accurate clustering could only be achieved by

manually applying extra pre-processing techniques that varied according to the

particular sample characteristics and clustering algorithms. However, the pre-

processing procedures needed extra time, software tools and significant human

expertise. If a clustering technique could be developed which could obtain clustering

results as good or even better than conventional clinical analysis without the

necessity for pre-processing procedures, it would make the diagnosis more efficient

and enable automation.

4.2 Oral Cancer Datasets Description

In the oral cancer FTIR spectra, there are a total of seven datasets taken from

three different patients. The spectral range in this study was limited to a

900−1800cm-1 interval. Figure 4.1 (a) shows a 4× magnification visual image from

one of Hematoxylin and Eosin stained oral tissue sections. There are two types of

cells (stroma and tumour) in this section with their regions are clearly identifiable by

their light and dark coloured stains respectively. Figure 4.1 (b) shows a 32×

magnified visual image from a portion of a parallel, unstained section; the

superimposed dashed white line separates the visually different morphologies. Five

single point spectra were recorded from each of the three distinct regions. The

Chapter 4

78

locations of these are marked by “+” on Figure 4.1 (b) and numbered as 1−5 for the

upper tumour region, 6−10 for the central stroma layer, and 11−15 for the lower

tumour region. The fifteen FTIR transmission spectra from these positions are

recorded as dataset 1, and the corresponding FTIR spectra (without extra pre-

processing) are shown in Figure 4.2.

(a) (b) Figure 4.1 Tissue sample from Dataset 1; (a) 4× stained picture; (b) 32×

unstained picture.

Figure 4.2 FITR spectra from Dataset 1.

Chapter 4

79

Figure 4.3 shows a 32× magnified visual image of dataset 2, unstained section;

the superimposed dashed white line separates the visually different morphologies.

Ten single point spectra numbered as 16−25 on the right hand side for the tumour

region, and rest of eight spectra numbered 26−33 on the left hand side for the stroma

region.

Figure 4.3 32× unstained picture from tissue sample Dataset 2.

Figure 4.4 shows a 32× magnified visual image of dataset 3, unstained section.

There are also two types of cells (stroma and tumour) in this section with their

regions. Four spectra numbered as 34−37 for the left tumour region, three spectra

numbered as 38−40 from the central stroma layer, and rest of four spectra numbered

as 41−44 from the right tumour region.

Chapter 4

80

Figure 4.4 32× unstained picture from tissue sample Dataset 3.

Figure 4.5 shows a white light image of three types of tissue sample from

dataset 4 and different morphologies can be visualised in the picture. The

corresponding spectra numbers are also shown below (The distinct grey-scale

contrast between the left half and right half of the image is artificial. It is a

consequence of the image being a composite of two independent pictures

corresponding to each half). It may be noticed that the boundary between stroma and

early keratinisation follows a meandering way through area numbers 88, 72, 56 and

55; and in a similar manner, the boundary between the marked tumour and stroma

region does not follow a vertical line as indicated, but rather appears to meander

somewhere through the area contained within the area numbers 50−52, 65−67 and

80−82. A closer histopathological inspection highlighted that there had been

invasion of the stroma region by tumour within the vicinity of the boundary between

the two layers. At this stage of the study, we are only concerned with ascertaining

Chapter 4

81

spectral characteristic of essentially distinct classes of tissue cells, rather than

gradation processes or mixed types [26]. Therefore those within the two boundary

regions and invasion area are excluded. These spectra number include: 46, 50, 51,

55, 56, 65, 66, 71, 72, and 81−88. That is, the number of spectra was reduced from

the original 48 to 31. Subsequently, the corresponding spectral points were

renumbered sequentially from 45 to 75. The three different categories of tissue types

in the new spectral numbering are as follows:

Tumour: 45−48, 56−59, 68−71.

Stroma: 49−51, 60−63.

Early keratinisation: 52−55, 64−67, 72−75.

45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76

77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92

45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 7661 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76

77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 9277 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92

Figure 4.5 White light image of tissue sample Dataset 4.

Chapter 4

82

Figure 4.6 presents a tissue sample from dataset 5. Thirty spectra were

recorded in each grid on the white light image and their corresponding spectra

numbers are also displayed in Figure 4.6 (a). Figure 4.6(b) shows the same tissue

area “spectroscopic-staining” image according well with that from conventional

histopathology H&E staining. Two types of tissue cells (stroma and tumour) exists

in this section, however in the boundary region coloured as purple in Figure 4.6(b)

were closer to tumour than stroma through the analysis.

Chapter 4

83

(a)

(b)

Figure 4.6 Tissue section from dataset 5 (a) white light image (b) spectroscopic-staining image.

Chapter 4

84

Figure 4.7 displays a tissue section and fifteen spectra, taken in two images.

Three visually different areas numbered as 131−135, 136−140 and 141−145 are

associated with characteristic of tumour, stroma and tumour respectively.

(a)

(b)

Figure 4.7 White image of tissue sample for dataset 6 (a) part 1 (b) part 2.

Chapter 4

85

Figure 4.8 shows a set of five white light images taken from an oral tissue

section from a third patient. Histopatholgical examination showed that this is a

complex region containing stroma, tumour and necrotic tissue. A linear scan

consisting of consecutive spectral points was recorded. Similar to dataset 4, some

spectra which lie in a boundary between cell types or have spectral characteristic

which are not clear were eliminated from the original recorded points, leaving 42

spectra. After readjustment of the numbering, the spectra are distributed as follows:

Tumour: 201−210, 225−235.

Stroma: 211−224.

Necrotic: 236−242.

Figure 4.8 White image of tissue sample for Dataset 7.

Chapter 4

86

4.3 Experiments on Oral Cancer Datasets

In this Chapter, three data clustering techniques that have often been used in

FTIR spectroscopy analysis, namely hierarchical cluster analysis (HCA), k-means

and fuzzy c-means clustering, are used to classify the seven oral cancer FTIR spectra

datasets introduced above. These had been obtained through conventional cytology

[26], and no further extra pre-processing was applied (only basic pre-processing). In

hierarchical clustering, four different types of linkage methods, namely “single”,

“average”, “complete” and “ward” were conducted individually. Due to the k-means

and fuzzy c-means algorithms being sensitive to the initial states, each method was

run ten times. The parameters setting for these three clustering algorithms that were

used are as follows:

HCA: The Euclidean distance was used to calculate the distance

between different data points.

K-means: The Squared Euclidean distance was used to compute the

distance between each data point to its centroid; Maximum

number of iterations was 100.

Fuzzy c-means: Fuzziness index is equal to 2; maximum number of iterations

was 100; minimal amount of improvement was 10-5.

Similarly to k-means, the squared Euclidean distance was

also used to calculate the distances between data points to

centroids.

Chapter 4

87

The implementation of these algorithm were performed using Matlab (version

6.5.0, release 13.0.1).

4.3.1 Results

The distribution of the numbers of different types of tissue identified clinically

and as obtained by the three clustering techniques are displayed in Table 4.1. As

mentioned previously, clustering is an unsupervised process; this means that the

results of the clustering are simply to group the data into two or more unlabelled

categories. In the results presented below, the clusters were mapped to the actual

known classifications in such a way as to minimise the number of disagreements

from clinical studies in each case. The results are presented in comparison with a

previous study on the same data where the data was pre-processed empirically before

a diagnosis analysis. In this study, all three clustering analyses were performed using

MATLAB (version 6.5.0, release 13.0.1).

Chapter 4

88

Table 4.1 Distribution of the different tissue types identified

clinically and as obtained by the various clustering techniques.

Single Average Complete WardTumour 10 10 10 10 10Stroma 5 5 5 5 5

Tumour 10 17 9 9 9Stroma 8 1 9 9 9

Tumour 8 4 4 8 7 3 6 4Stroma 3 7 7 3 4 8 5 7Tumour 12 19 19 12 12 11 19 13 19 11Stroma 7 5 5 7 7 8 5 6 5 8

Early keratinisation 12 7 7 12 12 12 7 12 7 12Tumour 18 1 18 18 18 17Stroma 12 29 12 12 12 13

Tumour 10 10 10 10 10Stroma 5 5 5 5 5

Tumour 21 28 17 17 15Stroma 14 13 18 18 20

Necrotic 7 1 7 7 7168

1416105

fuzzy c-means

105

18

9947

9

1416

10

k-means

1059

51718Dataset 7

Dataset 6

7

Dataset 3

Dataset 4

Dataset 5

Clinical study

Dataset 2

Hierarchical clustering

Dataset 1

Datasets names

Tissue types

From Table 4.1 it can be seen that, in most of datasets, the number of data

belonging to the various categories do not exactly match the results from the clinical

study. This is because some of the data that should have been classified in the

tumour cluster has been misclassified into the stroma cluster and vice versa. For

example, in data set 2, using the hierarchical clustering single linkage method, the

numbers of data considered as tumour is 17, while 1 is considered as stroma.

Actually, there are 10 data belong to tumour and 8 belong to stroma. 7 data points

have been misclassified. These missed data points are misclassified into tumour

cluster as extra data points. The extra data from these clustering techniques will be

regarded as the number of disagreements of classification in comparison to the

results from previous clinical study. The comparison results are shown in Table 4.2

Chapter 4

89

Table 4.2 Comparison results based on the number of disagreements

between clinical study and the various clustering results.

Single Average Complete WardTumour 0 0 0 0Stroma 0 0 0 0Tumour 7 0 0 0Stroma 0 1 1 1Tumour 0 0 0 0 0 0 0Stroma 4 4 5 3 5 2 4Tumour 7 7 3 3 3 7 3 3 7Stroma 5 5 3 3 4 5 2 4 5

Early keratinisation 0 0 0 0 0 0 0 0 0Tumour 12 0 0 0 0Stroma 1 0 0 0 1Tumour 0 0 0 0Stroma 0 0 0 0Tumour 7 0 0 0Stroma 0 4 4 6

Necrotic 1 0 0 021

0400

fuzzy c-means

00

0

0104

1

04

0

k-means

000

004Dataset 7

Dataset 6

0

Dataset 3

Dataset 4

Dataset 5

Dataset 2

Hierarchical clustering

Dataset 1

Datasets names

Tissue types

After running each clustering technique ten times, it can be seen that the k-

means and fuzzy c-means algorithms obtained more than one clustering result in

some datasets. This is because different initialisation may lead to different partitions

for both of these algorithms. From Tables 4.1 and 4.2, k-means has more variations

(3 out of 7 datasets) than fuzzy c-means (1 out of 7 datasets), and their corresponding

frequency (out of 10 runs) is shown in Table 4.3.

Chapter 4

90

Table 4.3 Clustering variations for k-means and fuzzy c-means

within three datasets.

Datasets

names K-means Fuzzy c-means

Dataset 3 2/10 3/10 5/10 -

Dataset 4 3/10 3/10 4/10 9/10 1/10

Dataset 5 5/10 5/10 -

4.3.2 Discussion

In order to further investigate the performance of the different clustering

methods, the average number of disagreements for all datasets was calculated, as

shown in Table 4.4. It can be seen that the hierarchical clustering single linkage

method has the worst performance, the average linkage performance is better than

single linkage, while the complete linkage and ward methods perform the best

overall, However, hierarchical clustering techniques are computationally expensive

(proportional to n2, where n is the number of spectral data), therefore, they are not

suitable for very large datasets [17]. K-means and fuzzy c-means have fairly good

performance, and for both the computational effort is approximation linearly with n.

Hence, compared with hierarchical clustering, these techniques will be far less time-

consuming on large datasets [17]. Moreover, although k-means has a slightly better

performance than fuzzy c-means (slightly fewer disagreements, on average), it can be

Chapter 4

91

seen from the standard deviations in Table 4.4 that k-means exhibits more variation

in its results than fuzzy c-means. Hence, the overall conclusion is that fuzzy c-means

is the most suitable clustering method in this context.

Table 4.4 Average number of disagreements obtained in the three

clustering methods.

Single Average Complete Ward

19.5±1.6

2.7±0.8 2.8±0.2

Hierarchical clusteringK-means

Fuzzy c-means

16 16 18.8±5.8

6.3 3.0 2.3 2.3

Average (S.D.) Number of

Disagreements per Run

Average (S.D.) Number of

Disagreements per Run per Dataset

44 21

4.4 Summary

In their previous study, Chalmers et al investigated seven sections of tissue

samples containing oral cancer cells using two comparative parallel processes: that

is, histological analysis and FTIR spectroscopy with the subsequent application of

multivariate analysis [26]. Prior to the multivariate analysis, all spectral data had to

be empirically pre-processed. It was found that accurate clustering could only be

achieved by manually applying extra pre-processing techniques that varied according

to the particular sample characteristics. Furthermore, these pre-processing methods

required additional time, software tools and significant human expertise to perform.

Chapter 4

92

In this Chapter, three commonly used clustering techniques in FTIR spectroscopic

data analysis, namely, HCA, k-means and fuzzy c-means were applied to the same

seven spectral datasets as Chalmers et al reported but without any extra pre-

processing procedure. Single, average, complete linkage and Ward’s method were

employed in the HCA clustering techniques.

The experimental results showed that the single linkage method obtained the

worst clustering results, average linkage method’s performance was better than

single linkage but, overall, complete linkage and Ward’s method obtained the best of

the solutions. However, one of major drawback for HCA clustering algorithm is

high computation expense. Therefore, for very large datasets (which normally

appear in practical FTIR spectral analysis), this method may not be suitable. On the

other hand, the k-means and fuzzy c-means algorithms performances also achieved

the good performance. In addition, they require less computational resources in

comparison with the HCA method. However, from the clustering results it can be

seen that k-means clustering algorithm generated less consistent clustering results

than fuzzy c-means. Overall, it may be suggested that fuzzy c-means is a more

suitable method to classify the FTIR spectral data in this study.

Chapter 5

93

CHAPTER 5

Methods for Automatically

Determining the Number of Clusters

5.1 Introduction

In a real medical diagnostic application, for a previously unseen tissue sample,

the number of different types of cells is normally not known in advance. Based on

this fact, a clustering technique which can automatically obtain the appropriate

number of tissue types is required. There have been many clustering methods have

been developed in attempt to automatically determine the optimal number of clusters.

Recently, Bandyopadhyay proposed a Variable String Length Simulated Annealing

(VFC-SA) algorithm [123], which applied a simulated annealing algorithm to the

fuzzy c-means clustering technique and used a cluster validity index measure as the

energy function. This has the advantage that, by using simulated annealing, the

algorithm can escape local optima and, therefore, may be able to find the globally

optimal solution(s). The Xie-Beni index was used as the cluster validity index to

Chapter 5

94

evaluate the quality of the solutions; the author stated that this is because it has been

shown to be able to detect the correct number of clusters in several experiments

[124]. The smallest index value corresponds to the best clustering obtained from all

partitions that are generated by the clustering method. Hence this VFC-SA algorithm

can generally avoid the limitations which exist in the standard fuzzy c-means

algorithm. However when we implemented this proposed algorithm, it was found

that sub-optimal solutions could be obtained in certain circumstances. In order to

overcome this limitation, we extended the original VFC-SA algorithm to produce the

Simulated Annealing Fuzzy Clustering (SAFC) algorithm. In this chapter, the

original VFC-SA and the extended SAFC algorithm are described in detail. The

experiments as described in Chapter 4 were performed on the same seven FTIR

spectra datasets containing oral cancer cells in order to evaluate the performance of

the VFC-SA and SAFC clustering algorithms in comparison to the original fuzzy c-

means algorithm.

5.2 VFC-SA Clustering Algorithm

In this algorithm, a variable number of cluster centres were encoded using a

variable length string to which simulated annealing was applied. At a given

temperature, the new state (string encoding) was accepted with a probability:

))/)(exp(1/(1 TEE cn −−+ , where En and Ec represents the new energy and current

energy respectively, and T is the current temperature.

The Xie-Beni index, VXB, was used to compute the evaluation of a cluster. The

initial state of the VFC-SA was generated by randomly choosing c points to be

Chapter 5

95

cluster centres from the datasets where c is an integer within the range ],[ maxmin cc .

The values 2min =c and nc =max (where n is the number of data points) was used

following the suggestion proposed by Bezdek in [73]. The initial temperature T was

set to a high temperature maxT , a neighbour of the solution was produced by making

one of several possible random alterations to the string describing the cluster centres

(as described below) and then the energy of the new solution was calculated. The

new solution was kept if it satisfied the simulated annealing acceptance requirement.

This process was repeated for a certain number of iterations, k , at the given

temperature. A cooling rate, r , where 10 << r , was used to decrease the current

temperature by rTT = . This was repeated until the T reached the termination

criteria temperature minT , at which point the current solution was returned. The

whole VFC-SA algorithm process is summarised in the steps shown in Figure 5.1.

The process of altering the current cluster centres comprised three functions.

They are: perturbing an existing centre (Perturb Centre), splitting an existing centre

(Split Centre) and deleting an existing centre (Delete Centre). At each iteration, one

of the three functions was randomly chosen. When splitting or deleting a centre, the

cluster sizes were used to select a centre. The size, jC , of a cluster, j , can be

expressed by:

∑=

=n

iijjC

1

|| µ , cj ,...1=∀ (5.1)

where c is the number of clusters. The three functions are described below.

Chapter 5

96

1) Set parameters rkcTT ,,,, minmax .

2) Initialised the string by randomly choosing c data points from the dataset to be

cluster centres.

3) Compute the corresponding membership values using equation

∑=

−

=C

k

m

ik

ijij

d

d

1

12

)(

1µ (2.7).

4) Calculate the initial energy cE using VXB index from equation s

VXB

π= (2.11).

5) Set the current temperature maxTT = .

6) While minTT ≥

6.1) For 1=i to k

6.1.1) Randomly alter a current centre in the string.

6.1.2) Compute the corresponding membership values using equation

(2.7).

6.1.3) Compute the corresponding centres with the equation

Cjx

v n

i

mij

n

ii

mij

j ,...,1,)(

)(

1

1 =∀=∑

∑

=

=

µ

µ (2.6).

6.1.4) Calculate the new energy nE from the new string.

6.1.5) If cn EE < , then accept the new string and set it as current

string.

6.1.6) Else accept the new string with a certain probability.

6.2) End for

6.3) rTT = .

7) End while.

8) Return the current string as the final solution.

Figure 5.1 VFC-SA clustering algorithm procedure.

Chapter 5

97

a) Perturb Centre

A random centre in the string is selected. This centre position is then modified

through addition of the change rate ][][ dvprrdcr current⋅⋅= , where currentv is the

selected centre and Nd ,...,1= , where N is the number of dimensions. r is a random

number between [−1, 1] and pr is the perturbation rate which was set through initial

experimentation as 0.007 as this gave the best trade-off between the quality of the

solutions produced and time taken to achieve them. If ][dvcurrent and ][dvnew

represent the current and new centre, respectively, then Perturb Centre can then be

expressed as: ][][][ dcrdvdv currentnew += .

b) Split Centre

The size of each cluster is calculated using equation (5.1). The centre of the

largest cluster is then replaced by two new centres created by the following

procedure. The point in the cluster with a cluster membership value less than but

closest 0.5 to the selected centre is identified as the reference, referencew . Then the

distance between this reference point and the current chosen centre is calculated

using: |][][|][ dwdvddist referencecurrent −= . Finally, the two new centres are then

obtained by ][][][ ddistdvdv currentnew ±= .

c) Delete Centre

As opposed to Split Centre, the smallest cluster is identified and its centre

deleted from the string encoding.

Chapter 5

98

5.3 SAFC Clustering Algorithm

When the original VFC-SA algorithm was implemented on a wider set of test

cases than used by the original authors [123], it was found to suffer from several

difficulties. In order to overcome these difficulties, four extensions to the algorithm

have been proposed. In addition, some details were not explicit in the original

algorithm, so that there were ambiguities present. In this Section, the focus is placed

on the extensions to VFC-SA in order to describe the proposed SAFC algorithm.

Also, the entire algorithm is stated explicitly in order to resolve the ambiguities.

The first extension is in the initialisation of the string. Instead of the original

initialisation in which random data points were chosen as initial cluster centres, the

fuzzy c-means clustering algorithm was applied using the random integer

],[ maxmin ccc ∈ as the number of clusters. The cluster centres obtained from the fuzzy

c-means clustering are then utilised as the initial cluster centres for SAFC. This is

because using the clustering results from previous clustering results leads to a better

initialization.

The second extension is in Perturb Centre. The method of choosing a centre in

the VFC-SA algorithm is to randomly select a centre from the current string.

However, this means that even a ‘good’ centre can be altered. In contrast, if the

weakest (smallest) centre is chosen, the situation in which an already good (large)

centre is destabilized is avoided. Ultimately, this can lead to a quicker and more

productive search as the poorer regions of a solution can be concentrated upon.

Chapter 5

99

The third extension is in Split Centre. If the boundary between the biggest

cluster and the other clusters is not obvious (not very marked), then the approach that

original authors use is to choose a reference point with a membership degree that is

less than but closest to 0.5. That is to say there are some data points whose

membership degree to the chosen centre is close to 0.5. However, there is another

situation that can also occur in the process of splitting centre; the biggest cluster is

separate and distinct from the other clusters. For example, let there be two clusters in

a set of data points which are separated, with a clear boundary between them. The

corresponding cluster centres at a specific time in the search are v1 and v2, as shown

in Figure 5.3 (shown in two-dimensions). The biggest cluster is chosen, say v1.

Then a data point whose membership degree is closest to but less than 0.5 can only

be chosen from the data points that belong to v2 (where the data points have

membership degrees less than 0.5 to v1). So, for example, the data point w1 (which is

closest to v1) is chosen as the reference data point. The new centres will then move

to vnew1 and vnew2. Obviously these centres are far from the ideal solution. Although

the new centres would be changed by the Perturb Centre function afterwards, it will

inevitably take a longer time to ‘repair’ the solutions. In the modified approach, two

new centres are created within the biggest cluster. The same dataset as in Figure 5.3

is used to illustrate this process. A data point is chosen, w1, that has a cluster

membership closest to the mean value of the membership degree above 0.5.

Remembering that the memberships of all clusters sums to one, it is obvious that if

the membership is greater than 0.5 then this must be the largest membership. Hence,

points with memberships above 0.5 can be deemed to be ‘close’ to the cluster centre.

Chapter 5

100

The mean of memberships above 0.5 thus represents a point which is close, but not

too close, to the cluster centre. Then two new centres vnew1 and vnew2 are created

according the distance between v1 and w1. This is shown in Figure 5.4. It is obvious

that the new centres are better than the ones in Figure 5.3 and therefore better

solutions are likely to be found in same time (number of iterations).

A brief overview of the split centre procedure is as follows:

1) Calculate the size of the cluster and select the biggest cluster, whereby its

cluster centre is v1.

2) Check whether there is any data point within the biggest cluster, which has

membership value to v1 is less than 0.5 but greater than 0.4.

2.1) If there is , then apply the approach that the original author used to

find the reference data point, as illustrated in Figure 5.3.

2.2) Else, apply the extended split centre approach to find the reference

point, as illustrated in Figure 5.4.

Figure 5.2 The split centre procedure.

The fourth extension is in the final step of the algorithm (return the current

solution as the final solution). In the SAFC algorithm, the best centre positions (with

the best VXB index value) that have been encountered are stored throughout the

search. At the end of the search, rather than returning the current solution, the best

solution seen throughout the whole duration of the search is returned.

Aside from these four extensions, we also ensure that the number of clusters

never violates the criteria whereby the number of clusters C should be within the

range of ],[ maxmin cc . Therefore when splitting a centre, if the number of clusters has

Chapter 5

101

reached maxc then the operation is disallowed. Dually, when deleting a centre, the

operation is not allowed if the number of clusters in the current solution is minc .

Figure 5.3 An illustration of Split Centre from the original algorithm with distinct clusters (where 11µ and 12µ represent the membership degree of w1 to

the centres v1 and v2 respectively).

Figure 5.4 The new Split Centre applied to the same dataset as Figure 5.3, above, (where w1 is now the data point that is closest to the mean value of the

membership degree above 0.5).

Based on all the extensions and enhancements to the VFC-SA algorithm, the

SAFC algorithm procedure can be described in the following steps:

v1(deleted)

v2 vnew1

vnew2

v1

v2 w1

85.0,15.0 1211 == µµ

v1(deleted)

v2 vnew1

vnew2

v1

v2

w1

Chapter 5

102

1) Set parameters rkcTT ,,,, minmax .

2) Initialised the string by applying fuzzy c-means algorithm to generate c cluster

centres from the original dataset.

3) Calculate the initial current energy cE and best energy bE based on the obtained

cluster centres and membership values to apply VXB index from equation (2.11).

4) Set the current temperature maxTT = .

5) while minTT ≥

5.1) For 1=i to k

5.1.1) Randomly alter the state of a current centre in the string.

5.1.2) Compute the corresponding membership values using equation

(2.7).

5.1.3) Compute the corresponding centres with the equation (2.6).

5.1.4) Calculate the new energy nE from the new string.

5.1.5) If cn EE < , then accept the new string and set it as current

string.

5.1.6) Else, accept the new string with a certain probability.

5.1.7) if bc EE < , then cb EE = , and set current string as the best string.

5.2) End for

5.3) rTT = .

6) End while.

7) Return the best string as the final solution.

Figure 5.5 The SAFC clustering algorithm.

Chapter 5

103

5.4 Evaluation of VFC−SA and SAFC Clustering of Oral

Cancer Cells

In order to assess the relative performance of the VFC-SA and SAFC

algorithms in comparison with the standard fuzzy c-means algorithm, the following

experiments were conducted. The same clinical seven oral cancer datasets as used in

chapter 4 were used in this investigation. The number of different types of cells in

each tissue section from clinical analysis was considered as the number of clusters to

be referenced. They were also used as the parameter for fuzzy c-means. The VXB

Xie-Beni index value has been utilised throughout to evaluate the quality of the

classification for these three algorithms. The parameters for VFC-SA and SAFC

were: 5min 10−=T , 40=k , 9.0=r . maxT was set as 3 in all cases. That is because the

maximum temperature has a direct impact on how much worse the XB index value

of a solution can be accepted at the beginning. If the maxT value is set too high, this

may result in the earlier stages of the search being less productive because simulated

annealing will accept almost all of the solutions and, therefore, will behave like

random search. In the original VFC-SA algorithm, the initialization value for

maxT was 100, but this led to a large time being spent on random search. In the

present experiments, maxT was empirically determined to be three based on the

observation that the percentage of worse solutions that were accepted was around

60%. In 1996, Rayward-Smith et al discussed starting temperatures for simulated

annealing search procedures and concluded that a starting temperature that results in

60% of worse solutions being accepted yields a good balance between the usefulness

Chapter 5

104

of the initial search and overall search time (i.e. high enough to allow some worse

solutions, but low enough to avoid conducting a random walk through the search

space and wasting search time) [125].

Solutions for the seven FTIR datasets were generated by using the fuzzy c-

means, VFC-SA and SAFC algorithms. Each method was allowed 10 runs on each

dataset. As mentioned at the beginning of this section, the number of clusters was

predetermined for fuzzy c-means through clinical analysis. The outputs of fuzzy c-

means (centres and membership degrees) then used to compute the corresponding

VXB index value. VFC-SA and SAFC automatically found the number of clusters by

choosing the solution with the smallest VXB index value. Table 5.1 shows the average

VXB index values obtained after ten runs of each algorithm (best average is in bold).

Table 5.1 Average of the VXB index values obtained when using the

fuzzy c-means, VFC-SA and SAFC algorithms.

Average VXB Index Value Dataset

Fuzzy C-Means VFC-SA SAFC

1 0.048036 0.047837 0.047729

2 0.078896 0.078880 0.078076

3 0.291699 0.282852 0.077935

4 0.416011 0.046125 0.046108

5 0.295937 0.251705 0.212153

6 0.071460 0.070533 0.070512

7 0.140328 0.149508 0.135858

Chapter 5

105

From Table 5.1, it can be seen that in all of these seven datasets, the average

VXB values of the solutions found by SAFC are smaller than both VFC-SA and fuzzy

c-means. This means that the clusters obtained by SAFC have, on average, better

VXB index values than the other two approaches. Put another way, it may also

indicate that SAFC is able to escape sub-optimal solutions better than the other two

methods.

In the datasets 1, 2, 4 and 6, the average of VXB index values in SAFC is only

slightly smaller than that obtained using VFC-SA. Nevertheless, when the Mann-

Whitney test (with p<0.01) [126] was performed on the results of these two

algorithms, the VXB index for SAFC was found to be statistically significantly lower

than that for VFC-SA for all datasets.

The number of clusters obtained by VFC-SA and SAFC for each dataset is

presented in Table 5.2. The brackets indicate the number of runs for which that

particular cluster number was returned. For example on dataset 5, the VFC-SA

algorithm found 2 clusters in 5 runs and 3 clusters in the other 5 runs. The number of

clusters identified by clinical analysis is also shown for comparative purposes.

From Table 5.2, it can be observed that in datasets 3, 4, 5 and 7, either one or

both of the VFC-SA and SAFC obtain solutions with a different number of clusters

than provided by clinical analysis. In fact, with datasets 5 and 7, VFC-SA even

produced a variable number of clusters within the 10 runs. Returning to the VXB

index values of Table 5.1, it was shown that all the average VXB index values

obtained by SAFC are better.

Chapter 5

106

Table 5.2 Comparison of the number of clusters achieved by clinical

analysis, VFC-SA and the SAFC methods.

Clinical VFC-SA SAFC1 2 2(10) 2(10)2 2 2(10) 2(10)3 2 2(10) 3(10)4 3 2(10) 2(10)5 2 2(5), 3(5) 3(10)6 2 2(10) 2(10)7 3 3(9), 4(1) 3(10)

DatasetNumber of Clusters in Solution

It can be observed that the corresponding VXB average index values for SAFC

for datasets 3, 4 and 5 produced much smaller values than fuzzy c-means. These

three datasets are also the datasets for which SAFC obtained a different number of

clusters to clinical analysis. In dataset 3, the average VXB index value in SAFC is

much smaller than in VFC-SA. This is because the number of clusters obtained from

these two algorithms is different (see Table 5.2). Obviously a different number of

clusters lead to a different cluster structure, and so there can be a big difference in the

validity index. In datasets 5 and 7, the differences of VXB index values are noticeable,

though not as big as dataset 3. This is because in these two datasets, some runs of

VFC-SA obtained the same number of clusters as SAFC.

In order to examine the results further, the data has been plotted using the first

and second principal components in two dimensions. These have been extracted

using the principal component analysis (PCA) technique [95,127]. The data has been

plotted in this way because, although the FTIR spectra are limited to within

Chapter 5

107

11800900 −− cm , there are still 901 absorbance values corresponding to each

wavenumber for each data. The first and second principal components are the

components that have the most variance in the original data. Therefore, although the

data is multidimensional, the principal components can be plotted to give an

approximate visualisation of the solutions that have been achieved. Figures 5.6−5.12

show the results for datasets 1-7 respectively using fuzzy c-means, VFC-SA and

SAFC (the data in each cluster is depicted using different markers and each cluster

centre is represented by a star). The first and second principal components in datasets

1-7 contain 96.14, 96.30, 89.76, 93.57, 79.28, 94.17 and 82.64 percent of the

variances in the original data, respectively. The percent of the total variability

explained by the first two principal components was obtained from the third output

(variances) of the function ‘princomp’ in Matlab. The formula can be expressed by:

percent = 100 × sum(first N variances) / sum(all variances) (5.2)

It should be noted that when a figure depicts a cluster result from more than

one algorithm, such as in Figure 5.6, it means that the partition results obtained from

those algorithms are the same. It may be that the positions of the centres are slightly

different as the validity index values from each algorithm are not exactly the same.

In each case, the results of clinical analysis are shown , either in a legend or by

directly labelling the points.

Chapter 5

108

-1 -0.5 0 0.5 1 1.5-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1

2

3

4

5

11

12

13

14

15

6

78

9

10

1st Principal Component

2nd

Prin

cipa

l Com

pone

nt

tumourstromacentre

Figure 5.6 Fuzzy C-Means, VFC-SA and SAFC cluster results for dataset 1.

-1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

19

26

27

28

29

30

31

32

33

16

17

18

20

21

222324

25


2nd

Prin

cipa

l Com

pone

nt

-1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

26

27

28

29

30

31

32

33

16

17

18

1920

21

222324

25


2nd

Prin

cipa

l Com

pone

nt

(a) (b)

Figure 5.7 Cluster results for dataset 2 obtained from

(a) Fuzzy C-Means, VFC-SA and 3/10 runs from SAFC (b) 7/10 runs from SAFC.

tumour

stroma

Chapter 5

109

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

34

35

36

37

38

39

40

41

42

43

44


2nd

Prin

cipa

l Com

pone

nt

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

35

36

37

41

42

43

44

34

38

39

40


2nd

Prin

cipa

l Com

pone

nt

(a) (b) Figure 5.8 Cluster results for dataset 3 obtained from

(a) Fuzzy C-Means and VFC-SA (b) SAFC.

-1 -0.5 0 0.5 1 1.5-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

52

53

54

64

65

67

75

45

4647

48

49

5051

56

57

58

59

60

6162 63

68

69

70

71

55

6672

73

74


2nd

Prin

cipa

l Com

pone

nt

-1 -0.5 0 0.5 1 1.5-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

52

53

54

55

64

65

66

67

72

73

74

75

45

4647

48

49

5051

56

57

58

59

60

6162

63

68

69

70

71


2nd

Prin

cipa

l Com

pone

nt


(a) Fuzzy C-Means (b) VFC-SA and SAFC.

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

105

106

107

108

109110

116117

118

119120

127

128

129

101

102103

104

111

112113

114

115

121

122

123

124

125

126

130


2nd

Prin

cipa

l Com

pone

nt

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

101

102103

104

111

112113

114121

122

123

124

107

109110

116117

127

128

129

105

106108

115 118

119120125

126

130


2nd

Prin

cipa

l Com

pone

nt


(a) Fuzzy C-Means and 5/10 runs from VFC-SA (b) SAFC and 5/10 runs from VFC-SA.

tumour

stroma

stroma

Early keratinisation

tumour

tumour

stroma

Chapter 5

110

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

136

137

138

139

140

131

132

133

134135

141

142

143

144145


2nd

Prin

cipa

l Com

pone

nt

stromatumourcentre

Figure 5.11 Fuzzy C-means, VFC-SA and SAFC cluster results for dataset 6.

-1 -0.5 0 0.5 1 1.5-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

204208 211 212

213214

215

216

217

218219

220

221 222223

224

235

236

237

238

239

240241

242

201202

203

205

206

207

209

210225

226

227228229

230

231

232

233234


2nd

Prin

cipa

l Com

pone

nt

-1 -0.5 0 0.5 1 1.5-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

204

207208

213214

215

218219

220226

235

236

237

238

239

240241

242

201202

203

205

206

209

210225227228

229

230

231

232

233234 211 212

216

217221 222

223

224


2nd

Prin

cipa

l Com

pone

nt

(a) (b)

Figure 5.12 Cluster results for dataset 7 obtained from (a) Fuzzy C-Means, 9 runs from

VFC-SA and SAFC (b) 1 run from VFC-SA.

From the clustering results displayed in these figures, it can be seen that three

of the clustering algorithms generated the same partition results in datasets 1 and 6

(as shown in Figure 5.6 and Figure 5.11 respectively). In addition, the obtained the

two clusters in each dataset also matched the clinical analysis results (see legends in

these two figures).

tumour

stroma necrotic

Chapter 5

111

In Figure 5.7, the number of clusters obtained from VFC-SA and SAFC are the

same as clinical analysis results, for example two clusters. However, it can be seen

that in Figure 5.7(a), the clustering output from VFC-SA and 3 out of 10 runs of

SAFC are the same as that from fuzzy c-means and, further, that there is one tumour

data point (19) that was misclassified as stroma. Figure 5.7(b) shows the clustering

results from the other 7 out of 10 runs of SAFC, in which this data point was

correctly categorised as tumour. This indicates that running both algorithms the

same number of times; SAFC has more probability to obtain the right classification.

Figure 5.8(a) displays the cluster results from dataset 3. It can be seen that

VFC-SA generated the same number of clusters as clinical analysis, and the partition

result is the same as fuzzy c-means. However, there are four data points in tumour

cluster, namely 34, 35, 36 and 37 which were misclassified as stroma. On the other

hand, although SAFC produced a different number of clusters than clinical analysis,

from this two dimensional PC space (Figure 5.8(b)), it would appear more reasonable

to group this dataset into three clusters rather than two. In addition, if the clusters

which have the most similar biochemical characteristics could be merged together,

then the squared green cluster (points 35-37) and the diamond red cluster (points 41-

44) will be merged. In this case, the accuracy of the clustering will be significantly

improved. In contrast, the clustering results generated from Figure 5.8(a) cannot

obtain the similar results by such a merging technique.

Cluster results for dataset 4 are presented in Figure 5.9. The clinical analysis

was that there are three types of cells in this tissue section, as shown in Figure 5.9(a).

Chapter 5

112

Applying the fuzzy c-means algorithm using the number of clusters from clinical

analysis, the clustering output obtained is also displayed in Figure 5.9(a). The early

keratinisation data points were split into two clusters, the data points belong to

stroma and tumour clusters were mixed up. From the data distribution in PC1 and

PC2 space, it will appear to be very difficult to separate tumour and stroma data

points as clinical study described (although, of course, the clinical partition might

become more apparent if further dimensions were to be considered). Figure 5.9(b)

shows the two clusters obtained from VFC-SA and SAFC, which appears more

representational of the existing data structure, although the data points belong to

stroma and tumour were still joined together. This is also why both algorithms

produced different number of clusters from clinical analysis.

Figure 5.10 displays the clustering algorithms applied on dataset 5, in which

two types of tissue samples were achieved from clinical study (as shown in Figure

5.10(a)). This figure also shows the clustering results from fuzzy c-means and 5 out

of 10 runs of the VFC-SA algorithm. Although the number of clusters obtained from

some runs of VFC-SA is the same as clinical analysis, from Figure 5.10(a) it can be

seen that some data points which belong to the tumour cluster were misclassified as

stroma, for instance, points 115, 125, 126 and 130. On the other hand, although

SAFC and the rest of the 5 runs of VFC-SA algorithms produced a different number

of clusters from clinical analysis (Figure 5.10(b)), three clusters appear more natural

than two clusters by visual inspection. Similar to the hypothesis for Figure 5.8(b), if

a technique can merge clusters with the most similar biochemical characteristic

Chapter 5

113

clusters, then in this case, all the data points belong to tumour will be combined, and

so the same clustering results as clinical study would be obtained.

Finally, Figure 5.12 presents the results obtained on dataset 7 for the fuzzy c-

means, VFC-SA and SAFC algorithms. Figure 5.12(a) shows that in 9 out of 10 runs

of VFC-SA algorithm and all 10 runs of SAFC algorithm the same number of

clusters as clinical analysis were obtained. However, three data points were

misclassified (points 235, 204 and 208). All these three points should belong to the

tumour cluster, but in this case, data point 235 was marked as necrotic and data

points 204, 208 were marked as stroma. Nevertheless, apart from these three points,

the rest of the data points were correctly categorised. Figure 5.12(b) shows the

clustering result from 1 out 10 runs of the VFC-SA algorithm in which four clusters

were obtained. This is due to the fact that the data points that should belong to

tumour and stroma clusters were split into three groups with the third cluster being

on the border between the tumour and stroma clusters. Although this occurred quite

rarely (only in 1 out of 10 runs), it does indicate the variety of clusters obtained from

the VFC-SA clustering algorithm.

From Figure 5.6 − 5.12, it can be seen that although in some datasets, such as

dataset 3 and 5, the VFC-SA algorithm (and some runs of the VFC-SA algorithm)

obtained the same number of clusters as clinical analysis, while the SAFC algorithm

did not. This does not necessarily mean that the clustering accuracy of VFC-SA on

these datasets is better than SAFC’s. Rather, it is just that the clustering results from

SAFC appear more reasonable through visual inspection. In addition, if a technique

Chapter 5

114

which can merge the most similar biochemical characteristic clusters could be

developed and then be applied to the partition results from SAFC, the accuracy of

clustering results will be significantly improved. In the clustering results from

dataset 2 (Figure 5.7), both VFC-SA and SAFC algorithms achieved the same

number of clusters as clinical analysis. However, when these two algorithms were

run ten times, SAFC is more likely to achieve the same results as clinical study. In

dataset 4, both VFC-SA and SAFC algorithms obtained a different number of

clusters from clinical analysis. Nevertheless, two well separated clusters can be seen

when displaying this dataset in the first two PCs dimensional space. Thus, it is hard

to see how any technique might end up with three clusters (to match clinical

analysis) for this particular dataset.

Although within Figures 5.6 − 5.12, the different number of clusters obtained

by the SAFC algorithm (compared to clinical analysis) have good visual

interpretation, there are at least three possible explanations for the difference.

Firstly, the clinical analysis may not be correct – this could potentially be caused by

the different types of cells in the tissue sample not being noticed by the clinical

observers or the cells within each sample could have been mixed with others.

Secondly, it could be that although a smaller VXB index value was obtained, indicating

a ‘better’ solution in technical terms, the VXB index is not accurately capturing the real

validity of the clusters. Put another way, although the SAFC finds the better solution

in terms of VXB index, this is not actually the best set of clusters in practice. A third

possibility is that the FTIR spectroscopic data has not extracted the required

information necessary in order to permit a correct determination of cluster numbers –

Chapter 5

115

i.e. there is a methodological problem with the technique itself. None of these

explanations of the difference between the clustering results obtained automatically

and those from clinical analysis detract from the fact that the SAFC produces better

solutions than VFC-SA in that it consistently finds better (statistically lower) values

of the objective function (VXB index).

5.5 Summary

In this Chapter, a new SAFC method has been proposed which has been

extended from the original VFC-SA algorithm in four ways. The newly proposed

algorithm’s performance has been evaluated on seven oral cancer FTIR spectra data

and compared to clinical analysis, the standard fuzzy c-means and the original VFC-

SA. The XB validity index was used as the evaluation method to measure the quality

of the clusters produced. The experimental results have shown that the SAFC

algorithm can escape the sub-optimal solutions obtained in the other two approaches

and hence produce better clusters. On the other hand, the numbers of clusters

obtained by SAFC in some datasets are not in agreement with those provided through

clinical analysis. This can be visualised by plotting the clustering results into the

first two dimensions of PC space. For the different number of clusters datasets,

SAFC results appear to more reasonably reflect the structure of the underlying data.

In addition, this also can be explained in following three ways. Firstly, the number

of clusters identified from clinical analysis may not be correct; secondly, the XB

validity index may not be suitable to apply on these clinical data; and thirdly, the

FTIR technique has not (for these datasets) captured sufficient information to

Chapter 5

116

facilitate correct classification. However, more results and information are needed

before any definitive conclusion can be made in this case. Nevertheless, this SAFC

algorithm is a further step towards the automatic classification of data for real

medical application. The next Chapter presents the investigations carried out on

FTIR spectral data collected by imaging an area of tissue section in many positions

(i.e. at a high spatial resolution), which results in a larger dataset than used up to

now.

Chapter 6

117

CHAPTER 6

Methods for the Examination of

Tissue Sections

6.1 Introduction

Infrared imaging techniques have become more frequently employed for the

investigation of tissue cells, with the main advantage that the scanning of the tissue

section is quicker and thus a larger area can be examined. However, in comparison

to previous oral cancer datasets, the size of the data collected using the infrared

imaging technique is several orders of magnitude larger and, therefore, any

developed technique must be capable of operating on such large datasets. This

Chapter focuses on the investigation of FTIR spectra data obtained by employing

infrared imaging technology to analyse lymph node tissue sections. For the purpose

of this Chapter’s study, a tissue area that incorporated a variety of lymph node tissue

types was used. The most important feature of this sample area was that it contained

sections of both cancerous invasion and healthy nodal tissue.

Chapter 6

118

The first part of this Chapter focuses on the investigation into a technique

named PCA−fuzzy c-means, which combines both the PCA technique and the fuzzy

c-means clustering algorithm in order to speed up the clustering analysis process by

reducing the size of the lymph node tissue spectral dataset without losing significant

information from the original data. The PCA technique and fuzzy c-means clustering

algorithm were also individually applied to the same lymph node dataset for

comparative purposes. The clustering results obtained from these three techniques

were displayed in false-colour images and their processing times were calculated and

compared. As mentioned above, the infrared imaging method allows for

significantly larger datasets and, indeed, the image created and used throughout this

Chapter was composed of 7497 spectra. This is a significantly higher number than in

previous studies and is beneficial in order to be able to assess the diagnostic ability of

the clustering techniques. The clustering algorithms used in the previous chapters of

this thesis, such as HCA and SAFC, were not suitable for this study due to the high

computational requirement of operating on such large-scale datasets. In the second

part of the chapter, a PCA−k-means technique, which is similar to PCA−fuzzy c-

means method, is also used to analyse the same lymph node tissue section. The

clustering results obtained from both algorithms are then encoded into false-colour

images and are compared.

6.2 Lymph Node Dataset Description

Lymph nodes are round kidney-shaped organs distributed throughout the body

along the lymphatic vessel system. They can vary in size from a few millimetres to

Chapter 6

119

more than 2cm and have two main functions. Lymph is filtered through them to

allow the removal of foreign particles by phagocyte cells. Additionally, foreign

antigens are trapped on the surface of antigen presenting cells and presented to

memory B-lymphocytes. These then migrate to the germinal centre of the follicles

and give rise to the synthesis of antibodies that combat disease invading the body

[106]. Figure 6.1(a) shows the H&E stained parallel tissue section used for infrared

analysis and allows the main structure of the node to be identified. The IR image

was collected from a particularly interesting site on the lymph node where several

different types of tissue are present and which, more importantly, includes areas of

both cancerous invasion and healthy nodal tissue. This selected tissue section has

been named LNII5. Figure 6.1(b) shows the spectral LNII5 at higher magnification

and this allows for easy identification of the surrounding capsule, cortex and

invading beast cancer tissue. In the centre of the cortex, with a lighter pigmentation,

is a stimulated proliferating follicle or germinal centre. Reticular cells that extend

into the sinuses can also be seen and characteristically form a delicate network

between the capsule and trabeculae (small, often microscopic, tissue elements in the

form of a small beam, strut or rod, generally having a mechanical function). At the

top left corner of the spectral image (see Figure 6.1(d)) lies a small pocket of fatty

tissue that normally surrounds the lymph node. This small area of fatty tissue has

been included in the infrared analysis. However, in the corresponding H&E stained

image, it has unfortunately been omitted due to the fact that the piece to be stained

was cut slightly below the fatty tissue. In Figure 6.1(c), the different types of tissue

are identified.

Chapter 6

120

(a) (b)

(c) (d)

Figure 6.1 (a) Photomicrograph of the H&E stained parallel lymph node tissue section used for IR analysis (b) selected area – LNII5 at high magnification (c)

different tissue types description (d) LNII5 spectral image.

Cancerous cortex tissue

Normal cortex tissue

Capsule (fibrocollagenous tissue)

Secondary follicle (normal cortex tissue)

Reticulum (fibrocollagenous tissue)

Chapter 6

121

6.3 A Combination of Principal Component Analysis and

Fuzzy C-Means Clustering

6.3.1 Introduction

The principal component analysis (PCA) technique (see Chapter 2.3.2) has

been widely used for FTIR spectra analysis. As mentioned in Section 2.3.2, PCA

can detect structure in the relationships between variables of data and can also be

used to reduce the dimensionality of a dataset.

In our previous work, the seven oral cancer IR spectral sizes are comparatively

small; the number of spectra is 15, 18, 11, 31, 30, 15 and 42 in datasets 1 to 7

respectively. In each dataset, some scattered individual spectra data points from

different types of tissue were chosen for the FTIR scan. However, the experiments

carried on the lymph node tissue section reported in this chapter are using 7497

spectra which have been taken from a whole sub area of one axillary lymph node –

LNII5. Each spectrum contains 821 absorbency measurements evenly distributed

across the indicated wave-number range (4000-720cm-1) at a frequency of every

fourth wave-number (such as 4000, 3996, 3992, …, 720). Therefore, the total size of

the LNII5 tissue section dataset is 7497×821 wave-numbers. It is apparent that if we

can reduce the dimensionality from the original data without losing too much useful

information, the clustering analysis will be more computationally efficient. In this

respect, PCA was used to adjust the coordinates of the original data and, in this

thesis, the first 10 PCs were selected on which to perform the data analysis. This

Chapter 6

122

number was chosen empirically on the basis that these PCs were found to contain

99.1% of the variances from the original data. Thus, these ten principal components

are still highly reflective of the original data. As PCA can also be used to detect the

structure from the original data, experiments utilising both PCA and the standard

fuzzy c-means clustering algorithm separately were also conducted. The clustering

results from these three techniques are compared using false-colour weighted images

and their corresponding computation times are also presented.

6.3.2 Experiments and Results of PCA

The collected spectral dataset was subjected to PCA to obtain the principal

components. The correlation coefficient between each spectrum and each PC was

then calculated. A bigger value indicates that the spectrum has a closer relationship

to this PC, and vice versa. After calculating the values from all the spectra to each

PC, normalisation was conducted to limit values within the range [0, 1]. Each point

on the image was then falsely coloured to represent the strength of the correlation

coefficient (as shown on the colour bars to the right of Figure 6.2). Since each

spectrum on the IR image has a unique spatial (x, y) position, false colour images can

be generated by plotting specially coloured pixels as a function of the spatial

coordinate. In the results, in order to distinguish between different clusters, each

cluster was assigned a unique colour. Images created for each of the first 10 PCs

using false colours are displayed in Figure 6.2. In the colour bar, red indicates that

the spectra corresponding to the given point are very similar to the specified PC, and

blue are greatly dissimilar.

Chapter 6

123

Figure 6.2(a) shows the original H&E stained image of LNII5. The first PC

image is presented in Figure 6.2(b) and, from the picture, it can be seen that the fatty

tissue situated in the corner of the image has the closest correlated relationship with

PC1. In contrast, the rest of the tissue area is much less closely related to PC1. The

second PC image, as shown in Figure 6.2(c), demonstrates that the germinal centre,

cancerous and some normal tissue have strong correlation with PC2, some reticulum

and normal area has less correlation with PC2 and lastly, capsule, fatty tissue, some

reticulum and normal tissue have the least correlation with PC2. It is apparent that

different types of tissue have been mixed around in each cluster marked by different

colours. Furthermore, in the PC2 image, the main normal and cancerous tissue area

are not separated. For the third PC image displayed in Figure 6.2(d) it can be seen

that capsule, fatty tissue, germinal centre, some normal and cancerous tissue, and

reticulum are the closest to PC3, and the rest of the normal and cancerous tissue is far

from PC3. In this image, more tissue types are mixed together, and normal and

cancerous tissue are still not clearly separated from each other. A similar situation

also occurs in the rest of the images. As the number of the PC is increased, the less

and less information appears to be contained within the images.

6.3.3 Experiments and Results of Fuzzy C-Means

The fuzzy c-means clustering algorithm is sensitive to the initial position of the

cluster centres and, as these are positioned randomly at the outset of the algorithm,

the method was run ten times with the number of clusters subjectively set from 2 to

9. The squared Euclidean distance was used to calculate the distances between the

Chapter 6

124

spectral data points to the cluster centres, and the fuzziness index m was set to a

value of 2. Initially, the two termination criteria were when the number of iterations

reached 100 and/or when the improvement obtained at each iteration was less than

10-5 (the stopping criterion of the iteration). Later, the minimum acceptable

improvement was altered to 10-7, as described below.

(a) (b) (c) (d) (e) (f)

(g) (h) (i) (j) (k)

Figure 6.2 IR imaging of lymph node tissue section LNII5 by PCA (a) H&E stained image of LNII5 (b)−(k) false colour weighted images for PC1−PC10

respectively.

Chapter 6

125

6.3.3.1 Setting the minimal amount of improvement for fuzzy c-means

During the initial stage of the experiments, the results produced by fuzzy c-

means varied considerably on each run. Figure 6.3 shows examples from 3 separate

runs (the number of clusters was 2). In these examples, the fuzzy c-means clustering

did not perform well. The main types of tissue were mixed throughout the section.

In each colour region, fuzzy c-means could not clearly separate the main types of

tissue. For example, in Figure 6.3(a), the red region included cancerous and normal

tissues, reticulum and capsule whilst the blue area included fatty tissue, capsule,

reticulum, and cancerous tissue. Therefore, there were no clear tissue types within

any cluster. A similar situation is also shown in Figure 6.3(b) and (c).

(a) (b) (c)

Figure 6.3 Clustering results from three separate runs with fuzzy c-means.

Chapter 6

126

Based on this observation, when the spectral data were plotted in the first 3

PCs (which contain 93.2% of the variance from the original data), as shown in Figure

6.4, it was found that the ranges of the components were:

[−0.0075, 0.0751]

[−0.0117, 0.0069]

and [−0.0096, 0.0047] repectively.

The order of −0.0075 is −10-2 and 0.0751 is 10-1 and so on. Thus, their

corresponding sizes of the range are 10-1, 10-2 and 10-2. It is well known that as the

component range becomes smaller, the data is more compact and, thus, the distances

between the data and their ideal centres are smaller. In fuzzy c-means, the objective

function J(U,V) (see Equation 2.3) is proportional to the squared Euclidean distance

between the data and the centres. In this case, the range sizes are 10-1, 10-2, their

squared Euclidean distances are then 10-2 and 10-4 (or even smaller). Hence, a small

range size may lead to a very small objective function value. One of the stopping

criteria occurs when the difference between two objective function values is less than

the minimal amount of improvement. Therefore, if the minimal amount of

improvement was not small enough (i.e. 10-5 as the initial setting) to allow

improvements in the centre positions, the performance of fuzzy c-means was found

to be bad. Due to this, a value of 10-7 was used as the minimal amount of

improvement for the remainder of the experiments. It was found that the

performance of fuzzy c-means improved and consistently achieved stable clustering

results.

Chapter 6

127

This finding also can be demonstrated in the seven oral cancer FTIR spectra

data sets which were used in the previous oral cancer experiments (see Chapters 4

and 5). In this previous work, the minimal amount of improvement had been set to

10-5. In these data sets, fuzzy c-means performed well because the range sizes were

compatible. The range sizes in the first three PCs are shown in Table 6.1.

It can be seen from Table 6.1 that the order of the first PC range is 100 for all of

the datasets (the squared Euclidean distance is therefore also of order 100). In the

second and third PCs, the range sizes are either 100 or 10-1, and so the corresponding

squared Euclidean distance are 100 and 10-2 respectively. Compared to these, a value

of 10-5 for the minimal amount of improvement is sufficiently small to allow the

centre positions to improve. This may also be the reason why fuzzy c-means

obtained good results for these datasets.

Figure 6.4 A three – dimensional scatter plot of the tissue section spectra projected onto the first 3 PCs.

Chapter 6

128

Table 6.1 The ranges of the first 3 PCs in seven oral cancer oral

cancer FTIR datasets.

Data sets Variances ranges of

First pc Second pc Third pc

Data set 1 1.8458 0.6141 0.2575

Data set 2 2.8569 1.2562 0.4795

Data set 3 1.5960 0.5859 0.6600

Data set 4 2.1702 0.8569 0.4112

Data set 5 1.6224 0.8317 0.5496

Data set 6 1.7023 0.8902 0.3342

Data set 7 1.8750 1.5900 0.8741

6.3.3.2 Clustering results obtained from fuzzy c-means

After resetting the minimal amount of improvement, fuzzy c-means produced

consistent clustering results. The number of clusters was subjectively set from 2 to 9

and their corresponding clustering results displayed in false colour weighted images

are shown in Figure 6.5. It can be seen that as the amount of clusters has been

increased from 2 to 5, as shown in Figure 6.5(b) – (e), the number of tissue types that

can be discriminated is increased. When comparing these images against the H&E

stained parallel section in Figure 6.5(a), the fuzzy c-means images for 5 clusters, as

shown in Figure 6.5(e), gives a very good agreement given that this is from an

adjacent tissue section and small morphological changes are likely. Each colour

Chapter 6

129

within the image can generally be assigned to a specific tissue type. The orange

cluster is the capsule, green the reticulum, maroon the healthy normal cortex tissue

surrounding the germinal centre in dark blue, and finally the invading cancerous

breast tissue is described by a light blue colour. The only misclassification

corresponds to spectra originating from fatty tissue located at the top left hand corner

of the image that have been grouped into the same cluster as the invading cancerous

tissue. Correct clustering of fatty tissue spectra into a single group was not

achievable via fuzzy c-means analysis. This is a direct consequence of their position

in multi-dimensional PC space, and shall be discussed in detail in Section 6.3.5. As

the amount of clusters is further increased from 6 to 8, as shown in Figure 6.5(f) –

(i), these main tissue types are then further subdivided. The capsule and reticulum

begin to show shared clusters and the formation of a lining that surrounds these

tissues. This is understandable as these types of tissue are very similar in nature.

The invading cancerous tissue also begins to display a second cluster that may

describe tissue with a different degree of malignancy, but not recognised via

conventional histology. When the number of clusters was again increased (above

nine), no further beneficial tissue discrimination could be made, with images

becoming needlessly complex and hard to interpret.

Chapter 6

130

(a) (b) 2 clusters (c) 3 clusters (d) 4 clusters (e) 5 clusters

(f) 6 clusters (g) 7 clusters (h) 8 clusters (i) 9 clusters

Figure 6.5 IR imaging of lymph node tissue section LNII5 by fuzzy c-means (a) H&E stained image of LNII5 (b)−(i) fuzzy c-means false colour weighted

clustering results, the number of clusters were from 2 – 9 respectively.

6.3.4 Experimental Results of the PCA – Fuzzy C-Means Technique

Finally, the collected spectra were subjected to PCA – fuzzy c-means

clustering analysis, where the dataset was initially compressed via PCA to its first ten

PCs, accounting for 99.1% of the variance contained in the original dataset. These

Chapter 6

131

new extracted variables were then clustered via conventional fuzzy c-means

methodology. Although this algorithm consecutively performs two different

multivariate analyses, the total computation time is significantly faster than

traditional fuzzy c-means analysis. This is a consequence of the dataset now only

being described by ten dimensions rather than 821 dimensional wave-numbers.

Results from the analysis were again visually displayed as false colour images and

are shown in Figure 6.6. When comparing these PCA-fuzzy c-means images directly

with those created via conventional fuzzy c-means clustering, no significant or

worrying loss of image quality can be observed. This quite clearly demonstrates that

data compression used in a correct statistical fashion can be an effective tool for

reduced computation requirements and analysis times.

6.3.5 Computational Time

For the examination of very large spectral datasets, such as those collected

from an entire lymph node, it is important that the type of analysis used is both fast

and efficient. Therefore, in this Section, the computation times for the three

multivariate techniques used in these experiments are examined. For each technique,

analyses were repeated 10 times and the average computation times determined. All

calculations were carried out on a 1.8 GHz Intel Pentium IV PC that utilised a 1GB

RAM, and ran under the Windows XP operating system.

The computation time for PCA ( PCACT ) is divided into two parts. This can be

expressed as:

Chapter 6

132

PCsIMGPlotPCAPCA TTCT 10+= (6.1)

where PCAT is the time taken by the principal component analysis; and PCsIMGPlotT10 is

the time taken to plot the first ten principal components images. In these

experiments, it was found that the average of 3.60=PCAT seconds and

(a) (b) 2clusters (c) 3clusters (d) 4clusters (e) 5clusters

(f) 6clusters (j) 7clusters (h) 8clusters (i) 9clusters

Figure 6.6 IR imaging of lymph node tissue section LNII5 by PCA–fuzzy c-means (a) H&E stained image of LNII5 (b)−(i) fuzzy c-means false colour

weighted clustering results, the number of clusters were from 2 – 9 respectively.

Chapter 6

133

6.8410 =PCsIMGPlotT seconds. Therefore, the total computation time for PCA ( PCACT )

was 144.9 seconds; approximately 2.4 minutes.

Computation time for fuzzy c-means ( FCMCT ) is also composed of two parts,

and can be expressed as:

)()( iimgPlotiFCMFCM TTCT += (6.2)

where, )(iFCMT is the time spent on fuzzy c-means clustering, i represents the number

of clusters (i=2,3,4…9); and )(iimgPlotT is the time for the imaging plot of i clusters.

Since the number of clusters that was calculated varied, the computational time

required for each was different. The average FCMCT for each cluster number is

shown in Table 6.2. The total FCMCT was calculated to be 1684.9 seconds,

approximately 28 minutes.

Finally, the computation time for the PCA-fuzzy c-means technique

( FCMPCACT − ) comprises four parts. This can be expressed as:

)()(10 iimgPlotiFCMPCsextractPCAFCMPCA TTTTCT +++=− (6.3)

where PCAT is the principal component analysis time, PCsextractT 10 is the time

taken to extract the first ten principal components, )(iFCMT is the time spent on fuzzy

c-means clustering and )(iimgPlotT represents the time for the imaging plot of i clusters.

It should be noted that PCAT and PCsextractT 10 were only performed once before applying

the fuzzy c-means clustering algorithm. A summary of the computational times for

the PCA-fuzzy c-means analysis is displayed in Table 6.3. The results in Table 6.3

Chapter 6

134

show that the total computation time taken for the PCA-fuzzy c-means technique

( FCMPCACT − ) was 151.29 seconds, approximately 2.5 minutes. A comparison of

computation time for all three techniques is shown in Table 6.4.

6.3.6 Discussion of Results

In comparing these three techniques from the computational time point of

view, it can be seen that PCA and PCA – fuzzy c-means took almost the same time

to complete the experiments, although PCA slightly faster. In contrast, the standard

fuzzy c-means algorithm took the longest time. Nevertheless, the results obtained

from these techniques are not proportional to their computation time. In the first PC

image, fatty tissue was separated from the rest of the other types of tissue. However,

from PC2 to PC10, different tissue sections started to become mixed together. In

particular, when the number of PCs increased beyond three the less and less

information was revealed. In fuzzy c-means clustering analysis, as the number of

clusters is increased from two to five, more types of tissue can be discriminated. In

the five clusters image, Figure 6.5 (e), most of the tissue sections can be correctly

assigned to their histological groups apart from the fatty tissue area. The reason for

this incorrect clustering can be explained by examination of the spectra in multi-

dimensional PC space. Figure 6.7(a) displays the original dataset plot in three

dimensional PCs space with five clusters. Figure 6.7(b) is a rotated picture of (a) and

can best describe the differences between the outlier fatty tissue (encircled) and

remaining tissue spectra. In the fuzzy c-means algorithm, the Euclidean distance is

used to define membership values; this means that when the shapes of the clusters

Chapter 6

135

are significantly different from spherical, the clustering results will be not effective.

In this dataset the first PC is descriptive of the lipid content in the fatty tissue. The

small amount of spectra collected from this region on the tissue section displays a

very large natural variation in the intensity of these lipid peaks. This has caused the

fatty tissue spectra to be distributed along this PC axis, so that the fuzzy c-means

clustering is less effective. Unfortunately this has led to misclassification of the fatty

tissue spectra into the same cluster as the cancerous spectra (dark red).

Table 6.2 Summary of fuzzy c-means clustering computation time.

fuzzy c-means clustering (sec) Image plot (sec) Total2 30.4 0.3 30.73 174 0.3 174.34 128.6 0.3 128.95 169.4 0.3 169.76 441.7 0.3 4427 325.3 0.3 325.68 413.4 0.3 413.7

Total 1682.8 2.1 1684.9

Number of Clusters Computation time for fuzzy c-means technique

Table 6.3 Summary of PCA-fuzzy c-means computation time.

fuzzy c-means clustering (sec) Image plot (sec) Total2 0.67 0.36 1.033 4.15 0.37 4.524 3.34 0.34 3.685 4.31 0.39 4.706 9.83 0.42 10.257 7.90 0.39 8.298 12.63 0.39 13.02

Subtotal 42.83 2.66 45.49TPCA - - 60.30

Textract10PCs - - 0.01Total - - 151.29

Number of Clusters Computation time for PCA-fuzzy c-means technique

Chapter 6

136

Table 6.4 Computation time comparison between PCA, Fuzzy c-

means and PCA-fuzzy c-means analysis techniques.

Techniques Computation times (mins)PCA 2.4

Fuzzy c-means 28PCA-fuzzy c-means 2.5

(a) (b)

Figure 6.7 LNII5 tissue section spectra plot in three dimensional PCs space (a) original plot with 5 clusters (b) rotated plot of picture (a).

Finally, the combined PCA – fuzzy c-means analysis achieved similar

discrimination as that achieved via fuzzy c-means analysis, but also showed a greatly

improved computational speed without a significant loss of information from the

original dataset. Thus, this allows high quality cluster analysis of large FTIR spectra

datasets in dramatically reduced times.

Chapter 6

137

6.4 Comparison of K−Means and Fuzzy C−Means in Lymph

Node Tissue Sections

Based on the PCA – fuzzy c-means technique, PCA can now be easily applied

prior to the k-means algorithm to generate a ‘PCA – k-means’ method. Thus, the

performance between k-means and fuzzy c-means clustering algorithms on large

lymph node tissue section LNII5 can be fairly easily compared. In the following

experiments, the PCA technique will be initially applied to all the spectra within

LNII5, and then the first ten PCs are extracted to be used by the k-means clustering

algorithm. The number of clusters was set in the same way as PCA – fuzzy c-means

methods, ranging from 2 to 9. The following settings were used for the k-means

experiments; the squared Euclidean distance was used as the distance measure, the

initial cluster centre positions were randomly selected, the maximum number of

iterations was set to 100 and the experiments were also run ten times. The purpose

of this Section is to compare the clustering results obtained from the k-means and

fuzzy c-means algorithms after PCA had been used to reduce the dimensionality of

the dataset. Therefore, from now on in this Section, PCA – fuzzy c-means and PCA

– k-means techniques are simply addressed as fuzzy c-means and k-means.

6.4.1 Results obtained from K-means and Fuzzy C-means (using PCA)

After both techniques had been run ten times for each number of clusters, it

was identified that fuzzy c-means had produced the more consistent clustering

results, although k-means was also able to produce fairly stable results. However, in

Chapter 6

138

the cases where there were 2, 5, 6, 7 and 9 clusters, the k-means algorithm produced

variable clustering results. In order to facilitate the comparison, results obtained in

Section 5.3.4 are also redisplayed in this Section.

Figure 6.8(a) and (b) are two examples of the variation obtained in results from

k-means; Figure 6.8(c) is from fuzzy c-means. In Figure 6.8(a), it can be seen that k-

means separated the fat from the other types of tissue. In comparison, Figure 6.8(b)

and (c) are almost the same. Based on the histological stained picture analysis shown

in Figure 6.1(c), the red region covers capsule and reticulum tissue area, whereas the

blue region includes the fatty tissue and nodal tissue (including both cancerous and

normal). Figure 6.9 (a) – (g) displays results obtained using k-means clustering

when the numbers of clusters is set to 3 – 9, and the corresponding results produced

from fuzzy c-means are shown in Figure 6.10(a) – (g). Due to fact that variations

occurred in k-means clustering with several different number of clusters; Figure 6.9

displays the cluster results which most frequently appeared (from the ten runs) using

the k-means algorithm.

(a) (b) (c)

Figure 6.8 Clustering results from k-means (a&b) and fuzzy c-means (c) in 2 clusters.

Chapter 6

139

(a) 3 clusters (b) 4 clusters (c) 5 clusters (d) 6 clusters (e) 7 clusters (f) 8 clusters (g) 9 clusters

Figure 6.9 K-means clustering results in 3 − 9 clusters.

(a) 3 clusters (b) 4 clusters (c) 5 clusters (d) 6 clusters (e) 7 clusters (f) 8 clusters (g) 9 clusters

Figure 6.10 Fuzzy c-means clustering results in 3 − 9 clusters.

Within the clustering process of the k-means algorithm, variable results also

occurred when the number of clusters was set to 5, 6, 7 and 9. These variations are

displayed in Figure 6.11 – 6.14.

(a) (b)

Figure 6.11 Variation in k-means clustering results for 5 clusters.

Chapter 6

140

(a) (b)


(a) (b)


(a) (b)


Chapter 6

141

6.4.2 Discussion of the Clustering Results for K-means and Fuzzy C-means

By observing the clustering results obtained from k-means, as displayed in

Figure 6.9(a), it can be seen that when the number of clusters was set to three, k-

means separated the fatty tissue successfully, but the normal tissue area (germinal

centre and normal cortex tissue) was grouped with cancerous cortex tissue. Capsule

and reticulum were still mixed together as when the number of clusters was two (as

shown in Figure 6.8(b)). This also happened in the corresponding clustering results

from the fuzzy c-means algorithm in Figure 6.10(a). However, the difference

between fuzzy c-means and k-means in this Figure was that, in the case of fuzzy c-

means, the normal germinal centre tissue was distinct from the cancerous cortex

tissue whereas, in the case of k-means, these two types of tissue were clustered

together. However, for fuzzy c-means some of the fatty tissue and a small area of

normal tissue (outside of the germinal centre) was still misclassified with cancerous

tissue.

When the number of clusters was increased to four, the k-means algorithm still

could not differentiate the cancerous tissue from the normal tissue (e.g. germinal

centre). This is demonstrated in Figure 6.9(b). On the other hand, some reticulum

was separated from capsule. However, these separated reticulum were mixed with

normal cortex tissue (outside of the germinal centre). This situation also occurred in

the fuzzy c-means results, as shown in Figure 6.10(b). Nevertheless, fuzzy c-means

was able to split the normal tissue (germinal centre and outside normal cortex tissue)

and cancerous tissue.

Chapter 6

142

As the number of clusters was further increased to five, different variations

started appearing within the k-means clustering results. In Figure 6.9(c) and Figure

6.11(b) the cancerous tissue still did not distinguish the normal germinal centre tissue

and cancer cortex tissue; the other variation displayed in Figure 6.11(a), other than

the separated normal and cancer tissue, was that some reticulum was still mixed in

with the normal cortex tissue (outsides of germinal centre). The corresponding

results from fuzzy c-means in Figure 6.10(c) showed that, apart from the fact that

fatty tissue was still mixed with cancerous tissue, the rest of the other types of tissue

obtained a similar classification as the clinical analysis, as shown in Figure 6.1(c).

Starting from six clusters, the k-means algorithm began to separate normal and

cancerous tissue consistently, although there was a small amount of reticulum that

was also classified with the capsule. As the number of clusters was further

increased, the k-means approach started to classify more subtypes within the capsule

and fatty tissue area, whereas fuzzy c-means classified more subtypes in the

cancerous and capsule tissue area. Finally, increasing the number of clusters to nine

yielded more and more tissue types being mixed together. The additional clusters

within these tissue sections may identify potential subtypes of tissue which cannot

currently be identified by pathological analysis and may potentially be useful for

diagnosis. Of course, they may also be clustering noise!

Overall, the fuzzy c-means algorithm was able to split the normal and

cancerous tissues in the early stage of the clustering process (as the number of

clusters was low) and, when the number of clusters was increased, the main different

Chapter 6

143

types of tissue can also be separated by the k-means algorithm. However, the fatty

tissue cannot be separated from the cancerous region using fuzzy c-means with any

of the cluster numbers used within this experiment (2 to 9 clusters). In contrast, k-

means can separate the fatty tissue almost regardless of the number of clusters.

Nevertheless, from six to nine clusters, although k-means sometimes obtained better

clustering results than fuzzy c-means, it is not stable, in the sense that it also

sometimes produced worse results. Therefore, it would not appear to be consistent

enough for real world application. In addition, as the number of clusters increases,

more and more information is obtained about the tissue which cannot be identified by

the pathologist.

6.5 Summary

A tissue section (LNII5) which was collected from a particularly interesting

site on the lymph node was used in the study of IR imaging presented in this Chapter.

Due to the large size of the FTIR spectral data obtained from LNII5, a technique

named PCA – fuzzy c-means which combined PCA and fuzzy c-means methods was

employed to speed up the cluster analysis with no significant information loss from

the original dataset. Experiments were conducted to apply the PCA, fuzzy c-means

and PCA – fuzzy c-means techniques individually to the LNII5 tissue section and the

results showed that fuzzy c-means and PCA – fuzzy c-means obtained almost the

same clustering results. These both performed better than PCA. However, PCA –

fuzzy c-means was almost ten times faster than the fuzzy c-means algorithm alone.

This speed benefit was obtained through the reduction in the size of the data using

Chapter 6

144

PCA. Moreover, when fuzzy c-means was initially applied to the LNII5 dataset, it

was shown that the performance of fuzzy c-means was poor. However, by

investigating the size of the first three PCs ranges, it was found that the parameter

which specifies the minimal amount of improvement was not small enough, causing

the fuzzy c-means algorithm to stop prematurely without making any further

improvement. Based on this finding, the setting for the minimal amount of

improvement was reduced and the performance of the fuzzy c-means algorithm then

improved significantly.

The k-means algorithm was also used to cluster the reduced dataset from PCA

and the results were compared to the PCA – fuzzy c-means technique. The results

demonstrated that, whilst PCA – fuzzy c-means can separate the main different tissue

types in the early stage of clustering, the PCA –k-means approach was only able to

satisfactorily classify them when the number of clusters was increased beyond five.

As the number of clusters was increased further, more information was obtained

within the classification (in particular, the possible identification of tissue subtypes)

which cannot be recognised by the pathologist.

Chapter 7

145

CHAPTER 7

A Cluster Merging Algorithm

7.1 Introduction

The motivation behind this Chapter is to create an automated method to

analyse FTIR spectra and to separate them into a clinically meaningful number of

clusters. In Chapter 5, a fuzzy clustering algorithm featuring simulated annealing

(SAFC) was used to automatically detect the ‘optimal’ number of clusters found

within a FTIR dataset. The dataset used in this study comprised of spectra that had

been collected from a variety of different tissue types. The SAFC algorithm begins

by generating a random number of clusters and then traverses the search space using

three different neighbourhood operations: i) perturb centre, ii) delete centre and iii)

split centre. The configuration with the minimum cluster validity index value is

returned. The results showed that this algorithm was able to obtain the same number

of clusters as clinical analysis in four out of seven datasets. However, smaller Xie-

Beni validity index values were achieved in some datasets even though the numbers

of clusters were different from the clinical analysis. Although the SAFC algorithm

Chapter 7

146

performed well on the small datasets, it proved to be very time consuming on larger

datasets. With the aim of overcoming this problem, in this Chapter, a refined fuzzy

c-means based clustering algorithm was developed to find the ‘optimal’ number of

clusters.

Both the SAFC and fuzzy c-means based clustering algorithms can

automatically detect the number of clusters based on the clustering structure and this

results in the minimum validity index value. However, both algorithms occasionally

identified an excessive number of clusters compared to clinical analysis. This was

partly due to the fuzzy c-means algorithm and cluster validity index, where all

distances between data points and cluster centres are calculated using their Euclidean

distances. This means that when the shapes of the clusters were significantly non-

spherical, the clustering and validity measures were not effective. However, the

complexity and range of the different cell types (e.g. healthy, pre-cancerous and

mature cancer) may also lead to an excessive number of clusters being identified.

The focus of this Chapter is on grouping the cells with the same clinical diagnosis

into one cluster so that the main types of the tissue can be explored through further

clinical analysis. In order to achieve this, it is necessary to combine the clusters with

most similar characteristics together, e.g. within the suspected pre-cancerous and

mature cancer cell types, as they may exhibit similar properties to one another even

though they are different stages of cancer. This information may be contained in the

existing infrared spectra.

Chapter 7

147

In this Chapter, a new method is proposed to automatically merge clusters in an

iterative manner. The algorithm identifies the two most similar clusters generated by

the initial fuzzy c-means based clustering algorithm and merges them. The merged

cluster will then be considered as a new cluster to rejoin the remainder of the

iterative merging process until a stop criterion is reached. In these experiments,

either the Xie-Beni index (VXB) [74] or the Sun-Wang-Jiang index [72] is used as the

cluster validity index depending on the size of the dataset undergoing clustering.

From observation, it is apparent that VXB is more suitable for small FTIR datasets

(less than 1000 data points), and VSWJ is more suitable for large FTIR datasets. This

may be because the fuzzy c-means algorithm utilising the VSWJ index often results in

an unstable and excessive number of clusters in comparison with VXB in small

datasets, and the VXB can usually generate less numbers of clusters than expected in

large datasets. In the following Sections, the feature selection and reduction methods

are described. Subsequently the fuzzy c-means based clustering algorithm is

presented in brief and then the proposed cluster merging method is described in

detail. Then the new algorithm is used to analyse the FTIR images of three selected

tissue sections and its results are analysed and conclusions are drawn.

7.2 Feature Extraction

The infrared spectra lymph node datasets utilised in this Chapter contain two

big datasets, which contain 276 and 343 spectra respectively; and three other large

datasets, which have 5764, 7497 and 7216 spectra respectively. For each spectal

data, there are 821 absorbance values, one corresponding to a data point every 4 cm-

Chapter 7

148

1. Thus the total size of the data in these five datasets are: 276×821(2.3×105),

343×821(2.8×105), 5764×821(4.7×106), 7497×821(6.2×106) and 7216×821(5.9×106),

respectively. As discussed in the previous Chapter, the application of clustering

algorithms to such large datasets can be very time consuming. In addition, it is

difficult to visualise the distribution of such data. In this Chapter, the same approach

as in Chapter 6 (PCA) is used to reduce the number of variables and also used to

permit visualisation. Once again, the first ten principal components (PCs) were used

as the input of the clustering method. For the five datasets used in this investigation,

the first two PCs represent, respectively, 78.9%, 73.8%, 70.9%, 89.1% and 94.1% of

the variance in the original datasets, whilst increasing this to the first ten PCs

improves the representation to 98.7%, 96.4%, 95.6%, 99.1% and 99.8%,

respectively. In order to visualise the data distribution, the original data can be

plotted in the first two PC dimensions. Details on feature reduction approaches have

been described previously by many authors, see, for example [128].

7.3 Fuzzy C-Means Based Clustering Algorithm

The fuzzy c-means based clustering algorithm is a good example of using

cluster validity to determine the optimal number of clusters from a given dataset (as

shown in Chapter 2, Figure 2.7). It is composed of two parts, where the first part is

based on running the fuzzy c-means clustering method within a certain range of

number of clusters. The best data structure (C) is obtained by choosing the

corresponding optimal value of cluster validity V from all the possible clustering

structures [72]; the second part of this algorithm is based on taking the best data

Chapter 7

149

structure from the first part, and then slightly perturbing each cluster centre for a

number of iterations while the validity index for each new data structure is

calculated. The data structure corresponding to the optimal validity index is then

returned. In the description of the algorithm shown in Figure 7.1, the minimal and

maximal number of clusters is referred to as cmin and cmax.

7.4 The Basis of a New Automated Method to Merge Clusters

The motivation to create a new merge clustering method was based on previous

attempts at clustering spectral data. In the previous analysis, a dataset that comprised

of a variety of different tissue types was analysed via SAFC and fuzzy c-means

based clustering algorithms. This dataset comprised 276 individual spectra that were

collected from regions of a tissue section diagnosed as ‘cancer’ (cancerous cortex

tissue: 159 spectra), ‘normal’ (benign cortex tissue: 72 spectra) and ‘reticulum’

(fibrocollagenous tissue: 45 spectra). In clinical diagnosis on this tissue section, the

histologist identified three different types of tissue. However, when the number of

clusters was set to a value of three for the SAFC and fuzzy c-means methods, the

clustering results did not match those of clinical diagnosis (possibly due to the

Euclidean distance measurement in fuzzy c-means). Some FTIR spectra from

cancerous regions were incorrectly grouped with spectra taken from regions of non-

cancerous tissue.

Chapter 7

150

1) Set cmin and cmax (in the experiments, we set cmin=2 and cmax=10)

2) For c = cmin to cmax

2.1) Initialise the cluster centres

2.2) Apply the standard fuzzy c-means algorithm and obtain the new centre and new fuzzy

partition matrix.

2.3) Calculate the cluster validity V.

3) Obtain the good data structure (C) that corresponds to the optimal cluster validity index

value V.

4) Set current C as the best data structure (Cbest).

5) For i = 1 to 100

5.1) Random slightly perturb the current C.

5.2) Calculate the new membership value and validity index value V corresponding

to the new data structure (Cnew).

5.3) If the new V is smaller than current Cbest V value, then set the Cnew as the Cbest,

otherwise, go back to step 5.1.

5.4) Set the Cbest as current C, and go back to step 5.1).

5.5) End for loop.

6) Return the best data structure Cbest with the optimal V value.

Figure 7.1 The fuzzy c-means based clustering algorithm.

When the fuzzy c-means based clustering algorithm was subsequently applied,

four clusters were obtained. However, two of the four clusters corresponded to one

type of tissue (cancerous). In the remaining two clusters, the majority of the data

was classified into the correct group (although there were two spectra data that were

misclassified). The results are shown in Figure 7.2. The excessive number of

clusters may have been caused by the fact that the FTIR spectra taken from

cancerous regions were taken from diverse areas of tissue, which might have

contained cells at different stages of the cancer (e.g. pre-cancerous and mature

Chapter 7

151

-5 0 5 10

x 10-3

-4

-3

-2

-1

0

1

2

3x 10

-3


2nd

Prin

cipa

l Com

pone

nt

reticulumcancernormalcancercentre

Figure 7.2 An extracted spectral dataset after applying fuzzy c-means based clustering algorithm.

cancer). As mentioned before, at this stage, we only wish to cluster cells which have

the same clinical diagnosis.

In the literature, many split-and-merge techniques have been used to determine

the correct number of clusters (see Chapter 2, Section 2.4). However, in general, all

these algorithms perform the split and merge procedure based on the dataset itself.

Besides these, the merge criteria used within a variety of other clustering algorithms

have also been reviewed in Chapter 2 (Section 2.5). Two alternative types of merge

criteria that have previously appeared in the literature are illustrated by using the

example shown in Figure 7.2. The first criterion identifies and merges clusters that

lie ‘closest’ to each other in multi-dimensional space [39]. In contrast, the second

criterion identifies the ‘worst’ clusters and merges them together. This is achieved

C3

C1

C2

C4

Chapter 7

152

by use of a cluster validity function, the most common of which measure the

compactness of the defined clusters [79]. Informally, a ‘good’ cluster is defined by

the property that data points within the cluster are tightly condensed around the

centre (high compactness). When applying these criteria to the dataset shown in

Figure 7.2, the two closest clusters are C2 and C3 (see the distance between each

cluster centre). In this dataset, C1 and C2 are more compact than the other two and

the worst two clusters are C3 and C4. However, the two clusters that should be

merged together are C1 and C3 (both are collected from cancer tissue). Hence,

neither of these approaches for merging clusters was suitable for solving the

problems encountered in FTIR clustering.

An alternative solution was developed based on examining the original infrared

spectra rather than searching for a relationship using the clustering structures in the

PCA space. Plotting the mean spectra from the separate clusters allows the major

differences between them to be more clearly visualised. The similarity between

clusters is more obvious at the wave-number corresponding to the IR frequency that

provides the largest variance between spectra. The proposed automated cluster

merging method described in this Chapter is based on this observation and can be

divided into two main stages. The first stage is to identify the frequency at which the

greatest variance between mean spectra is observed. The second step is to repeatedly

determine the most similar clusters and merge them, until a suitable termination

criterion has been reached. In the following Section these two steps are described in

detail.

Chapter 7

153

Step1: Determine a Reference Frequency

The reference frequency is defined as the frequency at which the biggest

difference between any two mean spectra is found. The full procedure of

determining this frequency is shown in Figure 7.3.

1) Obtain the clustering results from the fuzzy c-means based clustering algorithm.

2) Calculate the mean spectra iA for each cluster,

∑=

=iN

jij

ii A

NA

1

1 (i=1...c) (7.1)

where Ni is the number of spectra in the cluster i; Aij is the absorbance of the

spectrum j in cluster i; c is the number of clusters. The size of iA is p, the

number of wave-numbers in each spectrum (each mean spectrum is a vector of p

elements).

3) Compute the vector of pair-wise absolute differences Dij between all mean

spectra,

jiij AAD −= (i=1…c, j=1…c) (7.2)

4) Find the largest single element, dmax, within the set of vectors D.

5) Determine the frequency corresponding to the maximal element dmax.

Figure 7.3 The procedure of determining a reference wave-number.

Chapter 7

154

Figure 7.4 Mean infrared spectra obtained from different clusters.

The mean spectra obtained for four clusters are displayed in Figure 7.4. The set of

differences, D, was calculated between each pair of mean spectra using Equation

(7.2). The largest difference dmax exists between C1 and C4, as shown in Figure 7.5.

The IR frequency that corresponds to d is 2924 cm-1.

Figure 7.5 Enlarged region of Figure 7.4.

1000 1500 2000 2500 3000 3500

10

See Figure 7.5

8

6

4

2

0

Wavenumber/cm-1

2800 2850 2900 2950 3000 3050

1.5

2.0

2.5

3.0

3.5

4.0

4.5 C1 (cancer) C2 (normal) C3 (cancer) C4 (reticulum)

Wavenumber/cm

Abs

orba

nce

(x 1

0-3)

C1 (cancer) C2 (normal) C3 (cancer) C4 (reticulum)

Abs

orba

nce

(x 1

0-3)

Chapter 7

155

Step2: Automatically Merging Clusters

The next step is to merge the most similar clusters and then to merge them

together. This is determined by using the absorbance intensity for each mean

spectrum at the reference frequency. Clusters are therefore merged dependant upon

similarities in their IR spectra rather than clustering structure in multivariate space.

As this an iterative process, the merging procedure will end when at least one of the

termination criteria has been satisfied. Assume currently there are C mean spectra.

The detailed information can be described as shown in Figure 7.6.

In this Section, the same example as used above is utilised to illustrate this

procedure. In Figure 7.7, iA (i=1...4) is the mean absorbance values from the

normalised spectra of each obtained cluster. These are 1A = 0.0045,

2A =0.0034,

3A =0.0041, and 4A =0.0028, respectively. The line corresponds to the reference

frequency of 2924 cm-1. After sorting iA in ascending order, their new arrangement

is4A , 2A ,

3A and 1A . The distances between the average absorbance intensities are

represented as dist =. It is then trivial to calculate d1 = 0.0006, d2 = 0.0007, and d3 =

0.0004. d3 is obviously the minimum distance, distmin. The average of rest of dist =

(0.0006+0.0007)/2 = 0.00065 is greater than distmin. This satisfies the merging

condition in (4), and so the two clusters which correspond to distmin (i.e. C1 and C3)

are merged together. After this, the average of the mean spectral absorbance of these

two clusters ( newA =0.0043) replaces these values. The new array of the mean spectra

absorbance intensities is then re-sorted to be 4A ,

Chapter 7

156

1) Obtain C absorbance values of mean spectra at reference frequency from step 1,

re-sort them in ascending order.

2) Calculate the distance dist between these sorted and adjacent absorbance values

(note that the size of dist now is C-1)

3) Pick up the smallest distance distmin and find out the two most similar clusters

which correspond to this distance.

4) Merge these two clusters if they satisfy the merging condition: distmin ≤ average of

rest of dist (without distmin). The average absorbance value for the two merge

clusters is then calculated and is considered as a new object to join the rest of

merging iteration. Go back to 1)

5) When there are only two dist left, merge the two clusters which corresponding the

distmin if the following merging conditions satisfied: distmin ≤ 1/2 rest of dist OR

(distmin-1/2 rest of dist)/ distmin ≤ 0.1. Again, the average of these two mean

spectra absorbances is considered as a new object to replace them.

6) The merging process stops if there are only two clusters left or no merging

conditions are satisfied.

Figure 7.6 The procedure of automated merge clusters.

2A and newA , as displayed in Figure 7.8. The corresponding new distances are dnew1 =

0.0006 and dnew2 = 0.0009; see step (5). As distmin (0.0006) is not smaller than or

equal to 1/2 rest of dist (0.00045), it does not satisfy the either the first or second

conditions. Hence, in this situation, no merging conditions are satisfied, and so the

iterative process stops as defined in step (6).

Chapter 7

157

4A 2A

3A 1A

Figure 7.7 Four mean spectra absorbance at reference wave-number.

4A

2A newA

Figure7.8 The resultant absorbance distribution obtained after merging the two most similar clusters.

The merging condition in step (5) is different from the one when there are more

than two distances left, as shown in step (4). This is because, when there are only

two distances (i.e. 3 clusters) left, if the same merging condition as in step (4) is

used, this may lead to two clusters being merged in which their corresponding mean

spectra absorbance distance is slightly less than and nearly equal to the other

distance. For example, in Figure 7.9, if d2 is a slightly less than d1, then clusters b

and c will be merged together. Visually, this is not convincing. In order to achieve

the same effect as in the previous merging scenarios, the merging conditions

described in step (5) is generated. For example, in Figure 7.10, if d2 is smaller than

half of d1 (similar to the case where there are three distances) or the extra distance of

d2 to half of d1 is less than one tenth of d2, then cluster b and c are merged together.

In summary, the whole algorithm for the automated merging of clusters can be

described as shown in Figure 7.11.

dnew2 dnew1

d3 d2 d1

Chapter 7

158

a b c

Figure 7.9 The merging situation when there are two dist left (type 1).

a b c

Figure 7.10 The merging situation when there are two dist left (type 2).

Entire Automated Merging Clustering Procedure

1) The FTIR spectra are initially pre-processed to account for irregularities in cell

density across the tissue section. This includes baseline correction and peak area

normalisation.

2) PCA is applied to the processed dataset to reduce its dimensionality. Only the

first 10 PCs are extracted and utilised for subsequent fuzzy c-means cluster

analysis.

3) The fuzzy c-means based clustering algorithm is applied to this reduced dataset

and optimal clustering structure is adopted by finding the best clustering Cbest with

the minimal VXB or VSWJ value.

4) The merge cluster algorithm then identifies the reference frequency at which the

variance is maximal for the calculated mean spectra.

5) The algorithm merges the calculated clusters until a stop criterion is reached.

6) The final clustering results are correlated to the original 2D-FTIR image that was

collected. Each pixel/spectrum within this image is designated a colour dependant

on the cluster with which it belongs.

Figure 7.11 Entire automated merging clustering procedure.

d2 d1

d2 d1

Chapter 7

159

To demonstrate the new algorithm in entirety, the same dataset is utilised

again. The clustering results are shown in Figure 7.12. This can be compared to

Figure 7.2 and demonstrates the successful merging of spectra from regions of

cancerous tissue.

-5 0 5 10

x 10-3

-4

-3

-2

-1

0

1

2

3x 10

-3


2nd

Prin

cipa

l Com

pone

nt

normalreticulumcancercentre

Figure 7.12 The extracted spectral dataset after applying the proposed automated merging cluster method.

7.5 Experimental Results

In these experiments, the FTIR spectral datasets can be divided into two types,

based on the method of data collection. The first type is termed the ‘extracted

datasets’, to indicate that the spectral data within each dataset were individually but

not adjacently taken from different types of tissues. A sample of the extracted

dataset is shown in Figure 7.13. The two circled regions are cancer and normal

tissue sections, and the rest is reticulum. The plus symbols, ‘+’, represent the points

Chapter 7

160

chosen for FTIR spectra analysis. There were three extracted datasets (one from the

lymph node dataset, named extracted LNII7, and two from the previous oral cancer

dataset 3 and dataset 5) used in these experiments.

Figure 7.13 An example of an extracted dataset.

The second type of tissue samples is termed the ‘whole sub-area datasets’, to

indicate that the spectral data within each dataset is taken from a whole sub area of

axillary lymph node which contain several tissue sections. In contrast to the first

type of datasets, this type was captured from an entire sub tissue area. A

corresponding sample is shown in Figure 7.14. Within an entire lymph node tissue

sample, which contains only normal and cancer tissue sections, a whole sub area

which includes both tissue types was selected. Spectra were numbered following the

grid arrangement, sequentially from left to right, and from bottom to top, as shown

by the direction of the arrows in Figure 7.14. There were three whole sub area

lymph node datasets used in these experiments, namely LNII7, LNII5 and LN57.

Cancer

Normal

Reticulum

Entire tissue sample

Chapter 7

161

Figure 7.14 An example of a whole sub area of lymph node dataset.

These two types of FTIR dataset were examined using the newly developed

algorithm. The collected IR spectral datasets from each tissue section (apart from

dataset 3 and dataset 5) were initially clustered using the fuzzy c-means based

clustering algorithm. The generated clusters were then combined using the newly

proposed automated merge clustering method to help define the main types of tissue

found in each section. It should be noted that the initial clustering results of dataset

3 and dataset 5 were taken from previous outputs of SAFC algorithm (see Chapter 5

for details). However, these two datasets have only been used for purposes of

verification of the automated merge clustering method. Due to the nature of the

random initialisation, the final clustering results may vary. Therefore, the fuzzy c-

means based clustering algorithm was applied ten times for each dataset. The results

are displayed in Figures 7.15 to 7.20, below.

7.5.1 Extracted Datasets

Figure 7.15 (a) shows the clustering results of extracted LNII7 obtained from

the fuzzy c-means based clustering algorithm (where VXB index was used) in first two

Cancer

Normal

Entire lymph node tissue sample

Whole sub area

Chapter 7

162

PCs space. Two types of tissue sections, namely cancer and normal are contained in

this LNII7 tissue section (details of LNII7 tissue is shown later). A total of 343

spectral data were extracted; among these, 105 were defined as normal by the

pathologist, 238 were cancerous. After applying the fuzzy c-means based clustering

algorithm, three clusters were obtained, and within these, two clusters should belong

to a cancer cluster, as shown in Figure 7.15(a). Figure 7.15(b) displays the results

after applying the newly developed automated merge clustering method. It clearly

shows that the separate cancer clusters have now been correctly merged.

-4 -3 -2 -1 0 1 2 3 4

x 10-3

-4

-3

-2

-1

0

1

2

3

4x 10

-3


2nd

Pri

ncip

al C

om

pon

ent

cancernormalcancercentre

-4 -3 -2 -1 0 1 2 3 4

x 10-3

-4

-3

-2

-1

0

1

2

3

4x 10

-3


2nd

Prin

cipa

l Com

pone

nt

normalcancercentre

(a) (b)

Figure 7.15 (a) Extracted LNII7 clustering results after applying fuzzy c-means based clustering algorithm. (b) Extracted LNII7merged clusters results.

In order to verify the automated merge clustering algorithm, the method was

further applied to the previous oral cancer datasets (see Chapter 5), for which the

SAFC clustering algorithm obtained three clusters, rather than the two determined by

the histological analysis. Two datasets suffer this problem, namely dataset 3 and

dataset 5. Figure 7.16(a) and Figure 7.17(a) display the initial clustering results

Chapter 7

163

obtained from the SAFC algorithm for these two datasets, respectively, Figure

7.16(b) and Figure 7.17(b) are the corresponding results obtained after applying the

proposed merge clustering algorithm, in which the separate tumour clusters have

again been correctly merged. Although in dataset 3, the spectral data point numbered

34 was still misclassified in the result obtained after merging clusters, when

compared to previous SAFC clustering results (see Figure 5.8), the accuracy of the

results was, in general, much improved.

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

41

42

43

4435

36

37

34

38

39

40


2nd

Prin

cipa

l Com

pone

nt

tumourtumourstromacentre

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

34

38

39

40

35

36

37

41

42

43

44


2nd

Prin

cipa

l Com

pone

nt

stromatumorcentre

(a) (b)

Figure 7.16 (a) Dataset 3 clustering results obtained from SAFC algorithm. (b) Dataset 3 merged clusters results.

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

101

102103

104

111

112113

114121

122

123

124

105

106108

115 118

119120125

126

130

107

109110

116117

127

128

129


2nd

Prin

cipa

l Com

pone

nt

stromatumourtumourcentre

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

101

102103

104

111

112113

114121

122

123

124

105

106

107

108

109110

115

116117

118

119120125

126

127

128

129

130


2nd

Prin

cipa

l Com

pone

nt

stromatumorcentre

(a) (b)

Figure 7.17 (a) Dataset 5 clustering results obtained from SAFC algorithm. (b) Dataset 5 merged clusters results.

Chapter 7

164

7.5.2 Whole Sub Area Datasets

The tissue section displayed in Figure 7.18 was collected from a positive

axillary lymph node named LNII7 that displayed large areas of malignancy.

Invading cancer from the breast had almost fully infiltrated the lymph node with only

small remnants of normal nodal tissue remaining. Figure 7.18 (a) displays the total

absorbance IR image collected from an area on the tissue section where both

cancerous and normal nodal tissue was present. The IR spectral dataset comprised of

5764 spectra. Figure 7.18 (b) displays the H&E stained image for the same area

collected from the parallel stained tissue section. Initial clustering results from the

fuzzy c-means based selection algorithm are shown in Figures 7.18 (c) – (e) and the

final merged cluster image is shown in Figure 7.18 (f).

(a) (b) (c) (d) (e) (f)

Figure 7.18 Lymph node tissue section LNII7. Sampled area was 275µm × 818.75µm in size. (a) Total absorbance IR image (b) H&E stained image. Clustering results after fuzzy c-means based clustering algorithm. Each colour represents a different cluster of IR spectra (c) 5 cluster image (d) 6 cluster image (e) 9 cluster image (f) Final results obtained from automated merge clustering algorithm – this image contained two final clusters of IR spectra.


Cancerous cortex

Chapter 7

165

The second tissue section was also taken from a positive axillary lymph node

named LNII5 (from the same tissue as previously described in Chapter 6). The area

studied by IR now displayed several different types of tissue and the existence of a

secondary follicle that comprised of proliferating B-lymphocytes. Both the total

absorbance IR image and the H&E stained image are shown in Figure 7.19 (a) and

(b) respectively. The IR spectral dataset for the examined region comprised a total of

7497 spectra. Only two clustering results were obtained from the initial clustering

algorithm, and are displayed in Figure 7.19 (c) and (e). The corresponding merged

cluster image from both clustering structures resulted in three final clusters, as shown

in Figure 7.19 (d) and (f).

The final tissue section examined, named LN57, was collected from a benign

axillary lymph node. This node had been surrounded by large areas of fatty tissue

which had, in some regions, infiltrated close to the capsule of the node. An infrared

image was collected from this region which comprised 7216 spectra. The total IR

absorbance and H&E stained images from the examined sample area are shown in

Figure 7.20 (a) – (b). Initial clustering produced three different results, as shown in

Figure 7.20 (c) – (e). Further merging resulted in all images being made up of three

clusters as shown in Figure 7.20 (f).

Chapter 7

166

(a) (b)

(c) (d) (e) (f)

Figure 7.19 Lymph node tissue section LNII5. Sampled area was 30625µm × 95625µm in size. (a) Total absorbance IR image (b) H&E stained image. Results after fuzzy c-means based clustering algorithm. Each colour represents a different cluster of IR spectra (c) 5 cluster image (d) merged cluster result from 5 cluster image (e) 4 cluster image (f) merged cluster result from 4 cluster image. Both merged cluster results contained three clusters of IR spectra.

(a) (b)

(c) (d) (e) (f)

Figure 7.20 Lymph node tissue section LN57. Sampled area was 550µm × 512.5µm in size. (a) Total absorbance IR image (b) H&E stained image. Clustering results after fuzzy c-means based clustering algorithm. Each colour represents a different cluster of IR spectra (c) 3 cluster image (d) 4 cluster image (e) 5 cluster image (f) Final result obtained from automated merge clustering algorithm. Image contained three final clusters of IR spectra.

Cancerous cortex tissue

Secondary follicle (normal cortex tissue)

Reticulum (fibrocollagenous tissue)

Capsule (fibrocollagenous tissue)


Capsule

Fatty tissue with capsule

Cortex

Chapter 7

167

To help understand these results from the initial fuzzy c-means based

clustering algorithm, the clustering results obtained have been listed in Table 7.1.

This shows the number of clusters that were initially determined by the fuzzy c-

means based algorithm, and the number of clusters that were finally obtained after

merging. The value enclosed in parentheses indicates the number of times that the

specified number of clusters was returned by the fuzzy c-means based clustering

algorithm. For example, when studying the results for lymph node LNII7, the initial

algorithm obtained five clusters in two of the runs and six clusters in another five of

the runs.

Table 7.1 The number of clusters obtained

at different stages of clustering.

LNII7 5(2) 26(5) 29(3) 2

LNII5 5(9) 34(1) 3

LN57 3(7) 34(1) 35(2) 3

Number of clusters afterfuzzy c-means based clustering algorithm

automoted merging cluster algorithm

Chapter 7

168

7.6 Discussion of Results

When scrutinising the results from the initial fuzzy c-means based clustering

algorithm or SAFC clustering algorithm, a number of different cluster results was

obtained. This may be due to the random initialisation at the beginning of the

clustering process. A suitable initial cluster configuration may prevent the

occurrence of this problem. However, this can be considered as a research topic in

its own right.

A tendency to produce an excessive number of clusters from initial clustering

(SAFC or fuzzy c-means based clustering algorithm) was observed. The additional

clusters created by the algorithm may describe potential subtypes of tissue that are

presently not identified by conventional histopathology. After the application of the

novel cluster merging algorithm, a more stable clustering result was obtained. These

more clearly described the main types of tissue that existed within the samples

analysed, especially in whole sub area lymph node tissue sections.

When comparing the two types of experimental datasets, it is easier to interpret

the results obtained from the extracted FTIR spectral datasets, because different

types of spectral data points were withdrawn from a tissue section area for which the

type is certain. In contrast to this, the whole sub area lymph node FTIR spectra

datasets are more complex, since they are taken from entire sub-areas of tissue

samples. In the following, the clustering results obtained from the three lymph nodes

tissue sections LNII7, LNII5 and LN57 are analysed in more detail.

Chapter 7

169

As shown in the H&E stained image for tissue section LNII7 (Figure 7.18b),

the invading cancerous tissue exists at the bottom of the collected sample area (pink

colouration) and the normal tissue at the top (purple colouration). When studying the

initial clustering results (Figures 7.18c-e), it can be seen that several extra clusters

have been created in both the normal and cancerous areas. It is possible that the

clusters found in the area diagnosed as being cancerous might be representative of

several different sub-classes of malignancy not normally recognised by histology. In

contrast, the extra clusters found in the normal area could be descriptive of normal

tissue that is beginning to take on cancerous characteristics. However, it would be

preferable to merge possible sub-types of tissue into one defining group. This would

allow a simplified characterisation of the tissue section to be made. After applying

the newly proposed merge method, the initial overly complex images have now been

merged into a single more simplified one (Figure 7.18f). This newly created image

is now more representative of the main characteristics of the tissue section.

The area examined on tissue section LNII5 revealed the more complex

infrastructure of a lymph node (Figure 7.19b). Tissue types found within the sample

area include capsule, reticulum, normal cortex and cancerous cortex. It should also

be noted that a small spherical region known as a secondary follicle was present in

the centre of the normal cortex and rapidly proliferating lymphocytes have

congressed at this location. The initial clustering image shown in Figure 7.19c,

displaying five clusters of IR spectra, was obtained in nine out of ten repeats. When

compared against the H&E stained image for the same area (Figure 7.19b), the full

range of tissue types have been correctly characterised, including the secondary

Chapter 7

170

follicle. When applying the automated merge method, the clusters which are merged

have similar biochemistry, producing three main clusters that describe the main types

of tissue found within the sample area (Figure 7.19 d). These include

fibrocollagenous tissue (capsule and reticulum), normal cortex tissue (normal cortex

and secondary follicle) and finally cancerous cortex tissue.

The second and rare type of output from initial clustering is shown in Figure

7.19 (e), in which four clusters were obtained for the same IR spectral dataset. The

main difference between this output and the first is the incorrect grouping of both

reticulum and normal cortex into the same cluster. This is a consequence of the

spectra for these types of tissue being close in proximity to each other in PCA space.

Due to random initialisation of cluster centres, the algorithm has on this occasion

calculated that four clusters would best describe the data structure, representing a

local minimum validity index value. Hence these two types of tissue spectra were

not separated but grouped into the same cluster. After the merging process,

reticulum was now grouped with normal cortex tissue rather than capsule tissue.

However, the merge method still discriminated the capsule, normal cortex and

cancerous cortex tissue.

Tissue section LN57 was collected from a benign lymph node that exhibited

large surrounding areas of fatty tissue. The sample area analysed via IR contained

three main types of tissue. These were normal cortex tissue (mainly made up of a

secondary follicle), capsule tissue, and small pockets of fatty tissue that had in some

regions infiltrated the capsule of the node. Three results obtained from the initial

Chapter 7

171

clustering algorithm are shown in Figures 7.20 (c)–(e). The three clusters shown in

Figure 7.20 (c) characterise the main tissue types found in the examined sample area.

The green colour in the image describes the capsule of the lymph node, whereas the

red areas represent locations of fatty tissue invasion. Finally, the blue colour is

descriptive of normal cortex tissue.

The second type of output, as shown in Figure 7.20 (d), displays four clusters

of tissue spectra. Again the capsule has been separated into two clusters that

describe areas with or without fatty tissue (blue and yellow colours respectively).

But on this occasion an area can be seen that lies beneath the capsule of the lymph

node (cyan colour). This is likely to describe a region of the cortex called the

subcapsular sinus that lies directly underneath the capsule and allows lymph to enter

the node. The final type of output comprised five clusters, as displayed in Figure

7.20 (e). A similar scenario to the previous appears to have occurred. However, on

this occasion an additional cluster has been created that may further describe a layer

of fatty tissue within the capsule (red colour). After the automated merge clustering

method was applied to these different outputs, a final result comprising three clusters

was obtained, corresponding to the three tissue types present. It should be noted that

when the initial output was for three clusters, the merge method did not attempt to

further combine these, thus verifying the robustness of this cluster structure.

From these experiments, it can be seen that the proposed automated merge

cluster method can rapidly and efficiently obtain major types of biochemical tissue

from the existing samples. However, in order to transfer this algorithm into the

Chapter 7

172

clinical setting an extensive and rigorous verification and evaluation programme

would have to be conducted.

7.7 Summary

In this Chapter, fuzzy c-means based clustering algorithm was applied to

automatically generate the ‘optimal’ cluster structure for a given dataset (i.e. the

structure that yields the minimum validity index value). However, due to the

complexity of biological systems, an excessive number of clusters can sometimes be

obtained. In order to address this problem, an automated cluster merging algorithm

was developed and described in this Chapter. To demonstrate the proposed

algorithm, six FTIR spectra datasets (two from oral cancer tissue sections, and four

from axillary lymph node tissue sections) were analysed using this method. The

results indicated that the clusters that have similar biochemistry were successfully

merged and, therefore, demonstrated that the algorithm is successful in determining

the main tissue types within the different sections used. Further verification and

evaluation of this novel method would be required in order to transfer this algorithm

into the clinical setting.

Chapter 8

173

CHAPTER 8

Conclusions

Cancer has become a major adversary to human health, and the development

and enhancement of techniques for use in its diagnosis and treatment has

increasingly become a focus of worldwide research. Fourier Transform Infrared

(FTIR) spectroscopy is a powerful tool for determining the biochemical composition

within a biological system. This capability to provide an insight into the biochemical

changes that occur within cells has led, in recent years, to FTIR spectroscopy being

investigated in the study of various biomedical conditions.

In order to analyse the FTIR spectroscopic data from tissue samples,

multivariate clustering techniques have often been used to separate sets of unlabelled

infrared spectral data into different clusters based on their characteristics. The

purpose of clustering is to group the spectral data such that the data in the same

clusters are as similar as possible and data within different clusters are as dissimilar

as possible. Hence, different types of cells can be separated within biological tissue.

Among existing clustering techniques, it has been shown that fuzzy clustering

Chapter 8

174

techniques such as fuzzy c-means can have clear advantages over crisp and

probabilistic clustering methods, and they have been widely used in medical

diagnosis and pattern recognition. This thesis focuses on the development of fuzzy

clustering techniques that are able to automatically classify the cells present in a

variety of tissue sections and to investigate whether infrared spectroscopy can be

used as a diagnostic probe to identify early stages of cancer. In this Chapter, the

contributions of this thesis are summarised in the next Section, followed by a

discussion of some of the avenues of possible future work. Finally, the

dissemination which has resulted from this research is listed.

8.1 Contributions

This thesis has made the following contributions:

8.1.1 Comparison of three often used clustering techniques, namely hierarchical

clustering analysis, k-means and fuzzy c-means performance on oral cancer FTIR

spectral data.

Hierarchical clustering analysis, k-means and fuzzy c-means algorithms are

three frequently used clustering methods in infrared spectroscopy analysis.

However, a systematic comparison of these techniques on oral cancer FTIR spectra

had not been done performed prior to the present work. Furthermore, in previous

analysis [17,19,20,22,26,98,100,101,129] extra pre-processing steps, such as mean-

centring, variance scaling and first derivatives, had been carried out in an ‘ad hoc’

manner prior to further analysis being undertaken. Another benefit of the clustering

Chapter 8

175

methods developed in this thesis is that all FTIR spectral data undergo only basic

pre-processing, for example, water vapour removal, baseline correction and

normalisation to account for irregularities in cell density across the tissue section.

All these techniques are well established for FTIR spectra and can be easily

automated.

In Chapter 4, experiments based on these three techniques were carried out on

seven FTIR spectra datasets taken from patients with oral cancer. In addition, the

results were compared and discussed. In this Chapter, a novel method of analysis

was introduced in which the disagreements between the number of spectral data

between different clustering results and the clinical diagnosis from pathologists was

used to evaluate the quality of the clustering methods. This makes the differentiation

within diverse techniques more easy to identify.

8.1.2 Improvement of the previously introduced ‘Variable String Length Simulated

Annealing’ (VFC-SA) algorithm to enhance its stability and efficiency.

VFC-SA clustering is a method featuring a simulated annealing algorithm in

which a cluster validity index measure was used as the energy function in order to

automatically determine the best number of clusters. The advantage of using

simulated annealing is that it can escape local optima of cluster configurations as

present in the standard fuzzy c-means algorithm and hence may be able to find

globally optimal solutions. The Xie-Beni index was used as the cluster validity

index to evaluate the quality of the solutions. However, during the implementation

Chapter 8

176

of this proposed algorithm, it was found that sub-optimal solutions could be obtained

in certain circumstances.

In order to overcome this limitation, in Chapter 5, the original VFC-SA

algorithm was extended in four novel ways in order to produce the ‘Simulated

Annealing Fuzzy Clustering’ (SAFC) algorithm. An evaluation of the performance

of fuzzy c-means, VFC-SA and SAFC clustering algorithms on seven oral cancerous

FTIR spectra datasets was carried out and this demonstrated that SAFC obtained the

smallest Xie-Beni validity index values in all seven datasets. Particularly in

comparison with the VFC-SA algorithm, SAFC generates better quality, more stable

results in roughly the same computational time.

8.1.3 An analysis of lymph node tissue sections obtained from an infrared imaging

technique by using principal component analysis, fuzzy c-mean clustering and a

combination of both methods.

Following the relatively recent introduction of an infrared imaging technique

which allows a large number of FTIR spectra data to be obtained in a quick and

efficient way, it is now possible to analyse large areas of tissue section. In Chapter 6,

spectral data which had been collected using this technique from an area of lymph

node tissue section (named LNII5) which contained different types of cells, were

selected for investigation.

Principal component analysis (PCA) is a typical multivariate statistical

technique that has been widely applied in the field of data analysis and compression.

In this thesis, PCA is mainly used to reduce the number of dimensions for large FTIR

Chapter 8

177

datasets. However, in Chapter 6 another feature of PCA, namely data structure

detection, was employed to explore the underlying formation of cells within the

selected lymph node tissue section LNII5. This was implemented by calculating a

value which is used to measure the correlation relationship between all spectra to

each principal component (PC). A bigger value indicates that the spectrum has a

closer relationship to this PC and vice versa. Finally, the whole set of values for

each PC was displayed in a false colour weighted image which has the same size

pixels as the original LNII5 image. The technique was applied utilising the first 10

PCs and the results were shown.

The standard fuzzy c-means clustering algorithm was also used to analyse this

same LNII5 tissue section. The number of clusters was set from 2 to 9 successively

and the clustering results were also displayed in false colour weighted images.

The third method used to analyse LNII5 was to combine both PCA and fuzzy

c-means together, termed PCA-fuzzy c-means. However, unlike in the first method,

PCA was used here in the more conventional manner to reduce the dimensionality of

the variables without significantly loosing information from original data (as there

are 821 wave-numbers within each spectrum from the given lymph node tissue

section). The transformed data in the first 10 PCs space were utilised as the input to

the fuzzy c-means clustering algorithm. The clustering results were again displayed

in false colour weighted images. The computational time of all three techniques was

also calculated and compared.

Chapter 8

178

The experimental results showed that the PCA and PCA-fuzzy c-means

methods took almost the same computational time, and both were much faster than

the standard fuzzy c-means algorithm. At the same time, the PCA-fuzzy c-means

method obtained very similar clustering results to standard fuzzy c-means. However,

from PC3 onwards, the PCA method did not yield any additional useful information.

On the other hand, as the number of clusters was increased, more types of tissue

could be discriminated in the fuzzy c-means and PCA-fuzzy c-means methods.

Hence, it can be stated that (in this context) PCA-fuzzy c-means contains advantages

over both PCA and fuzzy c-means, and so can be taken as a good technique to be

used to analyse large FTIR spectral datasets.

8.1.4 Identification of the relationship between the parameter in fuzzy c-means

which controls the minimal acceptable amount of improvement and the range of data

in the first three principal components space. Comparison of clustering performance

between k-means and fuzzy c-means algorithms on a lymph node tissue section after

dimensionality reduction.

During the experiments to classify the lymph node tissue section LNII5 using

the fuzzy c-means algorithm it was found that, by comparing the data range sizes in

the first three principal components space, when the minimal amount of

improvement was not small enough the algorithm effectively stopped prematurely.

Based on this finding, the minimal amount of improvement setting was adjusted and

the performance of fuzzy c-means algorithm was then found to be significantly

improved.

Chapter 8

179

In order to compare clustering performance of k-means and fuzzy c-means in

the large FTIR spectra dataset LNII5, a similar process to the PCA-fuzzy c-means

technique was utilised on k-means algorithm, namely dimensionality reduction by

PCA prior to clustering was implemented. The corresponding experiments were

carried out and results were discussed in Chapter 6.

8.1.5 Development of a novel automated method to appropriately merge clusters

for FTIR spectral clustering analysis.

The objective of this research is to facilitate the automation of tools to be used

to enhance clinical analysis. This would require robust and consistent identification

of the appropriate number of clusters for the given clinical context. Therefore, a

clustering algorithm that can automatically detect the number of clusters is required.

In Chapter 5, the SAFC algorithm was designed and implemented with this purpose

in mind. However, in clustering large datasets, it can be very time consuming. With

the aim of overcoming this problem, a refined fuzzy c-means based clustering

algorithm was developed to find the ‘optimal’ number of clusters.

Both the SAFC and the fuzzy c-means based clustering algorithms can

automatically detect the number of clusters. However, both algorithms occasionally

identified an excessive number of clusters compared to clinical analysis.

Furthermore, the excessive number of clusters generated from the fuzzy c-means

based clustering algorithm separated the same type of tissue into two or more

clusters. In contrast, when the number of clusters was fixed to be the same as that

obtained from clinical analysis, the clustering results did not match clinical

Chapter 8

180

diagnosis. It is thought that this may be due to the fact that the distance measure in

the standard fuzzy c-means algorithm is a Euclidean distance. In light of this

observation, an automated method to merge clusters was developed in Chapter 7.

Six FTIR datasets were used to verify this newly proposed algorithm. The

experimental results indicated that successful merging of clusters that have similar

biochemistry was achieved, thus demonstrating that the method can successfully

determine the main tissue types within the given sections.

8.1.6 Summary

This thesis investigates and develops clustering techniques which can classify

different types of tissue FTIR spectra taken from a range of oral cancer and breast

cancer lymph node tissue sections. It includes various comparisons of alternative

clustering algorithms in small (oral cancer) and large (lymph node tissue) FTIR

spectral datasets, and describes the development of a novel clustering algorithm

which can automatically identify the correct number of tissue types as that obtained

by clinical diagnosis. The experiments carried out in this research indicate that

infrared spectroscopy may indeed become a powerful diagnostic probe to identify the

early stages of cancer.

8.2 Future Work

There are many possible potential avenues of further research that arise from

the work carried out in this thesis. Some of the more obvious ones are outlined

below.

Chapter 8

181

8.2.1 Clustering Algorithms

This thesis mainly focuses on modifications and enhancements of the fuzzy c-

means clustering algorithm to classify the FTIR spectral data, along with k-means,

hierarchical clustering and simulated annealing fuzzy clustering. Different clustering

methods have their own advantages and disadvantages for specific clustering criteria.

Therefore further investigation of the combination of other clustering algorithms

with diverse optimisation techniques which could inherit the advantages of each

individual method may also be an interesting future research direction. Other than

partitional clustering algorithms, density-based and model based clustering algorithm

could be examined.

In Chapter 7, an automatic clustering process based on cluster validity was

used to obtain the best clustering and the results showed that it works well.

However, this method has to run the fuzzy c-means algorithm within a series of

numbers of clusters. Development of an automatic clustering method which can

avoid running the same algorithm many times to discover the correct number of

clusters might be another future direction.

8.2.2 Distance Measures

The distance measurement used throughout this work is Euclidean distance.

This is suitable to measure the data in multi-dimensional space. However, when the

shape of the clusters different greatly from the spherical, it may result in an

inappropriate clustering. In light of this fact, other distance measures could also be

Chapter 8

182

examined. Examples of such alternative distance measures are the squared

Mahalanobis distance, mutual neighbour distance and the Chebychev distance.

8.2.3 Cluster Validity

In the process of automatic clustering, it has been shown that different cluster

validity indices may lead to different clustering results being obtained. In this thesis,

the Xie-Beni and the Sun-Wang-Jiang validity indices were applied, dependent on

different size of FTIR datasets. Various other cluster validity indices may also to be

investigated. Examples of some alternative indices that might be investigated can be

found at the end of Section 2.2.4.

8.2.4 Setting Initial Cluster Centres

As the initial cluster centres in fuzzy c-means clustering are set randomly, this

will often result in minor differences in clustering results in different runs of the

algorithm (indeed, sometimes major differences can be obtained). Therefore, in

order to obtain a representative set of results, a number of runs of this algorithm were

instigated. There remains scope for an investigation of a method which can set

‘good’ or ‘appropriate’ initial cluster centre positions that will then make the fuzzy c-

means clustering algorithm work more efficiently.

8.2.5 Data Sources

A wider source of infrared spectroscopy data, particularly from a wider range

of different types of cancer, are required in order to carry out further evaluation and

verification of the clustering algorithms developed in this thesis. In order to transfer

Chapter 8

183

these algorithms into the clinical setting, a far more thorough set of validation

experiments will obviously need to be carried out. Naturally, it is currently

expensive (in terms of both equipment and manpower) to obtain such data; and until

a technique is proven it can be hard to obtain the necessary funds. This is, of course,

a ‘catch 22’ situation, but it is hoped that the contributions presented in this thesis

will add to the weight of supporting evidence necessary to convince others of the

potential of the FTIR technique.

8.2.6 Infrared Spectroscopy Expert System

In future, if sufficient infrared spectral data were collected and analysed, a

database of infrared spectroscopic analysis could be established. Thus, any new

unlabeled spectra could then be identified through comparison of the existing spectra

in the database and a corresponding expert system could also be generated.

8.2.7 Dedicated Software For Infrared Spectrometry Hardware

The work developed in this thesis, and future works based on it, will hopefully

have potential to be a built into a software ‘sub-system’ within an infrared

spectrometry machine, once adequate evaluation and verification process had been

carried out. This would require a significant effort in software engineering, but could

then be sold / licensed to the manufacturers of the FTIR machines. Hence, in the

long term, there may be the prospect of an economic return from this research, in

addition to the obvious potential human benefits obtained from earlier and more

accurate diagnosis of cancers.

Chapter 8

184

8.3 Dissemination

The research described in this thesis has been disseminated through a number

of book chapters, journal papers and international conference papers. Most of the

work described in each of the main body Chapters of this thesis has also been

disseminated in this manner. A formal list of the publications and presentations

derived from this work now follows.

8.3.1 Book Chapter

Wang, X.Y., Garibaldi, J.M., Bird, B. and George, M. W., Novel

Developments in Fuzzy Clustering for the Classification of Cancerous Cells using

FTIR Spectroscopy, Book Chapter, Jose Valente de Oliveira and Witold Pedrycz

(eds), accepted for publication in the book Advances in Fuzzy Clustering and its

Applications, John Wiley and Sons, 2007. (Chapter 4, 5 and 7)

8.3.2 Journal Papers

Wang, X.Y., Garibaldi, J.M., Bird, B. and George, M. W., A Novel Fuzzy

Clustering Algorithm for the Analysis of Axillary Lymph Node Tissue Sections,

accepted to be published in Applied Intelligence, 2006. (Chapter 7)

Wang, X.Y., Garibaldi, J.M., Simulated Annealing Fuzzy Clustering in

Cancer Diagnosis, Informatica, vol 29, no. 1, pp 61-70, 2005. (Chapter 5)

Chapter 8

185

8.3.3 Conference Papers

Wang, X.Y., Garibaldi, J.M., Bird, B. and George, M. W., Fuzzy Clustering

in Biochemical Analysis of Cancer Cells , in the Proceedings of Fourth Conference

of the European Society for Fuzzy Logic and Technology (EUSFLAT 2005) and

Eleventh Rencontres Francophones sur la Logique Floue et ses Applications (LFA

2005). pp. 1118-1123, Barcelona, Spain, September, 7-9, 2005. (Chapter 7)

Wang, X.Y., Garibaldi, J.M., , A Comparison of Fuzzy and Non-Fuzzy

Clustering Techniques in Cancer Diagnosis , in the Proceedings of second

international conference in Computational Intelligence in Medicine and Healthcare

(The Biopattern Conference), pp. 250-256, Costa da Caparica, Lisbon, Portugal, 29

June - 1 July 2005. (Chapter 6)

Wang, X.Y., Whitwell, G. and Garibaldi, J.M., The Application of a

Simulated Annealing Fuzzy Clustering Algorithm for Cancer Diagnosis, in the

Proceedings of the IEEE 4th International Conference on Intelligent System Design

and Application, pp 467-472, Budapest, Hungary, August 26-28, 2004, ISBN 963-

71546-30-2. (Chapter 5)

Wang, X.Y., Garibaldi, J.M. and Ozen, T. , Application of The Fuzzy C-

Means Clustering Method on the Analysis of non Pre-processed FTIR Data for

Cancer Diagnosis , in the Proceedings of the 8th Australian and New Zealand

Conference on Intelligent Information Systems, pp. 233-238, Sydney, Australia,

December 10-12, 2003. (Chapter 4)

Chapter 8

186

8.3.4 Presentations

Wang, X.Y., Garibaldi, J.M., Bird, B. and George, M. W., A Novel Fuzzy

Clustering Algorithm for the Analysis of Potentially Cancerous Lymph Node Cells,

(oral presentation) in Automated Scheduling Optimisation and Planning Research

Group seminar, December, 2005.

Wang, X.Y., Garibaldi, J.M., Bird, B. and George, M. W., Fuzzy Clustering

in Biochemical Analysis of Cancer Cells , (oral presentation) in the Fourth

Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT

2005) and Eleventh Rencontres Francophones sur la Logique Floue et ses

Applications (LFA 2005). Barcelona, Spain, September 7-9, 2005.

Wang, X.Y., Garibaldi, J.M., A Comparison of Fuzzy and Non-Fuzzy

Clustering Techniques in Cancer Diagnosis , (oral presentation) in the Second

International Conference in Computational Intelligence in Medicine and Healthcare

The Biopattern Conference, Lisbon, Portugal, 29 June - 1 July 2005.

Wang, X.Y., Whitwell, G. and Garibaldi, J.M., The Application of a

Simulated Annealing Fuzzy Clustering Algorithm for Cancer Diagnosis, (oral

presentation) in the IEEE 4th International Conference on Intelligent System Design

and Application, Budapest, Hungary, August 26-28, 2004.

Wang, X.Y., Garibaldi, J.M. and Ozen, T., Application of The Fuzzy C-

Means Clustering Method on the Analysis of non Pre-processed FTIR Data for

Cancer Diagnosis, (oral presentation) in Automated Scheduling Optimisation and

Planning Research Group seminar, December, 2003.

Chapter 8

187

Wang, X.Y., Garibaldi, J. M., Fuzzy Clustering, (oral presentation) in

Automated Scheduling Optimisation and Planning Research Group seminar, April,

2003.

Bird, B., George, M. W., Wang, X-Y., Garibaldi, J. M., Stone, N., Smith, J.

and Barr, H., A Combined Infrared and Raman study of Axillary Lymph Nodes in

Breast Cancer. (Poster presentation) 3rd International Conference on Advanced

Vibrational Spectroscopy, Wisconsin, USA, August, 14-19, 2005.

Bird, B., George, M. W., Wang, X-Y., Garibaldi, J. M., Stone, N., Smith, J.

and Barr, H., A Combined Infrared and Raman study of Axillary Lymph Nodes in

Breast Cancer. (Oral Presentation) Mini-Symposium on Optoelectronics for Use in

the Diagnosis of Cancer, Grasmere, Lake District, UK, June 20-23, 2005.

Bird, B., Chesters, M. A., Chalmers, J., Tobin, M., Wang, X. Y., Garibaldi,

J. M., Hitchcock, A. and Symonds, I., Infrared Microspectroscopy as a Potential

Tool for Cervical Cancer Diagnosis. (Poster Presentation) Faraday Discussion 126,

Applications of Spectroscopy to Biomedical Problems, University of Nottingham,

UK, September 1-3, 2003.

Bird, B., Chesters, M. A., Chalmers, J., Tobin, M., Wang, X. Y., Garibaldi,

J. M., Hitchcock, A. and Symonds, I., Infrared Microspectroscopy as a Potential

Tool for Cervical Cancer Diagnosis. (Poster Presentation) 2nd International

Conference on Advanced Vibrational Spectroscopy, University of Nottingham, UK,

August 24-29, 2003.

References

188

References

[1] National Cancer Research Institute website, 2006, www.ncri.org.uk

[2] Zachariadou-Veneti, S., 2000, "A Tribute George Papanicolaou (1883-1962)", Cytopathology, vol. 11, pp. 152-157.

[3] Stuttaford, T., 4 May 2001, "Greatest need is annual smear", The Times newspaper.

[4] 4 May 2001, "Imperfect cervical cencer tests are better than none at all", The Times newspaper.

[5] Robles, S. C., 2002, "Deconstructing the Myths of Cervical Cancer", Perspectives in Health, vol. 5, no. 2,

[6] Mantsch, H. and McElhaney, R. N., 1990, "Application of IR spectroscopy to biology and medicine", J Molec Struc, vol. 217, pp. 347-362.

[7] Wong, P. T. T. and Rigas, B., 1990, "Infrared spectra of microtome sections of human colon tissues", Applied Spectroscopy, vol. 44, pp. 1715-1718.

[8] Rigas, B., Morgello, S., Goldan, I. S., and Wong, P. T. T., 1990, "Human colorectal cancers display abnormal Fourier transform infrared spectra", in Proceedings of the National Academia of Science, USA, vol. 87, pp. 84-88.

[9] Wong, P. T. T., Goldstein, S. M., Grekin, R. C., Godwin, T. A., Pivik, C., and Rigas, B., 1993, "Distinct infrared spectroscopic patterns of human basal cell carcinoma of the skin", Cancer Research, vol. 53, no. 4, pp. 762-765.

[10] Morris, B. J., Lee, C., Nightingale, B. N., Molodysky, E., Morris, L. J., and Appio, R., 1995, "Fourier transform infrared spectroscopy of dysplastic, papillomavirus-positive cervicovaginal lavage speciens", Gynecological Oncology, vol. 56, no. 2, pp. 245-249.

[11] Jackson, M. and Mantsch, H., 1996, Biomedical Infrared Spectroscopy, in Infrared Spectroscopy of Biomolecules, Mantsch, H. and Chapman, D. (eds), Wiley-Liss Inc., New York, pp. 311-340.

[12] Benedetti, E., Teodori, L., Trinca, M. L., Vergamini, P., Slavati, F., Mauro, F., and Spremolla, G., 1990, "A new approach to the study of human solid tumor cells by means of FT-IR microspectroscopy", Applied Spectroscopy, vol. 44, pp. 1276-1280.

[13] Benedetti, E., Papineschi, F., Vergamini, P., Consolini, R., and Spremolla, G., 1984, "Analytical infrared spectral differences between human normal and leukaemic cells (CLL) — I", Leukemia Research, vol. 8, no. 3, pp. 483-489.

References

189

[14] McIntosh, L., Mansfield, J., Crowson, A., Mantsch, H., and Jaskson, M., 1999, "Analysis and Interpretation of Infrared Microscopic Maps: Visualization and Classification of Skin Components by Digital Staining and Multivariate Analysis", Biospectroscopy, vol. 5, pp. 265-275.

[15] Goodacre, R., Timmins, E., Burton, R., Kaderbhai, N., Woodward, A., Kell, D., and Rooney, P., 1998, "Rapid Identification of Urinary Tract Infection Bacteria Using Hyperspectral Whole-organism Fingerprinting and Artifical Neural Networks", Microbiology, vol. 144, pp. 1157-1170.

[16] Richter, T., Steiner, G., Abu-Id, M., Salzer, R., Bergmann, R., Rodig, H., and Johannsen, B., 2002, "Identification of Tumor Tissue by FTIR Spectroscopy in Combination with Positron Emission Tomography", Vibrational Spectroscopy, vol. 28, pp. 103-110.

[17] Lasch, P., Haensch, W., Naumann, D., and Diem, M, 2004, "Imaging of colorectal adenocarcinoma using FT-IR microspectroscopy and cluster analysis", Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease, vol. 1688, no. 2, pp. 176-186.

[18] Romeo, M. J. and Diem, M., 2005, "Infrared Spectral Imaging of Lymph Nodes: Strategies for Analysis and Artifact Reduction", Vibrational Spectroscopy, vol. 38, pp. 115-119.

[19] Wood, B. R., Chiriboga, L., Yee, H., Quinn, M. A., McNaughton, D., and Diem, M, 2004, "Fourier transform infrared (FTIR) spectral mapping of the cervical transformation zone, and dysplastic squamous epithelium", Gynecologic Oncology, vol. 93, no. 1, pp. 59-68.

[20] Lasch, P., Wasche, W., McCarthy, W. J., Muller, G., and Naumann, D., 1998, "Imaging of Human Colon Carcinoma Thin Sections by FTIR Microspectrometry", Infrared Spectroscopy: New Tool in Medicine, vol. 3257, no. 3, pp. 187-197.

[21] Salman, A., Erukhimovitch, V., Talyshinsky, M., and Huleihil, M., 2002, "FTIR Spectroscopic Method for Detection of Cells Infected with Herpes Viruses", Biopolymers ( Biospectroscopy), vol. 67, pp. 406-412.

[22] Schultz, C. P., Liu, K., Johnston, J. B., and Mantsch, H., 1996, "Study of Chronic Lymphocytic Leukemia Cells by FT-IR Spectroscopy and Cluster Analysis", Leukemia Research, vol. 20, no. 8, pp. 649-655.

[23] Zhang, L., Small, G. W., Haka, A. S., Kidder, L. H., and Lewis, E. N., 2003, "Classification of Fourier Transform Infrared Microscopic Imaging Data of Human Breast Cells by Cluster Analysis and Artificial Neural Networks", Applied Spectroscopy, vol. 57, no. 1, pp. 14-22.

[24] 2001, Handbook of Analytical Method for Materials, Materials Evaluation and Engineering, Inc.

[25] Berkhin, P., 2002, "Survey of Clustering Data Mining Techniques", San Jose, CA, USA, Accrue Software.

References

190

[26] Allibone, R., Chalmers, J. M., Chesters, M. A., Fisher, S., Hitchcock, A., Pearson, M., Rutten, F. J. M., Symonds, I., and Tobin, M., 2002, "FT-IR microscopy of oral and cervical tissue samples", Derby City General Hospital, Internal Report.

[27] Jain, A. K., Murty, M. N., and Flynn, P. J., 1999, "Data Clustering: A Review", ACM Computing Surveys, vol. 31, no. 3, pp. 264-323.

[28] Omran, M., 2004, Particle Swarm Optimization Methods for Pattern Recognition and Image Processing, Ph.D Thesis, Faculty of Engineering, Built Environment and Information Technology, University of Pretoria.

[29] Han, J. and Kamber, M., 2001, Data Mining: Concepts and Techniques, Morgan Kaufmann. Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA.

[30] Carpineto, C. and Romano, G., 1996, "A Lattice Conceptual Clustering System and Its Application to Browsing Retrieval", Machine learning, vol. 24, no. 2, pp. 95-122.

[31] Judd, D., Mckinley, P., and Jain, A., 1998, "Large-scale Parallel Data Clustering", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 871-876.

[32] Ozer, M., 2005, "Fuzzy c-means Clustering and Internet Portals: A Case Study", European Journal of Operational Research, vol. 164, no. 3, pp. 696-714.

[33] Hamerly, G. and Elkan, C., 2002, "Alternatives to the k-means algorithm that find better clusterings", in Proceedings of CIKM-02, 11th ACM International Conference on Information and Knowledge Management, pp. 600-607.

[34] Garrett-Mayer,E. and Parmigiani,G., 2004, "Clustering and Classification Methods for Gene Expression Data Analysis", Johns Hopkins University, Dept.of Biostatistics Working Papers, Johns Hopkins University, The Berkeley Electronic Press(bepress), http://www.bepress.com/jhubiostat/.

[35] Jiang, D., Tang, C., and Zhang, A., 2004, "Cluster Analysis for Gene Express Data: A Survey", IEEE Transactions on Knowledge and data Engineering, vol. 16, no. 11, pp. 1370-1386.

[36] Anderberg, M.R., 1973, Cluster Analysis for Application, Academic Press. New York.

[37] Blum, C. and Roli, A., 2003, "Metaheuristics in Combinatorial Optimization: Overview and Conceptual Comparison", ACM Computing Surveys, vol. 35, no. 3, pp. 268-308.

[38] 2004, "Cluster Analysis", Copyright StatSoft,Inc.,1984-2004, http://www.statsoft.com/textbook/stcluan.html.

[39] Ward, J. H., 1963, "Hierarchical Grouping to Optimize an Objective Function", Journal of the American Statistical Association, vol. 58, no. 301, pp. 236-244.

References

191

[40] 1999, "Characteristics of Methods for Clustering Observations", http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap8/sect4.htm, SAS/STAT User's guide onlineDoc,Version 8, SAS Institute Inc.,Cary,NC,USA.

[41] Jain, A.K. and Dubes, R.C., 1988, Algorithms for Clustering Data, Prentice-Hall advanced reference series, Prentice-Hall. Englewood Cliffs, NJ, USA.

[42] MacQueen, J. B., 1967, "Some methods of classification and analysis of multivariate observations", in Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California, Berkeley, pp. 281-297.

[43] Ruspini, E. H., 1969, "A New Approach to Clustering", Information and Control, vol. 15, no. 1, pp. 22-32.

[44] Dunn, J. C., 1973, "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics, vol. 3, no. 3, pp. 32-57.

[45] Bezdek, J., 1981, Pattern Recognition With Fuzzy Objective Function Algorithms, Plenum. New York.

[46] Hoppner, F., Klawonn, F., Kruse, R., and Runkler, T., 1999, Fuzzy Cluster Analysis Methods for Classification, Data Analysis and Image Recognition, John Wiley and Sons Ltd.

[47] Lampinen, T., Koivisto, H., and Honkanen, T., 2002, "Profiling Network Applications with Fuzzy C-Means Clustering and Self-organizing Map", in Proceedings of 1st International Conference on Fuzzy Systems and Knowledge Discovery: Computational Intelligence for the E-Age, Orchid Country Club, Singapore, vol. 1, pp. 300-394.

[48] Zhao, Y. and Karypis, G., 2004, "Soft Clustering Criterion Functions for Partitional Document Clustering: A Summary of Results", in Proceedings of CIKM-04, 13th ACM International Conference on Information and Knowledge Management, pp. 246-247.

[49] Kirkpatrick, S., Gelatt, C. D., Jr., and Vecchi, M. P., 1983, "Optimization by Simulated Annealing", Science, vol. 220, no. 4598, pp. 671-680.

[50] Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E., 1953, "Equation of State Calculations by Fast Computing Machines", Journal of Chemical Physics, vol. 21, no. 6, pp. 1087-1092.

[51] Klein, R. and Dubes, R. C., 1989, "Experiments in Projection and Clustering by Simulated Annealing", Pattern Recognition, vol. 22, pp. 213-220.

[52] Selim, S. Z. and Al-Sultan, K., 1991, "A Simulated Annealing Algorithm for the Clustering Problem", Pattern Recognition, vol. 24, no. 10, pp. 1003-1008.

References

192

[53] Brown, D. E. and Huntley, C. L., 1992, "A Practical Application of Simulated Annealing to Clustering", Pattern Recognition, vol. 25, no. 4, pp. 401-412.

[54] Barker, A., 1989, Neural Networks for Data Fusion, Master Thesis, University of Virginia, Charlottesville, Virginia.

[55] Al-Sultan, K. and Selim, S. Z., 1993, "A Global Algorithm for the Fuzzy Clustering Problem", Pattern Recognition, vol. 26, no. 9, pp. 1357-1361.

[56] Lukashin, A. V. and Fuchs, R., 2001, "Analysis of Temporal Gene Expression Profiles: Clustering by Simulated Annealing and Determining the Optimal Number of Clusters", Bioinformatics, vol. 17, no. 5, pp. 405-414.

[57] Yang, W., Rueda, L., and Ngom, A., 2005, "A Simulated Snnealing Approach to Find the Optimal Parameters for Fuzzy Clustering Microarray Data", in Proceedings of XXV International Conference of the Chilean Computer Science Society - SCCC 2005, Valdivia, Chile, pp. 45-54.

[58] Al-Sultan, K., 1995, "A Tabu Search Approach to the Clustering Problem", Pattern Recognition, vol. 28, no. 9, pp. 1443-1451.

[59] Al-Sultan, K. and Fedjki, C., 1997, "A Tabu Search-Based Algorithm for the Fuzzy Clustering Problem", Pattern Recognition, vol. 30, no. 12, pp. 2023-2030.

[60] Sung, C. S. and Jin, H. W., 2000, "A Tabu-Search-Based Heuristic for Clustering", Pattern Recognition, vol. 33, pp. 849-858.

[61] Hall, L. O., Ozyurt, B., and Bezdek, J. C., 1999, "Clustering with a Genetically Optimized Approach", Evolutionary Computation, IEEE Transations on, vol. 3, no. 2, pp. 103-112.

[62] Maulik, U. and Bandyopadhyay, S., 2000, "Genetic Algorithm-Based Clustering Technique", Pattern Recognition, vol. 33, pp. 1455-1465.

[63] Bandyopadhyay, S. and Maulik, U., 2002, "Genetic Clustering for Automatic Evolution of Clusters and Application to Image Classification", Pattern Recognition, vol. 35, pp. 1197-1208.

[64] Davies, D. L. and Bouldin, D. W., 1979, "A Cluster Separation Measure", IEEE Trans.Pattern Anal.Mach.Intell., vol. 1, no. 2, pp. 224-227.

[65] Garai, G. and Chaudhuri, B. B., 2004, "A Novel Genetic Algorithm for Automatic Clustering", Pattern Recognition Letters, vol. 25, pp. 173-187.

[66] Tseng, L. and Yang, S., 2001, "A Genetic Approach to the Automatic Clustering Problem", Pattern Recognition, vol. 34, pp. 415-424.

[67] Babu, G. P. and Murty, M. N., 1994, "Clustering with Evolution Strategies", Pattern Recognition, vol. 27, no. 2, pp. 321-329.

References

193

[68] Lee, C. and Antonsson, E., 2000, "Dynamic Partitional Clustering Using Evolution Strategies", in Proceedings of the 3rd Asia-Pacific Conference on Simulated Evolution and Learning, Nagoya, Japan,

[69] Halkidi, M., Batistakis, Y., and Vazirgiannis, M., 2001, "On Clustering Validation Techniques", Journal of Intelligent Information Systems, vol. 17, no. 2/3, pp. 107-145.

[70] Su, M., December 2005, "A New Index of Cluster Validity", http://www.cecs.missouri.edu/~skubic/8820/ClusterValid.pdf, Electrical and Computer Engineering Department, University of Missouri-Columbia, Columbia, MO, USA.

[71] Rezaee, M. R., Lelieveldt, B. P. F., and ReiBer, J. H. C., 1998, "A New Cluster Validity Index for the Fuzzy C-Means", Pattern Recognition Letters, vol. 19, pp. 237-246.

[72] Sun, H., Wang, S., and Jiang, Q., 2004, "FCM-Based Model Selection Algorithms for Determining the Number of Clusters", Pattern Recognition, vol. 37, pp. 2027-2037.

[73] Bezdek, J., 1998, Pattern Recognition in Handbook of Fuzzy Computation, IOP Publishing Ltd. Boston, NY.

[74] Xie, X. L. and Beni, G., 1991, "A Validity Measure for Fuzzy Clustering", IEEE Trans.Pattern Analysis and Machine Intelligence., vol. 13, no. 8, pp. 841-847.

[75] Kim, D. W., Lee, K. H., and Lee, D., 2004, "On Cluster Validity Index for Estimation of the Optimal Number of Fuzzy Clusters", Pattern Recognition, vol. 37, pp. 2009-2025.

[76] Fukuyama, Y. and Sugeno, M., 1989, "A New Method of Choosing the Number of Clusters for the Fuzzy C-Means Method", in Proceedings of Fifth Fuzzy System Symposium, pp. 247-250.

[77] Rhee, H. and Oh, K., 1996, "A Validity Measure for Fuzzy Clustering and Its Uses in Selecting Optimal Number of Clusters", in Processdings of IEEE, pp. 1020-1025.

[78] Bandyopadhyay, S. and Maulik, U., 2001, "Nonparametric Genetic Clustering: Comparison of Validity Indices", IEEE Transactions on System, Man, and Cybernetic, vol. 31, no. 1, pp. 120-125.

[79] Xie, Y., Raghavan, V. V., and Zhao, X., 2002, "3M Algorithm: Finding an Optimal Fuzzy Cluster Scheme for Proximity Data", in Proceedings of the FUZZY-IEEE conference-2002 IEEE world congress on Computational Intelligence, Honolulu, HI, vol. 1, pp. 627-632.

[80] Kim, M. and Ramakrishna, R. S., 2005, "New Indices for Cluster Validity Assessment", Pattern Recognition Letters, vol. 26, pp. 2353-2363.

References

194

[81] Wu, K. and Yang, M., 2005, "A Cluster Validity Index for Fuzzy Clustering", Pattern Recognition Letters, vol. 26, no. 9, pp. 1275-1291.

[82] Hamerly, G. and Elkan, C., 2003, "Learning the K in K-means", in Proceedings of the 17th Annual Conference on Neural Information Processing Systems, British Columbia, Canada,

[83] Ray, S. and Turi, R., 1999, "Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentaion", in Proceedings of 4th International Conference on Advances in Pattern Recognition and Digital Techniques (ICAPRDT' 99), New Delhi, India, pp. 137-143.

[84] Ball, G. and Hall, D., 1965, "A Novel Method of Data Analysis and Classification", Stanford University, Stanford, CA, Technique Report AD-699616.

[85] Tran, T., Wehrens, R., and Buydens, L., 2005, "Clustering Multispectral Images: A Tutorial", Chemometrics and Intelligent Laboratory Systems, vol. 77, pp. 3-17.

[86] Turi, R.H., 2001, Clustering-Based Colour Image Segmentation, Ph.D Thesis, School of Computer Science and Software Engineering, Monash University, Australia.

[87] Tou, J., 1979, "DYNOC - A Dynamic Optimal Cluster-Seeking Technique", International Journal of Parallel Programming, vol. 8, no. 6, pp. 541-547.

[88] Chaudhuri, D., Chaudhuri, B. B., and Murthy, C. A., 1992, "A New Split-and-Merge Clustering Technique", Pattern Recognition Letters, vol. 13, pp. 399-409.

[89] Huang, K., 2002, "A Synergistic Automatic Clustering Technique (Syneract) for Multispectral Image Analysis", Photogrammetric Engineering and Remote Sensing, vol. 1, no. 1, pp. 33-40.

[90] Ester, M., Kriegel, H., Sander, J., and Xu, X., 1996, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", in Proceedings of 2nd Internation Conference on Knowledge Discovery and Data Mining.(KDD-96), Menlo Park, CA, pp. 226-231.

[91] Guha, S., Rastogi, R., and Shim, K., 1998, "CURE: An Efficient Clustering Algorithm for Large Database", in Proceedings of ACMSIGMOD international conference on Management Data, New York, pp. 73-84.

[92] Karypis, G., Han, E.-H., and Kumar, V., 1999, "Chameleon: Hierarchical Clustering Using Dynamic Modeling", IEEE computer, vol. 32, pp. 68-75.

[93] Kelly, P. M., 1994, "An Algorithm for Merging Hyperellipsoidal Clusters", Technical report LA-UR-94-3306.

[94] Tou, J.T. and Gonzalez, R., 1972, Pattern Recognition Principle, Addison -Wesley. Reading, MA.

References

195

[95] Jolliffe, I.T., 1986, Principal Component Analysis, Springer-Verlag. New York.

[96] Jolliffe, I.T., 2002, Principal Component Analysis (Second Edition), Aberdeen, UK.

[97] Goncalves, A. R., Esposito, E., and Benar, P., 1998, "Evaluation of Panus Tigrinus in the Delignification of Sugarcane Bagasse by FTIR-PCA and Pulp Properites", Journal of Biotechnology, vol. 66, pp. 177-185.

[98] Kim, S. W., Ban, S. H., Chung, H., Cho, S., Chung, H. J., Choi, P. S., Yoo, O. J., and Liu, J. R., 2004, "Taxonomic Discrimination of Flowering Plants by Multivariate Analysis of Fourier Transform Infrared Spectroscopy Data", Plant Cell Reports, vol. 23, no. 4, pp. 246-250.

[99] 2003, "Matlab Statistics Toolbox:Linkage", Matlab version 6.5.0, release 13.0.1,

[100] Zhao, H., Kassama, Y., Young, M., Kell, D. B., and Goodacre, R., 2004, "Differentiation of Micromonospora Isolates from a Coastal Sediment in Wales on the Basis of Fourier Transform Infrared Spectroscopy, 16S rRNA Sequence Analysis, and the Amplified Fragment Length Polymorphism Technique", Applied and Environmental Microbiology, vol. 70, no. 11, pp. 6619-6627.

[101] Naumann, A., Navarro-Gonzalez, M., Peddireddi, S., Kues, U., and Polle, A., 2005, "Fourier Transform Infrared Microscopy and Imaging: Detection of Fungi in Wood", Fungal Genetics and Biology, vol. 42, pp. 829-835.

[102] Mansfield, J., Sowa, M., Majzels, C., Collins, C., Cloutis, E., and Mantsch, H., 1999, "Near Infrared Spectroscopic Reflectance Imaging: Supervised vs. Unsupervised Analysis Using An Art Conservation Application", Vibrational Spectroscopy, vol. 19, pp. 33-45.

[103] Cancer Research UK website for Oral Cancer, 2005, http://info.cancerresearchuk.org/cancerstats/types/oral/?a=5441

[104] Cancer Research UK website for Oral Cancer, 2005, http://info.cancerresearchuk.org/cancerstats/types/oral/?a=5441

[105] The Concise Biotech Dictionary website, 2004, www.thebiotechdictionary.com/term/histology

[106] Bird, B., 2006, FTIR Imaging: A Route Toward Automated Histopathology, Ph. D Thesis, The Department of Chemistry, The University of Nottingham, UK.

[107] Kissin, M. W., Querci-della-Rovere, G., Easton, D., and Westbury, G., 1986, "Risk of Lymphoedema Following the Treatment of Breast Cancer", British Journal of Surgery, vol. 73, pp. 580-584.

[108] Reddy, M. and Given-wilson, R., 2004, "Screening for Breast Cancer", Surgery, vol. 22, no. 7, pp. 155-160.

References

196

[109] Turner, R. R., Ollila, D. W., Krasne, D. L., and Giuliano, A. E., 1997, "Histopathologic validation of the sentinel lymph node hypothesis for breast cancer", Annals of Surgery, vol. 226, pp. 271-278.

[110] Bird, B, June 2005, "Fourier Transform Infrared (FTIR) Imaging - A Potential Tool for Cancer Diagnosis", School of Chemistry, The University of Nottingham, UK.

[111] van Diest, P. J., Torrenga, H., Borgstein, P. J., Pijpers, R., Bleichrodt, R. P., Rahusen, F. D., and Meijer, S., 1999, "Reliability of Intraoperative Frozen Section and Imprint Cytological Investigation of Sentinel Lymph Nodes in Breast Cancer", Histopahthology, vol. 35, no. 1, pp. 14-18.

[112] Gulec, S. A., Su, J., O'Leary, J. P., and Stolier, A., 2001, "Clinical utility of frozen section in sentinel node biopsy in breast cancer", American Surgeon, vol. 67, no. 6, pp. 529-532.

[113] Salem, A. A., Douglas-Jounes, A. G., Moneypenny, I. J., Sweetland, H. M., Webster, D. J., Newcombe, R. G., and Mansel, R. E., 2002, "Detection of Axillary Node Status During Breast Cancer Surgery", European Journal of Surgical Oncology, vol. 28, pp. 789-

[114] Swenson, K. K., Nissen, M. J., Ceronsky, C., Swenson, L., Lee, M. W., and Tuttle, T. M., 2002, "Comparison of Side Effects Between Sentinel Lymph Node and Axillary Lymph Node Sissection for Breast Cancer", Annuals of Surgical Oncology, vol. 9, pp. 745-753.

[115] Johnson, K. S., Chicken, D. W., Pickard, D. C. O., Lee, A. C., Briggs, G., Falzon, M., Bigio, I. J., Keshtgar, M. R., and Bown, S. G., 2004, "Elastic scattering spectroscopy for intraoperative determination of sentinel lymph node status in the breast", Journal of Biomedical Optics, vol. 9, no. 6, pp. 1122-1128.

[116] Godavarty, A., Thompson, A. B., Roy, R., Eppstein, M. J., Zhang, C., Gurfinkel, M., and Sevick-Muraca, E. M., 2004, "Diagnostic imaging of breast cancer using fluorescence-enhanced optical tomography: phantom studies", Journal of Biomedical Optics, vol. 9, no. 3, pp. 486-496.

[117] Smith, J., Kendall, C., Sammon, A., Christie-Brown, J., and Stone, N., 2003, "Raman Spectral Mapping in the Assessment of Axillary Lymph Nodes in Breast Cancer", Technology in Cancer Research & Treatment, vol. 2, no. 4, pp. 327-332.

[118] Contractor, K., Burke, M., Singhal, H., Bonsal, U., Boyle, S., Williams, G., Bostwick, P., and Mitchel, R., 2002, "Contact Cytology in the Intraoperative Detection of Sentinel Node Metastasis", Journal of Surgical Oncology, vol. 28, pp. 787-

[119] Surewicz, W. K., Mantsch, H. H., and Chapman, D., 1993, "Determination of Protein Secondary Structure by Fourier Transform Infrared Spectroscopy: A Critical Assessment", Biochemistry, vol. 32, no. 2, pp. 389-395.

References

197

[120] Lasch, P., Schmitt, J., and Naumann, D., 2000, "Colorectal Adenocarcinoma Diagnosis by FT-IR Micropectrometry", Proceedings of SPIE, Biomedical Spectroscopy: Vibrational Spectroscopy and Other Novel Techniques, vol. 3918, pp. 45-56.

[121] Perelman, L., ., Backman, V., Wallace, M., Zonios, G., Manoharan, R., Nusrat, A., Shields, S., Seiler, M., Lima, C., Hamano, T., Itzkan, I., Van Dam, J., Crawford, J. M., and Feld, M. S., 1998, "Observation of periodic fine structure in reflectance from biological tissue: A new technique for measuring nuclearsize distribution", Physical Review Letters, vol. 80, pp. 627-630.

[122] Mourant, J. R., Hielscher, A. H., Eick, A. A., Shen, D., Johnson, T. M., and Freyer, J. P., 1998, "Evidence for intrinsic differences in light-scattering properties of tumorigenic and nontumorigenic cells", Cancer Cytopathology, vol. 84, no. 6, pp. 366-374.

[123] Bandyopadhyay, S., 2003, "Simulated Annealing for Fuzzy Clustering: Variable Representation, Evolution of the Number of Clusters and remote Sensing Applications", unpublished, private communication.

[124] Pal, N. R. and Bezdek, J., 1995, "On Cluster Validity for the Fuzzy C-Means Model", IEEE Trans.Fuzzy System., vol. 3, pp. 370-379.

[125] Rayward-Smith, V.J., Osman, I.H., Reeves, C.R., and Smith, G.D., 1996, Modern Heuristic Search Methods, John Wiley & Sons.

[126] Conover, W.J., 1999, Practical Nonparametric Statistics, John Wiley & Sons.

[127] Causton, D.R., 1987, A Biologist's Advanced mathematics, Allen & Unwin. London.

[128] Liu, H., 1998, Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic.

[129] Diem, M, Chiriboga, L., and Yee, H., 2000, "Infrared Spectroscopy of Human Cells and Tissue. VIII. Strategies for Analysis of Infrared Tissue Mapping Data and Applications to Liver Tissue", Biopolymers ( Biospectroscopy), vol. 57, pp. 282-290.

Appendix

198

Appendix

Medical Terminologies

In the following, the medical terms mentioned within the thesis are explained and the

explanations are either provided by Chemistry Department or searched from online

medical dictionary.

● Biopsy:

The removal of a small portion of tissue from the body for microscopic examination.

● Cortex tissue:

The outer layer of an internal organ or body structure.

● Fibrocollagenous tissue:

A type of tissue is fibrous and collagenous; pertaining to or composed of fibrous

tissue mainly composed of collagen.

● Inter-observer:

Discrepancy between two different observers.

● Intra-observer:

Discrepancy between two different examinations by the same observer.

● Ipsilateral axilla:

Located on or affecting the same side of the axilla, where axilla is the cavity beneath

the junction of a forelimb and the body.

Appendix

199

● Keratinisation:

The conversion of thin outer cells into a tough material.

● Lymphoedema:

Swelling, especially in subcutaneous tissues, as a result of obstruction of lymphatic

vessels or lymph nodes, with accumulation of lymph in the affected region.

● Metastasis:

Cancer spread from its original location.

● Morphological changes:

By studying the form and structure of the cells without consideration of function.

● Necrotic tissue:

Dead tissue through injury or disease.

● Pap smear:

A method for the early detection of cancer especially of the uterine cervix that

involves the staining of exfoliated cells using a special technique which differentiates

diseased tissue. The name ‘Pap’ is taken from the surname of the inventor of the

screening test, Dr. George Papanicolaous.

● Phagocyte cell:

A type of cell in the body which can absorb waste material, harmful microorganisms,

such as a white blood cell, it protects the body against infection by destroying

bacteria.

● Reticulum:

A fine network formed by cells, by certain structures within cells, or by connective-

tissue fibres between cells.

Appendix

200

● Stroma tissue:

Connective tissue, also refer to normal tissue.

● Trabeculae:

Small, often microscopic, tissue elements in the form of a small beam, strut or rod,

generally having a mechanical function.

● Traditional/conventional histology:

The study of plant or animal tissue, usually this involves studying thin cross-sections

of tissue under a microscope.

● Tumour tissue:

An abnormal new growth of tissue.

fuzzy clustering in the analysis of fourier transform infrared spectra for cancer diagnosis

Documents