tools and applications with artificial intelligence studies in computational intelligence

Constantinos Koutsojannis and Spiros Sirmakessis (Eds.)

Tools and Applications with Artificial Intelligence

Studies in Computational Intelligence,Volume 166

Editor-in-ChiefProf. Janusz KacprzykSystems Research InstitutePolish Academy of Sciencesul. Newelska 601-447 WarsawPolandE-mail: [email protected]

Further volumes of this series can be found on our homepage:springer.com

Vol. 144.Andreas Fink and Franz Rothlauf (Eds.)Advances in Computational Intelligence in Transport, Logistics,and Supply Chain Management, 2008ISBN 978-3-540-69024-5

Vol. 145. Mikhail Ju. Moshkov, Marcin Piliszczukand Beata ZieloskoPartial Covers, Reducts and Decision Rules in Rough Sets, 2008ISBN 978-3-540-69027-6

Vol. 146. Fatos Xhafa and Ajith Abraham (Eds.)Metaheuristics for Scheduling in Distributed ComputingEnvironments, 2008ISBN 978-3-540-69260-7

Vol. 147. Oliver KramerSelf-Adaptive Heuristics for Evolutionary Computation, 2008ISBN 978-3-540-69280-5

Vol. 148. Philipp LimbourgDependability Modelling under Uncertainty, 2008ISBN 978-3-540-69286-7

Vol. 149. Roger Lee (Ed.)Software Engineering,Artificial Intelligence, Networking andParallel/Distributed Computing, 2008ISBN 978-3-540-70559-8

Vol. 150. Roger Lee (Ed.)Software Engineering Research, Management andApplications, 2008ISBN 978-3-540-70774-5

Vol. 151. Tomasz G. Smolinski, Mariofanna G. Milanovaand Aboul-Ella Hassanien (Eds.)Computational Intelligence in Biomedicine and Bioinformatics,2008ISBN 978-3-540-70776-9

Vol. 152. Jaros�law StepaniukRough – Granular Computing in Knowledge Discovery and DataMining, 2008ISBN 978-3-540-70800-1

Vol. 153. Carlos Cotta and Jano van Hemert (Eds.)Recent Advances in Evolutionary Computation forCombinatorial Optimization, 2008ISBN 978-3-540-70806-3

Vol. 154. Oscar Castillo, Patricia Melin, Janusz Kacprzyk andWitold Pedrycz (Eds.)Soft Computing for Hybrid Intelligent Systems, 2008ISBN 978-3-540-70811-7

Vol. 155. Hamid R. Tizhoosh and M.Ventresca (Eds.)Oppositional Concepts in Computational Intelligence, 2008ISBN 978-3-540-70826-1

Vol. 156. Dawn E. Holmes and Lakhmi C. Jain (Eds.)Innovations in Bayesian Networks, 2008ISBN 978-3-540-85065-6

Vol. 157.Ying-ping Chen and Meng-Hiot Lim (Eds.)Linkage in Evolutionary Computation, 2008ISBN 978-3-540-85067-0

Vol. 158. Marina Gavrilova (Ed.)Generalized Voronoi Diagram: A Geometry-Based Approach toComputational Intelligence, 2009ISBN 978-3-540-85125-7

Vol. 159. Dimitri Plemenos and Georgios Miaoulis (Eds.)Artificial Intelligence Techniques for Computer Graphics, 2009ISBN 978-3-540-85127-1

Vol. 160. P. Rajasekaran and Vasantha Kalyani DavidPattern Recognition using Neural and Functional Networks,2009ISBN 978-3-540-85129-5

Vol. 161. Francisco Baptista Pereira and Jorge Tavares (Eds.)Bio-inspired Algorithms for the Vehicle Routing Problem, 2009ISBN 978-3-540-85151-6

Vol. 162. Costin Badica, Giuseppe Mangioni,Vincenza Carchiolo and Dumitru Dan Burdescu (Eds.)Intelligent Distributed Computing, Systems and Applications,2008ISBN 978-3-540-85256-8

Vol. 163. Pawel Delimata, Mikhail Ju. Moshkov,Andrzej Skowron and Zbigniew SurajInhibitory Rules in Data Analysis, 2009ISBN 978-3-540-85637-5

Vol. 164. Nadia Nedjah, Luiza de Macedo Mourelle,Janusz Kacprzyk, Felipe M.G. Francaand Alberto Ferreira de Souza (Eds.)Intelligent Text Categorization and Clustering, 2009ISBN 978-3-540-85643-6

Vol. 165. Djamel A. Zighed, Shusaku Tsumoto,Zbigniew W. Ras and Hakim Hacid (Eds.)Mining Complex Data, 2009ISBN 978-3-540-88066-0

Vol. 166. Constantinos Koutsojannis and Spiros Sirmakessis(Eds.)Tools and Applications with Artificial Intelligence, 2009ISBN 978-3-540-88068-4

Constantinos KoutsojannisSpiros Sirmakessis(Eds.)

Tools and Applications withArtificial Intelligence

123

Prof. Spiros SirmakessisResearch Academic Computer TechnologyInstitute "D. Maritsas" Building N.Kazantzaki str.Patras University Campus26500 PatrasGreece

Prof. Constantinos KoutsojannisComputer Engineering and Informatics DepartmentUniversity of Patras26500 PatrasGreece

ISBN 978-3-540-88068-4 e-ISBN 978-3-540-88069-1

DOI 10.1007/978-3-540-88069-1

Studies in Computational Intelligence ISSN 1860949X

Library of Congress Control Number: 2008935499

c© 2009 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part of thematerial is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilm or in any other way, and storage in databanks.Duplication of this publication or parts thereof is permitted only under the provisions ofthe German Copyright Law of September 9, 1965, in its current version, and permission for usemust always be obtained from Springer.Violations are liable to prosecution under the GermanCopyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this publicationdoes not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general use.

Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.

Printed in acid-free paper

9 8 7 6 5 4 3 2 1

springer.com

Preface

In recent years, the use of Artificial Intelligence (AI) techniques has been greatly increased. The term “intelligence” seems to be a “must” in a large number of European and International project calls. AI Techniques have been used in almost any domain. Application-oriented systems usually incorporate some kind of “intelligence” by using techniques stemming from intelligent search, knowledge representation, machine learning, knowledge discovery, intelligent agents, computational intelligence etc. The Workshop on "Applications with Artificial Intelligence" seeks for quality papers on computer applications that incorporate some kind of AI technique. The objective of the workshop was to bring together scientists, engineers and practitioners, who work on designing or developing applications that use intelligent techniques or work on intelligent techniques and apply them to application domains (like medicine, biology, education etc), to present and discuss their research works and exchange ideas. Results of project-based works were very welcome.

Topics of interest for the workshop as far as application domains were concerned included (but not limited to) the following:

• Biology • Computer Networks • Computer Security • E-commerce • Education • Engineering • Finance • Health Care & Medicine • Logistics • Multimedia • Psychology • Power Systems • Sociology • Web Applications

Topics of interest as far as intelligent techniques were concerned included (but are not limited to) the following:

• Evolutionary Algorithms • Fuzzy Logic • Hybrid Techniques

Preface

VI

• Heuristic Search • Intelligent Agents • Knowledge-Based Systems • Knowledge Representation • Knowledge Discovery • Machine Learning • Neural Networks • Planning • Semantic Web Techniques

We would like to express our appreciation to all authors of submitted papers, to the members of the program committee and all the people that have worked for this event.

This workshop could not have been held without the outstanding efforts of Marios Katsis Finally, recognition and acknowledgement is due to all members of the Internet and Multimedia Research Unit at Research Academic Computer Technology Institute and the eBusiness Lab staff. Constantinos Koutsojannis

Spiros Sirmakessis

Organization

Program Committee

Constantinos Koutsojannis (Co-chair)

Department of Physiotherapy, Technological Educational Institution of Patras, Greece

Spiros Sirmakessis (Co-chair) Department of Applied Informatics in Administration and Economy, Technological Educational Institution of Messolongi, Greece

Spiros Likothanasis Computer Engineering and Informatics Department, University of Patras, Greece

Maria Rigkou Computer Engineering and Informatics Department, University of Patras, Greece

Martin Rajman Global Computing Center, Ecole Polytechnique Federale de Lausanne, Switzerland

Michalis Xenos Hellenic Open University, Greece Thrasyvoulos Tsiatsos Department of Informatics, Aristotle University of

Thessaloniki, Greece Georgios Miaoulis Department of Informatics, TEI of Athens Greece Nikiktas Karanikolas Department of Informatics, TEI of Athens Greece Dimitri Plemenos XLIM laboratory, University of Limoges, France Pierre-François Bonnefoi XLIM laboratory, University of Limoges, France Grigorios Beligiannis Department of Business Administration in Food and

Agricultural Enterprises, University of Ioannina, Greece

Vassilios Tampakas Department of Accounting, TEI of Patras, Greece

Contents

Evaluation of Morphological Features for Breast CellsClassification Using Neural NetworksHarsa Amylia Mat Sakim, Nuryanti Mohd Salleh, Mohd Rizal Arshad,Nor Hayati Othman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

A BP-Neural Network Improvement to Hop-Counting forLocalization in Wireless Sensor NetworksYurong Xu, James Ford, Eric Becker, Fillia S. Makedon . . . . . . . . . . . . . . . 11

SudokuSat—A Tool for Analyzing Difficult Sudoku PuzzlesMartin Henz, Hoang-Minh Truong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Understanding and Forecasting Air Pollution with the Aid ofArtificial Intelligence Methods in Athens, GreeceKostas D. Karatzas, George Papadourakis, Ioannis Kyriakidis . . . . . . . . . . . 37

Multiple Target Tracking for Mobile Robots Using the JPDAFAlgorithmAliakbar Gorji, Mohammad Bagher Menhaj, Saeed Shiry . . . . . . . . . . . . . . . 51

An Artificial Market for Emission PermitsIoannis Mourtos, Spyros Xanthopoulos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Application of Neural Networks for InvestigatingDay-of-the-Week Effect in Stock MarketVirgilijus Sakalauskas, Dalia Kriksciuniene . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

WSRPS: A Weaning Success Rate Prediction System Basedon Artificial Neural Network AlgorithmsAustin H. Chen, Guan Ting Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Dealing with Large Datasets Using an Artificial IntelligenceClustering ToolCharalampos N. Moschopoulos, Panagiotis Tsiatsis,Grigorios N. Beligiannis, Dimitrios Fotakis, Spiridon D. Likothanassis . . . 105

X Contents

An Application of Fuzzy Measure and Integral for DiagnosingFaults in Rotating MachinesMasahiro Tsunoyama, Hirokazu Jinno, Masayuki Ogawa, Tatsuo Sato . . . 121

A Multi-agent Architecture for Sensors and Actuators’ FaultDetection and Isolation in Case of Uncertain ParameterSystemsSalma Bouslama Bouabdallah, Ramla Saddam, Moncef Tagina . . . . . . . . . . 135

A User-Friendly Evolutionary Tool for High-SchoolTimetablingCharalampos N. Moschopoulos, Christos E. Alexakos, Christina Dosi,Grigorios N. Beligiannis, Spiridon D. Likothanassis . . . . . . . . . . . . . . . . . . . 149

HEPAR: An Intelligent System for Hepatitis Prognosis andLiver Transplantation Decision SupportConstantinos Koutsojannis, Andrew Koupparis,Ioannis Hatzilygeroudis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

Improving Web Content Delivery in eGoverment ApplicationsKostas Markellos, Marios Katsis, Spiros Sirmakessis . . . . . . . . . . . . . . . . . . 181

Improving Text-Dependent Speaker Recognition PerformanceDonato Impedovo, Mario Refice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

C. Koutsojannis and S. Sirmakessis (Eds.): Tools & Appli. with Artificial Intel., SCI 166, pp. 1–9. springerlink.com © Springer-Verlag Berlin Heidelberg 2009

Evaluation of Morphological Features for Breast Cells Classification Using Neural Networks

Harsa Amylia Mat Sakim1, Nuryanti Mohd Salleh1, Mohd Rizal Arshad1, and Nor Hayati Othman2

1 School of Electrical and Electronic Engineering, Universiti Sains Malaysia, Engineering Campus, 14300 Nibong Tebal, Seberang Perai Selatan, Pulau Pinang, Malaysia [email protected] 2 School of Medical Sciences, Universiti Sains Malaysia, Health Campus, 16150 Kubang Kerian, Kelantan, Malaysia [email protected]

Abstract. Rapid technology advancement has contributed towards achievements in medical applications. Cancer detection in its earliest stage is definitely very important for effective treatments. Innovation in diagnostic features of tumours may play a central role in development of new treatment methods. Thus, the purpose of this study is to evaluate proposed morphologi-cal features to classify breast cancer cells. In this paper, the morphological features were evalu-ated using neural networks. The features were presented to several neural networks architecture to investigate the most suitable neural network type for classifying the features effectively. The performance of the networks was compared based on resulted mean squared error, accuracy, false positive, false negative, sensitivity and specificity. The optimum network for classification of breast cancer cells was found using Hybrid Multilayer Perceptron (HMLP) network. The HMLP network was then employed to investigate the diagnostic capability of the features indi-vidually and in combination. The features were found to have important diagnostic capabilities. Training the network with a larger number of dominant morphological features was found to significantly increase the diagnostic capabilities. A combination of the proposed features gave the highest accuracy of 96%.

Keywords: Morphological features Breast cancer Fine needle aspirates Neural network Classification.

1 Introduction

Breast cancer is the most commonly diagnosed cancer and it is the leading cause of cancer-related death among women. It is estimated that approximately one in 12 women will develop breast cancer in their lifetime. The majority of breast cancers (95%) are sporadic. Only a small proportion, particularly those diagnosed in young women, are due to a highly penetrant autosomal-dominant trait. There has been con-siderable progress in the identification and localization of the morphological features of tumors responsible for hereditary breast cancer. Early detection is the key to recog-nize the stage of the disease in order to implement a proper treatment [6].

An alternative diagnostic method to mammography is using fine needle aspiration (FNA) technique. FNA provide a way to examine a small amount of tissue from the

2 H.A.M. Sakim et al.

tumor. Conventionally, FNA smear slides are viewed under the microscope to deter-mine the malignancy. By carefully examining both the characteristics of individual cells and important contextual features for instance the size of cell clumps have been able to diagnose successfully using FNA [7]. Specific morphological patterns or fea-tures are investigated before diagnoses are given. However, many different features are thought to be correlated with malignancy and the process remains highly subjec-tive, depending upon the skill and the experience of the pathologists. Pathologists commonly investigate the morphological features of the cells in order to identify features that are thought to have diagnostic abilities. The diagnosis process takes a long time and it is costly.

New innovation in neural network methods and its application in medical field have enabled prediction of the cancer cells easier. The ability some of those methods have been reported more accurate as compared to conventional methods [3]. Conse-quently, breast cancer diagnosis systems based on artificial intelligence have been implemented widely as an alternative. In order to evaluate the morphological features diagnostic ability, therefore, this work was carried out.

2 Morphological Features

The morphological features provide information about the size, shape and texture of a cell. The nuclear size is expressed by the radius and area of the cell. The nuclear shape is expressed by the compactness, roundness, smoothness, length of the major and minor axes, fractal dimension, symmetry and concavity. Both size and shape are expressed by the perimeter feature. On the other hand, nuclear texture is verified by finding the variance of the grey scale intensities in the component pixels [1].

In this paper, the inputs to the neural network would be the breast cancer cell morphological features. The breast cancer cells are those obtained from fine needle aspirations biopsy. In general, common features of malignant patterns are:

1. High cell yield. 2. A single population of atypical epithelial cells. 3. Irregular angulated clusters of atypical cells. 4. Reduced cohesiveness of epithelial cells. 5. Nuclear enlargement and irregularity of variable degree. 6. Single cells with intact cytoplasm. 7. Absence of single bare nuclei of benign type. 8. Necrosis - important if present.

On the other hand, benign tissues will have overall low cell yield. Sheets of ductal cells with small uniform nuclei would appear with myoepithelial nuclei visible among epithelial cells in aggregates. There will also be single, bare, oval nuclei separated from epithelial aggregates [4].

However, only five morphological features were used in this evaluation. The features comprises of Cellularity of Cells, Cells in Cluster, Cells in Discrete, Xlength and Ylength. Cellularity or cell density is a measure of distribution of breast cells on a smear. A smear with poor cell yield generally indicates normal breast gland tissue. Cellularity has been divided into three categories, ranked between 1 (poor density) up

Evaluation of Morphological Features for Breast Cells Classification 3

to 3 (highly dense). Cells visualized on a smear could be in clusters (grouped) or in discrete (individually placed, separated from other cells). Irregularity in distribution generally indicates abnormality. An estimated count of cells in discrete and cells in clusters are included as two separate inputs to the neural network. Xlength refers to the shortest while Ylength refers to the longest cells visualized. Both measurements were made in micrometers (μm).

3 Methodology

In this study, a preliminary investigation on the suitable network type has been carried out to evaluate the suitable network for classifying breast cancer cells. The morpho-logical features data obtained from FNA smears were used as inputs to train the artifi-cial neural networks. The best networks’ architecture was determined based on results on testing set in terms of:

1. Classification accuracy, false positive, false negative, sensitivity and specificity percentage.

2. Mean Squared Error (MSE) nearest or equal to 0. For ease of comparison, the values of MSE were quoted in the unit of decibel (dB).

The optimum network identified is then used to utilize the morphological features effectively. During this stage, individual and various combinations of breast features were applied to the network to produce classification. Each of the features was then omitted respectively to evaluate the impact on network diagnostic performance.

3.1 Training and Testing Data

Morphological features obtained from fine needle aspiration (FNA) of breast cells were used to train and test the neural networks. A total of 804 FNA samples of breast cells were analyzed and each data was considered as individual cases. There were 538 (67%) benign and 266 (33%) malignant cases. These cases were randomly selected and grouped into two data sets that were training set and testing set. A total of 404 data were used as training set while 400 data were used as testing set. The training set was used for estimation of the network parameters and trained the network to identify the morphological features effectively. The testing set was used for evaluation of the performance of the model. This data set assures the efficiency of the neural networks architecture in classifying between benign and malignant breast cancer cells. Listings of each of the training and testing data sets are given in Table 1 respectively.

Table 1. Divisions of morphological features data sets

Total Cases = 804 Data Set Training TestingBenign 290 248 Malignant 114 152 Total 404 400


3.2 Optimum Neural Network Investigation

To investigate the optimum neural network architecture for breast cancer cells diag-nosis, morphological features data were preliminary applied to six types of neural network and their performance compared. The trained networks were Multilayer Perceptron (MLP), Multilayer Perceptron Sigmoid (MLPSig), Hybrid Multilayer Per-ceptron (HMLP), Radial Basis Function (RBF), Hybrid Radial Basis Function (HRBF) [2] and Self-Organizing Maps (SOM).

The neural networks established consisted of three layers that were input layer, hidden layer and output layer. To compare the different types of network, only one hidden layer was employed for all network types. There was only one output node, which corresponds to classification of breast cancer cells. The network’s output was graded to range from 0 to 1 and determined due to cut-off point of 0.5. A probability of high output (≥0.5) was considered to indicate malignancy (cell which infected with cancerous), while a probability of low output (<0.5) indicated benignity (normal cell). The input layer has five nodes, which corresponds to the five morphological features under study. Specific integer values were proposed to indicate each of the morpho-logical features as the network’s inputs.

Corresponding diagnostic results given by pathologist were assigned to 1 for malignant and 0 for benign. The actual output of neural network is compared to these values (0s and 1s) for comparative measurements. The number of training epochs and nodes in the hidden layer needs to be investigated to find the optimum network struc-ture. In general, each network type was initially trained with a fixed hidden node

Table 2. Variables proposed for classification network

Input Marker

Categories Neural Network Inputs

High 3 Moderate 2

Cellularity Cells

Poor 1 >100 3

50-100 2 Cells in Discrete

<50 1 >51 4

31-50 3 10-30 2

Cells in Cluster

<10 1

X Length Mean of shortest length of

cells in µm

Y Length Mean of longest length of

cells in µm Malignant 1 Output

Benign 0


while the epoch was varied. Then, training and testing were repeated for increasing numbers of hidden nodes. In the literature, there is no specific rule on determining the number of nodes in the hidden layer. Nevertheless, in a work by Choong et al. [5] to predict recurrence of patients with node positive, a neural network with the least number of layers and nodes was defined as the optimum network. Table 2 shows proposed features for classification.

4 Results and Discussion

The summary of performance of all networks trained

using the morphological data is tabulated in Table 3. The table shows the best number of hidden node (HN) and epoch for each type of network investigated. From this table, it can be seen that all the networks investigated were able to give more than 90% training and testing data accuracy. However, only SOM network presented below 90% (84%) for testing data accuracy. During training, HMLP network per-formed the best with 96% accuracy and MSE of -30.81dB. During testing, the highest accuracy was 99% with least MSE of -36.66dB.

Table 3. Performance of all networks trained using morphological features

Training Testing NetworkType

No. of HN

No.of

EpochMSE

(-dB) ACC

(%) MSE

(-dB) ACC

(%) MLP 10 11 27.43 94 33.35 98 MLPSig 8 6 29.26 95 34.21 98 HMLP 7 6 30.81 96 36.66 99 RBF 7 4 23.23 94 26.42 98 HRBF 13 2 27.84 94 32.58 98 SOM 10 50 14.09 92 12.05 84

Based on results tabulated in Table 3, the HMLP network with 7 hidden nodes and trained using 6 epochs was presented to analyze the morphological features. When the input features were omitted one at a time, the classification results are as shown in Table 4. FP%, FN%, SN%, SP% and ACC% are percentage of False Positive, False Negative, Sensitivity, Specificity and Accuracy respectively. From Table 4, some observations could be made. They were:

1. The network specificity and accuracy reduced the most when Ylength was omit-ted (using all features but Ylength). The increased in MSE and false positive percentage are too large. The network cannot identify benign cases correctly. Ylength is considered an important diagnostic feature.

2. Omitting Xlength affects the classification such that only benign cases were correctly identified. The network cannot detect any of malignant cases (100% false negative), thus reducing overall accuracy. The MSE was increased the most.


3. Omitting Cluster enabled the network to identify all malignant cases. However, there was a reduction in benign cases identified, thus reducing overall accuracy.

4. Omitting Discrete has positive effect on sensitivity but negative effect on speci-ficity. The overall accuracy was not affected. Discrete can be considered comparatively, not to be a significant feature. However, MSE was increased. The feature Discrete may be useful for difficult cases.

Table 4. Results of classification when input features were omitted one at a time

Input Nodes

FP (%)

FN (%)

SN (%)

SP (%)

ACC(%)

Final MSE (-dB)

All features included

3 6 94 97 96 29.52

Cellularity omitted

3 18 82 97 92 23.9

Cluster omitted

26 0 100 74 83 18.31

Discrete omitted

4 3 97 96 96 28.13

Xlength omitted

0 100 0 100 67 3.94

Ylength omitted

93 3 97 7 37 9.42

Various combinations of input markers were then presented to the HMLP network. The classification results on all the data are as shown in Table 5. Applying single inputs at a time to HMLP network gave classification results as shown in Table 6. Some observations made were:

1. Combination of the features Xlength and Ylength improved classification accu-racy. The network achieved 100% sensitivity and no false negative presented. All malignant cases were correctly identified.

2. Combination of the features Cluster, Xlength and Ylength were able to correctly classify most of the patients. Compared to network with Xlength and Ylength only, more benign cases were correctly identified, although some malignant cases were missed.

3. Network with combination of Discrete, Xlength and Ylength was able to identify many of the malignant cases. However, many benign cases were missed, thus reducing overall accuracy.

4. Combination of Cellularity and Xlength was identified to be the lowest classifica-tion accuracy for two input nodes category. While combination of Cellularity,


Table 5. Results of classification when using combination of input features

Input Nodes FP

(%) FN

(%) SN

(%) SP

(%) ACC(%)

Final MSE (-dB)

3 Input Nodes Cell,Clust&Discrete 0 100 0 100 67 7.80 Cell,Clust & XL 85 4 96 15 42 10.80 Cell,Clust & YL 0 100 0 100 67 4.95 Cell,Discrete & XL 96 3 97 4 35 8.58 Cell,Discrete & YL 0 100 0 100 67 6.73 Cell,XL&YL 34 0 100 66 77 16.18 Clust,Discrete &XL 47 3 97 53 67 14.43 Clust,Discrete &YL 26 100 0 74 50 -0.63 Clust,XL&YL 3 13 87 97 94 24.35 Discrete,XL&YL 22 2 98 78 85 19.08 2 Input Nodes Cell & Clust 0 100 0 100 67 8.15 Cell & Discrete 0 100 0 100 67 8.26 Cell & XL 92 4 96 8 37 9.00 Cell & YL 0 100 0 100 67 7.39 Clust & Discrete 0 100 0 100 67 6.58 Clust & XL 42 3 97 58 71 14.73 Clust & YL 26 100 0 74 49 -0.07 Discrete & XL 60 3 97 40 59 13.52 Discrete & YL 16 100 0 84 56 1.44 XL & YL 36 0 100 64 76 16.41

Cell=Cellularity, Clust=Cluster, XL=Xlength, YL=Ylength.

Table 6. Results of classification when using one feature at a time

Input Nodes FP (%)

FN (%)

SN (%)

SP (%)

ACC(%)

Final MSE (-dB)

Cellularity only 0 100 0 100 67 9.05 Cluster only 0 100 0 100 67 6.13 Discrete only 0 100 0 100 67 7.02 Xlength only 53 3 97 47 63 13.87 Ylength only 0 100 0 100 67 2.17

Discrete and Xlength was identified to be the lowest classification accuracy for three input nodes category. However, both of the combination were more sensitive towards malignant cases than benign cases.


5. The feature Xlength on its own has positive effect on sensitivity and negative effect on specificity. Combination of this feature with others is important to detect malignancy.

6. Single input application for instance Cellularity, Cluster, Discrete or Ylength, was not sensitive towards malignant cases (only identified benign cases).

5 Conclusion

In this study, the morphological features were found to have diagnostic capabilities. The individual features were able to give correct classifications. This shows that the features are able to indicate tumor aggressiveness at cellular level. Although the fea-tures were independently insignificant, the features were able to give high accuracies collectively. This is shown when the features were employed individually; they did not give high classification results as expected. Most of the features only identified benign cases well. For example, the feature Xlength on its own classified many of the malignant cases but at the expense of many misclassified benign cases. When com-bined with Ylength, all the malignant cases could be classified and more benign cases were identified.

Although the feature Ylength was able to correctly classify all benign cases on its own, none of the malignant cases was identified. However, employing all the morpho-logical features except Ylength degrades the overall network performance. Ylength can be considered to be relatively, the most significant morphological feature. When the combination of the features Cellularity, Cluster, Xlength and Ylength were used as in-puts to the HMLP network, the highest accuracy, 96% was achieved. Although this ac-curacy percentage was similar during all features employment, but its sensitivity and specificity were higher. This shows that a combination of markers could be more useful in classification although they do not seem to be that important on their own. Therefore, other morphological features that are thought to be unimportant should also be included so that its impact when combined with other markers could be studied.

In other words, training the network with a small number of morphological fea-tures can significantly reduce its prediction capabilities. The networks classification would perform better when more dominant features were employed. This indicates that morphological features play an important role in network classification.

Acknowledgments. Data of this study has been provided by the Pathology Depart-ment, School of Medical Sciences, USM, Kubang Kerian, Kelantan, Malaysia.

References

[1] Demir, C., Yener, B.: Automated Cancer Diagnosis Based on Histopathological Images: A Systematic Survey. Technical Report, Rensselaer Polytechnic Institute, Department of Computer Science, TR-05-09 (2005)

[2] Mat Sakim, H.A., Mat Isa, N.A., Naguib, R.N.G., Sherbet, G.V.: Image Cytometry Data From Breast Lesions Analyzed using Hybrid Networks. In: Proceedings of the 2005 IEEE, Engineering in Medicine and Biology 27th Annual Conference, Shanghai, China, Septem-ber 1-4 (2005)


[3] Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, New Jersey (1994)

[4] Trott, P.A.: Breast Cytopathology: A Diagnostic Atlas. Chapman & Hall Medical, Boca Raton (1996)

[5] Choong, P.L., DeSilva, C.J.S., Attikiouzel, Y.: Predicting local and distant metastasis for breast cancer patients using the Bayesian Neural Network. In: Int. Conference on Digital Signal Processing, vol. 1, pp. 83–88 (1997)

[6] Lakhani, S.R.: The Pathology of Familial Breast Cancer: Morphological Aspects. Breast Cancer Research 1(1), 31–35 (1999)

[7] Street, W.N., Wolberg, W.H., Mangasarian, O.L.: Nuclear Feature Extraction for Breast Tumor Diagnosis. In: IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology 1905, San Jose, California, pp. 861–870 (1993)


A BP-Neural Network Improvement to Hop-Counting for Localization in Wireless Sensor Networks

Yurong Xu1,2, James Ford1,2, Eric Becker1, and Fillia S. Makedon1,2

1 Heracleia Human Centered Computing Lab, Dept. of Computer Science and Engineering, Univ. of Texas at Arlington, Arlington, TX 76013 2 Dept. of Computer Science, Dartmouth College, Hanover NH 03755, USA {yurong,jford}@cs.dartmouth.edu, [email protected], [email protected]

Abstract. GDL (Geographic Distributed Localization) is a distributed localization technique useful for letting nodes in a network determines their relative locations without special hard-ware or configuration. As compared with previous proposed measurement techniques, which require additional hardware (such as special antennas), this technique is based on hop counting and only requires the connectivity of network. In this paper, we propose to incorporate neural networks into the standard hop counting-based approach to improve the accuracy of localiza-tion over the existing method. We use one neural network at each node to compute its hop-coordinates, after a training procedure. Using network simulations, we show that incorpo-rating our technique improves the accuracy of localization results generated by GDL by 15%. The application of this method is not limited to the GDL algorithm; it can also be used in other hop-counting based localization algorithms, such as MDS-MAP. It is also designed to be suitable for implementation in current embedded systems, such as standard motes like Moteiv’s Invent.

Keywords: Neural Networks, Embedded Computation, Measurement, Hop-counting, Localization.

1 Introduction

Wireless Sensor Networks (WSNs) [1], an emerging new type of ad-hoc network, integrate sensing, processing and wireless communication in a distributed system. WSNs have numerous applications, such as surveillance, healthcare, industry automa-tion, and military uses.

Many location-aware applications of WSNs require the location of nodes to be known. In addition, different protocols require that each node be aware of its location as accurately as possible. One such protocol is geographic routing, which is important in many WSNs [2].

Many physical features in WSN have been utilized by different localization algo-rithms to generate the location of nodes in a WSN: the angle of arrival (AOA) [17],

12 Y. Xu et al.

time of arrival (TOA) [19], time difference of arrival (TDOA) [7], Receive Signal Strength Indicator (RSSI) [8], Radio Interferometry (RI) [23] and hop information and connectivity. Usually, sensor nodes must deploy additional hardware for AOA, TOA and TDOA, or special antennas for AOA. As to RSSI, it depends on a rough relationship between the distance and the signal strength, but this relationship is affected unpredictably [8] by the antenna, battery, electric components inside of the node and the outside environment. RI requires accurate synchronization. In contrast with other physical features which we can use in localization algorithms, hop infor-mation and connectivity are much more stable and can be achieved in most currently deployed WSNs without any additional hardware.

Extensive simulation experiments on different localization algorithms [9,10,14-16] based on only hop-counting or connectivity have shown that there is usually a size-able error in the location generated by the algorithms in comparison with actual loca-tions, even after some refinement processes such as the ones in [15] and [16]. While approximate localization is acceptable for a number of purposes, including routing, increasing the accuracy of localization would be beneficial for many applications.

This paper aims to identify the problems leading to low accuracy in hop-counting, and focuses on how to correct inaccuracies that occur in hop-counting with the help of a neural network approach. For each node in the network, we employ a back-propagation (BP) neural network to let nodes learn and remember their local network structure; using this information, nodes can then distinguish different neighbors one hop away and improve the measurement accuracy of hop-counting. We called this method neural network based hop coordinates (NN hop coordinates). It not only counts the number of hops, but also offsets this total based on the local network structure.

The method can be used on any localization algorithms based on hop-counting. When combined with three well-known localization algorithms of this type—MDS-MAP [15], MDS-MAP(P) [16] and GDL[22]— achieves about a 86%, 25% and 15% improvement, respectively, in our simulation, comparing with results from the same algorithms without NN-based hop coordinates.

2 Related Work

Node localization in WSNs has been an active research field for some time, and many localization algorithms have been proposed. Based on the different physical features of WSNs, we can roughly separate these into four categories: (1) some algorithms, such as [3], [9], [10], [14], [15] and [16], merely utilize the connectivity or hop count information in a WSN; (2) other algorithms, such as [17], use AOA in addition to hop-counting; (3) some, such as [19], uses TOA information; (4) and some (e.g. [7]) use TDOA. Hop information and connectivity are available in currently implemented WSNs, while AOA requires additional devices such as ultrasound receivers [13] or directional antennas, and TOA and TDOA require nodes to have a second, different propagation speed communication device, such as an ultrasound transceiver [13], in

A BP-Neural Network Improvement to Hop-Counting for Localization in WSNs 13

addition to RF transceivers to measure the delivery time. Recently, Spotlight, a light-based localization system that relies on additional hardware, was proposed [20], introducing a fifth category.

A number of methods exist (e.g. [11] ) that do not need additional hardware but do require “anchor” nodes (nodes whose positions are known in advance). In this paper, we will only consider localization algorithms that use only hop information and con-nectivity, and that thus need no additional hardware and no special nodes such as anchor nodes. This will exclude algorithms such as BVR [18], but still leaves us with several algorithms. DV-Hop, DV-distance by Niculescu and Nath [9], and Hop-TERRAIN by Savarese et al. [10] use hop-counting to calculate the location of nodes. However, these methods are outperformed by the following MDS-MAP series of algorithms and by GDL [21].

MDS-MAP [15] uses a multidimensional scaling (MDS) calculation to generate a localization map based on connectivity in WSNs. Shang et al. later presented a second algorithm called MDS-MAP(P) [16], which extends MDS-MAP into a distributed algorithm. MDS-MAP(P) first computes local maps for each node with MDS using local shortest paths, then merges local maps into a global map.

GDL, introduced in [22], provides another variation to hop-counting, which aver-ages the counted hops from neighbor nodes of each node to improve the accuracy of measurement. A problem for GDL is that such a method didn’t consider about the possibility of an irregular distribution of neighbors around a node. One extreme ex-ample is a very sparse network, such as a line of nodes variably spaced to be more than one half of a hop radius apart, but within the hop distance: since each node only has two neighbors with different hop numbers in this case, the averaging calculations used by GDL are not beneficial.

Though the main idea of hop coordinates derives from hop-counting, our technique can not only be applied directly in the localization algorithms based on hop-counting, but can also be applied in localization algorithms which need connectivity, such as DV-hop, the MDS-MAP algorithms, and GDL. In our simulation experiments in this work, we apply hop coordinates in the MDS-MAP algorithms.

3 Problems

3.1 Hop-Counting Technique

Hop-counting needs no special hardware to calculate localization information. It is based on having a pre-appointed but arbitrary bootstrap node send out a hop-counting message with a variable hops = 0 inside; by using that message, each node determines its hop distance from the bootstrap node and forwards that message after accumulat-ing the variable hops.

Suppose that a node A is calculating this hop distance, and node B is one of the neighbors AN of node A. Then the basic hop-counting procedure for node A can be

shown as follows:

14 Y. Xu et al.

Procedure1 (compute Hops) in nodeA

Input: message( Bhop ) from nodeAB N

Output: Ahop , Ahop ; for each message( Bhop ) from

AB N and~TIMEOUT

if ( Bhop < Ahop ) Ahop = Bhop +1; forward (message( Ahop )) to MAC; else drop (message( Bhop )); endif endfor return Ahop ;

Here, Ahop and Bhop represent the hop distance from node A and B to some boot-

strap node. AN is a set of nodes which can be reached by node A as neighbors in k

hop(s). “MAC” refers to the medium access control layer.

3.2 Extending the Hop-Counting Technique

The basic idea behind hop-counting is that there is a special bootstrap node in the network, which sends out a message to flood the network, and all other nodes in the network will use this message to count the number of hops to that bootstrap node. This is an efficient way to measure the distance between two nodes in wireless sensor networks, but a problem is that it only distinguishes nodes based on the number of hops between them, so that many nodes share the same distance from the bootstrap node in terms of their number of hops. One example is shown in Fig.1.

A

Fig. 1. Nodes A and node B share same number of hops to the bootstrap node (“hop count”) in a network, despite being significantly far apart. Here, the dots represent nodes in the network, the oversized node is the bootstrap node, and the number near each node is the hop count for that node.


From Fig. 1, we can see that nodes A and B are identified as having the same hop number based on the number of distinct hops it takes to reach them from the bootstrap node; in actuality, nodes A and B have different physical distances to the bootstrap node.

Unfortunately, this problem occurs frequently in a typical network, as illustrated in Fig. 2.

A

Fig. 2. Different nodes share same hop count. After hop-counting, there are many different nodes sharing the same hop counts, some examples of which are indicated by arrows.

In our previous work on GDL, we used an averaging technique to improve the hop-counting technique; here, we introduce a different approach based on BP neural networks.

4 Neural Networks

In this work, we use a standard three-layer neural network with training via a back-propagation algorithm. There are three layers, including the input layer, output layer, and hidden layer, as shown in Fig.3.

Given input nodes kx , hidden nodes kh and output nodes ky , we can calculate

the values of the latter two of these based on the earlier ones plus sets of weights, (1)w and (2)w , between layers:

(1) (1)[ ]k ij ijh F w x= •∑

( 2 ) ( 2 )[ ]k jk kjy F w h= •∑

Given a training set { }nx and desired target outputs{ }nt , to find (1)w and (2)w

we minimize the network error, defined by

2( )n nn

E y x t= −∑

16 Y. Xu et al.

Fig. 3. Neural Networks with Hidden Layer. Inputs (leftmost column) drive the input layer, which in turn causes propagation of information through weighted links (W(1)), to the hidden layer; this layer then influences an output layer with additional weighted links (W(2)), ulti-mately driving a set of outputs (rightmost column of nodes)

We use a standard back propagation algorithm to train this neural network. Back propagation uses supervised learning in which the network is trained using data for which inputs as well as outputs are known. This algorithm is a generalization of the least mean squares approach, and modifies network weights to minimize the mean squared error between the desired and actual outputs of the network. The calculation

of (2)w proceeds as follows: ( 2 )

( 2 ) ( 2 )

( 2 )( )

2( ) [ ]jk jj

j

jk jk

w hE E yy t F h

yw w

∂∂ ∂ ∂= • • = −∂ ∂∂ ∂

∑ ∑∑i

i i

( 2) ( 2)

( 2)jk jk

jk

Ew w a

w

∂= − •∂

(Here, a is the learning rate)

A similar procedure is used to calculate(1)w . Once trained, the network weights

are identified as fixed values to be used to compute output values for new input samples.

5 NN-Based Hop Coordinates

5.1 Definition of Hop Coordinates

Unlike most other localization algorithms, which only count number of hops or utilize connectivity between nodes, the proposed NN-based Hop coordinates introduced in this paper not only counts the number of hops from some bootstrap node, but also offset these counts with local network information pertaining to that node.


Definition 1. A Hop Coordinate [22] is constructed from two parts: number of hops and offset. The first part is a positive integer which equals the number of hops in a minimum hop route from some bootstrap node to a given node. The second part can be seen as a decimal fraction generated from local network information, as defined as:

Ahop , Aoffset (here, A is a node).

We will assume Ahop will always be equal to or larger than 0 and that

0 1Aoffset≤ < .

In the following, AN is a set of nodes that can be reached by node A in one hop, and

AN is the number of nodes in AN .

5.2 Constructing the Neural Network

Before we describe our approach, let us first reexamine the example of hop-counting problems shown in Fig. 1. From that picture we can see that even though nodes A and B share the same hop count, they do not share the same neighbors, which are shown inside dashed circles.

For a given node, which we can call node A, we assume A

N is the set of neighbor

nodes (nodes within one hop). We can prove that there is no node B, such that it is both a neighbor of node A and has a hop count that differs by more than one as follows:

Theorem 1. For a given node A, if AN is the set of neighbor nodes for node A, there

is no node B which belongs to AN and satisfies |hopB - hopA| > 1.

Proof: suppose there is a node B that belongs to AN and |hopB - hopA| > 1. Without

loss of generality, assume that hopB - hopA > 1. Then, because node B is node A’s neighbor, node B must have received a hop-counting message from node A, based on the hop-counting algorithm Procedure1, and since current hopB >received hopA +1, we know hopB must have been updated to hopA +1. This contradicts the assumption that |hopB - hopA| > 1. For the hopA - hopB > 1 case, the proof is similar. □

So, based on Theorem 1, for a node A, there are only three types of nodes in AN :

Nodes with a hop count equal to (hopA +1), which we designate as 1

AN

+ .

Nodes with a hop count equal to (hopA -1), which we designate as. 1

AN

−

Nodes with a hop count equal to (hopA), which we designate as 0

AN .

We use the number of nodes of each type, i.e. 1AN − , 1

AN + , and, AN together with

hopA as inputs for the neural network in node A. The previous presentation of the NN hop coordinates method assumes we can

make use of floating point arithmetic; however, a standard WSN node usually only

18 Y. Xu et al.

Fig. 4. Neural Networks for NN-based Hop Coordinates. Inputs (leftmost column) drive an input layer, which in turn causes propagation of information through weighted links (W(1)), to a hidden layer; this layer then influences an output layer with additional weighted links (W(2)), ultimately driving a set of outputs (rightmost column of nodes).

has an 8 bit or 16 bit microprocessor without floating point support [1]. We thus need

to use integer values to represent Aoffset and Ahop (here A is any arbitrary node). In

practice, we can conveniently use one byte each to represent Ahop and Aoffset

allowing values from 0 to 255 which we can interpret as described later in the paper. We designate these bytes as the high byte (HB) for the hop count and low byte (LB) for the offset. If an application needs more resolution for hop-coordinates we can use two bytes, or a double word, for each.

6 Simulation

6.1 Simulation Process

There are three steps in our simulation.

Step 1: Create training data set from NS-2 We implement the hop coordinates algorithm as a routing agent and the bootstrap

node program as a protocol agent in NS-2 version 2.29 [4] with 802.15.4 MAC layer [6] and the CMU wireless extension [5]. The configurations for NS-2 are RF range = 15 meters, propagation = TwoRayGround, and antenna = Omni Antenna.

We use uniform placement—n nodes are placed on a grid with 100% of r random-ized placement error. Here r is the width of a small square inside the grid. We con-structed a total of 60 placements with n = 100, 400, 900, 1600, and with r = 2, 4, 5, 8, 10, and 12 meters, respectively. The reason we use uniform placement with 100%r error is that usually such a placement includes both node holes and islands in one placement, as shown in Fig.5.


0 20 40 60 80 100 1200

20

40

60

80

100

120

Fig. 5. A typical placement for simulation Constructed with n= 400, r=4. Green ovals: holes, Blue ones: islands.

A 6-element tuple is used to represent a training data set for arbitrary node A:

(AN , 1

AN

− , 1

AN

+ , hopA, LB, HB). The elements represent, respectively, the number

of neighbors, number of neighbors with (hopA -1), number of neighbors with (hopA +1), hopA, and the LB and HB values described in Section 4.2. In our experiments, these HB and LB are computed as the result of dividing the physical distance from node A to the bootstrap node by the transmission range R. We trained with around 64 samples for each node as a training set.

Step 2: Process to Compute Hop Coordinates with Neural Network After we finish training for the neural network in each node, we fix the internal

layer in each node and use it to compute the NN-based hop-coordinates for the node

by inputting a 4-tuple (AN , 1

AN

− , 1

AN

+ , hopA) to compute (LB, HB).

Step 3: Compute Predicted Maps using Localization Algorithms As described in the related work summary above, the GDL and MDS-MAP series

of localization algorithms outperform other localization algorithm, so in our simula-tion we use GDL and MDS-MAP series algorithms to evaluate our technique.

By collecting the NN-based hop-coordinates from neighbors for a node, that node can compute shortest paths between all pairs of nodes in the neighbors using Dijkstra's algorithm. Then, we apply MDS-MAP [15], MDS-MAP(P) [16] and GDL [22] to the above data to compute the relative maps. Given four fixed anchor nodes, we transform these relative maps to absolute maps.

Thus, by comparing with the physical position of the each node in the network, we can obtain and compare the accuracy measurements with and without NN-based hop-coordinates.

6.2 Simulation Results

We obtained results of the accuracy of localization by MDS-MAP and GDL algorithms, with and without NN-based hop coordinates, on networks with n = 100,

20 Y. Xu et al.

400, 900, and 1600 nodes, and r = 2, 4, 6, 8, 10, and 12 meters, as shown in Fig. 6.a for r = 2 through Fig. 6.f for r = 12.

From Fig.6. (a,b,c) we can see that when r=2, 4, 6, the accuracy in localization with NN-based hop coordinates is higher than MDS-MAP, but close to MDS-MAP(P) and GDL. Since MDS-MAP only utilizes connectivity information, NN-based hop-coordinates can help to improve the localization accuracy. In GDL and MDS-MAP(P), in contrast, the fact that both incorporate techniques to improve accuracy, and that the network density is high, means that NN-based hop-coordinates provide little additional benefit.

From Fig. 6 (d,e,f), we can see that when 6r > meters, improvements of accuracy in coordinates are significant in MDS-MAP, MDS-MAP(P) and GDL. As a result of

0

10

20

30

40

50

60

70

80

90

100

100 400 900 1600# of nodes

Err

or(

%)

MDS-MAPMDS-MAP(P)GDL w ith NN Hop-CoordinatesGDL w ithout NN Hop-Coordinates

0

10

20

30

40

50

60

70

80

90

100

100 400 900 1600# of nodes

Err

or(

%)


Fig. 6a. Comparison of the accuracy of localization of MDS-MAP, MDS-MAP(P) and GDL algorithms with or without the NN hop coordinates technique, on networks with n = 100, 400, 900, 1600 nodes, and r=2, 4, 6, 8, 10, or 12 meters, respectively. Here r=2. Notes: X-axis: # of nodes, Y-axis: the error of localization in percent.

Fig. 6b. Comparison of the accuracy of localization (r=4)

0

10

20

30

40

50

60

70

80

90

100

100 400 900 1600# of nodes

Err

or(%

)


0

10

20

30

40

50

60

70

80

90

100

100 400 900 1600# of nodes

Err

or(%

)


Fig. 6c. Comparison of the accuracy of localization (r=6)

Fig. 6d. Comparison of the accuracy of localization (r=8)


0

10

20

30

40

50

60

70

80

90

100

100 400 900 1600# of nodes

Err

or(%

)


0

10

20

30

40

50

60

70

80

90

100

100 400 900 1600# of nodes

Err

or(%

)


Fig. 6e. Comparison of the accuracy of localization (r=10)

Fig. 6f. Comparison of the accuracy of localization (r=12)

the NN training process, NN-based hop-coordinates provide a larger accuracy improvement then the techniques used in the MDS-MAP and GDL algorithms.

Then, we compute the overall average accuracy based on Fig. 6, corresponding to r=2, 4, 6, 8, 10, 12 meters. The results are shown in Fig. 7. From it, we can see that in the MDS-MAP, MDS-MAP(P) and GDL algorithms, the localization accuracy with our technique is about 86% and 26% and 15% improved compared with the GDL algorithms using only hop-counting, when 2 12r≤ ≤ .

0

20

40

60

80

100

0 2 4 6 8 10 12 15

r (meter)

Err

or(%

)

MDS-MAP

MDS-MAP(P)

GDL w ith NN Hop-Coordinates

GDL w ithout NN Hop-Coordinates

Fig. 7. Comparison of overall average accuracy of localization with or without hop coordinates

6.3 Costs

As a setup phase, our technique requires a significantly long training time, but afterwards the technique has a low computational cost: in essence, it requires the

22 Y. Xu et al.

multiplication of two small matrices. The approach described here requires approxi-mately 350 bytes of memory to store the neural network parameters in each node.

7 Summary and Conclusions

In this paper, we presented a new localization technique, Neural Network based Hop-Coordinate, which improves the accuracy of localization under same localization algorithms that use only connectivity information. In simulations, our technique on average improves the accuracy of locations generated by MDS-MAP(P) and GDL improving by 26%, 15%,, in comparison with the results without applying our tech-nique; We also achieve good results in comparison to GDL in combination with another variation of hop coordinates based on using an average of the neighborhood hop count as the hop coordinates of each node [22]. Our technique works well in networks with varying numbers of nodes, especially when there are sparse nodes, which is typically the worst case for GDL otherwise.

Through a simulation, we show how our technique can work with different algorithms, particularly MDS-MAP series algorithms and GDL, which require only connectivity. Our technique can also be applied to other localization algorithms, such as DV-hop, DV-distance, Hop-TERRAIN, which rely on hop-counting technique, without modification.

In cost, this method has a low computational cost after its training process, and re-quires around 350 bytes memory in a microprocessor without a floating-point unit (FPU).

References

[1] Akyildiz, I., et al.: A Survey on Sensor Networks. IEEE Communication 40(8), 102–114 (2002)

[2] Culler, D., et al.: Towards a Sensor Network Architecture: Lowering the Waistline. In: Proc. of SenSys. 2005 (2005)

[3] He, T., Huang, C., Blum, B., Stankovic, J., Abdelzaher, T.: Range-Free Localization Schemes in Large Scale Sensor Networks. In: Proc. of Mobile Computing and Network-ing (MobiCom 2003) (2003)

[4] McCanne, S., Floyd, S.: NS-2 Network Simulator, http://www.isi.edu/ nsnam/ns/

[5] The CMU MONARCH Group: Wireless and Mobility Extensions to ns-2, http://www.monarch.cs.cmu.edu/cmu-ns.html

[6] Jianliang, Z., et al.: 802.15.4 extension to NS-2, http://www-ee.ccny.cuny.edu/ zheng/pub

[7] Priyantha, N., Balakrishnan, H., Demaine, E., Teller, S.: Anchor-free distributed localiza-tion in sensor networks. In: Proceedings of ACM SenSys 2003, pp. 340–341 (2003)

[8] Zhou, G., He, T., Krishnamurthy, S., Stankovic, J.A.: Impact of radio irregularity on wireless sensor networks. In: MobiSys 2004: Proceedings of the 2nd international confer-ence on Mobile systems, applications, and services, pp. 125–138 (2004)

[9] Niculescu, D., Nath, B.: Ad-hoc positioning system. In: IEEE GlobeCom (2001)


[10] Savarese, C., Rabaey, J., Langendoen, K.: Robust positioning algorithm for distributed ad-hoc wireless sensor networks. In: USENIX Technical Annual Conf., Monterey, CA (2002)

[11] Stoleru, R., He, T., Stankovic, J., Luebke, D.: A high-accuracy, low-cost localization sys-tem for wireless sensor network. In: Proc. of SenSys 2005 (2005)

[12] Ratnasamy, S., Karp, B., Yin, L., Yu, F., Estrin, D., Govindan, R., Shenker, S.: GHT: A Geographic Hash Table for Data-Centric Storage. In: ACM WSNA (2002)

[13] Priyantha, N., Chakaborty, A., Balakrishnan, H.: The Cricket Location-support System. In: Proceedings of MobiCom (2000)

[14] Savarese, C., Rabaey, J., Langendoen, K.: Robust Positioning Algorithms for Distributed Ad-Hoc Wireless Sensor Networks. In: Proc. of USENIX, Technical Conference (2002)

[15] Shang, Y., Ruml, W., Zhang, Y., Fromherz, M.: Localization from Mere Connectivity. In: Proc. of Intl. Symp. on Mobile Ad Hoc Networking and Computing (MobiHoc) (2003)

[16] Shang, Y., Ruml, W.: Improved MDS-Based Localization. In: InfoCom 2004 (2004) [17] Bruck, J., Gao, J., Jiang, A.: Localization and Routing in Sensor Networks by Local An-

gle Information. In: Proc. the 6th ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc 2005) (2005)

[18] Fonseca, R., Ratnasamy, S., Culler, D., Shenker, S., Stoica, I.: Beacon vector routing: Scalable point-to-point in wireless sensornets. In: Second Symposium on Network Sys-tems Design and Implementation (NSDI) (2005)

[19] Capkun, S., Hamdi, M., Hubaux, J.: GPS-free positioning in mobile ad-hoc networks. In: Proceedings of Hawaii International Conference on System Sciences (HICCSS 2001), pp. 3481–3490 (2001)

[20] Maróti, M., Völgyesi, P., Dóra, S., Kusý, B., Nádas, A., Lédeczi, Á., Balogh, G., Molnár, K.: Radio interferometric geolocation. In: Proc. of SenSys 2005 (2005)

[21] Xu, Y., Ford, J., Makedon, F.S.: A Variation on Hop-counting for Geographic Routing (EmNetS-III), Cambridge, MA (2006)

[22] Xu, Y., Ford, J., Makedon, F.S.: A Distributed Localization Algorithm for Wireless Sen-sor Networks Using Only Hop-Counting. In: IC3N 2006, Arlington, VA (2006)

[23] Maróti, M., Völgyesi, P., Dóra, S., Kusý, B., Nádas, A., Lédeczi, Á., Balogh, G., Molnár, K.: Radio interferometric geolocation. In: Proc. of SenSys 2005 (2005)

SUDOKUSAT—A Tool for Analyzing Difficult SudokuPuzzles

Martin Henz and Hoang-Minh Truong

National University of SingaporeSchool of ComputingComputing 1, Law Link, Singapore 117590, Singapore

Abstract. Sudoku puzzles enjoy world-wide popularity, and a large community of puzzlers ishoping for ever more difficult puzzles. A crucial step for generating difficult Sudoku puzzles is thefast assessment of the difficulty of a puzzle. In a study in 2006, it has been shown that SAT solvingprovides a way to efficiently differentiate between Sudoku puzzles according to their difficulty,by analyzing which resolution technique solves a given puzzle. This paper shows that one of thesetechniques—unit resolution with failed literal propagation—does not solve a recently publishedSudoku puzzle called AI Escargot, claimed to be the world’s most difficult. The technique is alsounable to solve any of a list of difficult puzzles published after AI Escargot, whereas it solvesall previously studied Sudoku puzzles. We show that the technique can serve as an efficient andreliable computational method for distinguishing the most difficult Sudoku puzzles. As a proof-of-concept for an efficient difficulty checker, we present the tool SUDOKUSAT that categorizesSudoku puzzles with respect to the resolution technique required for solving them.

1 Sudoku

Sudoku puzzles have fascinated puzzle solvers since their invention in 1979. A Sudokupuzzle is a 9 × 9 grid of cells, which is composed of nine 3 × 3 non-overlapping boxesof cells. The objective is to fill the grid with digits from 1 to 9 so that each row, columnand box contains any digit at most once. Some cells are already filled with digits; thesecells are called hints. To be a proper Sudoku puzzle, the grid must admit a uniqueway to fill the remaining cells to meet the objective (Uniqueness Property). Figure 1shows a Sudoku puzzle that can be solved within a few minutes by an experiencedpuzzler.

Whereas precursors of Sudoku—based on magic squares and Latin squares—wereknown since the late 19th century, modern Sudoku puzzles were first published anony-mously by Dell Magazines in 1979, and are widely attributed to Howard Garns. Thepuzzles became popular in Japan starting in the mid 1980s and world-wide in 2005,giving rise to a large international community of puzzlers who enjoy solving puzzles ofvarious levels of difficulty.

The Sudoku community spends much effort to generate new and ever more difficultpuzzles. A milestone in this quest was reached by Arto Inkala in 2006.

C. Koutsojannis and S. Sirmakessis (Eds.): Tools & Appli. with Artificial Intel., SCI 166, pp. 25–35.springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

26 M. Henz and H.-M. Truong

7 1 6 8 2

3

8 9 4

7 9

5 3 1

1 9 2 6

9 3

5 2

4 7

Fig. 1. A Typical Sudoku Puzzle

2 AI Escargot

In November 2006, Arto Inkala published a Sudoku puzzle that he called “AI Escar-got”1, shown in Figure 2, which he claimed to be “the most difficult Sudoku puzzleknown so far”. In his recently published book (Ink07), Inkala notes:

In addition to my own evaluation method, I gave AI Escargot to others forevaluation in October 2006. It was soon found to be the most difficult Sudokupuzzle known according to at least three established evaluation methods. In thecase of two methods, in fact, AI was the first puzzle that they were unable tosolve. These methods were RMS (Ravel Minimal Step) and Sudoku Explainer1.1. The third and fourth method was Gfsr and Suexrat9 method. The fifthmethod I used to evaluate AI Escargot also found it to be the most difficult, butthat method is my own.

What happened next was what I had anticipated: AI Escargot motivatedother makers of difficult Sudoku puzzles to continue the work. It continues tobe the most difficult Sudoku puzzle, when the results of all evaluation methodsare combined, but some puzzles can now be rated more difficult by a particularmethod.

The quest for difficult Sudoku puzzles continued since publication of AI Escargot,and a web site lists the most difficult puzzles discovered since the publication of AI

1 The letters “A” and “I” represent Dr. Inkala’s initials.

SUDOKUSAT—A Tool for Analyzing Difficult Sudoku Puzzles 27

1 7 9

3 2 8

9 6 5

5 3 9

1 8 2

6 4

3 1

4 7

7 3

Fig. 2. AI Escargot

Escargot (rav07). A common approach for generating difficult Sudoku puzzles isgenerate-and-test: generate large numbers of promising candidate puzzles2, and testeach of these candidates for the Uniqueness Property and difficulty. In order to effi-ciently and effectively search for difficult puzzles, the rating has to be both fast andcorrespond to the human intuition of Sudoku difficulty. In the quote above, Inkala men-tions five techniques of rating:

Ravel Minimal Step (RMS). The minimal number of “difficult steps” needed to solvethe puzzle. RMS is not able to rate AI Escargot. Even if appropriate “difficult steps”would be added to the rating scheme, a time-consuming search-based techniqueis required for RMS, since the choice of steps and their order greatly affects thenumber of steps needed.

Sudoku Explainer. This Java software by Nicolas Juillerat uses a complex algorithmto assess the difficulty of a puzzle. Apparently, the more recent version SudokuExplainer 1.2 is able to solve AI Escargot, but assessing its difficulty reportedlytakes several minutes.

Gfsr. At the time of submission, the authors were not able to find details on this evalu-ation method.

Suexrat9. Suexrat9 returns the average number of depth-first search steps, where onlysimple reasoning steps are carried out at each node. This method is slowed down bythe need for multiple runs of a depth-first solver, and the result does not correspond

2 Enumerating possible Sudoku grids is interesting task in itself; for details, see (FJ05).


well to the experience of human solvers, who typically apply techniques radidallydifferent from depth-first search.

Inkala’s method. At the time of submission, the authors were not able to obtain detailson this method.

It is not clear whether any of these techniques are able to serve as an effective testingstep in generate-and-test. In this paper, we propose to use a SAT-based technique calledfailed literal propagation (FLP) as effective test. In order to argue the case, we brieflydiscuss the computational techniques that have been applied to Sudoku puzzles in thenext section. Section 4 zooms into SAT-based solving techniques. Section 5 discusseshow a SAT-based solver can be augmented with a Sudoku-specific reasoning method tooptimize the test for the Uniqueness Property. Section 6 presents SUDOKUSAT, a SAT-based tool for analyzing the difficulty of Sudoku puzzles. We have used SUDOKUSAT

for assessing the difficulty of AI Escargot and other difficult puzzles, and Section 7establishes the first known Sudoku puzzles that cannot be solved with FLP, which givesrise to that hope—expressed in Section 8—that SAT-solving may prove to be a valuabletool in the quest for the most difficult Sudoku puzzles.

3 Sudoku in AI

The problem of finding solutions of n2 × n2 Sudoku puzzles is NP-hard (Yat03). Allknown 9 × 9 puzzles can be quickly solved using a variety of techniques, ranging fromgeneral search algorithms such as Dancing Links (Knu00), over constraint program-ming techniques (Sim05; Hol06) and SAT-based techniques (see next section), to a hostof Sudoku-specific algorithms such as Sudoku Explainer (Jui06) and Sudoku Assis-tant/Solver (Han07).

Researchers in Artificial Intelligence have used constraint programming and SATsolving to assess the difficulty of Sudoku puzzles. Simonis shows that finite-domainconstraint programming (CP) can be employed to assess the difficulty of Sudoku puz-zles (Sim05). CP models use finite domain variables for each cell representing the digitplaced in the cell, as well as finite domain variables that indicate in which cell of a givenrow/column/box a particular digit is placed. Both kinds of variables are connected withall-different constraints. Channeling constraints connect variables of different kinds. Si-monis presents several constraints that exploit the interaction between rows, columnsand boxes. Of particular value for solving Sudoku puzzles is a technique called shav-ing (TL00), which speculatively assigns a value to a variable. When propagation detectsan inconsistency as a result of the assignment, the value is removed from the variabledomain. Simonis proceeds to compare the published difficulty of a range of puzzleswith the kinds of propagation techniques required to solve them without search. Sincehe did not have access to the difficult instances published after 2006, the effectivenessof CP to assess the difficulty of Sudoku puzzles remains a subject of future work.

Weber presented the first SAT-based tool for solving Sudoku (Web05), by translatinga puzzle using the theorem prover Isabelle/HOL into clause form, solving the resultingSAT problem using the SAT solver zChaff (MMZ+01), and reporting that the encod-ing and solving is done “within milliseconds”. Lynce and Ouaknine apply various SAT-based resolution techniques to Sudoku (LO06). They confirm that the SAT-equivalent of


shaving called failed literal propagation provides a very efficient and effective methodfor solving the puzzles. Similar to Simonis, Lynce and Ouaknine did not have accessto the difficult puzzles published after 2006, and thus, the significance of the differ-ent resolution techniques for assessing Sudoku difficulty remains unclear. Comparingconstraint programming and SAT solving on Sudoku, we conclude that SAT solvingprovides simpler and equally powerful solving techniques, and therefore focus on SATsolving for this work.

4 SAT-Based Sudoku Solving

Satisfiability is the problem of determining if the variables in a given Boolean formulacan be assigned to truth values such that the formula evaluates to TRUE. In SAT prob-lems, the formula is a conjunction of disjunctions of literals, where a literal is eithera variable or the negation of a variable. The disjunctions are called clauses, and theformula a clause set. If the clause set is satisfiable, complete SAT solvers provide anassignment of the variables which satisfies it. When we encode a given Sudoku puzzleas a SAT problem, and give it to a complete SAT solver, the solver returns the solutionto the puzzle, encoded in a variable assignment.

We use the following extended encoding of Sudoku, given by Lynce and Ouak-nine (LO06). With each cell of the grid, this encoding associates 9 variables, one foreach possible digit. The variable sxyz is assigned to TRUE if and only if the cell in rowx and column y contains the digit z. The following conjunctions of clauses combinedmake up the extended encoding (for a detailed explanation, see (LO06)):

–∧9

x=1∧9

y=1∨9

z=1 sxyz

–∧9

y=1∧9

z=1∧8

x=1∧9

i=x+1(¬sxyz ∨ ¬siyz)–

∧9x=1

∧9z=1

∧8y=1

∧9i=y+1(¬sxyz ∨ ¬sxiz)

–∧9

z=1∧2

i=0∧2

j=0∧3

x=1∧3

y=1∧3

k=y+1(¬s(3i+x)(3j+y)z ∨ ¬s(3i+x)(3j+k)z)

–∧9

z=1∧2

i=0∧2

j=0∧3

x=1∧3

y=1∧3

k=x+1∧3

l=1(¬s(3i+x)(3j+y)z ∨ ¬s(3i+k)(3j+l)z)

–∧9

x=1∧9

y=1∧8

z=1∧9

i=z+1(¬sxyz ∨ ¬sxyi)–

∧9y=1

∧9z=1

∨9x=1 sxyz

–∧9

x=1∧9

z=1∨9

y=1 sxyz

–∧9

z=1∧2

i=0∧2

j=0∧3

x=1∧3

y=1 s(3i+x)(3j+y)z

Complete solving techniques for SAT problems are based on simplification and search.A basic simplification technique is the unit clause rule (DP60). The rule eliminatesa unit clause of the form ωi = (lj), where lj is a literal, by removing all clauses inwhich lj appears, and removing ¬lj from all clauses containing it. Unit resolutionapplies the unit clause rule exhaustively until the clause set contains no unit clause.The most common complete solving technique for SAT problems is the Davis-Putnam-Logemann-Loveland (DPLL) algorithm (DLL62), which combines tree search, whereeach branching assigns a selected literal to a truth value, with unit resolution. Solv-ing Sudoku puzzles is done by humans using complex kinds of reasoning such as the


“Swordfish” pattern (Han07), as opposed to tree search. Thus, in order to use SAT solv-ing to assess the difficulty of a puzzle, Lynce and Ouaknine examine—in addition tounit resolution—the following three propagation-based techniques. .

Failed literal propagation (FLP). The so-called failed literal rule is applied to aclause set f and a literal l, by creating a new clause set f ′ = f ∧ (l). If unitresolution applied to f ′ yields a contradiction (an empty clause), f is simplified byadding ¬l to it and performing unit resolution. Failed literal propagation repeatedlyapplies this failed literal rule to all literals, until no more simplification is achieved.

Binary failed literal propagation (BFLP). This technique extends failed literal prop-agation by considering pairs of literals l1 and l2. If unit resolution after adding theclauses (l1) and (l2) to the clause set leads to a contradiction, the clause (¬l1 ∨¬l2)can be added to the clause set.

Hyper-binary resolution (HBR). This technique infers from clause subsets of the form(¬l1 ∨ xi) ∧ (¬l2 ∨ xi)∧

· · · ∧ (¬lk ∨ xi) ∧ (l1 ∨ l2 ∨ · · · ∨ lk ∨ xj)the clause (xi ∨ xj). Similar to the previous two techniques, unit resolution is pe-formed after each application of the rule, and the rule is applied exhaustively.

5 Grid Analysis

During the search for Sudoku puzzles using generate-and-test, millions of puzzles needto be assessed, whether they admit a unique solution. The only known technique forthis problem is to find the set of all solutions, and check whether that set is a singleton.The DPLL algorithm family is an efficient technique for finding all solutions of Sudokupuzzles.

We are proposing to combine DPLL with the following Sudoku-specific propagationrules, which are given by a technique called Grid Analysis (GA), employed by humanpuzzlers and Sudoku software such as Sudoku Assistant/Solver.

– When a candidate digit is possible in a certain set of cells that form the intersectionof a set of n rows and n columns, but are not possible elsewhere in that same set ofrows, then they are also not possible elsewhere in that same set of columns.

– Analogously, when a candidate digit is possible in a certain set of cells that formthe intersection of a set of n rows and n columns, but are not possible elsewhere inthat same set of columns, then they are also not possible elsewhere in that same setof rows.

At a particular node in the DPLL search tree, the first rule can be implemented byiterating through all possible digits k and numbers n, where 1 ≤ 2 ≤ 8. For each valueof k and n, we check whether there are n rows, in which there arel less than n candidatecells for k, whether all candidate cells belong to n columns, and whether each of thesecolumns contains at least one candidate cell. If these conditions are met, we delete kfrom all cells in the n columns that do not belong to any of the n rows, by assigning thecorresponding variable sijk to TRUE. The second rule can be implemented similarly.The technique can be further improved by switching the roles of value and row/columnindex, as explained in (Sim05).


6 SUDOKUSAT

In order to experiment with FLP, BFLP, HBR and GA for solving Sudoku puzzles,we implemented a SAT solver, in which DPLL, FLP, BFLP and HBR can be used asalternative solving techniques. For each technique, the user can choose to add GA. Inthe case of FLP, BFLP and HBR, GA is performed after propagation. Whenever GAobtains a simplification, propagation is repeated, leading to a fixpoint with respect toa combination of FLP/BFLP/HBR and GA. Similarly, during DPLL, GA is combined

Fig. 3. Interface of SUDOKUSAT

Fig. 4. Display of solved puzzle and statistics on the search tree size, runtime and remainingpropositional variables


with unit resolution to achieve a combined fixpoint. Figure 3 shows the user interfaceof SUDOKUSAT after entering the hints of a problem and setting the solver options.

Figure 4 shows the user interface of SUDOKUSAT after solving the puzzle, displayingstatistics on the size of the search tree (in case DPLL was chosen), the overall runtime,and the percentage of remaining variables (in case FLP, BFLP or HBR was chosen).

Note that Holst reports on a tool for solving Sudoku puzzles, based on constraintlogic programming (Hol06).

7 Results

The experiments reported by Lynce and Ouaknine (LO06) use a set of 24,260 very dif-ficult Sudoku puzzles collected by Gordon Royle (Roy07). These puzzles are minimalin the sense that they have the smallest known number of hints, while still admitting aunique solution. The puzzles contain 17 hints, as there are no known Sudoku puzzleswith only 16 hints. A common assumption at the time of publication of previous studieson using AI techniques for Sudoku (LO06; Web05; Sim05; Hol06) was that the mostdifficult Sudoku puzzles would be minimal. Lynce and Ouaknine show that unit reso-lution alone can only solve about half of their studied puzzles, whereas the remainingthree techniques solve them all.

Inkala argues that the number of hints is not indicative of the difficulty of the puz-zle (Ink07), and presents AI Escargot, a puzzle with 23 hints, which was acknowledgedas the most difficult Sudoku puzzle in November 2006 (AFP06). The 20 most difficultSudoku puzzles known at the time of this writing all have 21 hints (rav07).

We have applied the three techniques given in Section 4 to these 20 instances and AIEscargot and obtained the following results:

– Not surprisingly, UR fails to solve any of the puzzles.– None of the puzzles can be solved using FLP.– All puzzles are solved using BFLP, HBR, and Grid Analysis.

In our view, the most significant of these results is the second. Lynce and Ouak-nine (LO06) report that FLP solves all puzzles listed by Royle (Roy07), a fact thatwe verified with SUDOKUSAT. However, FLP is not able to solve any of new difficultpuzzles. The runtime for attempting to solve them varies between 0.28 and 1.59 sec-onds, with an average of 0.79, using a PC with a Pentium 4 processor running at 3GHzusing Windows Vista. Figure 5 shows the detailed runtimes for the 20 most difficultpuzzles.

For AI Escargot, we obtained the runtime results given in Table 6. Note that the mostefficient method for solving AI Escargot appears to be DPLL, combined with the GridAnalysis technique presented in Section 5.

The runtime results indicate that our implementation of FLP in SUDOKUSAT pro-vides an efficient and reliable test for identifying the most difficult Sudoku puzzles, andthat DPLL together with Grid Analysis is a promising technique for quickly checkingthat a candidate puzzle has a unique solution.


000001020300040500000600007002000001080090030400000800500002000090030400006700000 1.062003000009400000020080600100200004000090800007005030000000900800000005030070010006 0.843100300000020090400005007000800000100040000020007060003000400800000020090006005007 0.578100007000020030500004900000008006001090000020700000300000001008030050060000400700 0.937003400080000009200000060001007010000060002000500800040010000900800000030004500007 0.766000001020300040500000600007002000006050030080400000900900002000080050400001700000 0.313003900000040070001600002000800000002070050030009000400200001008000040050000600900 1.594100007000020080000004300500005400001080000020900000300000005007000020060003100900 1.39002900000030070000500004100008000002090000030600500400000003008000060070100200500 0.344000001020300040500000600007001000006040080090500000300800002000050090400006700000 0.391900004000050080070001200000002600009030000040700000500000009002080050030000700600 1.218003400000050009200700060000200000700090000010008005004000300008010002900000070060 0.281000001020300040500000600001001000007050030040800000900400002000090050800006700000 0.422400009000030010020006700000001000004050200070800000600000004008070030010000500900 0.625100050080000009003000200400004000900030000007800600050002800060500010000070004000 1.078100050000006009000080200004040030008007000060900000100030800002000004050000010700 0.656003200000040090000600008010200000003010006040007000500000001002090040060000500700 0.796020400000006080100700003000000060300000200005090007040300000800001000090040500002 0.594002600000030080000500009100006000002080000030700001400000004005010020080000700900 1.188500000009020100070008000300040702000000050000000006010003000800060004020900000005 0.813

Fig. 5. The 20 most difficult Sudoku puzzles (rav07) and the runtimes in seconds required for(unsuccessfully) attempting to solve them using FLP. The puzzles are given by concatenating therows of cells containing their hints, where a cell with no hint is represented by 0.

Technique Runtime (s) Size of search treeDPLL 0.13 50 nodesDPLL + GA 0.076 20 nodesBFLP 5.200 1 node (no search needed)BFLP + GA 4.672 1 node (no search needed)HBR 48.000 1 node (no search needed)

Fig. 6. Runtimes for solving AI Escargot using SUDOKUSAT using successful techniques (Pen-tium 4, 3GHz, Windows Vista). Neither FLP nor FLP + GA are able to solve AI Escargot, and inboth cases, 76% of the propositional variables remain.

8 Conclusions

We have argued in this paper that SAT solvers provide effective measures of the dif-ficulty of Sudoku puzzles by demonstrating that the recently discovered puzzle AIEscargot—claimed to be the most difficult Sudoku puzzle in the world—is the firstknown Sudoku puzzle that cannot be solved by unit resolution combined with failedliteral propagation (FLP). We also show that 20 puzzles discovered after AI Escargotcannot be solved using FLP, whereas the technique is able to solve all previously studiedSudoku puzzles. These results lead us to suggest the following method for generatingdifficult puzzles:


1. Generate a candidate puzzle p using known enumeration techniques (e.g. (FJ05)).2. Test if p has a solution and satisfies the Uniqueness Property. For this step, the

runtime results obtained for AI Escargot suggest a combination of DPLL with gridanalysis. If the test fails, go to 1.

3. Test if FLP solves p. If not, save p as a difficult puzzle.4. Go to 1.

As a prototype implementation of Steps 2 and 3, we presented SUDOKUSAT, a Sudokusolver that allows a puzzle designer to assess the difficulty of candidate Sudoku puz-zles by applying different resolution techniques in the solving process. We hope thatthis tool, or an adaptation of it, allows Sudoku enthusiasts to generate further difficultSudoku instances for the benefit of Sudoku puzzlers around the world.

References

[AFP06] AFP. Mathematician claims to have penned hardest Sudoku. USA Today (November2006)

[DLL62] Davis, M., Logemann, G., Loveland, D.: A machine program for theorem prov-ing. Communications of the Association for Computing Machinery 5(7), 394–397(1962)

[DP60] Davis, M., Putnam, H.: A computing procedure for quantification theory. Journal ofthe Association for Computing Machinery 7, 201–215 (1960)

[FJ05] Felgenhauer, B., Jarvis, F.: Enumerating possible Sudoku grids (2005),http://www.afjarvis.staff.shef.ac.uk/sudoku/sudoku.pdf

[Han07] Hanson, B.: Sudoku Assistant/Solver. Online software written in JavaScript (2007),http://www.stolaf.edu/people/hansonr/sudoku/

[Hol06] Holst, C.K.: Sudoku—an excercise in constraint programming in Visual Prolog 7.In: 1st Visual Prolog Applications and Language Conference, VIP-ALC 2006, Faro,Portugal (April 2006)

[Ink07] Inkala, A.: AI Escargot—The Most Difficult Sudoku Puzzle, Lulu, Finland (2007)ISBN 978-1-84753-451-4

[Jui06] Juillerat, N.: Sudoku Explainer. Version 1.2 (December 2006), http://diuf.unifr.ch/people/juillera/Sudoku/Sudoku.html

[Knu00] Knuth, D.E.: Dancing links. In: Davies, J., Roscoe, B., Woodcock, J. (eds.) MillenialPerspectives in Computer Science, pp. 187–214. Palgrave, Houndmills (2000)

[LO06] Lynce, I., Ouaknine, J.: Sudoku as a SAT problem. In: Berstein, S.Z. (ed.) Proceed-ings of the 9th International Symposium on Artificial Intelligence and Mathematics,AIMATH 2006, Fort Lauderdale, Florida, USA ( January 2006)

[MMZ+01] Moskewicz, M., Madigan, C., Zhao, Y., Zhang, L., Malik, S.: Chaff: Engineering anefficient SAT solver. In: Proceedings of the 38th Design Automation Conference,Las Vegas, USA (June 2001)

[rav07] ravel. The hardest sudokus. Sudoku Players’ Forums (January 25, 2007), http://www.sudoku.com/forums/viewtopic.php?t=4212&start=587

[Roy07] Royle, G.: Minimum sudoku (2007), http://people.csse.uwa.edu.au/gordon/sudokumin.php

[Sim05] Simonis, H.: Sudoku as a constraint problem. In: Proceedings of the CP Workshopon Modeling and Reformulating Constraint Satisfaction Problems, Sitges, Spain,pp. 13–27 (October 2005)

http://www.afjarvis.staff.shef.ac.uk/sudoku/sudoku.pdf

http://www.stolaf.edu/people/hansonr/sudoku/

http://diuf.unifr.ch/people/juillera/Sudoku/Sudoku.html

http://diuf.unifr.ch/people/juillera/Sudoku/Sudoku.html

http://www.sudoku.com/forums/viewtopic.php?t=4212&start=587

http://www.sudoku.com/forums/viewtopic.php?t=4212&start=587

http://people.csse.uwa.edu.au/gordon/sudokumin.php

http://people.csse.uwa.edu.au/gordon/sudokumin.php


[TL00] Torres, P., Lopez, P.: Overview and possible extensions of shaving techniques forjob-shop problems. In: 2nd International Workshop on Integration of AI and ORTechniques in Constraint Programming for Combinatorial Optimization Problems,CP-AI-OR 2000, IC-PARC, UK, pp. 181–186 (March 2000)

[Web05] Weber, T.: A SAT-based Sudoku solver. In: Geoff Sutclife and Andrei Voronkov,editors, 12th International Conference on Logic for Programming, Artificial Intel-ligence and Reasoning, LPAR 2005, pp. 11–15, Montego Bay, Jamaica (December2005) (short paper Proceedings)

[Yat03] Yato, T.: Complexity and completeness of finding another solution and its appli-cation to puzzles. Master’s thesis, Univ. of Tokyo, Dept. of Information Science,Faculty of Science, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, JAPAN (January 2003),http://www-imai.is.s.u-tokyo.ac.jp/˜yato/data2/MasterThesis.ps

http://www-imai.is.s.u-tokyo.ac.jp/~yato/data2/MasterThesis.ps

http://www-imai.is.s.u-tokyo.ac.jp/~yato/data2/MasterThesis.ps


Understanding and Forecasting Air Pollution with the Aid of Artificial Intelligence Methods in Athens, Greece

Kostas D. Karatzas, George Papadourakis, and Ioannis Kyriakidis

Dept. of Mechanical Engineering, Aristotle University, GR-54124 Thessaloniki, Greece [email protected] Dept. of Applied Informatics & Multimedia, T.E.I. of Crete, Stavromenos GR-71004 Heraklion, Crete, Greece [email protected] Dept. of Applied Informatics & Multimedia, T.E.I. of Crete, Stavromenos GR-71004 Heraklion, Crete, Greece [email protected]

Abstract. The understanding and management of air quality problems is a suitable problem concerning the application of artificial intelligence (AI) methods towards knowledge discovery for the purposes of modelling and forecasting. As air quality has a direct impact on the quality of life and on the general environment, the ability to extract knowledge concerning relationships between parameters that influence environmental policy making and pollution abatement measures becoming more and more important. In the present paper an arsenal of AI methods consisting of Neural Networks and Principal Component Analysis is being supported by a set of mathematical tools including statistical analysis, fast Fourier transformations and periodograms, for the investigation and forecasting of photochemical pollutants time series in Athens, Greece. Results verify the ability of the methods to analyze and model this knowledge domain and to forecast the levels of key parameters that provide direct input to the environmental decision making process.

1 Introduction

The worldwide concern about air pollution problems has created the need for air qual-ity forecast for the purposes of minimizing the adverse effects that bad atmospheric quality poses upon humans and their environment. As air pollution is a continuous environmental problem, it reinforces the need for more accurate and at the same time, more easily applied models and for better understanding of the cause effect relation-ships. Methods for AQ forecasting are required in particular for predicting the worst air quality situation. The present paper focuses on forecasting of hourly pollution episodes associated to benzene. Benzene (C6H6) is a volatile organic compound (VOC) consisting of a stable chemical ring structure that constitutes the base of the aromatic hydrocarbon family (thus with a characteristic smell). Benzene is character-ized as a health hazard: according to the WHO, its unit risk factor is 4.10-6, meaning that in a population of one million persons, 4 additional cases of cancer can be ex-pected following a continuous inhalation of 1 µg/m3 during its lifetime or one addi-tional cancer in a population of 250 000 people.

38 K.D. Karatzas, G. Papadourakis, and I. Kyriakidis

Human activities and mostly traffic are the main benzene emissions sources. As these emissions occur within populated areas of high density, it is essential to develop forecasting models that will allow prediction of air quality levels related to hourly benzene concentration, been useful in everyday environmental management for city authorities [1].

Artificial Neural Networks (ANNs) are computational data structures that try to mimic the biological processes of the human brain and nervous system. ANNs were advanced in the late 1980s, popularizing non-linear regression techniques like Multi-layer Perceptons (MLP) [2] and self-organising maps (SOM) [3]. It is shown that ANNs can be trained successfully to approximate virtually any smooth, measurable function [4]. ANNs are highly adaptive to non-parametric data distributions and, make no prior hypotheses about the relationships between the variables. ANNs are also less sensitive to error term assumptions and they can tolerate noise, chaotic com-ponents and heavy tails better than most of the others methods. Other advantages include greater fault tolerance, robustness, and adaptability especially compared to expert systems, due to the large number of interconnected processing elements that can be trained to learn new patterns [5]. These features provide ANNs the potential to model complex non-linear phenomenon like air pollution [6-10].

2 Background

2.1 General Intro to Air Quality Analysis and Forecasting

Earlier research work has already dealt with using knowledge discovery techniques mainly for air quality associated incident forecasting, on the basis of incident defini-tions (i.e. episodes, pollution threshold value exceedances etc) that are associated with air pollution levels. One should make a distinction between (a) the usage of determi-nistic (usually three-dimensional) air pollution models that are used for the assess-ment of air pollution levels and for the forecasting of air quality on a grid basis concerning an area of interest, and (b) the usage of mathematical methods that deal directly with time series “generated” by monitoring stations. For the latter, several models have been built for predicting incidents that may occur in the near future. For instance, conventional statistical regression models [11], [12], [13], and time-series analysis have been applied to predict ozone levels [14]. Neural networks have been used for short-term ozone prediction [15], also in combination with adaptive nonlin-ear state space-based prediction [16]. In addition, methods like case-based reasoning [17] and classification and regression trees [18], [19], have been employed for pre-dicting air pollutant concentrations [20].

2.2 The Greater Athens Area and Its AQ and Monitoring Stations

The city of Athens today is located in a basin of approximately 450 km2 (Figure 1). This basin is surrounded at three sides by fairly high mountains (Mt. Parnis, Mt. Pendeli, Mt. Hymettus), while to the SW it is open to the sea. Industrial activities take place both in the Athens basin and in the neighbouring Mesogia plain. The Athens basin is characterized by a high concentration of population (about 40% of the Greek population), accumulation of industry (about 50% of the Greek industrial activities)

Understanding and Forecasting Air Pollution 39

and high motorization (about 50% of the registered Greek cars). Anthropogenic emis-sions in conjunction with unfavourable topographical and meteorological conditions are responsible for the high air pollution levels in the area [20].

The visual results of atmospheric pollution “nephos”, a brown cloud over the city, made their appearance in the 70’s. Alarmingly elevated pollutant concentrations al-ready threaten public health and at the same time cause irreparable damage to invalu-able ancient monuments. Already by mid 80’s it became apparent to the Athenian authorities that road traffic is the main contributor to the anthropogenic emissions in Athens: Practically all carbon monoxide (CO) emissions, 3/4 of the nitrogen oxides (NOx) emissions and nearly 2/3 of the volatile organic compounds (VOCs) emissions were found to be associated with road traffic. It was realized that the largest part of these emissions was associated with private use passenger cars. Therefore, decisive interventions to the market and use of cars were recognized as essential for improving air quality in Athens [20]. Benzene and CO are similar in connection with their emis-sions. It is expected that in Athens traffic (basic cause of CO) is the basic cause of benzene, as suggested also by [21]. Benzene concentrations were found to be alarm-ingly high in Athens [22], a fact that renders this pollutant as appropriate for model-ling and forecasting.

Fig. 1. Topography of the Greater Athens Area and location of the Air Quality monitoring stations. Patisia station is indicated with red.

2.3 Data Presentation

In this paper we use a 2-year long data set of hourly air pollutant concentrations and meteorological information coming from the Patisia monitoring station, one of air quality the stations operating the Greater Athens Area.. Parameters addressed include Benzene (C6H6, μg/m3) carbon monoxide (CO, mg/m3), sulfur dioxide (SO2, μg/m3) relative humidity (RH, %), air temperature (Ta, Co) and wind speed (WS, m/s). In


addition the sine and cosine transformation functions are applied for the wind direc-tion (WD=1+sin(θ+π/4)), as they were found to lead to better results papers [6].

The data set used had missing values that were automatically flagged with the value -9999. The criterion applied for excluding a day from the time series, was the 30% of the total values missing. To calculate the missing values of the remaining days, the Lagrange interpolating polynomial (of 6th degree) was applied.

3 Methodologies Applied

3.1 Covariance Calculation

In probability theory and statistics, covariance is the measure of how much two ran-dom variables varies together, as distinct from variance, which measures how much a single variable varies. The covariance between two real-valued random variables X and Y is defined as [23]:

N

YYXXCov

XY

∑ −−=

))((

Covariance calculation results are presented in Table 1. From Table 1 it’s clear that the highest covariation to benzene is SO2 and CO, indi-

cating that these pollutants are simultaneously present in the atmosphere, and thus providing a hint for common emission sources and association with the life cycle of the city operation. In addition, there is minor covariation from RH and WS.

Table 1. Results of covariance of hourly Benzene concentrations to the other characteristics for the time period 2004-2005 (Patisia station)

Covariance of Benzene

to: 2004

2004 (%)

2005 2005 (%)

SO2 40.5245 60 61.532 79,7

CO 7.3824 11 7.1475 9,3

RH 13.7737 20 4.1642 5,4

Ta -2.9822 4 -0.6281 0,8

WS -3.5233 5 -3.7368 4,8

WD -0.0040 0 -0.0097 0,0

3.2 Principal Component Analysis

Principal component analysis (PCA) is a computational intelligence method originat-ing from multivariate statistical analysis that allows for the identification of the major drives within a certain multidimensional data set and thus may be applied for data


compression, identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in high dimensional data, where the luxury of graphical representation is not avail-able, PCA is a powerful tool for such an analysis. The other main advantage of PCA is that once these patterns have been identified the data can be compressed, using dimensionality reduction, without significant loss of information [24].

Table 2 shows the PCA analysis results that represents the relative importance of each feature studied in comparison to hourly benzene concentrations. Kaiser rules of keeping all eigenvalues greater than one [25] indicates that WD feature should be eliminated. The remaining features account for 99.95% of the variation of the data set. In addition to the PCA analysis, before eliminating a feature with small eigenvalue, it’s covariance with the dependent variable must be checked. If a feature with a small eigenvalue has a high covariance with the dependent variance it should not be elimi-nated. Furthermore, in order to obtain substantial PCA analysis, as a rule of thumb, the results should contain at least three high eigenvalue features [26]. According to Table 1, the covariance between benzene concentrations and wind direction (WD) is found to be close to zero. This fact reinforces the elimination of the WD feature.

Table 2. The eigenvalues that created from PCA analysis

Parameters Eigenvalues (%) Eigenvalues

SO2 68.518 672.62

CO 28.096 275.81

RH 2.8455 27.933

Ta 0.34162 3.3536

WS 0.14718 1.4448

WD 0.050801 0.49869

3.3 Periodogram and Fourier Transform

The discrete Fourier transform of a stationary discrete time series x(n) is a set of N- discrete harmonics X(k) where

∑−

=

−=1

0

/2)()(N

n

NknjenxkX π

and it represents the frequency response of x(n). X(k) is complex. The magnitude of X(k) squared, a real value, is called strength and a diagram of all the strength harmon-ics is called periodogram. When x(n) is real then the periodogram is symmetric and only half of the harmonics are needed. Here x(n) is the hourly benzene concentration time series for year 2004 and 2005.


Initially, a test is performed in order to check if x(n) is stationary. In order for x(n) to be stationary is should possess no time dependent trend. Figure 2 illustrates the benzene hourly concentrations time series, the regression line and its moving average. It is clear that these data is non-stationary because the moving average is changing around the horizontal axis and the regression line is:

6809,7104315,76809,75

≈⋅⋅−=+=−

xbxay

In order to transform the data from non-stationary to stationary the moving average must be subtracted from the time series. Figure 3 shows the transformed stationary time series of benzene, regression line and moving average. It is obvious that the moving average is taking values close to zero and the regression line is:

0103791,20109,0 6 ≈⋅⋅−=+= − xbxay

In order to compute the periodogram of the benzene hourly concentrations time se-ries, the Fast Fourier Transformation (FFT) [27] is utilized. The periodogram of ben-zene is illustrated in Figure 4. The highest values of strength in the periodogram, determine the periodicity of the time series. Table 3 presents the most significant strength values and their corresponding periodicity.

The periodicity of Table 4 confirms the original hypothesis that the basic source of benzene is traffic. This is indicated from the periodicity of traffic per week (7-days: highest traffic on working days and less on weekends). Furthermore, the periodicity of 8-hours is very important because it indicates the 8-hour working shift of the work-ing population, since on that period extra traffic occurs. The 24-hour periodicity indi-cates that there is higher traffic during day time than night time. Overall, benzene

Fig. 2. Non-stationary benzene hourly concentration time series, regression line and moving average


Fig. 3. Stationary benzene hourly concentration time series, regression line and moving average

Fig. 4. Periodogram of benzene hourly concentration time series (Patisia station)

Table 3. The most significant strength values and their corresponding periodicity

C/N Strength Cycles /Hour Periodicity1 2,36x107 0,0059844 7 days 2 1,92x107 0,125 8 hours 3 1,35x107 0,041667 24 hours 4 8,30x106 0,16667 6 hours


hourly concentrations seem to follow the life cycle of the city operation, as the latter is being dictated by transportation. Taking this into account, it is evident that popula-tion exposure is expected to be non negligible.

3.4 Artificial Neural Networks

In order to model and forecast benzene hourly concentrations, ANNs were applied. For this purpose, the type of the neural network to be utilized was the multilayer per-ceptron using the backpropagation training algorithm, due to its successful perform-ance in similar air quality related applications. This algorithm was implemented by using the Matlab Neural Network Toolbox. Several (a total of 12) network configura-tions were implemented and tested during the training phase in order to avoid local minima [28].

Data to be imported to the ANN were firstly normalized, by applying the hyper-bolic tangent sigmoid transfer function, which was also applied on the hidden layers. On the output layer the linear transfer function was used. This is a common structure for function approximation (or regression) problems [29], but has also shown good results in similar studies [6]. For the training phase the Levenberg-Marquardt back-propagation algorithm [30] is implemented to the 50% of the data. For the validation phase 25% of the data is used and for testing the remaining 25%.

Two distinct methods were used to separate the data. In the first one the data were separated successively as shown in Figure 5. Every four hourly values, two are used for training, one for validation and one for testing. In the second method, the first 50% of the data is used for training, the next 25% for validation and finally the last 25% for testing as shown in Figure 6.

Tr. Tst Tr. Val. … Tr. Tst Tr. Val.

Fig. 5. First separation Method (successively)

Training (50%) Validation (25%) Testing (25%)

Fig. 6. Second separation Method

4 Results

Table 4 presents the performance results of the different network configurations for data separation using method 1. Configurations 10 and 12 have the best results. Thus, network configuration 10 is chosen since it has one hidden layer, so it’s more simple and faster at the training phase.


Table 4. Performance results table of different network configuration (Method 1)

Hidden Layers ( ) Output Layer ( )Patisia

Station One Two Three n R R2

Config. 1 5 - - 61 0,777 0,60

Config. 2 5 5 - 59 0,789 0,62

Config. 3 5 5 5 97 0,788 0,62

Config. 4 10 - - 66 0,785 0,62

Config. 5 10 10 - 52 0,793 0,63

Config. 6 10 10 10 33 0,786 0,62

Config. 7 15 - - 51 0,79 0,62

Config. 8 15 15 - 25 0,789 0,62

Config. 9 15 15 15 19 0,791 0,63

Config. 10 20 - - 66 0,794 0,63

Config. 11 20 20 - 16 0,79 0,62

Config. 12 20 20 20 14 0,794 0,63

Table 5. Performance results table of different network configurations (Method 2)

Hidden Layers ( ) Output Layer ( )Patisia

station One Two Three n R R2

Config. 1 5 - - 25 0,764 0,58

Config. 2 5 5 - 45 0,775 0,60

Config. 3 5 5 5 24 0,777 0,60

Config. 4 10 - - 16 0,776 0,60

Config. 5 10 10 - 27 0,78 0,61

Config. 6 10 10 10 27 0,78 0,61

Config. 7 15 - - 43 0,785 0,62

Config. 8 15 15 - 21 0,777 0,60

Config. 9 15 15 15 19 0,778 0,61

Config. 10 20 - - 63 0,785 0,62

Config. 11 20 20 - 14 0,78 0,61

Config. 12 20 20 20 16 0,779 0,61


Table 5 presents the performance results of the different network configurations for data separation using method 2. Configurations 7 and 10 with one hidden layer have the best results. The configuration 7 is chosen since it has the best performance with less neurons.

Table 6. Statistical indicators of neural network with benzene maximum allowed value set to 5μg/m3

Benzene maximum value

5 μg/m3 Patisia Station

Model 1

Model 2

Observed mean 7,184 7,184

Predicted mean 6,802 6,930

Observed standard deviation 5,003 5,003

Predicted standard deviation 2,507 2,520

Normalised mean difference (NMD) 0,053 0,035

Root mean square error (RMSE) 3,449 3,412

RMSE systematic 0,030 0,030

RMSE unsystematic 0,038 0,038

Correlation coefficient 0,78 0,785

Index of agreement (IA) 0,772 0,777

Mean absolute error (MAE) 2,176 2,163

Mean percentage error (MPE) 0,510 0,523

Mean bias error (MBE) 0,382 0,255

Number of observed alarms 8164 8164

Number of predicted alarms 9178 9425

Number of hours with correct alarm (A)

7559 7635

Correct Alarm Rate(%) 82,4% 81%

Number of hours with false alarm (C)

1619 1790

False Alarm Rate(%) 17,6% 19%

B 605 523

Critical success index (CSI) 0,773 0,767


Table 7. Statistical indicators of neural network with benzene maximum allowed value set to 10μg/m3

Benzene maximum value

10 μg/m3 Patisia Station

Model 1

Model 2

Observed mean 7,184 7,184

Predicted mean 6,802 6,930

Observed standard deviation 5,003 5,003

Predicted standard deviation 2,507 2,520

Normalised mean difference(NMD) 0,053 0,035

Root mean square error (RMSE) 3,449 3,412

RMSE systematic 0,030 0,030

RMSE unsystematic 0,038 0,038

Correlation coefficient 0,780 0,785

Index of agreement (IA) 0,772 0,777

Mean absolute error (MAE) 2,176 2,163

Mean percentage error (MPE) 0,510 0,523

Mean bias error (MBE) 0,382 0,255

Number of observed alarms 3103 3103

Number of predicted alarms 1972 2125

Number of hours with correct alarm (A)

1682 1794

Correct Alarm Rate(%) 85,3% 84,4%

Number of hours with false alarm (C)

290 331

False Alarm Rate(%) 14,7% 15,6%

B 1421 1309

Critical success index (CSI) 0,496 0,522

For the evaluation of the forecasting performance of the ANNs models, a set of sta-tistical indicators are used to provide a numerical description of the goodness of fit, as proposed by Willmot (1982, 1985) [1]. Table 6 presents the results of forecasting performance with maximum allowed benzene concentration levels of 5 μg/m3 and at Table 7 with the relevant criterion set to 10 μg/m3. It should be noted that these values were selected on the basis of the EU limit values currently posed by EU [31] for the


mean annual concentration (brief 2000/69, Table 8), due to absence of a relevant criterion for the hourly concentration level.

Observed and predicted standard deviations quantify the amount of the variance the model is capturing compared to the observed data. Two often mentioned meas-urements of residual error include the mean absolute error (MAE) and the root mean squared error (RMSE) which summarise the difference between the observed and modelled concentrations.

The latter can be decomposed in the systematic RMSE, which is due to the model performance and the predictors included, and unsystematic RMSE that is due to re-siduals, factors that cannot be controlled. A good model is considered to have an unsystematic RMSE much larger than its systematic RMSE. It is observed that our two models have little difference.

False alarm rate is the percent of times that an alert day was forecasted but did not actually occur. A reliable model requires not only having a large percentage of suc-cessful forecasts but also a percentage of false positives as small as possible (Jorquera et al., 1998) [1].

The critical success index (CSI or Threat score) verifies how well the high pollu-tion events were predicted and it is unaffected by the number of correct forecasts.

Table 8. Limits of benzene according to brief 2000/69 EU

Suggested concentration limit values ( g/m3)Benzene

limits2005 2006 2006 2008 2009

Maximum allowed value

( g/m3)

Year mean value 10 9 8 7 6 5

Overall both models perform satisfactory, with a correct alarm rate more than 80%, among the highest reported in related literature [6-10]. “B” values (the number of observed limit overstepping that hasn’t been predicted) are slightly higher in Model 2.


The primary goal of this study was the development of an efficient air quality fore-casting model concerning benzene hourly concentration values. For this purpose, air pollutant concentrations and meteorological information have been used. As a first step, periodogram and covariance analysis supports the hypothesis that the basic source of benzene is traffic. Thus, related abatement measures should combine tech-nological interventions to the vehicle system, with traffics volume and speed management and restrictions. Following that analysis, a set of ANNs models were constructed and tested, via incorporating two different data separation (data model) strategies. According to the performance of the models, Model 1 (with twenty neu-rons in one hidden layer) has the best results (79,4% forecast ability). However, Model 2 has slightly better percentages of correct and false alarms than Model 1.


In summary, Model 1has a better forecast ability. On the contrary, Model 2 allows a more reliable prediction of European Union limits excess. That is due to the fact that this model has a large percentage of successful forecasts but also a percentage of false forecasted events.

Acknowledgements

The authors acknowledge the Ministry of Environment, Physical Planning and Public Works., Dept. of Air Quality, Athens, Greece, for providing with the air quality and meteorological information.

References

[1] Slini, T., Karatzas, K., Papadopoulos, A.: Regression analysis and urban air quality fore-casting: An application for the city of Athens. Global Nest: the Int. J. 4, 153–162 (2002)

[2] Gallant, S.I.: Perceptron-based learning algorithms. IEEE Transactions on Neural Net-works 1(2), 179–191 (1990)

[3] Kohonen, T.: Self-Organizing Maps, Series in Information Sciences, 2nd edn.1997, vol. 30. Springer, Heidelberg (1995)

[4] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2, 359–366 (1989)

[5] Lippman, R.P.: An introduction to computing with neural nets. IEEE ASSP Magazine (1987)

[6] Slini, T., Karatzas, K., Mousiopoulos, N.: Correlation of air pollution and meteorological data using neural networks. Int. J. Environment and Pollution 20(1-6), 218–229 (2004)

[7] Chelani, A.B., Gajghate, D.G., Hasan, M.Z.: Prediction of ambient PM10 and toxic, met-als using artificial neural networks. J. of Air and Waste Management Ass. 52, 805–810 (2002)

[8] Comrie, A.C.: Comparing neural networks and regression models for ozone forecasting. J. of Air and Waste Management Ass. 47, 653–663 (1997)

[9] Kolhmainen, M., Martikainen, H., Ruuskanen, J.: Neural networks and periodic compo-nents used in air quality forecasting. In: Atmospheric Environment, vol. 35, pp. 815–825 (2001)

[10] Perez, P., Trier, A.: Prediction of NO and NO2 concentrations near a street with heavy traffic in Santiago, Chile. Atmospheric Environment 35(21), 1783–1789 (2001)

[11] Bordignon, S., Gaetan, C., Lisi, F.: Nonlinear models for groundlevel ozone forecasting. Statistical Methods and Applications 11, 227–246 (2002)

[12] Huang, L.S., Smith, R.L.: Meteorologically-dependent trends in urban ozone, National Institute of Statistical Sciences, Technical Report 72 (1997)

[13] Kim, H.Y., Guldmann, J.M.: Modeling air quality in urban areas: A cell-based statistical approach. Geographical Analysis, 33 (2001)

[14] Chen, L.-J., Islam, S., Biswas, P.: Nonlinear dynamics of hourly ozone concentrations: Nonparametric short term prediction. Atmospheric Environment 32, 1839–1848 (1998)

[15] Shiva Nagendra, S.M., Khare, M.: Artificial neural network approach for modelling ni-trogen dioxide dispersion from vehicular exhaust emissions. Ecological Modelling 190, 99–115 (2006)


[16] Zolghadri, A., Monsion, M., Henry, D., Marchionini, C., Petrique, O.: Development of an operational model-based warning system for tropospheric ozone concentrations in Bor-deaux, France. Environmental Modelling & Software 19, 369–382 (2004)

[17] Kalapanidas, E., Avouris, N.: Short-term air quality prediction using a case-based classi-fier. Environmental Modelling & Software 16, 263–272 (2001)

[18] Barrero, M.A., Grimalt, J.O., Canto´n, L.: Prediction of daily ozone concentration maxima in the urban atmosphere. Chemometrics and Intelligent Laboratory Systems 80, 67–76 (2006)

[19] Briggsa, D.J., de Hoogh, C., Gulliver, J., Wills, J., Elliott, P., Kingham, S., Smallboned, K.: A regression-based method for mapping traffic-related air pollution: application and testing in four contrasting urban environments. The Science of the Total Environ-ment 253, 151–167 (2000)

[20] Athanasiadis, I.N., Karatzas, K.D., Mitkas, P.A.: Classification techniques for air quality forecasting. In: 17th European Conference on Artificial Intelligence (2006)

[21] Kassomenos, P., Karakitsios, S., Papaloukas, C.: Estimation of daily traffic emissions in a South-European urban agglomeration during a workday. Evaluation of several “what if” scenarios. Science of The Total Environment 370, 480–490 (2006)

[22] Ch. Chatzis, E., Alexopoulos, A.: Indoor and outdoor personal exposure to benzene in Athens, Greece. Science of The Total Environment 349, 72–80 (2005)

[23] Roussos P.L.,Giannis T.: Applied statistics in the social sciences (Second revised publica-tion), Ellhnika Grammata, Athens (2006)

[24] Smith, L.I.: A tutorial on Principal Components Analysis (February 26, 2002) [25] Lance, C., Butts, M., Michels, L.: The sources of four commonly reported cutoff criteria:

What did they really say? Organizational Research Methods 9, 202–220 (2006) [26] Garson, G.D.: Professor of Public Administration Editor, Social Science Computer Re-

view College of Humanities and Social Sciences, http://www2.chass.ncsu. edu/garson/pa765/factor.htm

[27] Wikipedia, The Free Encyclopedia, “Fourier transform”, http://en.wikipedia. org/wiki/Fourier_transform

[28] StatSoft, Inc. © Copyright, 1984-2003, “Neural Networks”, http://www. statsoft.com/textbook/stneunet.html

[29] Matlab Help. Keywords: Neural Network Toolbox [30] Saini, L.M., Soni, M.K.: Artificial neural network based peak load forecasting using

Levenberg-Marquardt and quasi-Newton methods. Generation, Transmission and Distri-bution, IEE Proceedings 149(5), 578–584 (2002)

[31] European Environment Agency and World Health Organisation, Air and Health - Local authorities, health and environment (1999), http://reports.eea.europa.eu/ 2599XXX/en/page008.html


Multiple Target Tracking for Mobile Robots Using the JPDAF Algorithm

Aliakbar Gorji, Mohammad Bagher Menhaj, and Saeed Shiry

Amirkabir University of Technology, 424 Hafez Ave., 15914 Tehran, Iran [email protected], [email protected], [email protected]

Abstract. Mobile robot localization is taken into account as one of the most important topics in robotics. In this paper, the localization problem is extended to the cases in which estimating the position of multi robots is considered. To do so, the Joint Probabilistic Data Association Filter approach is applied for tracking the position of multiple robots. To characterize the motion of each robot, we First develop a simple near constant velocity model and then a variable velocity based model is devised. The latter improves the performance of tracking when the robots change their velocity and conduct maneuvering movements. This in turn gives an advantage to explore the movement of the manoeuvring objects which is common in many robotics problems such as soccer or rescue robots. To highlight the merit of the proposed algorithm, we performed some case studies. Simulation results approve the efficiency of the algorithm in tracking multi-ple mobile robots with maneuvering movements.

1 Introduction

The problem of estimating the position of a mobile robot in physical environment is an important problem in mobile robotics. Indeed, because of the intrinsic error embedded in the dead-reckoning of a mobile robot, the data gathered form the odometry are not trustable and, therefore, some operations should be conducted on the crude data to reach more accurate knowledge about the position of a mobile robot. Nowadays, vari-ous algorithms are proposed to modify the accuracy of localization algorithms as briefly discussed below.

In the field of mobile robot localization, Kalman filter approach [1] plays a signifi-cant role. Kalman approach combines the odometry data and the measurements collected by the sensors of a mobile robot to provide an accurate estimation of the robot’s position. Although the Kalman approach has provided a general strategy to solve localization problems, and it is still in use in many localization problems such as mobile robot tracking [2] and Simultaneous Localization and Mapping [3], it suffers from some weaknesses. The most significant problem of the Kalman method is its weakness to the state estimation of many real systems in the presence of severe nonlinearities. Actually, in general nonlinear systems, the posterior probability den-sity function of the states given the measurements cannot be presented by a Gaussian

52 A. Gorji, M.B. Menhaj, and S. Shiry

density function and, therefore, Kalman method results in unsatisfactory responses. Moreover, Kalman filter is very sensitive to the initial conditions of the states. Accord-ingly, Kalman methods do not provide accurate results in many common global local-ization problems.

More recently, particle filters have been introduced to estimate non-Gaussian, nonlinear dynamic processes [4]. The key idea of particle filters is to represent the state by sets of samples (or particles). The major advantage of this technique is that it can represent multi-modal state densities, a property which has been shown to increase the robustness of the underlying state estimation process. Moreover, particle filters are less sensitive to the initial values of the states than the Kalman method is. This prob-lem is very considerable in applications in which the initial values of the states can not be properly chosen such as global localization problems. The afore-mentioned power of particle filters in the state estimation of nonlinear systems has caused them to be applied with great success to different state estimation problems including visual tracking [5], mobile robot localization [6], control and scheduling [7], target tracking [8] and dynamic probabilistic networks [9].

Multiple robot tracking is another attractive issue in the field of mobile robotics. This problem can be introduced as the tracking of mobile robots’ position by another one the so called reference robot. Although this problem appears similar to the common localization algorithms, the traditional approaches cannot be used because the reference robot in order to predict the position of other robots does not have access to the odometery data of these robots used in localization algorithms. More-over, using the particle filter to estimate the position of all robots, also known as com-bined state space, is not tractable because the complexity of this approach grows exponentially as the number of objects being tracked increases.

To remedy the problem of the lack of odometery data, various motion models have been proposed in which no knowledge about external inputs, such as odometery, is necessary. The near constant velocity model is the simplest one. This model has been greatly used in many tracking applications such as aircraft tracking [10], people tracking [11] and some other target tracking applications [12]. Besides this model, some other structures have been proposed usable in some special applications. Near constant ac-celeration model is a modified version of the previous model which can be applied to the tracking of maneuvering objects [13]. Also, interactive multiple model (IMM) filters represent another tool to track the movement of highly maneuvering objects where the former proposed approaches are not reliable [14].

Recently, many strategies have been proposed in the literature to address the difficulties associated with multi-target tracking. The key idea of the literature is to estimate each object’s position separately. To do so, the measurements should be assigned to each target. Therefore, data association has been introduced to relate each target to its associated measurements. By combining data association concept with filtering and state estimation algorithms, the multiple target tracking issue can be dealt easier than applying state estimation procedure to the combined state space model. Among the various methods, Joint Probabilistic Data Association Filter (JPDAF) algorithm is taken into account as one of the most attractive approaches. The traditional JPDAF [15] uses Extended Kalman Filtering (EKF) joint with the data association to estimate the states of multiple objects. However, the performance of the algorithm degrades as the non-linearities become more severe. To enhance the performance of

Multiple Target Tracking for Mobile Robots Using the JPDAF Algorithm 53

the JPDAF against the severe nonlinearity of dynamical and sensor models, recently, strategies have been proposed to combine the JPDAF with particle techniques known as Monte Carlo JPDAF, to accommodate general non-linear and non-Gaussian models [16]. The newer versions of the above mentioned algorithm use an approach named as soft gating to reduce the computational cost which is one of the most serious prob-lems in online tracking algorithms [16]. Besides the Monte Carlo JPDAF algorithm, some other methods have been proposed in the literature. In [16], a comprehensive survey has been presented about the most well-known algorithms applicable in the field of multiple target tracking.

The Monte Carlo JPDAF algorithm has been used in many applications. In [11], the algorithm presented as sample based JPDAF has been applied for multiple people tracking problem. In [17], the JPDAF has been used for multi-target detection and tracking using laser scanner. But, the problem of tracking multiple robots by a reference robot has not been discussed in the literature. This issue is very important in many multiple robot problems such as the robot soccer. In this paper, the Monte Carlo JPDAF algorithm is considered for the multiple robot tracking problem.

The rest of the paper is organized as follows. Section 2 describes the Monte Carlo JPDAF algorithm for multiple target tracking. The JPDAF for multiple robot tracking is the topic of section 3. In this section, we will show how the motion of robots in a real environment can be described. To do so, near constant velocity and acceleration model are introduced to describe the movement of mobile robots. Then, it will be shown how a reference robot can track the motion of the other robots. In section 4, various simulation results are presented to support the performance of the algorithm. In this section, the maneuvering movement of the mobile robot is also considered. The results show the suitable performance of the JPDAF algorithm in tracking the motion of mobile robots in the simulated environment. Finally, section 5 concludes the paper.

2 The JPDAF Algorithm for Multiple Target Tracking

In this section, the JPDAF algorithm considered as the most successful and widly applied strategy for multi-target tracking under data association uncertainty is presented. The Monte Carlo version of the JPDAF algorithm uses common particle filter approach to estimate the posterior density function of the states given measurements (P(xt |y1:t)

) Now, consider the problem of tracking of N objects. xt

k denotes the state

of these objects at time t where k=1,2,..,N is the the target number. Furthermore, the movement of each target can be considered by a general nonlinear state space model as follows:

xt+1 = f (xt)+ vt

yt = g(xt)+wt (1)

where the signals vt and wt are the white noise with the covariance matrixes Q and R, respectively. Furthermore, in the above equation, f and g are the general nonlinear functions representing the dynamic behavior of the target and the sensor model. The JPDAF algorithm recursively updates the marginal filtering distribution for

each target P(xkt |y1:t), k=1,2,..,N instead of computing the joint filtering distribution


P(Xt |y1:t), Xt = x1 , ..., xN . To compute the above distribution function, the following

points should be considered:

• It is very important to know how to assign each target state to a measurement. Indeed, in each iteration the sensor provides a set of measurements. The source of these measurements can be the targets or the disturbances also known as clutters. Therefore, a special procedure is needed to assign each target to its associated measurement. This procedure is designated as Data Association considered as the key stage of the JPDAF algorithm described later.

• Because the JPDAF algorithm recursively updates the estimated states, a recur-sive solution should be applied to update the states in each sample time. Tradi-tionally, Kalman filtering has been a strong tool for recursively estimating the states of the targets in multi-target tracking scenario. Recently, particle filters joint with the data association strategy have provided better estimations specially when the sensor model is nonlinear.

With regard to the above points, the following sections describe how particle filters paralleled with the data association concept can deal with the multi-target tracking problem.

2.1 The Particle Filter for Online State Estimation

Consider the problem of online state estimation as computing the posterior probabil-ity density function P(xk

t |y1:t). To provide a recursive formulation for computing the above density function the following stages are presented:

• Prediction stage: The prediction step is proceeded independently for each target as:

P(xkt |y1:t−1) =

∫

xkt−1

P(xkt |xk

t−1)P(xkt−1|y1:t−1)dxk

t−1

(2)

• Update stage: The update step can be described as follows:

P(xkt |y1:t) ∝ P(yt |xk

t )P(xkt |y1:t−1)

(3)

The particle filter estimates the probability distribution density function P(xkt |y1:t−1)

by sampling from a specific distribution function as follows:

P(xkt |y1:t−1) =

N

∑i=1

witδ (xt − xi

t)

(4)

Where i=1,2,...,N is the sample number, w t is the normalized importance weight

and δ is the delta dirac function. In the above equation, the state p

xit is sampled from

the proposal density function q(xkt |xk

t−1,y1:t). By substituting the above equation in equation (2) and the fact that the states are drawn from the proposal function q, the recursive equation for the prediction step can be written as follows:

α it = α i

t−1P(xk

t |xk,it−1)

q(xkt |xk,i

t−1,y1:t)

(5)


Where xk,it−1 is the it h sample of xk

t−1. Now, by using equation (3) the update stage

can be expressed as the recursive adjustment of importance weights as follows:

wit = α i

t P(yt |xkt ) (6)

By repeating the above procedure at each time step the sequential importance sam-pling (SIS) for online state estimation is presented as follows:

The SIS Algorithm

1. For i=1: N initialize the states xi0 , prediction weights α i

0 and importance weights wi

0 .

2. At each time step t proceed the following stages: 3. Sample the states from the proposal density function as follows:

xit ∼ (xt |xi

t−1,y1:t) (7)

4. Update prediction weights as equation (5). 5. Update importance weights as equation (6). 6. Normalize the importance weights as follows:

wit =

wit

Σ Ni=1wi

t

(8)

7. Set t=t+1 and go to step 3.

In the above, for simplicity, the index, k, has been eliminated. The main failure of the SIS algorithm is the degeneracy problem. That is, after a few iterations one of the nor-malized importance ratios tends to 1, while the remaining ratios tend to zero. This prob-lem causes the variance of the importance weights to increase stochastically over time [18]. To avoid the degeneracy of the SIS algorithm, a selection (resampling) stage may be used to eliminate samples with low importance weights and multiply samples with high importance weights. There are many approaches to implement the resampling stage [18]. Among them, the residual resampling provides a straightforward strategy to solve the degeneracy problem in the SIS algorithm. Using the residual resampling joint with the SIS algorithm can be presented as the SIR algorithm given below:

The SIR Algorithm

1. For i=1: N initialize the states xi0 , prediction weights α i

0 and importance weights wi

0 . 2. At each time step t perform the following stages:

3. Do the SIS algorithm to sample states xit and compute normalized importance

weights wit .


4. Check the resampling criterion: • If Ne f f > t hresh follow the SIS algorithm. Else • Implement the residual resampling stage to multiply/suppress xi

t with high/low importance weights.

• Set the new normalized importance weights as wit = 1

N . 5. Set t=t+1 and go to step 3.

In the above algorithm, Ne f f is a criterion checking the degeneracy problem which can be written as:

Ne f f =1

∑Ni=1(w

it)2

(9)

In [18], a comprehensive discussion has been made on how to implement the resid-ual resampling stage.

Besides the STR algorithm, some other approaches have been proposed in the literature to enhance the quality of the STR algorithm such as Markov Chain Monte Carlo particle filters, Hybrid STR and auxiliary particle filters [19]. Although these methods are more accurate than the common STR algorithm, some other problems such as the computational cost are the most significant reasons that, in many real time applications such as online tracking, the traditional STR algorithm is applied to recursive state estimation.

2.2 Data Association

In the last section, the STR algorithm was presented for online state estimation. However, the major problem of the proposed algorithm is how one can compute the likelihood function P(yt |xk

t ) . To do so, an association should be made between the measurements and the targets. Briefly, the association can be defined as target to meas-urement and measurement to target association:

Definition 1: We will denote a target to measurement association (T→M) by λ = {r,mc,mT} where r = r1 , .., rK and rk is defined as follows:

rk ={

0 If the kth target is undetectedj If the kth target generates the jth measurement

Where j=1,2,...,m and m is the number of measurements at each time step and k=1,2,..,K which K is the number of targets.

Definition 2: In a similar fashion, the measurement to target association (M→T) is defined as λ = {r,mc,mT} where r = r1 , r2 , ..., rm and r j is defined as follows:

r j =

⎧⎪⎪⎨

⎪⎪⎩

0 If the jth measurement is not associatedto any target

k If the jth measurement is associatedto the kth target


In both above equations, mT is the number of measurements due to the targets and mc

is the number of clutters. It is very easy to show that both definitions are equivalent but the dimension of the

target to measurement association is less than the measurement to target association. Therefore, in this paper, the target to measurement association is used. Now, the likelihood function for each target can be written as:

P(yt |xkt ) = β 0 +

m

∑i=1

β iP(yit |xk

t )

(10)

In the above equation, β i is defined as the probability that the it h measurement is assigned to the kt h target. Therefore, β i is written as:

β i = P(rk = i|y1:t) (11)

Before presenting the computation of the above equation, the following definition is explained:

Definition 3: We define the set ∧ as all possible assignments which can be made between the measurements and targets. For example, consider a 3-target tracking problem. Assume that the following associations are recognized between the targets and measurements:

r1 = 1, 2, r2 = 3, r3 = 0.

Now, the set ∧ can be shown in Table 1.

Table 1. Simulation parameters

Target 1 Target 2 Target 3

0 0 00 3 01 0 01 3 02 0 02 3 0

In Table 1, 0 means that the target has not been detected. By using the above definition, equation (11) can be rewritten as:

P(rk = i|y1:t) = Σλt∈∧t ,rk= jP(λt |y1:t)

(12)

The right side of the above equation is written as follows:

P(λt |y1:t) =P(yt |y1:t−1, λt)P(y1:t−1, λt)

P(y1:t)= P(yt |y1:t−1, λt)P(λt)

(13)

In the above equation, we have used the fact that the association vector λ t is not de-

pendent on the history of measurements. Each of the density functions in the right side

of the above equation can be presented as follows:


• P(λ t ) The above density function can be written as:

P(λt) = P(rt ,mc,mT ) = P(rt |mc,mT )P(mc)P(mT )

(14)

The computation of each density function in the right side of the above equation is straightforward and can be found in [16].

• P(yt |y1:t−1, λt) Because the targets are considered independent, the above density function can be

written as follows:

P(yt |y1:t−1, λt) = (Vmax)−mcΠ j∈Γ Pr jt (y j

t |y1:t−1) (15)

Where Vmax is the maximum volume which is in the line of sight of the sensors, Γ is the

set of all valid measurement to target associations and Pr jt (yt |y1:t−1) is the predictive

likelihood for the (r jt )th target. Now, consider r j

t = k . The predictive likelihood for the kt h target can be formulated as follows:

Pk(y jt |y1:t−1) =

∫

P(y jt |xk

t )P(xkt |y1:t−1)dxk

t

(16)

Both density functions in the right side of the above equation are estimated using the samples drawn from the proposal density function. However, the main problem is how one can determine the association between the measurements (yt ) and the targets. To do so, the soft gating method is proposed as the following algorithm:

Soft Gating

1. Consider xi,kt , i=1,2,..,N, as the samples drawn from the proposal density function

2. For j=1:m do the following steps for the jt h measurement: 3. For k=1:K do the following steps: 4. Compute μk

j as follows:

μkj =

N

∑i=1

α i,kt g(xi,k

t )

(17)

Where g is the sensor model and αkt is the normalized weight as presented in the

STR algorithm. 5. Compute σ k

j as follows:

σ kj = R+

N

∑i=1

α i,kt

(g(xi,k

t )−μkj )(g(xi,k

t )−μkj)T

(18)

6. Compute the distance to the jt h target as follows:

d2j =

12(y j

t −μkj )

T (σ kj )

−1(y jt −μk

j )

(19)

7. If d2j < ε, assign the jt h measurement to the kt h target.


It is easy to show that the predictive likelihood presented in equation (16) can be approximated as follows:

Pk(y jt |y1:t−1) � N(μk,σ k) (20)

where N(μk,σ k) is a normal distribution with mean μk and the covariance matrix σ k.

By computing the predictive likelihood, the likelihood density function is easily estimated. In the next subsection, the JPDAF algorithm is presented for multi-target tracking.

2.3 The JPDAF Algorithm

The mathematical foundations of the JPDAF algorithm were discussed in the last sections. Now, we are ready to propose the full JPDAF algorithm for the problem of multiple target tracking. To do so, each target is characterized by the dynamical model introduced by equation (1). The JPDAF algorithm is presented below:

The JPDAF Algorithm

1. Initialization: initialize the states for each target as xi,k0 , where i=1,2,..,N and

k=1,2,...,K, the predictive importance weights α i,k0

i and importance weights wi,k

0 . 2. At each time step t, proceed through the following stages. 3. For i=1:N do the following steps. 4. For k=1:K conduct the following procedures for each target. 5. Sample the new states from the proposal density function as follows.

xi,kt ∼ q(xt |xi,k

t−1,y1:t) (21)

6. Update the predictive importance weights as follows.

α i,kt = α i,k

t−1P(xi,k

t |xi,kt−1)

q(xi,kt |xi,k

t−1,y1:t) (22)

Then, normalize the predictive importance weights. 7. Use the sampled states and new observations, yt , to constitute the association vector for each target as Rk = { j|0 ≤ j ≤ m,y j → k} where (→ k) refers to the association between the kt h target and the jt h measurement. To do so, use the soft gating procedure described in the last subsection. 8. Constitute all of possible association for the targets and make the set, Γ , as described by definition 3. 9. Use equation (13) and compute β l for each measurement where l=1,2,..,m and m is the number of measurements. 10. By using equation (10) compute the likelihood ratio for each target as P(yt |xi,k

t ) . 11. Compute the importance weights for each target as follows.

wi,kt = α i,k

t P(yt |xi,kt )

(23)


Then, normalize the importance weights as follows:

wi,kt =

wi,kt

Σ Ni=1wi,k

t (24)

12. Proceed the resampling stage. To do so, implement the similar procedure described in the STR algorithm. Afterwards, for each target the resampled states can be pre sented as follows:

{wi,kt ,xi,k

t }→ { 1N

,xm(i),kt }

(25)

13. Set the time as t=t+1 and go to step 3.

The above algorithm can be used for multiple target tracking issue. In the next sec-

tion, we show how the JPDAF algorithm can be used for multiple robot tracking. In addition, some well-known motion models are presented to track the motion of a mobile robot.

3 The JPDAF Algorithm for Multiple Robot Tracking

In the last section, the JPDAF algorithm was completely discussed. Now, we want to use the JPDAF algorithm to the problem of multiple robot tracking. To do so, con-sider a simple 2-wheel differential mobile robot whose dynamical model is represented as follows:

xt+1 = xt +Δsr +Δsl

2cos(θt +

Δsr −Δsl

2b)

yt+1 = yt +Δsr +Δsl

2sin(θt +

Δsr −Δsl

2b)

θt+1 = θt +Δsr −Δsl

b

(26)

where [xt, yt ] is the position of the robot, θt is the angle of robot’s head, ∆ sr and ∆ sl

are the distances traveled by each wheel, and b refers to the distance between two wheels of the robot. The above equation describes a simple model presenting the motion of a differential mobile robot. For a single mobile robot localization the most straightforward way is to use this model and the data collected from the sensors set on the left and right wheels measuring ∆ sr and ∆ sl at each time step. But, the above method does not satisfy the suitable results because the data gathered from the sensors are dotted with the additive noise and, therefore, the estimated trajectory does not match the actual trajectory. To solve this problem, measurements obtained from the sensors are used to modify the estimated states. Therefore, Kalman and particle filters have been greatly applied to the problem of mobile robot localization [2, 6]. Now, suppose the case in which the position of other robots should be identified by a refer-ence robot. In this situation, the dynamical model discussed previously is not applica-ble because the reference robot does not have access the internal sensors of the other


robots such as the sensors measuring the movement of each wheel. Therefore, a motion model should be defined for the movement of each mobile robot. The simplest method is near constant velocity model presented as follows:

Xt+1 = AXt +But

Xt = [xt , xt ,yt , yt ]T (27)

where the system matrixes are defined as:

A =

⎛

⎜⎜⎝

1 Ts 0 00 1 0 00 0 1 Ts0 0 0 1

⎞

⎟⎟⎠

B =

⎛

⎜⎜⎜⎝

T 2s2 0Ts 0

0 T 2s2

0 Ts

⎞

⎟⎟⎟⎠

(28)

where Ts refers to the sample time. In the above equations, ut is a white noise with zero mean and an arbitrary covariance matrix. Because the model is considered as a near constant velocity, the covariance of the additive noise can not be so large. In- deed, this model is suitable for the movement of the targets with a constant velocity which is common in many applications such as aircraft path tracking [5] and people tracking [11].

The movement of a mobile robot can be modeled by the above model in many conditions. However, in some special cases the robots’ movement can not be characterized by a simple near constant velocity model. For example, in a robot soccer problem, the robots conduct a maneuvering movement to reach a special target ordinary to the ball. In these cases, the robot changes its orientation by varying the input forces im-posed to the right and left wheels. Therefore, the motion trajectory of the robot is so complex that the elementary constant velocity does not result in satisfactory re-sponses. To overcome the mentioned problem the variable velocity model is proposed. The key idea behind this model is using the robot’s acceleration as another state variable which should be estimated as well as the velocity and position of the robot. Therefore, the new state vector is defined as, Xt = [xt , xt ,ax

t ,yt , yt ,ayt ] , where

axt and ay

t a are the robot’s acceleration along with the x and y axis, respectively. Now,

the near constant acceleration model can be written as what mentioned in equation (27), except for the system matrixes are defined as follows [13]:

A =

⎛

⎜⎜⎜⎜⎜⎜⎝

1 0 Ts 0 a1 00 1 0 Ts 0 a10 0 1 0 a2 00 0 0 1 0 a20 0 0 0 exp(−Ts) 00 0 0 0 0 exp(−Ts)

⎞

⎟⎟⎟⎟⎟⎟⎠

B =

⎛

⎜⎜⎜⎜⎜⎜⎝

b1 00 b1b2 00 b2b3 00 b3

⎞

⎟⎟⎟⎟⎟⎟⎠

(29)


where the parameters of the above equation are defined as follows:

b3 =1c

(1− exp(−cTs)

)

a2 = b3,b2 =1c(Ts −a2),a1 = b2

b1 =1c(

T 2s

2−a1)

(30)

In the above equation, c is a constant value. The above model can be used to track the motion trajectory of a maneuvering object, such as the movement of a mobile robot. Moreover, combining the results of the proposed models can enhance the performance of tracking. This idea can be presented as follows:

Xt = αXat +(1−α)Xvt (31)

where X at and X v

t are the estimation of the near constant acceleration and near constant velocity model, respectively, and α is a number between 0 and 1.

Besides the above approaches, some other methods have been proposed to deal with tracking of maneuvering objects such as TMM filters [14]. But, these methods are applied to track the movement of the objects with sudden maneuvering targets which is common in aircrafts. However, in mobile robots scenario, the robot’s motion is not so erratic for the above method to be necessary. Therefore, the near constant velocity model, near constant acceleration model or a combination of the proposed models can be considered as the most suitable structures which can be used to represent the dynamical model of a mobile robot. Afterwards, the JPDAF algorithm can be easily applied to the multiple robot tracking problem.

4 Simulation Results

To evaluate the efficiency of the JPDAF algorithm, a 3-robot tracking scenario is considered. Fig.1 shows the initial position of the reference and target robots. To prepare the scenario and implement the simulations, the parameters are presented for the mobile robots structure and simulation environment as Table 2. Now, the following experiments are conducted to evaluate the performance of the JPDAF algorithm in various situations.

Table 2. Simulation Parameters

Parameters Description

vl the robot’s left wheel speedvr the robot’s right wheel speedb the distance between the robot’s wheelsns number of sensors placed on each robotRmax the maximum range of simulation environmentQs the covariance matrix of the sensors’ additive noisets Sample time for simulation runningtmax Simulation maximum running time


Fig. 1. Initial Position of Reference and Target Robots and Simulation Environment, non-maneuvering case

4.1 Local Tracking for Non-maneuvering Movement

For each mobile robot, the speed of each wheel is determined as shown in Table 3.

The initial position of the reference robot is set as [5,5, Π2 ] . Also, the position of the

target robots are considered as [8,2,π] , [1,3, Π3 ] and [1,9, Π

4 ] , respectively. To run the simulation, sample time, ts , and simulation maximum running time, tmax , are set as 1s and 200s, respectively. Fig. 2 shows the trajectories generated for each mobile robot. To probe the physical environment, 24 sensors are placed on the reference robot. The sensors provide the distance from the static objects, walls or obstacles, and the dynamic targets. Because the position of each sensor to the robot’s head is fixed, the sensors’ data can be represented as a pair of [r,φ ] where φ is the angle between the sensor and the reference coordination. Also, the covariance of the sensors’ noise is supposed to be:

(.2 00 5×10E −4

)

(32)

Now, the JPDAF framework is used to track the trajectory of each robot. In this simulation, the clutters are assumed to be the data collected from the obstacles such as

Table 3. Simulation Parameters

Robot Number Left Wheel Speed Right wheel Speed

1 .2( RadS ) .2( Rad

S )2 .3( Rad

S ) .3( RadS )

3 .1( RadS ) .1( Rad

S )


walls. To apply the near constant velocity model as described in equation (27), the initial value of the velocity is chosen uniformly between 1 and -1 as follows:

xt = U (−1,1) (33)

yt = U (−1,1) (34)

where U (.) is the uniform distribution function. Fig. 3 presents the simulation re-

sults using the JPDAF algorithm with 500 particles for each robot’s tracking. Simula-tions justify the ability of the JPDAF algorithm in modifying the intrinsic error embedded in the data collected from the measurement sensor.

4.2 Local Tracking for Maneuvering Movement

Now, consider a maneuvering movement for each robot. Therefore, a variable veloc-ity model is considered for each robot’s movement. Fig. 4 shows the generated trajec-tory for each mobile robot. Now, we apply the JPDAF algorithm for tracking each robot’s position. To do so, three approaches are implemented. Near constant velocity model, near constant acceleration model and combined model described in the last section are used to estimate the states of the mobile robots. Fig. 5-7 shows the results for each robot separately. To compare the efficiency of each strategy, the obtained er-ror is presented for each approach in Tables 4, 5 where the following criterion is used for estimating the error:

ez =∑tmax

t=1(zt − zt)2

tmax,zt = {xt ,yt}

(35)

The results show the combined method outperforms relative to the two other approaches.

Fig. 2. The generated trajectory for target robots


Fig. 3. The estimated trajectory using the JPDAF algorithm

Fig. 4. The generated trajectory for manoeuvering target robots

Table 4. The error for estimating xt

Robot Number Constant Acceleration Constant Velocity Combined

1 .0045 .0046 .00322 .0112 .0082 .0053 .0046 .0067 .0032


Table 5. The error for estimating yt

Robot Number Constant Acceleration Constant Velocity Combined

1 .0082 .009 .00452 .0096 .0052 .0053 .0073 .0077 .0042

Fig. 5. The estimated trajectory for robot 1




5 Conclusion

In this paper, the multi robot tracking task using the JPDAF algorithm has been fully investigated. After introducing the mathematical basis of the JPDAF algorithm, we showed how to implement this algorithm on multiple robots. To do so, three different models were represented. The near constant velocity model and near constant accel-eration model were first described and, then, these two models were combined so that the accuracy of multi robot tracking is enhanced. Afterwards, simulation studies were carried out in two different stages. First, tracking the non-maneuvering motion of the robots was explored. Later, the JPDAF algorithm was tested on tracking the motion of the robots with maneuvering movements. The simulation results showed that the combined approach provided better performance than the other presented strategies. Although the simulations were explored in the case in which the reference robot was assumed to be stationary, this algorithm can be easily extended to multi robot tracking as well as mobile reference robots. Tndeed, robot’s self localization can be used to obtain the mobile robot’s position. Afterwards, the JPDAF algorithm is easily applied to the target robots according to the current estimated position of the reference robot.

References

[1] Kalman, R.E., Bucy, R.S.: New Results in Linear Filtering and Prediction, Trans. Ameri-can Society of Mechanical Engineers. Journal of Basic Engineering 83D, 95–108 (1961)

[2] Siegwart, R., Nourbakhsh, T.R.: Tntoduction to Autonomous Mobile Robots. MIT Press, Cambridge (2004)

[3] Howard, A.: Multi-robot Simultaneous Localization and Mapping using Particle Filters, Robotics: Science and Systems T, pp. 201–208 (2005)


[4] Gordon, N.J., Salmond, D.J., Smith, A.F.M.: A Novel Approach to Nonlinear/non-Gaussian Bayesian State Estimation. IEE Proceedings F 140(2), 107–113 (1993)

[5] Tsard, M., Blake, A.: Contour Tracking by Stochastic Propagation of Conditional Den-sity. In: Proc. of the European Conference of Computer Vision (1996)

[6] Fox, D., Thrun, S., Dellaert, F., Burgard, W.: Particle Filters for Mobile Robot Localiza-tion, Sequential Monte Carlo Methods in Practice. Springer, New York (2000)

[7] Hue, C., Cadre, J.P.L., Perez, P.: Sequential Monte Carlo Methods for Multiple Target Tracking and Data Fusion. IEEE Transactions on Signal Processing 50(2) (February 2002)

[8] Ristic, B., Arulampalam, S., Gordon, N.: Beyond the Kalman Filter, Artech House (2004) [9] Kanazawa, K., Koller, D., Russell, S.J.: Stochastic Simulation Algorithms for Dynamic

Prob- abilistic Networks. In: Proc. of the 11th Annual Conference on Uncertainty in AT (UAT), Montreal, Canada (1995)

[10] Gustafsson, F., Gunnarsson, F., Bergman, N., Forssell, U., Jansson, J., Karlsson, R., Nordlund, P.J.: Particle Filters for Positioning, Navigation, and Tracking. IEEE Transac-tions on Signal Processing 50(2) (February 2002)

[11] Schulz, D., Burgard, W., Fox, D., Cremers, A.B.: People Tracking with a Mobile Robot Using Sample-based Joint Probabilistic Data Association Filters. International Journal of Robotics Research (TJRR), 22(2) (2003)

[12] Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking. IEEE Transactions on Signal Proc-essing 50(2) (February 2002)

[13] Tkoma, N., Tchimura, N., Higuchi, T., Maeda, H.: Particle Filter Based Method for Ma-neuvering Target Tracking. In: IEEE Tnternational Workshop on Tntelligent Signal Proc-essing, Budapest, Hungary, May 24-25, pp. 3–8 (2001)

[14] Pitre, R.R., Jilkov, V.P., Li, X.R.: A comparative study of multiple-model algorithms for maneuvering target tracking. In: Proc. 2005 SPTE Conf. Signal Processing, Sensor Fusion, and Target Recognition XTV, Orlando, FL (March 2005)

[15] Fortmann, T.E., Bar-Shalom, Y., Scheffe, M.: Sonar Tracking of Multiple Targets Using Joint Probabilistic Data Association. IEEE Journal of Oceanic Engineering 8, 173–184 (1983)

[16] Vermaak, J., Godsill, S.J., Perez, P.: Monte Carlo Filtering for Multi-Target Tracking and Data Association. IEEE Transactions on Aerospace and Electronic Systems 41(1), 309–332 (2005)

[17] Frank, O., Nieto, J., Guivant, J., Scheding, S.: Multiple Target Tracking Using Sequential Monte Carlo Methods and Statistical Data Association. In: Proceedings of the 2003 TEEE/TRSJ, Tnternational Conference on Tntelligent Robots and Systems, Las Vegas, Nevada (October 2003)

[18] de Freitas, J.F.G.: Bayesian Methods for Neural Networks, PhD thesis, Trinity College, University of Cambridge (1999)

[19] Moral, P.D., Doucet, A., Jasra, A.: Sequential Monte Carlo Samplers. J. R. Statist. Soc. B 68, Part 3, 411–436 (2006)


An Artificial Market for Emission Permits

Ioannis Mourtos and Spyros Xanthopoulos

Department of Economics, University of Patras, 26504 Rio, Greece [email protected] Department of Management Science and Technology, Athens University of Economics and Business, 104 34 Athens, Greece [email protected]

Abstract. This paper presents an artificial, i.e. computer-based, market that models various auctions for emissions permits. A brief analysis of the existing auctions for emission permits provides the motivation for introducing a new type of auction that generalizes known mecha-nisms. This auction type and its implications are mainly analyzed through the artificial market presented here. The market uses a genetic algorithm that simulates the evolving behavior of bidders. Experimental analysis conforms to game-theoretical results and suggests that the pro-posed auction is efficient in terms of forcing players to reveal their true valuations.

1 Introduction

Although market-based mechanisms have long ago been proposed as a environmental policy tool, they have recently attracted strong interest because of their inclusion in the flexible mechanisms of the Kyoto Protocol. In brief, these flexible mechanisms include the following (Caplan, 2003):

(i) Emission quotas for each Annex I country, calculated with respect to base-year emissions (where 1990 is the commonly used base-year), which should not be exceeded within the first commitment period of the protocol (i.e. 2008-2012).

(ii) Trading of emission permits by countries or by individual companies, where tradable permits are defined as a percentage of each country's emission quotas.

(iii) Further trading of emission credits, generated either by domestic sinks, or by the Joint Implementation (JI) and Clean Development (CDM) mechanisms allowed by the Kyoto protocol.

It is worthwhile to notice that even countries (like the US) that have not ratified the Protocol have long initiated certain emissions markets, the most characteristic exam-ple being the auction for SO2 allowances operated by the Environmental Protection Agency (EPA) of the US (Cason & Plott 1996, Svendsen & Christensen 1999). An analogous scheme has been employed by the corresponding EPA of China (Changhua & Friedman 1998). There exist numerous other systems for trading and transferring GHG emissions (Hasselknippe, 2003) at various stages of preparation or operation. These markets allow the trading of either permits or credits or both and use standard auction types in order to allocate those permits among potential buyers. An auction that takes into account the particularities of those markets introduced in Mourtos et al.

70 I. Mourtos and S. Xanthopoulos

(2005). That auction is called the pay-convex auction and aims at revealing the true valuations of bidders, i.e. the marginal cost of reducing CO2 emissions, also called marginal abatement costs (MAC).

In this work, we present an artificial market, i.e. a computer-based market that simulates the behavior of players in various auctions types, assuming that each auc-tion is performed for a single unit and in multiple rounds. Artificial markets have recently attracted strong attention as a tool for modeling risk attitudes and evolving behavior in capital markets (see for example Noe et al. 2003) and also for assessing the comparative performance of trading mechanisms. Hence, our work contributes in this direction, while also providing experimental evidence in support of the pay-convex auction.

The remainder of this paper is organized as follows. Standard auction types and ex-isting auctions for emission allowances are discussed in Section 2, while a new type of auction is presented in Section 3. The artificial market for evaluating the various auction types is presented in Section 4, while experimental results are discussed in Section 5.

2 Types of Auctions

We examine sealed bid auctions, namely first and second price auctions, with inde-pendent and identically distributed private values (for related definitions, see Mil-grom, 2004). In the first price sealed-bid auction, each bidder submits one bid in ignorance of all other bids. If a single unit is being auctioned, the highest bidder wins and pays the amount she has bid. When there is more than one unit for sale, sealed bids are sorted from high to low, and items are awarded at highest bid price until the supply is exhausted. Hence, each bidder acquires her unit at a (possibly) different price, thus this generalization of the first price auction is called a discriminatory (or pay-the-bid) auction. In the second price sealed-bid auction, the item is awarded to the highest bid at a price equal to the second-highest bid. It is well known that the single dominant strategy for each bidder is to reveal her true valuation. When auctioning multiple units, all winning bidders pay for the items at the same price, i.e. the highest losing price. Therefore, this generalization of the second-price auction for the case of multiple units is called a uniform auction.

There exist a number of known arguments (Krishna, 2002), which favor the selec-tion of first-price rather than second-price auctions, at least in the case of carbon permits. First, it is known that revenue is the same for both auctions (by the Revenue-Equivalence Theorem). Further, the distribution of prices in a second-price auction is a mean-preserving spread of those in a first-price one, therefore first-price auctions could provide more concise price signals. Finally, revenue is larger in the first price auction, if players are risk-averse.

In fact, specifying the price to be paid by the winning bidder, i.e. the pricing rule, is an important issue on itself, since it evidently affects the bidders' behavior. In the case of pay-the-bid pricing, the bidder attempts to estimate the lowest winning bid bmin and then set her bid at slightly higher value. Bids in excess of bmin are clearly lost

An Artificial Market for Emission Permits 71

utility. With uniform pricing though, predicting the clearing price is less important, since every winner pays the clearing price regardless of her actual bid. Further, at a discriminatory auction, all transacting bids affect the different transaction prices, while uniform pricing implies that only the bids and asks near the clearing price affect the transaction price.

Two auctions for emissions permits (SO2), already in operation, are the auction of Environmental Protection Agency (EPA) in the US (Cason & Plott 1996, Svendsen & Christensen 1999) and the auction held by the corresponding Chinese governmental agency (Changhua & Friedman 1998). Both auctions define variants of the first-price auction, their difference lying in the matching rule employed to allocate bids to asks. In the US EPA auction, the bids are arranged in descending order, whereas the asks are arranged in ascending order. The lowest ask is matched with the highest bid and the trade occurs at the bid price. It follows that auction leads to lower asking prices receiving higher trading priority and thus being matched to higher received prices. The seller strategies inherent in the EPA auction have been analyzed in Cason & Plott (1996), where it is shown that sellers' asks are biased downwards by the specific matching rule i.e. that of the lowest minimum price being matched to the highest re-maining bid.

At the auction employed by the Chinese EPA, the matching rule assigns the highest bid to the highest ask, the second highest bid to the second highest ask (as long as the bid value exceeds the asking price) and so on, until no remaining bid exceeds any of the remaining asks. Although this appears as a rather slight modification of the EPA auction in the US, its implications with respect to trading volume and players' atti-tudes are substantial. Without entering a technical discussion, we should note that this auction maximizes trading volume for any given set of bids and asks.

Hence, both auctions are discriminative towards the buyers and attempt to transfer the utility surplus to the sellers. Essentially, the buyers are challenged to reveal their marginal abatement costs and acquire emission allowances at zero profit. However, the US auction suffers from possible underestimation of sellers' values, the opposite being the disadvantage of the Chinese auction. Uniform-pricing could resolve these issues but without maintaining the same trading volume (Milgrom, 2004). Finally, none of these auctions takes into account sellers' prices concerning the price of the transaction. This is the point where this paper's contribution could be utilized.

3 A New Auction Type

Observe that, while buyers face a pay-the-bid pricing rule, sellers face an inverse rule, which could be called pay-the-ask, in the sense that sellers “bid” at a price but the transaction is cleared at the price of the buyers. Let us then assume that this pricing rule is, instead, applied to the buyers. In more detail, higher bids still win but the price of the transaction is the reservation price of the seller, as long as this is smaller than the winning bid. Evidently, buyers can offer high bids, thus challenged to reveal their abatement costs, without risking a high transaction price. In particular, such a rule could be useful for a government (or any non-profit authority) auctioning permits and aiming at minimizing the financial impact for the buyers. Further, if the government


applies the same reservation price to all units being auctioned, this reservation price could serve the clearing price of a uniform auction (or that of a carbon tax).

Such an approach suffers from an obvious weakness: buyers could bid at arbitrarily high values, since these values would not affect the actual price to be paid. This dis-advantage could be counteracted if buyers' bids are also taken into account and this is exactly achieved by the pricing rule adopted by the pay-convex auction. That rule sets the price of the transaction equal to a convex combination of buyer's bid and the seller's ask. Hence, the pay-convex auction can be summarized in the following steps. Step I Announce the numbers of buyers N and the value of a constant e,

0<e<1. Step II Receive the bids bi and asks cj. Step III Call a matching procedure to determine all allowed pairs (bi,cj),

bi,>cj. Step IV For each pair (bi,cj), set the price of the transaction equal to e*bi+(1-

e)*cj.

Note that the any matching rule could be employed in Step III. Notice also that the pricing rule of Step IV generalizes the pay-the-bid and pay-the-ask pricing rules, which appear as special cases for e=1 and e=0, respectively. The following has been established in Mourtos et al. (2005).

Proposition 1 For uniformly distributed bids and asks and risk-neutral players, the optimal bid for buyer participating in a pay-convex auction is

ueN

Nb

)1)(1(

2

++= , (1)

where N is the number of buyers and u the buyer's utility for the auctioned item. Hence, for 0≤e≤1, the pay-convex auction forms a generalization of the other two

auction types, since the pay-the-bid and pay-the-ask auctions are equivalent to the pay-convex auction for e=1 and e=0, respectively. Observe that the fraction b/u tends to the value 2/(1+e) as N increases.

Consider now an auction where the auctioneer knows the number of bidders and bidders know the value of parameter e in advance. For fixed N, bids lie below the ob-

tained utility if 12

)1)(1( ≥++N

eNor equivalently

1

1

+−≥

N

Ne . Hence, the auctioneer

setting e=(N-1)/(N+1) forces risk-neutral players to reveal their true valuations. This also implies that the expected gain of the winning bid is zero. For example, for N=20, setting e=0.905, results in bidding exactly at one's reservation price. Notice that, if the number of bidders remains unknown but follows a certain probability distribution, the regulator can still estimate the optimum value of e. Further, increasing the value of e well above (N-1)/(N+1) implies that the auction aims at transferring additional utility to the bidder, the opposite being achieved by decreasing the value of e below (N-1)/(N+1).


4 An Artificial Market

The artificial market presented in this section aims at (i) evaluating the comparative performance of the auctions analyzed above and (ii) simulating the evolutionary be-havior of risk-neutral buyers. The market includes N buyers and 10 items, each sold independently. All submitting bids or asks are for a single item, i.e. a single permit. Each buyer's utility is drawn from a uniform distribution in the interval [0,400], while the reservation price is drawn from a uniform distribution in the interval [0,100], i.e. there exists a 75% probability for a buyer's value to exceed the reservation price.

Buyers are risk-neutral and define their strategy according to the type of the auc-tion (i.e. pay-the-bid, pay-the-ask and pay-convex) and the non-dominated strategy implied by Proposition 1. Each auction is repeated for 300 rounds, while each round including 10 auctions, one for a single item. The buyers' and sellers' utilities are drawn once and remain identical throughout all rounds. To allow for buyers to define an evolving strategy, the results of all previous rounds, i.e. whether a buyer won or lost, are available.

Clearly, if bids are simply the ones implied by Proposition 1, the experiment re-produces our theoretical predictions. To obtain a less-controlled experiment, we allow for the buyers to increase or decrease their bids, by taking into account the outcomes of all previous rounds. This roughly implies that losing bids tend to increase and win-ning ones tend to decrease. The simplest method to accomplish this is to create bids which are mean-preserving spreads of the bid implied by Proposition 1. Let S~N(0,5), i.e. S has a much smaller variance than the uniform function from which buyers' utili-ties are drawn. If, at a certain round, buyer k has won W times and lost L times, then she defines her next bid bk′ as

LW

Wy probabilit with },,

2

1min{

LW

Ly probabilit with },,

2

1)(min{

'

⎪⎪⎭

⎪⎪⎬

⎫

⎪⎪⎩

⎪⎪⎨

⎧

++−

++−+

=Sb

eb

Sbe

bubb

kk

kkk

k (2)

where S is drawn independently at each calculation. This rule allows for reasonable fluctuations to occur among buyers’ bids and clearly encourages buyers to increase their bids as competition intensifies. The bids are always kept below a certain limit, in order for the buyer not to lose overall, i.e. pay a price above their utility u. It is not dif-ficult to predict that bids will eventually converge towards the buyers' true valuations.

A more involved method for introducing an evolving behavior is via a genetic al-gorithm. For the purposes of this algorithm, each buyer maintains a pool of 100 chromosomes, every chromosome encoding a single strategy. In each round, a chro-mosome is drawn randomly from the pool and the strategy encoded is followed. To better clarify what a strategy stands for, recall that our experiments refer to buyers bidding repetitively at a single auction. Hence, chromosomes represent increases and decreases with respect to the bid value implied by Proposition 1. More specifically, at the beginning, half of the chromosomes represent an increase and half a decrease, where all values correspond to independent draws of variable S.


At each round, a player draws randomly a chromosome from her pool and submits the bid implied by that chromosome. Once the auction for all 10 items is completed, the pools are updated through the procedures of selection, crossover, mutation and election. Selection is performed first and replaces some chromosomes that encode the least profitable strategy with chromosomes that encode the most profitable one. To accomplish that, each strategy’s profitability is estimated by assuming that all other players retain the bids submitted in the previous round. The outcome is a ranking of all strategies (chromosomes) in terms of expected profitability. The selection process is completed by replacing the 5 chromosomes encoding the least profitable strategies with the 5 chromosomes encoding the most profitable ones.

Following selection, new chromosomes are produced through a crossover process analogous to the one presented in Noe et al. (2003)>, i.e. chromosomes are repre-sented by binary encodings that are “crossed” with a probability of 6%. After that mutation is applied. Mutation amounts to randomly selecting a chromosome and re-placing it with another one. The probability of mutation for each chromosome in each round is 5% percent and it is equally likely that the mutant will represent an increase or a decrease, again by a draw of variable S. Such a large probability has been se-lected in order to allow for a considerable diversity in the chromosome pool, thus en-hancing the quality of the learning mechanism. All new chromosomes generated through crossover and mutation, together with the original ones (i.e. after selection), are then examined by an election step: only the 5 chromosomes encoding the most profitable strategies are added to the new chromosome pool by replacing the 5 ones encoding the least profitable strategies.

5 Experimental Results

The code has been written in C in the Microsoft Visual Studio Professional 2003, un-der WinXP. Each experiment consists of 100 simulation runs, where each run simu-lates 10 independent auctions performed in 300 rounds. The number of players varies from N=20 to N=100. Results presented hereafter are the average results (over all runs) obtained for N=40. At each round, the same 10 items are sold at the same reser-vation prices, but the buyers have prior knowledge of the winning bids in all previous rounds. An experiment is conducted for each of six auction types, distinguished by the value of parameter e. Specifically, we examine the cases of e=0 (pay-the-ask), e=1 (pay-the-bid), e=(N-1)/(N+1)=opt (pay-convex, with bids being forced to become equal to private values) and, finally, e=0.25,0.5,0.75 (intermediate options for a pay-convex auction).

If buyers adopt the rule implied by equation (1), we denote the experiment as “Simple”, while if they adopt the more sophisticated attitude implied by the genetic algorithm we denote the corresponding experiment as “Genetic”. Results concern (i) the average fraction of a bid value over the bidder's utility (b/u) and (ii) the average revenue raised by each winning bid (i.e. true valuation minus bid value). The basic question addressed is which auction leads to faster convergence (i.e. convergence in the smallest number of rounds) towards the true valuations. It can be reasonably ex-pected that, after a sufficiently large number of rounds, the average profit for a winner will tend to zero, since bids will gradually increase.


-0.01

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0 50 100 150 200 250 300 350

Rounds

Diff

eren

ce fr

om 2

/(1+e

)e=0.00e=0.25e=0.5e=0.75e=1.00e=opt

Fig. 1. Difference of b/u ratios from limit 2/(1+e) when N=40 agents use the “Genetic” rule

Figures 1 and 2 present the experimental results for N=40 and e∈{0,0.25 0.5,0.75, 1,opt}, when agents (buyers) select their bids according to the “Genetic” and “Simple” rules, respectively. More specifically, these figures depict the difference between the average b/u ratio and the theoretical limit of 2/(1+e) after 50,100,150,200,250 and 300 rounds of the simulation. The outcome is quite intuitive: bidders start bidding as implied by Proposition 1 (on average) and gradually increase their bids in order to increase the probability of winning the single item auctioned. Hence, the average ratio b/u decreases and converges towards the limit of 2/(1+e), with average revenue also being decreased. This convergence is much faster when buyers adopt the “Genetic” rule, thus implying that the genetic algorithm allows buyers to better “learn” and adjust to the terms of the auction. Setting e=opt results in more rapid convergence, almost from the first rounds, and also in buyers' revealing their true valuations much faster. This provides evidence in support of the pay-convex auction and illustrates that taking the reservation price of the item into account increases market efficiency.

-0.01

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0 50 100 150 200 250 300 350

Rounds

Diff

eren

ce fr

om 2

/(1+e

)

e=0.00e=0.25e=0.5e=0.75e=1.00e=opt

Fig. 2. Difference of b/u ratios from limit 2/(1+e) when N=40 agents use the “Simple” rule


6 Conclusion

This paper has presented an artificial market in order to compare different auction types, within the context of emissions trading. In that context, an auction should en-courage the buyers to reveal their marginal abatement costs, thus leading to an effi-cient allocation of the auctioned permits. Experimental analysis with an artificial market shows that the pay-convex auction leads to much faster convergence of bids towards true valuations, when a single item is auctioned. The results obtained are highly motivating for future research, which could examine multiple-item auctions, while also focusing on a more elaborate design of the genetic algorithm.

Acknowledgments. Research supported by the European Union (European Social Fund) and national resources

References

[1] Caplan, A.J., Cornes, R.C., Silva, E.C.D.: An ideal Kyoto protocol: emissions trading, re-distributive transfers and global participation. Oxford Economic Papers 55, 216–234 (2003)

[2] Cason, T.N., Plott, R.C.: EPA’s new emissions trading mechanism: a laboratory evalua-tion. Journal of Environmental Economics and Management 30, 133–160 (1996)

[3] Changhua, S.R., Friedman, D.: The Matching Market Institution: A Laboratory Investiga-tion. The American Economic Review 9, 1311–1322 (1998)

[4] Cropper, M.L., Oates, W.E.: Environmental Economics: a survey. Journal of Economic Literature 30, 675–740 (1985)

[5] Hasselknippe, H.: Systems for carbon trading: an overview. Climate Policy 3, 43–57 (2003)

[6] Krishna, V.: Auction theory. Academic Press, London (2002) [7] Milgrom, P.: Putting auction theory to work. Cambridge University Press, Cambridge

(2004) [8] Mourtos, I., Markellos, R.N., Xanthopoulos, S.: An auction mechanism for emission per-

mits. Working Paper MSL-WP-2005-02, Athens University of Economics and Business (2005)

[9] Noe, T.H., Rebello, M.J., Wang, J.: Corporate Financing: An Artificial Agent-based Analysis. The Journal of Finance 58, 943–973 (2003)

[10] Svendsen, G.T., Christensen, J.L.: The US SO2 auction: analysis and generalization. En-ergy Economics 21, 403–416 (1999)


Application of Neural Networks for Investigating Day-of-the-Week Effect in Stock Market

Virgilijus Sakalauskas and Dalia Kriksciuniene

Department of Informatics, Vilnius University, Muitines 8, 44280 Kaunas, Lithuania {virgilijus.sakalauskas,dalia.kriksciuniene}@vukhf.lt

Abstract. In this article, the problem of the day-of-the-week effect is investigated for small emerging stock market (Vilnius Stock Exchange), where this effect is vaguely expressed and its presence is difficult to confirm. The research results helped to conclude the effective-ness of application of neural networks, as compared to the traditional linear statistical methods for identifying differences of Monday and Friday stock trading anomalies. The effectiveness of the method was confirmed by exploring impact of different variables to the day-of-the-week effect, evaluation of their influence and sensitivity analysis, and by selecting best performing neural network type.

Keywords: Artificial neural networks, day-of-the-week effect, MLP, mean return, RBF, stock market.

1 Introduction

The efficient market hypothesis states that it is not possible to consistently outperform the market by using any information that the market already knows. However, re-searchers have reported evidence of abnormal returns related to day of the week effects, also week of the month, month of the year, turn of the month, turn of the year effects or holiday effects. In the extensive study of market anomalies, based on analysis of Dow Jones daily data, registered during 100 years period, and of about 25 years of S&P500 daily data, the set of rules, which contained nearly 9,500 different calendar effects was tested [17]. Mostly significantly different stock returns come through the first and the last trading days of a week. As indicated [1, 5, 8] daily stock returns tend to be lower on Mondays and higher on Fridays for the United States and Canada. Therefore, methods of predicting any divergences from the efficient market hypothesis may allow to develop profitable trading strategies and to decrease the investment risk.

The day-of-the-week effect is understood as significantly different stock returns in different days of a week. It means that different trade situation in different days of a week have some features of regularity. A number of scientific papers, devoted to analysis of this anomaly [3,15,18], have disclosed that from 1980 day-of-the-week effect was clearly evident in the vast majority of developed markets, but it appears to have faded away in the 1990s [7,16]. This implication is based on long-run improvements in market efficiency, which may have lowered the effects of

78 V. Sakalauskas and D. Kriksciuniene

certain anomalies in recent periods [8]. The other reason of the diminished signifi-cance of the calendar effects is the inability of the prevalent statistical methods to recognize them. As it is noticed in [17], the anomalies are most outstanding in short periods and in comparatively small data subsets. The market inefficiency, as investigated in [13], could be observed by applying research of higher moments instead of traditional statistical analysis, and also taking into account all available information about the stocks: the trading volumes, dividends, earnings-price ratios, prices of other assets, interest rates, exchange rates, and, subsequently, usage of powerful nonparametric regression techniques, such as artificial neural networks.

The main variable, used for day-of-the-week effect investigation in the research literature is the mean return. The other important indicators for the research are the daily closing prices and indexes [2]. However, application of other variables for establishing this anomaly is quite rare. In the paper we apply the artificial neural networks (ANN). The researched data includes mean return and also number of deals and shares, turnover, H-L (high minus low price).

There are few research works of applying ANN for day-of-the-week effect analysis or to the related questions of the efficient market hypothesis. Yet the results obtained are quite encouraging to use them for further research.

Neural networks are an artificial intelligence method for modelling complex target functions. Recently artificial neural networks methods have started to be used widely to the domain of financial time series prediction. However, it is a highly complicated task. Financial time series often behave nearly like a random-walk process and the predictability of stock prices or levels of indices under the efficient market hypothesis is impossible. On the other hand, statistical behaviour of the financial time series is different at different points in time and usually is very noisy.

ANN mostly are used for predicting stock performance [9,20], classification of stocks, predicting price changes of stock indexes [10], forecasting the performance of stock prices [6,12,21]. In most analyzed applications, the neural network (NN) results outperform statistical methods, such as multiple linear regression analysis, ANOVA, discriminant analysis and others [11].

In the following chapter the research on the day-of-the-week effect methodology and organization of research data set is defined. Then the investigation of the day-of-the- week effect was made by applying artificial neural networks classification potential. Two standard types of neural networks were applied by using ST Neural Network software: MLP (Multilayer Perceptrons) and RBF (Radial Basis Function Networks). The research outcomes and conclusions were evaluated, and revealed better effectiveness of application of the neural networks for achieving significant confirmation of influence of day-of-the-week effect in more cases, that is was ossi-ble to achieve by traditional statistical analysis.

2 Data and Methodology

In the paper data for empirical research was taken from information base of Vilnius Stock Exchange (The Nordic Exchange, 2006 [19]). Vilnius Stock Exchange belongs to the category of small emerging securities markets. It is proved by the main financial indicators: market value of 7 billions EUR, near 2 million EUR share trading

Application of Neural Networks for Investigating Day-of-the-Week Effect 79

value per business day, approximately 600 numbers of trades per business day, and the equity list consisting of 44 shares.

To identify the presence of day-of the-week effect we shall use the data of daily re-turn values of 24 shares (out of 44 listed), from period of 2003-01-01 till 2006-11-21. The selected shares represent variety of Vilnius Stock Exchange equity list by capitalization, daily turnover, trade volume, return and risk, and thus ensure the validity of the research results.

In the paper will be used logarithmic understanding of return (1), where return on time moment t, tR is evaluated by logarithmic difference of stock price over time interval (t-1,t].

)ln(1−

=t

tt P

PR ,

(1)

where Pt indicates the price of financial instrument at time moment t. The return values of 24 equities were assigned to the variables, named correspond-

ingly to their symbolic notation of Vilnius Stock Exchange. The initial analysis by summary statistics of the data set, revealed quite big differences in return for dif-ferent days of a week (Table 1). In the Table 1 stocks are sorted according to aver-age daily trading volume. The trading volume starts from 14 thousand LTL (1EUR=3.45 LTL) (LEL) and rises till 700 000 LTL (TEO). It may be observed that only in the case of fairly intensive stock trade, Monday/Friday effect is perceptible – the Monday mean return is the lowest, and Friday – the highest amongst all days of a week. Table 1 shows that day-of-the-week effect may be perceptible only when stocks (UTR) have average turnover bigger than 120 000 LTL, but this can not be firmly confirmed without exhaustive statistical research.

By analysing standard deviation we could notice that there was no difference in trading volatility between days of the week, or among different variables. The lowest return and the biggest standard deviation values were in the group with the highest trading volume. These outcomes fit to the point that biggest speculative trading transactions in Vilnius Stock Exchange are made with the stock of highest turnover and best liquidity.

Application of traditional statistical methods, such as t-test and one-way ANOVA, Levene and Brown-Forsythe test of homogeneity of variances confirmed the day of the week effect only for few stocks out of the total 24 [14].

Using Kolmogorov-Smirnov test for different days of the week, and to all possible pairs of the day it was defined that significant difference for some days of the week could be confirmed only for 3 variables, namely LDJ, PZV, UTR (out of 24) [14].

The traditional statistical analysis could make some indications of the day-of-the- week effect by finding differences in mean values of return of securities, but did not allow to state that these differences are significant. We should use more sensitive research methods and more variables that could help to estimate the day-of-the-week effect.

In the following section for research of this anomaly we will use days-of-the-week classification method with the help of artificial neural networks. To determine the day-of-the-week effect we invoked not only the mean return and vol-ume, but also the number of deals and shares, turnover, H-L (high minus low price).


Table 1. Day-of-the-week Summary Statistics

3 Application of Neural Networks Methods

To determine the day-of-the-week effect traditional statistical methods are usually used and the only most important variable - mean return - is invoked. However, in the case of vaguely expressed day-of-the-week anomaly, such approach is not sufficient. For the analysis we should invoke more powerful nonlinear statistical research and use more variables that may have impact for the day-of-the-week effect. Here we used artificial neural networks for exploring presence of the day-of-the-week effect for the same data set of 24 equities from Vilnius Stock Exchange, as described in Part 2. The idea of research was to attempt distinguish Monday/Friday from other days of the week using neural networks classification potential.

In our research we shall apply two standard types of neural networks, imple mented in the STATISTICA Neural Networks standard module: MLP (Multilayer


Perceptrons) and RBF (Radial Basis Function Networks), evaluated for good performance of classification tasks [15].

These two types of networks have some different features. MLP units are defined by their weights and threshold values, which provide equation of the defining line and the rate of falloff from that line [15]. A radial basis function network (RBF), has a hidden layer of radial units, each modeling a Gaussian response surface.

In this research both types of neural networks, MLP and RBF, were used to dis-tinct Monday and Friday from the other trading days of the week. We shall use the predicting variables: mean return, number of deals and shares, turnover, H-L (high minus low price). During the research we will attempt to select the best neural network for classification, to choose most influential variables to the final result, and make their sensitivity analysis.

To determine the day-of-the-week effect we used from 700 to 1000 items of data, accordingly to each investigated security. The non trading days of the particular securities data were eliminated from the sets.

As Monday/Friday records are 4 times less than other days of the week, classifica-tion could be made presuming all days are ‘others’ (vary from the Mon-day/Friday). In such a case we obtain fairly high correct classification percentage (80%). Though there would be no correctly classified Monday/Friday records at all. We will not proceed this way, as we do not have a goal to receive as high as pos-sible general level of classification. We only wish to see if predictors can fairly

Fig. 1. TEO securities data set segment used in calculations


significant distinguish Mondays/Fridays from other days of the week if a priori it is not known what day of the week is analyzed. Thus applying ANN and classifying the cases we shall presume that prior probabilities for Monday/Friday and other days of the week are equal. We also do not have a necessity to divide data to training, test and verification sets as we do not need to obtain classification instrument, appropriate for classification of future data. We only want to prove the presence of day-of-the-week anomaly using available data.

For calculation we shall use the Statistica Neural Network Intelligent Problem Solver tool, which allowed to automate classification by indicating input and output variables, assign distribution cases, accept-reject confidence limits, network complex-ity options, set limit duration of design process, results to be displayed. Further we explain the data processing procedure by applying ANN for one of the securities (TEO, Fig.1), aimed to distinct Monday from other days of the week. Same proce-dure was applied for each 24 securities data set.

Using Statistica Neural Network Intelligent Problem Solver tool we select DAY as output (dependent) variable and DEALS, NO_OF_SH, TURNOVER, RETURN, H_L as input (independent) variables. The performance of best found network (correct classification rate, error, area under ROC curve) is presented in generated report by Intelligent Problem Solver (Fig. 2).

Fig. 2. Intelligent Problem Solver report of TEO equities Monday effect

The report, presented in Fig. 2, states finding best classification algorithm took 3:43 min. It resulted in finding improved MLP (3 layer) network specified as 3:3-7-1:1 (3 input variables, 7 hidden units and 1 output variable), error level achieved 0.36817. The error was calculated as the root mean square (RMS) of errors, meas-ured for each individual case. The ST Neural Networks training algorithms attempted to minimize the training error, the final performance measure indicated rate of correctly classified cases. This result depends on Accept and Reject thresholds (confidence limits).

In Fig. 2, the correct classification rate is 0.623377, therefore 62.33% cases were correctly classified (both Monday and other days). In the Statistica Neural Networks software the network performance can be evaluated according to the achieved correct classification rate as poor (rate is less than 0.6), O.K and High (over 0.8). The utcome of analysis for security (TEO) states, that it is influenced by


Monday effect, and the constructed neural network distinct Mondays from the other trading days with sufficient reliability. Further we also assume classification perform-ance as significant, if it is higher, than 0.6. The classification statistics report (Fig. 3) presents number of correctly classified data of Monday from other trading days.

Fig. 3. Statistics of Monday data recognition

As the correct classification number highly exceeds wrongly classified data, the Monday effect for TEO security is confirmed.

In further analysis we try to define, which input variables were most important for classification, and to make the sensitivity analysis for the neural network (Fig. 4).

The figure shows, how classification results could be altered by rejecting each variable, e.g. if we reject variable DEALS, the baseline error becomes 0.393 (Fig. 4).

Fig. 4. Sensitivity analysis of TEO securities (Monday)


The sensitivity analysis ranked input variables, according to their impact for net-work performance. The highest error is assigned for the highest ranked, it shows, how network performance deteriorates, if they are not present (Fig. 4). Obviously that for different stocks other variables can have the biggest impact.

In Table 2 we can see the sensitivity analysis for all 24 researched stocks. In this table for all stocks by numbers from 1 to 5 are marked Input variables according to their importance (1 means the most important). If some position does not have a number, it means that respective variable is not important in classification.

From Table 2 may be observed that the most important variables in most cases are DEALS, H_L and RETURN. Variables NO_OF_SH and TURNOVER mostly are not so important in researching the day-of-the-week effect.

Table 2. Summary of sensitivity analysis


The above described neural classification data processing procedure for all data of 24 securities, is summarized in Table 3. The significant performance values (> 0.6), which indicate presence of Monday or Friday effect for the particular secu-rities, are highlighted with dark background.

From Table 3, we can conclude that only results for of 11 equities confirm Mon-day effect, and for 9 equities the Friday effect was present. Both effects were present for 4 equities. In the Table 3, the equities were sorted according to average daily trading turnover, which is traditional parameter in the research of day-of-the-week effect. However, as it may be seen from the Table 3, turnover impact for Monday/Friday anomaly may not be confirmed. As mentioned before, most im-portant parameters having impact for the day-of the week particularity are the num-ber of deals, difference between highest and lowest equity price and mean return.

From the summary table the performance of neural networks (MLP versus RBF) showed that for indicating Monday effect best results were achieved by using MLP (18 times), and the RBF performed better for 6 times. The Friday effect was

Table 3. Application results of neural network for securities trading data (presence of Monday and Friday effect is marked by dark background)

Monday Friday

Perfor-

mance Error

Network

Input Hidd.

Perfor-

mance Error

Network

Input Hidd.

LEL 0,5623 0,3909 MLP 2 6 0,5770 0,3849 MLP 3 6

KBL 0,6501 0,3590 MLP 5 10 0,5214 0,3994 RBF 4 9

LNS 0,6066 0,3656 MLP 5 7 0,5436 0,3961 MLP 3 3

LEN 0,5576 0,3815 MLP 3 6 0,7291 0,3590 RBF 5 180

VBL 0,5943 0,3640 MLP 5 6 0,5814 0,3870 MLP 4 6

LJL 0,5193 0,3918 RBF 3 10 0,6409 0,3910 RBF 5 66

LLK 0,6281 0,3683 MLP 4 6 0,6638 0,3563 MLP 5 10

KJK 0,6738 0,3526 MLP 4 7 0,5778 0,3829 RBF 4 13

ZMP 0,5776 0,3807 MLP 4 7 0,6007 0,3566 MLP 4 7

LDJ 0,4850 0,3857 MLP 4 6 0,5795 0,3869 RBF 3 12

RST 0,5713 0,3927 RBF 3 7 0,5245 0,3953 RBF 4 12

UTR 0,5854 0,3736 MLP 3 10 0,6161 0,3717 MLP 5 8

NDL 0,6656 0,3419 MLP 5 8 0,6748 0,3426 MLP 5 8

SAN 0,5671 0,3682 MLP 5 8 0,5855 0,3924 RBF 3 19

KNF 0,7058 0,3554 RBF 3 119 0,6671 0,3734 RBF 3 79

PTR 0,5707 0,3760 MLP 4 4 0,6820 0,3475 MLP 5 10

PZV 0,6282 0,3639 MLP 5 11 0,5734 0,3953 RBF 3 18

APG 0,6453 0,3633 MLP 5 11 0,5668 0,3964 RBF 2 13

MNF 0,6346 0,3622 MLP 4 8 0,5569 0,3916 RBF3 13

SNG 0,5571 0,3936 RBF 2 7 0,5447 0,3951 RBF 2 9

UKB 0,5795 0,3818 RBF 3 17 0,5906 0,3767 MLP 4 11

RSU 0,5709 0,3828 MLP 3 5 0,5456 0,3918 MLP 4 5

LFO 0,6583 0,3720 RBF 4 34 0,6761 0,3480 MLP 6 10

TEO 0,6234 0,3682 MLP 5 11 0,5563 0,3912 MLP 3 4


indicated with the same precision by both types of neural networks (12 times each - MLP and RBF). The neural networks MLP and RDF used similar number of input variables, but the number of elements in hidden layer for RBF was much bigger.

Comparing the performance of methods of neural networks and general linear statistical analysis for investigation of the day-of-the-week effect [14] revealed the advantages of the neural computations. By applying them the presence of the day-of- the-week effect was confirmed in 20 cases, whereas application of Kolmogorov- Smirnov test could show significant indications only for 3 such cases [14].


Despite of quite extensive analysis of the day of the week effect in scientific litera-ture, there is evidently lack of investigations of applying artificial neural networks in this sphere.

In this article day of the week effect was researched for the case of Lithuanian stock exchange, which can be assigned to the category of small emerging stock markets. Daily trade values of 24 shares of Vilnius Stock Exchange of the time interval from 2003-01-01 to 2006-11-21 were arranged for analysis using artificial neural networks. The calculations have been made by applying functional modules and standard procedures implemented in software package Statistica Neural Network.

The investigation was based on analysis of numerous variables: return, deals, number of shares, turnover, H-L (high minus low price). Sensitivity analysis of variables showed, that the most important impact to define day-of-the-week effect was made by variables DEALS, H_L and RETURN.

By applying Statistica Neural Network software the best classifying neural networks were selected, though no preference could be given for MLP and RBF neu-ral networks due to their similar performance. Neural networks were more effective for revealing Monday and Friday effect, comparing to the methods of traditional statistical analysis.

Application of neural networks and the sensitivity analysis revealed that there are more variables, which have significant influence for the day-of-the-week effect. The results of the research, presented in the article, reveal that the analysis should be based not only on values of daily mean return, but should use data of the variables H- L and DEALS as well.

The research results helped to conclude the effectiveness of application of neural networks, as compared to the traditional linear statistical methods for such type of classification problem, where the effect is vaguely expressed and its presence is difficult to confirm. The effectiveness of the method has been confirmed by exploring variables, influencing the day-of-the-week effect, their influence and sensitivity analysis, and by selecting best performing type of neural network.

References

1. Balaban, E., Bayar, A., Kan, O.B.: Stock returns, seasonality and asymmetricconditional volatility in World Equity Markets. Applied Economics Letters 8, 263–268 (2001)

2. Basher, S.A., Sadorsky, P.: Day-of-the-week effects in emerging stock markets, vol. 13, pp. 621–628. Taylor and Francis, Abington (2006)


3. Brooks, C., Persand, G.: Seasonality in Southeast Asian stock markets: somenew evidence on day-of-the-week effects. Applied Economics Letters 8, 155–158 (2001)

4. Fama, E.F.: The behaviour of stock market prices. J. Busin. 38, 34–105 (1965) 5. Flannery, M.J., Protopapadakis, A.A.: From T-bills to common stocks: investigating the

generality of intra-week return seasonality. Journal of Finance 43, 431–450 (1988) 6. Gencay, R.: The predictability of security returns with simple technical trading. Journal of

Empirical Finance 5, 347–359 (1998), http://dapissarenko.com/resources/fyp/Pissarenko2002.pdf

7. Kamath, R., Chusanachoti, J.: An investigation of the day-of-the-week effect in Korea: has the anomalous effect vanished in the 1990s? International Journal of Business 7, 47–62 (2002)

8. Kohers, G., Kohers, N., Pandey, V., Kohers, T.: The disappearing day-of-the- week effect in the world’s largest equity markets. Applied Economics Letters 11, 167–171 (2004)

9. Kumar, M., Thenmozhi, M.: Forecasting Nifty Index Futures Returns usingNeural Net-work and ARIMA Models. Financial Engineering and Applications (2004)

10. Nekipelov, N.: An Experiment on Forecasting the Financial Markets, BaseGroup Labs (2007), http://www.basegroup.ru/tech/stockmarket.en.htm

11. Pissarenko, D.: Neural networks for financial time series prediction: Overview over recent research. BSc thesis (2002)

12. Qi, M.: Nonlinear predictability of stock returns using financial and economic variables. Journal of Business and Economic Statistics 17, 419–429 (1999)

13. Reschenhofer, E.: Unexpected Features of Financial Time Series: Higher-Order Anomalies and Predictability. Journal of Data Science 2, 1–15 (2004)

14. Sakalauskas, V., Kriksciuniene, D.: The impact of daily trade volume on the day- of-the-week effect in emerging stock markets. Information Technology and Control 36(1A), 152–158 (2007)

15. StatSoft Inc., Electronic Statistics Textbook. Tulsa, OK: StatSoft. WEB (2006), http://www.statsoft.com/textbook/stathome.html

16. Steeley, J.M.: A note on information seasonality and the disappearance of the weekend ef-fect in the UK stock market. Journal of Banking and Finance 25, 1941–1956 (2001)

17. Sullivan, R., Timmermann, A., White, H.: Dangers of Data-Driven Inference: The Case of Calendar Effects in Stock Returns, Working Paper, University of California, San Diego (1998), http://ucsd.edu/hwcv081a.pdf

18. Tang, G.Y.N.: Day-of-the-week effect on skewness and kurtosis: a direct test and portfolio effect The European Journal of Finance. 2, 333–351 (1998)

19. The Nordic Exchange (2006), http://www.baltic.omxgroup.com/ 20. Virili, F., Reisleben, B.: Nonstationarity and data preprocessing for neural network predic-

tions of an economic time series. In: Proc. Int. Joint Conference on Neural Networks 2000, Como, vol. 5, pp. 129–136 (2000)

21. Yao, J.T., Tan, C.L.: Guidelines for financial forecasting with neural networks. In: Proc. International Conference on Neural Information Processing, Shanghai, China, pp. 757–761 (2001)


WSRPS: A Weaning Success Rate Prediction System Based on Artificial Neural Network Algorithms

Austin H. Chen1,2 and Guan Ting Chen 2

1 Department of Medical Informatics, Tzu-Chi University, No. 701, Sec. 3, Jhongyang Rd. Hualien City, Hualien County 97004, Taiwan [email protected] 2 Graduate Institute of Medical Informatics, Tzu-chi University, No. 701, Sec. 3, Jhongyang Rd. Hualien City, Hualien County 97004, Taiwan [email protected]

Abstract. Weaning patients efficiently off mechanical ventilation continues to be a challenge for clinical professionals. Medical professionals usually make a decision to wean based mostly on their own experience. In this study, we present a weaning success rate prediction system (WSRPS); it is a computer-aided system developed from MatLab interfaces using Artificial Neu-ral Network (ANN) algorithms. The goal of this system is to help doctors objectively and ef-fectively predict whether weaning is appropriate for patients based on the patients’ clinical data. The system can also output Microsoft EXCEL (XLS) files (i.e. a database including the basic data of patients and doctors) to medical officers. To our knowledge, this is the first design approach of its kind to be used in the study of weaning success rate prediction.

1 Introduction

In this study, we present a Weaning Success Rate Prediction System (WSRPS) using Artificial Neural Network (ANN) algorithms. WSRPS is a user-friendly sys-tem designed to help doctors objectively and effectively predict whether weaning is appropriate for patients based on the patients’ clinical data.

As the average age of the population increases, the number of patients requiring mechanical ventilation is likewise increasing. Using a respirator has many disadvan-tages such as the risk of respirator pneumonia, patient discomfort, and the economic burden to the family [12, 17]. There is no standard for doctors to decide when to begin a weaning process. Weaning too late not only increases the infected rate but also wastes limited medical resources.

Several weaning indices [19], protocols [8], and outcome managers [3] have been designed and tested in an effort to decrease costs and improve outcomes associated with prolonged ventilation. These efforts may have helped our understand-ing of the weaning process; however, knowing when and how to wean remains a challenge. Development of a reliable system using computer techniques provides a possible solution for these questions.

Among all the possible computer techniques, Artificial Neural Network is one of the most promising candidates. ANN is a bionic analysis technique for Artificial Intelligence. It has been widely used in health care studies; examples include the diagnosis of myocardial infarction [2], the monitoring of lung embolism [11], the

90 A.H. Chen and G.T. Chen

analysis of electrocardiogram wave patterns [18], the prediction of trauma death rate [6], the operation decision support for brain surgery [13], the probability of child meningococcus [16], and the analysis of transplants [7].

The goal of this study was to develop a prediction system which can serve as an in-telligent tool for doctors to predict the weaning success rate based on a patient’s direct test data. The Artificial Neural Network is a promising tool for this study.

2 Feature Selection

BWAP (Burns Wean Assessment Program) is a respirator weaning estimation form developed by Burns et al. [5, 4]. It is composed of 26 factors, including 12 general

Table 1. Features and characteristics

Days of Estimating

5 Systolic Pressure It should be between 90 and 150 mmHg without using methoxamine or low dose (dopamine 5µg/kg/min or dobutamin 5µg /kg/min).

6 Body Temperature It should be between 35.5 and 37.5 with no infecting.

7 Specific Volume of Blood Observe that the specific volume of blood is more than 30% or not.

8 Sodium (Na) It should be between 137 to 153 mEq / L.

9 Potassium (K) It should be between 3.5 to 5.3 mEq / L.

10 Max Pressure of Inhaling It uses to be the significant referring that estimating patient could spontaneity breathe.

11 Max Pressure of Expiration It should be more than 30 cmH2O.

12 Index Number of Shallowly Quick Breath

It should less than 105 per minute.

13 Air Exchanging Capacity per Minute

It should be smaller than 10 1/min.

14 Cough Appraise the ability of patients to exclude secretions of the breathing passage.

15 X-ray It will be an important evidence whether weaning is successful

16 Capacity of Urine BUN takes a high estimation with heart failure or kidney function destroyed. Patients would be dysuria which would effect weaning.

Features Characteristics

1 Age Age is one of the most danger factors causing weaning fail for people more than 70 years old [10]

2 Days of Intubation Not only the rate of infecting in the hospital will increase but also decrease the thaw and endurance of breathe. The respirator weaning success rate is going to decrease with more days of intubation [9].

3 During days of intubation, medical officers should make a precise plan to estimate days of weaning.

4 Heartbeat It should be 60 and 120 per minute. Medical officers should observe the condition of angina pectoris and arrhythmia with people.

WSRPS: A Weaning Success Rate Prediction System 91

classifications and 14 breathing-related classifications. Among these 26 features, it is considered that weaning will be successful if more than 17 features are satisfied within the requirements. In this study, we selected and screened 16 features after thorough dis-cussion with experts and doctors. Our results demonstrated that the selection of these features produced a very high accuracy in the prediction of weaning success. The fea-tures and their importance are summarized in Table 1.

3 Methods

3.1 Data Sources

This research collected 121 sets of data from a medical center in the middle region of Taiwan. The datasets range from June 21st, 2004 to December 31st, 2004. These

Table 2. Data statistics of patients before weaning

Features Range Data statistics

AgeMinimum 16

Maximum 102

Average 74.0

Standard deviation 12.23

Days of intubation Minimum 3

Maximum 45

Average 13.50


Days of estimating Minimum 2

Maximum 43

Average 8.94


Heartbeat Minimum 63

Maximum 160

Average 89.82


Blood pressure Minimum 47

Maximum 187

Average 122.31


Body temperature Minimum 35

Maximum 37.9

Average 36.51


Specific volume of blood Minimum 19.5

Maximum 47.1

Average 30.98


Sodium (Na) Minimum 100

Maximum 158

Average 133.70


Potassium (K) Minimum 2.5

Maximum 6.6

Average 4.05


Max pressure of inhaling Minimum -60

Maximum -6

Average -29.63


Max pressure of expiration Minimum 7

Maximum 125

Average 38.77


Index number of shallowly quick breath Minimum 8.6

Maximum 214

Average 90.00


Air exchanging capacity per minute Minimum 3.28

Maximum 19.97

Average 8.73



Table 3. Data statistics of patients before weaning

Features Range Frequency Percents

No coughing ability 46 0.38 Cough

Has coughing ability 75 0.62

No improvement 29 0.24 X-ray

Improvement 92 0.76

>1500ml/day 106 0.88 Capacity of urine

<1500ml/day 15 0.12

records contain data on patients who are on a respirator and form the basis of our feature set. BWAP is composed of 26 factors; however, some of these features are not easily collected and are not specific. After discussion with doctors and ex-perts, we screened and selected 16 features for this study. Data collected from pa-tients is summarized in Tables1 and 2. After weaning, if the patient can breathe by himself/herself, we define it as a successful weaning.

3.2 Artificial Neural Network

We use two-thirds of the 121 datasets, or 81 sets of data, for training, and one-third, or 40 sets of data, for testing. ANN codes were written in MatLab using Back Propa-gation Network (BPN). The network contains 3 layers of forward neural net-works: the input layer, the hidden layer, and the output layer. There are 16 neurons in the input layer, 7 neurons in the hidden layer, and 1 neuron in the output layer. Figure 1 displays the framework of the model.

Fig. 1. Framework of the ANN Model

Three kinds of BPN training algorithms were used to obtain optimal results: conjugate gradient algorithm [15], Levenberg-Marquarft algorithm [14], and the One-Step-Secant algorithm [1]. The training framework of the model is shown in Fig 2.


Fig. 2. Training framework of the model

4 System Implementation

In this study, we have developed a weaning success rate prediction system (WSRPS) that can help medical professionals predict weaning success rate based on clinical test data. The system generates its predictions based on ANN techniques. WSRPS utilizes MATLAB codes and interfaces together to display the prediction results on a user-friendly interface. Our system can also output Microsoft EXCEL (XLS) files, producing a group of spreadsheets that include basic patient and doc-tor data for medical officers. To our knowledge, this is the first design approach of its kind to be used in the study of weaning success rate prediction.

4.1 System Architecture and Flowchart

WSRPS is a human-based system that provides two main functions to the users. The first function of the system generates two types of databases. One database records the basic information and clinical test data of the patients, and the other database records the codes of the doctor who is on duty and the conditions of the hospitalized patients. The second function of WSRPS is the core of the system. After users input the test data, the system will analyze the data and display the results on the interface. The system will also generate Microsoft EXCEL files which contain all patient weaning data.

The system was developed in a MatLab environment. MatLab is an interactive, matrix-oriented programming language that enables us to express our mathematical ideas very concisely and directly; it considerably reduces development time and keeps code short, readable and fully portable. WSRPS converts codes written in MATLAB and integrates the results onto the MATLAB GUIDE interface. The design flowchart of WSRPS is shown in Figure 3.


Input Base Patient Data

WeaningPatient.xls & Doctor.xls

Input 16 BWAP factors of patients

Choose analysis function

BWAP Predictive ANN Predictive

Sale.xls Output Score and weaning advice

Exit

Fig. 3. The design flowchart of WSRPS

Two system interfaces were designed in this study: Base Patient Data (BPD) and WSRPS interfaces.

4.2 The BPD System Interface (Figure 4)

There are three panels in the BPD interface: Patient, Doctor, and Note. Users can input Patient record ID, Patient name, Birthday, and Identification in the Patient panel. For known patients, they can load all previous data from XLS files by enter-ing the Patient record ID. For new patients, the input data will be saved into XLS files which may relate to other data through the Patient record ID.

Medical officers must first enter their identification code in Doctor ID to re-trieve and save data. The input fields in the Doctor panel include: (1) Department, choose the department where the medical officers belong to; (2) Note date, enter recording date; (3) Airway, pick airway, windpipe or windcut; (4) Source, click the selection on the popup-menu to write source where the patients come from; (5) Out Hospital Condition, select the condition of patient leave out the hospital; and (6) Out ICU Condition, select the condition of patient leave out the ICU. The en-tered records are written into a Doctor.xls file.

The Note panel includes thirteen input fields. The users are required to enter the following clinical test data: (1) Airway ID; (2) Ad Hospital ID, the hospital where the patients stay; (3) Airway; (4) On_Endo_Date, the date of intubating; (5) en-donumber, the number of respirator; (6) weaning_endo_date, the date of weaning; (7) END, finish the weaning case or not; (8) end_date, the date of finishing date;(9) success, success or not; (10) endo_reason, the reason of intubating; (11) end_reason,


Fig. 4. The screen of the BPD system interface

the reason of terminating the case; (12) Chief complaint; (13) Pardiagnose. The enter-ing records are written into Weaning Patient.xls files. The contents of the Popup menus for the input fields are summarized in Table 4.

Table 4. The Contents of Popup-menus

Popup-menu Name Contents

Sex Male/Female

Department Chest Cavity internal

Infections Branches

Stomach Branches

Kidneys Branches

Metabolism Branches

Immunities Branches

Nerves Internal

Heart Internal

Poisons Branches

Surgical

Airway (in Doctor Panel) Airpipe/Aircut

Source Home

Other ICU

Other RCW


Table 4. (continued)

Other RCC

Other Normal Ward

Rest-home

Other Department

Out Hospital Condition Leave with Success

Death

Leave on the Verge of Death

Leave with Health

Turn to Other Hospital

Other Reasons

Out ICU Condition Leave with Success

Death


Leave with Health

Turn to Other Hospital

Other Reasons

Airway (In Note Panel) Success

Success(Tracheostomy)

Success(BIPAP)

Failed

Turn to Other ICU

Leave with Health


Death

END Yes/No

Success Yes/No

Figure 5 shows an example of how to input data in the doctor’s department field. The users may select the department from the popup menu.

After the users input all data in the WSRPS interface (see Figure 6) and click OK button, all data will be saved into XLS files.

4.3 The WSRPS System Interface (Figure 7)

There are four panels in the WSRPS interface: Patient Data, Input Data, ANN Pre-diction, and BWAP Prediction. Users enter basic patient data (Patient ID, Endo ID and Date) in the Patient Data panel. This data can serve as foreign keys in the data-base. The Input Data Panel contains 16 patient features used in this study.


Fig. 5. Popup menu for the Department field

Fig. 6. The screen for a complete data input


Fig. 7. The screen of the WSRPS system interface

Two functional panels were generated in this interface: ANN Prediction and BWAP Prediction. In the ANN Prediction panel, three algorithms (Rate Adjoint Gradiant, Levenberg-Marquaft, and One-Step-Secant) in Artificial Neural Network were available.

5 Results and Discussion

5.1 Outputs from the System

Users may select either the BWAP or ANN Prediction method for analysis. In ANN Prediction panel, users first select the algorithm and then click the Training button to add the new data into the database. Users may also click the Predictive button to produce an ANN model and write the results into a WSRPS.xls file. This system will then display the results and advice generated from Artificial Neural Network model (Figure 7). The Predictive button in the BWAP Prediction panel uses the single patient’s data from the Input Data panel to calculate the score based on the BWAP estimate score list. It can help the users (i.e. doctors) to determine whether to wean the patient in question. Figure 8 and 9 show successful weaning examples using ANN prediction and BWAP prediction, respectively. Figure 10 and 11, however, show failed weaning examples using ANN prediction and BWAP prediction, respectively.

WSRPS can also generate three XLS files (shown in Figure 12, 13, and 14) to share data among computers that may or may not have MatLab installed.


Fig. 8. A successful example using the ANN prediction system

Fig. 9. A successful example using BWAP prediction system


Fig. 10. A failed example of the ANN prediction system

Fig. 11. A failed example of the BWAP prediction system


Fig. 12. Save the input data of the BPD interface into the WeaningPatient.xls file

Fig. 13. Save the data of the BPD interface into the Doctor.xls file

Fig. 14. Save the data of the WSRPS interface into an XLS file


5.2 Performance Comparison from Three ANN Algorithms

The confusion matrix used to analyze the accuracy and sensitivity of the three algorithms is defined in Table 5.1.

Table 5. Confusion matrix

Actually weaning successful (Y) Actually weaning fail (N)

predict weaning successful (Y) A B

predict weaning fail (N) C D

Where Accuracy, Sensitivity, and Specificity are defined by

Accuracy = (A+D)/ (A+B+C+D) Sensitivity = A/ (A+C).

Table 5.2 summarizes the results from these three algorithms. The results from all these algorithms show a very high sensitivity (around 80%) and accuracy (around 70%).

Table 6. Analyzing results from 3 ANN algorithms

mean standard deviation 95%CI

sensitivity 0.79 0.07 0.03 Conjugate Gradient Method

accuracy 0.73 0.07 0.03

sensitivity 0.80 0.10 0.05 Levenberg- Marquarft

accuracy 0.68 0.06 0.03

sensitivity 0.83 0.11 0.05 One-Step-Secant

accuracy 0.74 0.05 0.02

6 Conclusions

In this study, we present a weaning success rate prediction system (WSRPS); it is a computer-aided system developed from MatLab interfaces using Artificial Neu-ral Network (ANN) algorithms. The goal of this system is to help doctors objec-tively and effectively predict whether weaning is appropriate for patients based on the patients’ clinical data. In addition to its prediction capability, WSRPS can also generate several Microsoft EXCEL (XLS) files for medical professionals. These files constitute a database that includes basic patient data for doctors. To our knowl-edge, this is the first design approach of its kind to be used in the prediction of weaning success rate.

Acknowledgements

The authors thank the National Science Council of the Republic of China for their financial support of project NSC 95-2221-E-320-001.


References

[1] Battiti, R.: First and second order methods for learning: Between steepest descent and Newton’s method. Neural Computation, 141–166 (1992)

[2] Baxt, W.G.: Use of an artificial neural network for the diagnosis of myocardial infarction. Ann. Intern. Med. 115(11), 843–848 (1991)

[3] Burns, S.M., et al.: Design, testing and results of an outcomes managed approach to pa-tients requiring prolonged ventilation. Am. J. Crit Care 7, 45–57 (1998)

[4] Burns, S.M., et al.: The weaning continuum use of Acute Physiology and Chronic Health Evaluation III, Burns Wean Assessment Program, Therapeutic Intervention Scoring Sys-tem, and Wean Index scores to establish stages of weaning. Critical Care Med. 2259–2267 (2000)

[5] Burns, S.M., et al.: Implementation of an institutional program to improve clinical and fi-nancial outcomes of mechanically ventilated patients: One-year outcomes and lessons learned. Critical Care Med., 2752–2763 (2003)

[6] DiRusso, S.M., et al.: An Artificial Neural Network as a Model for Prediction of Survival in Trauma Patients: Validation for a Regional Trauma Area. J. Trauma 49, 212–223 (2000)

[7] Dorsey, S.G., et al.: A neural network model for predicting pancreas transplant graft out-come. Diabetes Care 20(7), 1128–1133 (1997)

[8] Esteban, A., et al.: A comparison of four methods of weaning patients from mechanical ventilation. NEJM 332, 345–350 (1995)

[9] Esteban, A., et al.: Characteristics and Outcomes in Adult Patients Receiving Mechanical Ventilation. JAMA 287(3), 345–355 (2002)

[10] Kaohsiung Veterans General Hospital: Shallow talking about respirator, http://www.vghks.gov.tw/cm/html/health/symptom/EduMV.htm

[11] Kemeny, V., et al.: Automatic embolus detection by a neural network. Stroke 30, 807–810 (1999)

[12] Kurek, C.J., et al.: Clinical and economic outcomes of mechanically ventilated patients in New York State during 1993. Chest 114, 214–222 (1998)

[13] Li, Y.C., et al.: Neural network modeling for surgical decisions on traumatic brain injury patients. Int. J. Med. Inf. 57(1), 1–9 (2000)

[14] Marquez, G., et al.: Development of tissue-simulating optical phantoms: poly-N-isopropylacrylamide solution entrapped inside a hydrogel. Phys. Med. Biol., 309–318 (1999)

[15] Moller, M.F.: A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning. Neural networks, 525–533 (1993)

[16] Nguyen, T., et al.: Comparison of prediction models for adverse outcome in pediatric meningococcal disease using artificial neural network and logistic regression analyses. Journal of Clinical Epidemiolog 55, 687–695 (2002)

[17] Rosen, R.L., Bone, R.C.: Financial implications of ventilator care. Crit Care Clin. 6, 797–805 (1990)

[18] Vijaya, G., et al.: ANN-based QRS-complex analysis of ECG. J. Med. Eng. Technol. 22, 160–167 (1998)

[19] Yang, K.L., Tobin, M.J.: A prospective study of indexes predicting the outcome of trials of weaning from mechanical ventilation. NEJM 324, 1445–1450 (1991)


Dealing with Large Datasets Using an Artificial Intelligence Clustering Tool

Charalampos N. Moschopoulos1, Panagiotis Tsiatsis1, Grigorios N. Beligiannis2, Dimitrios Fotakis3, and Spiridon D. Likothanassis4 1 Department of Computer Engineering & Informatics, University of Patras, GR-26500, Rio, Patras, Greece [email protected], [email protected] 2 Department of Business Administration in Food and Agricultural Enterprises, University of Ioannina, G. Seferi 2, GR-30100, Agrinio, Greece [email protected] 3 Department of Information and Communication Systems Engineering, University of the Aegean, 83200 Karlovasi, Samos, Greece [email protected] 4 Department of Computer Engineering & Informatics, University of Patras, GR-26500, Rio, Patras, Greece [email protected]

Abstract. Nowadays, clustering on very large datasets is a very common task. In many scien-tific and research areas such as bioinformatics and/or economics, clustering on very big datasets has to be performed by people that are not familiar with computerized methods. In this contri-bution, an artificial intelligence clustering tool is presented which is user friendly and includes various powerful clustering algorithms that are able to cope with very large datasets that vary in nature. Moreover, the tool, presented in this contribution, allows the combination of various artificial intelligence algorithms in order to achieve better results. Experimental results show that the proposed artificial intelligence clustering tool is very flexible and has significant com-putational power, a fact that makes it suitable for clustering applications of very large datasets.

1 Introduction

In recent years, applications in bioinformatics, economics, image processing and document categorization generate large datasets that need to be clustered or classified in order valuable conclusions to be carried out. The main goal in these situations is to find the best set of representatives, among the members of the datasets, in order to be able classify the data correctly. Moreover, these representatives can be used for clus-tering or classifying oncoming unknown data belonging to the same problem. Classi-cal clustering algorithms are not suitable for this kind of problem due to the very big size of the initial datasets. That is the main reason why hybrid intelligent clustering algorithms should be used and should be included into user friendly computer appli-cations, so that even untrained users will be able to use them in their applications.

In the area of bioinformatics, many clustering tools have been developed for de-tecting gene patterns (Yoshida et al. 2006),(Zheng et al. 2006). Clustering tools have

106 C.N. Moschopoulos et al.

been also developed for extracting information concerning protein structure (Wainreb et al. 2006) or for analyzing data derived from microarray experiments (Bolshakova et al. 2005). In large databases, clustering tools have to take into consideration the relationships between database’s members (Ryu and Eick 2005). Nevertheless, a clus-tering method should be focus on the nature of data that are hosted in a database. For instance, in document databases, fuzzy clustering tools have been successfully applied (Torra et al. 2005) while in image clustering, a tool called eID (Stan and Sethi 2003) is characterized by its hierarchical clustering approach. Finally, in financial forecast-ing, many hybrid clustering methods have been embedded in decision support tools (Li 2000),(Tsang et al 2004).

In this contribution, an artificial intelligence clustering tool is presented which is user friendly and includes various powerful clustering algorithms that are able to cope with very large datasets that vary in nature. The presented clustering tool is user friendly and provides classification and visualization results to users. The hybrid method that has been used for coping with very large datasets is to apply two different algorithms on them. The first algorithm decreases the initial dataset into a much smaller one while the second algorithm performs quality clustering on the new smaller dataset. In this way, we are able to obtain very good results in a respective small period of time.

The algorithms included in the clustering tool are flexible and can be applied to datasets having very different properties. This is demonstrated in this paper, where exhaustively experiments are performed on many different datasets. As a result, the proposed clustering tool can be applied in many problems belonging to different research and scientific areas and still provide very satisfying results. However, it is advisable to follow a small preprocessing procedure, depending on the nature of each problem, in order to maximize the quality of solutions that derive from the presented clustering tool.

The rest of the contribution is organized as follows. Section 2 provides a small re-view of the algorithms that are included in the presented clustering tool. In section 3, we present thoroughly the main characteristics and properties of the proposed cluster-ing tool, while in section 4 we perform extensive experiments on synthetic datasets with totally different properties in order to demonstrate the flexibility of the proposed tool. Also, a detailed example of the application of the tool to a specific dataset is presented in order to show its operation details and its user friendly interface. Finally, section 5 summarizes the conclusions and refers to issues concerning future work.

2 Presentation of the Clustering Tool’s Algorithms

In this section the algorithms that are included in the clustering tool are presented. As mentioned before, these algorithms are divided in two categories. Those belonging to the “first level” and those belonging to the “second level”. The “first level” algorithms are the algorithm of Meyerson (Meyerson 2001) and Fast Facility Location (FFL) algorithm (Fotakis 2006). The “second level” algorithms are the k-means algo-rithm, the LocalSearch algorithm (Guha et al. 2003) and the FL subroutine which is part of the LocalSearch algorithm.

Dealing with Large Datasets Using an Artificial Intelligence Clustering Tool 107

Usually, when clustering has to be performed in very large datasets, a combination of algorithms is used. In these situations, classical clustering algorithms such as LocalSearch and k-means can not be applied efficiently because they are too slow. Also, algorithms such as FFL and Meyerson will not provide very satisfactory results. So, in order to obtain accurate results in short time, a combination of algorithms be-longing to the two levels, mentioned above, should be used. In the rest of this section, a review of all algorithms included in the presented clustering tool is given.

2.1 The Algorithm of Meyerson

The online facility location problem has become very popular during the last decade. In this category of problems many applications belong to, such as network construc-tion and mobile computing. According to the definition of the problem, initially, a set of “demand” points is given. Each point arrives one at a time so that the algorithm does not know a priori the whole dataset. The goal is to choose a set of facilities in order to represent with the (near) optimal way the large volumes of data that belong to

the input data stream. The conservation of each facility has a specific cost, wf , oth-

erwise the solution of the problem would be obvious (every demand point could be a facility). The set of facilities that are conserved is symbolized as F and the whole set of demand points as D. In this contribution, we consider the uniform case, that is, the cost of each facility is the same for each point.

The algorithm has to decide dynamically, for each demand point, whether to assign it to an existing facility or to make it a new facility. The algorithm’s objective is to minimize the total cost which is defined by the sum of the cost of maintaining a num-ber of facilities plus the assignment cost. More formally, the total cost can be repre-sented as:

ww F u D

f d( F,u )∈ ∈

+∑ ∑

where wf is the facility cost and d(F,u) is the minimum distance between the

demand point u and the set of facilities. The algorithm of Meyerson (Meyerson 2001) has been developed for the online

facility location problem. It takes into consideration each demand point only one time and it is randomized. It works as follows: First, the distance of a new demand point u to the closest opened facility, at that moment, is measured. Then, with probability

wd( F ,u ) / f a new facility is opened at this demand point, that is, this demand

point is from now on a new facility. Otherwise, u is assigned to its closest opened facility.

It is obvious that the value of wf plays an important role to the number of the

facilities that will be finally opened. For example, a high value of wf will cause the

opening of fewer facilities. The algorithm of Meyerson holds in memory only the set of facilities. Moreover, according to the operation of the algorithm, a facility has many possibilities to be opened in a dense area of the metric space. However, the


algorithm of Meyerson is prune to create unnecessary facilities as it does not review the usefulness of the already opened facilities.

2.2 The Fast Facility Location Algorithm

The Fast Facility Location (FFL) algorithm (Fotakis 2006) is based on Meyerson’s Online Facility Location algorithm and is also used for the online facility location problem. FFL has the ability, with one pass, to maintain a set of facilities approximat-ing the optimal configuration within a constant factor.

The property that distinguishes FFL algorithm from Meyerson’s algorithm is that FFL has a mechanism that closes some of the old facilities based on a memoryless version of the merge rule presented in (Fotakis 2004). According to that rule, each facility w is the only one that exists in its merge ball Bw(rw), where rw is the replace-ment radius of the facility w. When a new facility u is created, its replacement radius is initialized to a fraction of the distance between u and the nearest existing facility. If u is included in an existing facility’s ball, then the old facility is closed and its demands are re-assigned to u. The algorithm keeps decreasing rw to ensure that no merge operations can dramatically increase the total assignment cost of the demands.

FFL keeps in memory only the set of facilities, denoted as F , and the replacement radius rw of each facility w. Because of the “closing facility” mechanism, it is a little bit slower than Meyerson’s algorithm but it does not create unnecessary facilities.

2.3 The k-Means Algorithm

The k-means algorithm is the most popular clustering algorithm (Kaufman and Rousseeuw 1990), due to its simplicity and its effectiveness. It belongs to the wide category of the partitioning methods, where the dataset is subdivided into k groups.

In the first step of the algorithm, k demand points are chosen randomly as the ini-tial set of facilities. Every other point p D∈ is assigned to its closest facility. That is

the way the clusters are been formed. For every cluster, its centroid is computed and the demand points are reassigned. This procedure is repeated until the assignment cost can not be further reduced.

2.4 The LocalSearch Algorithm

The LocalSearch algorithm (Guha et al. 2003) is an algorithm used for solving the k-median problem and is derived from a variation of the algorithm presented in (Charikar and Guha 1999).

Contrary to k-means algorithm, LocalSearch does not produce a predefined num-ber of facilities. It uses binary search in order to find a value of f that leads the algorithm to produce facilities, close to a desired number. Furthermore, LocalSearch algorithm uses the technique of sampling in order to reach faster to a solution. How-ever, this has a small impact on the algorithm’s approximation of the optimal solution (Charikar and Guha 1999). Anyway, at the end of the execution of the LocalSearch algorithm a set of feasible facilities is obtained, notated as M.

In order to reach the desired number of final facilities, the LocalSearch algorithm has to proceed to appropriate closures of unneeded facilities and respective


re-assignments of points. For this purpose, LocalSearch algorithm uses a subroutine called FL. This subroutine takes as input a set of feasible facilities M and checks if the opening of a facility u M∈ reduces the final cost by closing all facilities that their points can be re-assigned to u. The FL subroutine stops when there is no important reduction to the final cost. If no significant change to the final cost is occurred, the value of f changes and the procedure is repeated. It is proved that the solution that derives from the combination of sampling and FL is close to the optimal one.

The proposed clustering tool gives the user the opportunity to carry out experi-ments not only using the whole LocalSearch algorithm but also using the FL routine alone.

3 The Clustering Tool

The proposed clustering tool gives the opportunity to users to choose the algorithm or the combination of algorithms that they prefer to use and provides various statistic results and visualization options. Moreover, the clustering tool can work as a classi-fier. In Fig 1 the main window of the proposed clustering tool is shown.

Fig. 1. The main window of the clustering tool


The clustering tool gives the user the opportunity to define the values of many parameters due to the significant number of different clustering algorithms that it con-tains. Special care has been taken in order to make the clustering tool as friendly as possible and not to create any confusion to the users. The clustering tool contains a Help Menu which describes in detail the steps that have to be followed from the user in order to use each different algorithm. Depending on the algorithm that each user selects to execute, the respective parameters are set to active state, and can be defined by the user, while all others are set to inactive state, and cannot be defined by the user. Moreover, there are warning messages that inform the user for obligated choices. For instance, in cases that input data has high dimensionality, visualization can not be performed. Finally, when the execution button is pressed, the clustering tool performs an error check in the parameters’ values inserted by the user and produces pop up error messages that inform the user in case that he/she has made some wrong choice.

In the following paragraphs the parts of the clustering tool are presented thoroughly, that is, the selection of the algorithms, the definition of the parameters, the definition of the input data, and finally, the statistic and visualization options.

3.1 Selection of the Algorithms

In Part 1 as it can be seen in Fig 1, the user can choose the algorithm(s) of his desire. He can run an algorithm independently or in a combination with others. In case that a combined algorithm strategy is used, the algorithms of the clustering tool are sepa-rated in two levels. The first level contains the FFl and the algorithm of Meyerson while the second one contains the LocalSearch and the k-means algorithm and the subroutine FL of the LocalSearch algorithm.

Usually, a combination of algorithms is used in cases that the clustering procedure has to be applied on large amounts of data. The algorithms of the first level are very fast and their goal is to decrease the initial data set into a much smaller one, without losing much of the initial information. The second level’s algorithms provide better cluster quality and work faster due to the small size of the dataset which has been derived from the application of the first level algorithms.

The next step for the user is to specify the values of the parameters belonging to the algorithm he has selected. The parameters definition is performed in Part 2 as it can be seen in Fig 1. As mentioned before, only the parameters of the selected algo-rithms are active. The parameters concerning the dimensionality and the distance met-ric have to be defined in every case. The user can select between the Euclidean and Manhattan distance, depending to the input instance of each experiment. If the FFL algorithm or the Meyerson’s algorithm are selected to be used, then the parameter A and the facility cost (FAC) must be defined. The FAC has also to be defined if the Fl routine is selected to be used. The sample proportion option is needed for the Local-Search algorithm. Finally, the k-means algorithm demands the specification of the k center parameter. The clustering tool allows the user to save specific parameter values in files, in case that he/she wants to use them in the future.


3.2 Input and Output File Options

Following the specification of each algorithm and its parameters, the user is able to define the input data file and also to give a name for the file which will contain the results of the performed clustering (Part 3 in Fig 1). In this point the user has two options: he/she can use the synthetic data generator or he/she can define the input data file on his/her own.

Usually, the synthetic data generator is chosen if the user wants to perform experi-ments in order to test the performance of the algorithms that are included in the clus-tering tool. For this reason, the synthetic data generator has the ability to create many different kind of instances. More specifically, the user can define the number of points and the number of clusters he would like to have in the created instance. He can also define the spacing of the clusters’ centers (in grid or random mode) and the percent-age of the noise in the input data. Moreover, the synthetic data generator allows the user to create small or big clusters and to define the proportion between a small, a big and a normal cluster. Also, the deviation of the clusters can be defined by the user. Another important property of the synthetic data generator is the ability to construct a synthetic dataset that represents the worst case scenario. This is achieved by ordering the demand points in the following way: placing the most distant ones first and those that are near to a cluster’s center last. The online algorithms will have significant difficulties dealing with those datasets as they function in one pass and as a result they will create many unnecessary facilities. Finally, the synthetic data generator estimates an optimal cost for every generated dataset.

If the user chooses to use an already existing input file, then the only thing he has to do is to browse it through the Open Input File button. This option is usually used if the input file represents the demand points of a real case problem.

Whatever input data is used, the name of the output file must be defined through the Open Output File button. The user can create a new output file or use an already existing one.

3.3 Other Tool Options

Due to the stochastic nature of the algorithms, the results of each execution are quite different. For this purpose, the clustering tool offers an option for batch execution of the selected algorithms, in order the user to be able to estimate “mean” results. An-other helpful property that can be used is the time limit option. In addition, if the synthetic data generator is used, the clustering tool allows the user to choose between a facility cost value proposed by the system and a facility cost value he has chosen. The proposed facility cost will lead to results that will contain approximately the same number of clusters with the generated input.

Moreover, the user can select automatic visualization of the results, if the dimen-sionality of data is equal to 2. The datasets that are visualized are three: the initial dataset, the facilities generated by the execution of the first online algorithm (if the user has selected a combination of algorithms) and the set of final facilities (Fig 2).

Besides that, the clustering tool can create three output files: the Console Out-put.txt, the Error log.txt and the Clusters and Points.txt. The first one contains all the outputs of the program in detail, the second one contains error messages that were


a b c

Fig. 2. (a) Initial dataset which is consisted of 10 clusters. (b) Visualization of the results after the application of the first algorithm. The red dots are the generated facilities. Due to the func-tion of the algorithm, the number of the generated facilities is much bigger than the number of the clusters. (c) Visualization of the results after the application of the second algorithm. The red dots are the final 10 facilities. Each cluster has been colored differently in order to help the user understand the final result.

generated during the execution of the algorithms while the last one contains the clus-ters and the points that belong to each cluster. The visualized sets and the presentation of the results are controlled by the choices made by the user in Part 5 in Fig 1.

Finally, at the bottom of the main window of the clustering tool warning messages are displayed, during the execution of the algorithms, in order to guide the user (Part 6 in Fig 1).

4 Application on Synthetic Datasets

The clustering tool has been developed in order to assist users to perform clustering on very large datasets. The user can apply an algorithmic strategy based on the properties of the dataset which will be clustered.

In this section, we will describe in detail how the user can use various algorithmic strategies on synthetic datasets with different characteristics and properties. We will

Table 1. Synthetic datasets used in experiments

Dataset Dimension K

Number of points

Centers'location Properties

1 2 10 100.000 grid mode random noise=5%, 1 Big cluster - 2 sparse clusters 2 2 10 500.000 grid mode random noise=5%, 1 Big cluster - 2 sparse clusters 3 2 20 500.000 grid mode random noise=5%, 1 Big cluster - 2 sparse clusters 4 2 50 500.000 grid mode random noise=5%, 1 Big cluster - 2 sparse clusters 5 2 100 500.000 grid mode random noise=5%, 1 Big cluster - 2 sparse clusters 6 2 15 1.000.000 grid mode random noise=5%, 1 Big cluster - 3 sparse clusters 7 2 20 5.000.000 grid mode random noise=5%, 1 Big cluster - 4 sparse clusters 8 2 10 100.000 random mode random noise=5%, 1 Big cluster - 2 sparse clusters 9 2 10 500.000 random mode random noise=5%, 1 Big cluster - 2 sparse clusters

10 25 10 500.000 random mode random noise=5%, 1 Big cluster - 2 sparse clusters, ordered demand points11 50 10 500.000 random mode random noise=5%, 1 Big cluster - 2 sparse clusters, ordered demand points12 25 20 1.000.000 random mode random noise=5%, 2 Big cluster - 4 sparse clusters 13 2 10 500.000 grid mode random noise=5%, 1 Big cluster - 2 sparse faraway clusters 14 2 10 500.000 random mode random noise=5%, 1 Big cluster - 1 sparse faraway cluster 15 100 10 100.000 random mode random noise=5%, 1 Big cluster - 2 sparse clusters


show that the proposed clustering tool can reach very good results in all cases in order to demonstrate its usefulness and effectiveness. Table 1 shows a number of datasets that can be created by the embedded synthetic data generator. The diversity of the synthetic datasets demonstrates the ability of the synthetic data generator included in the clustering tool.

Due to their size, a combination of algorithms should be used, as described in previous sections. However, a specific combination of algorithms will not have the same success on all datasets due to the diversity in their properties and characteristics.

For instance, for the first or the eighth dataset the user can apply a traditional clus-tering algorithm such as the k-means or the LocalSearch algorithm. A combined algo-rithmic strategy is not recommended because the number of the demand points is not too large so as to justify such a strategy. The difference in the execution time between a traditional clustering algorithm and a combination of algorithms is just a few seconds. So, the LocalSearch algorithm without the technique of sampling will be the best choice for this occasion, as it can obtain better results than the k-means algorithm (O’Callaghan et al. 2002).

The second, the third and the ninth dataset are similar, with low dimensionality and few centers, as it can be seen in Table 1. In this case the most appropriate strategy is the use of algorithms FFL and LocalSearch. The amount of mathematical operations, generated by the execution of FFL algorithm, is not prohibitively big comparing to the technique of sampling which is used in LocalSearch algorithm. However, when the number of centers is getting bigger, as in the forth and the fifth dataset, the FFL-LocalSearch strategy is getting too demanding. For this reason, it will be better if the user chooses the LocalSearch algorithm combined with the technique of sampling.

The major property of the sixth and the seventh dataset is the very large number of demand points. The dimensionalities of the data and the number of the centers have small values. Thus, the use of the FFL-LocalSearch strategy is the most appropriate one, for the same reasons, as described above. Nevertheless, in the following datasets (tenth, eleventh, twelfth and fifteenth) the dimensionality increases. Moreover, the demand points are ordered representing the worst case input scenario. The FFL algo-rithm will be severely affected from these properties and its performance will be decreased. On the other hand, the LocalSearch algorithm which uses the sampling technique is able to overcome this problem.

The thirteenth and the fourteenth dataset contain two and one cluster respectively that are very small and far away from all the others. These clusters are very difficult to discover by sampling the input data. In the contrary, the FFL algorithm can detect the remote cluster. So, the user should use the FFL-LocalSearch strategy.

The assignment cost is used as a measure of quality of the solutions which are pro-vided by the algorithms. In order to compare the quality of the various solutions, we divide the each time assigned cost with the estimated optimal cost which is produced by our synthetic data generator. As it can be seen in Fig 3, in most cases, the proposed algorithmic strategies produce better solutions than the optimal estimated ones (the optimal for each situation equals to 1). At the last six datasets, the algorithmic strate-gies can not reach the estimated optimal. This is normal, as these datasets are very difficult to be clustered. However, the proposed strategies achieve in these “difficult”


Fig. 3. The approximations reached by the algorithmic strategies (pink line) and the estimated optimal (blue line)

datasets very good approximations. This is a clear proof that the algorithms that are included in the proposed clustering tool are able to succeed very good results. So, the users of the clustering tool can easily apply very powerful algorithms in order to perform clustering in very large datasets.

Finally, we have to mention that the Meyerson’s and the k-means algorithms can be also used by the users of the clustering tool. The Meyerson’s algorithm can achieve a little bit worse results than the FFL algorithm but it is quicker than it. Concerning the k-means algorithm, it is quicker than the LocalSearch algorithm but unable to achieve the quality of results that the LocalSearch algorithm can reach. However, according to the demands of the user, these algorithms can also be an excellent choice for performing clustering in an unknown dataset.

4.1 An Example of Using the Clustering Tool on a Synthetic Dataset

In this section the use of the clustering tool on the third dataset is presented in detail. Also, all the output files that are generated by the application are shown.

As mentioned before, the most appropriate strategy for the third dataset is using the combination between FFL and LocalSearch algorithms. First, we fill in all the neces-sary fields in the main window of the clustering tool. In Fig 4 the main window of the clustering tool after the definition of all necessary parameters is shown. As it can be seen, we select the Euclidian distance as metric and we set the values of the Parameter A and the facility cost in such values that after the execution of the FFL algorithm, around 3500 facilities will be generated. Furthermore, we choose to run this combination for three times due to the stochastic nature of the chosen algorithmic strategy. Finally, we select to perform classification and create all the extra files that the clustering tool can create.


Fig. 4. The main window of the clustering tool after the specification of the parameters’ values

a b c

Fig. 5. (a) The third dataset (b) The facilities generated by the first algorithm (FFL) (c) The final facilities generated by the second algorithm (LocalSearch)

When the execution has been completed, the user can see the visualizations of the initial dataset, the results of the first algorithm (FFL algorithm) and the results of the second algorithm (LocalSearch algorithm). These visualizations are shown in Fig 5.

It is obvious that the selected strategy achieves excellent results. In Fig 5(c), it is shown that two of the final facilities have been generated in the center of the small clusters, which is really very difficult to achieve.


Fig. 6. The "Results" pop up window. The 1st part of the window contains all the properties of the input dataset and the values of the clustering tool’s parameters. Moreover, it presents the results of each algorithm for each execution separately. The 2nd part of the widow contains the average results of the algorithmic strategy used.

If the “Show Results” button is pressed, then a new window rises up and shows the results of each execution and the average performance of the respective algorithms (Fig 6). The “Results” window is separated in two parts. The first part contains the properties of the clustered dataset, the values of the parameters and the performance of each algorithm for each execution. The second part of the “Show Results” window presents the average performance of the algorithms. The results contain the number of facilities that were opened and the assignment cost of the proposed solution, which is the Euclidian distance of all the demand points measured from the closest center to them. Also, the total cost of the solution is recorded which is the sum of the assignment cost plus the cost of maintaining a number of facilities open. Finally, the execution time (in seconds) that was needed in order the algorithm to be completed is reported.

These results are also saved in a file called: “Final_Results.txt”, as it can be seen in Fig. 7. Besides that, a file called “Cluster and Points.txt” is generated which contains all the clusters generated from the first or the second algorithm and the attributes of the points that constitute these clusters. In other words, this file contains the results of


Fig. 7. Screenshot of the “Final_Results.txt” file

Fig. 8. Screenshot of the “Clusters and Points.txt” file (FFL Algorithm)


Fig. 9. Screenshot of the “Clusters and Points.txt” file (LocalSearch algorithm)

Fig. 10. Screenshot of the “Intermediate Output.txt” file


the classifier applied by the clustering tool (Fig. 8 and Fig. 9). As expected, the facili-ties which were opened by the FFL algorithm contain few demand points (Fig. 8) in contrast to the facilities opened by the LocalSearch algorithm (Fig. 9). Finally, two files are generated called “Intermediate Output.txt” (Fig. 10) and “Final Output.txt” (Fig. 11) that contain the facilities generated from the first and the second algorithm respectively. The first column of these files represents the number of demand points that are assigned to these facilities.

Fig. 11. Screenshot of the “Final Output.txt” file

5 Conclusions and Future Work

In this contribution, a new artificial intelligence clustering tool, that is appropriate for application on large datasets, is presented. It includes various clustering algorithms and it is able to achieve very good results by applying them in combinations to unknown input datasets.

Due to the different nature of the algorithms that are included in the proposed clus-tering tool, using each time the appropriate algorithmic strategy, the user is able to reach to very good solutions and cope with very large datasets that vary in their char-acteristics and properties, as demonstrated by the exhaustive experimental results. Furthermore, the proposed clustering tool is user friendly and can be used as a classi-fier as well. Nowadays, very large datasets are generated in many real world problems belonging to different scientific and research areas such as bioinformatics and/or economics. These datasets need to be effectively elaborated in order to result to useful conclusions. The main issue of our future work will be the application of the proposed clustering tool to very large datasets deriving from real world problems.

References

[1] Bolshakova, N., Azuaje, F., Cunningham, P.: An integrated tool for microarray data clus-tering and cluster validity assessment. Bioinf. 21(4), 451–455 (2005)

[2] Charikar, M., Guha, S.: Improved combinatorial algorithms for the facility location and k-median problems. FOCS, 378–388 (1999)


[3] Fotakis, D.: Incremental Algorithms for Facility Location and K-meadian. In: Albers, S., Radzik, T. (eds.) ESA 2004. LNCS, vol. 3221, pp. 321–347. Springer, Heidelberg (2004)

[4] Fotakis, D.: Memoryless “Facility Location in One Pass. In: Durand, B., Thomas, W. (eds.) STACS 2006. LNCS, vol. 3884, pp. 608–620. Springer, Heidelberg (2006)

[5] Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Cllaghan, L.: Clustering Data Streams:Theory and Practice. IEEE TDKE 15(3), 515–528 (2003)

[6] Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Inc., New York (1990)

[7] Li, S.: The development of a hybrid intelligent system for developing marketing strategy. Decis. Support Syst. 27, 395–409 (2000)

[8] Meyerson, A.: Online Facility Location. In: FOCS, pp. 426–431 (2001) [9] O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algo-

rithms for high-quality clustering. IEEE ICDE 685(2002) [10] Ryu, T.W., Eick, C.F.: A database clustering methodology and tool. Inf. Sci. 171, 29–59

(2005) [11] Stan, D., Sethi, I.K.: eID: a system for exploration of image databases. Inf. Process

Manag. 39, 335–361 (2003) [12] Torra, V., Miyamoto, S., Lanau, S.: Exploration of textual document archives using a

fuzzy hierarchical clustering algorithm in the GAMBAL system. Inf. Process Manag. 41, 587–598 (2005)

[13] Tsang, E., Yung, P., Li, J.: EDDIE-Automation, a decision support tool for financial fore-casting. Decis. Support Syst. 37, 559–565 (2004)

[14] Wainreb, G., Haspel, N., Wolfson, H.J., Nussinov, R.: A permissive secondary structure-guided superposition tool for clustering of protein fragments toward protein structure pre-diction via fragment assembly. Bioinf. 22(11), 1343–1352 (2006)

[15] Yoshida, R., Higuchi, T., Imoto, S., Miyano, S.: ArrayCluster: an analytic tool for clus-tering, data visualization and module finder on gene expression profiles. Bioinf. 22(12), 1538–1539 (2006)

[16] Zheng, J., Svensson, J.T., Madishetty, K., Close, T.J., Jiang, T., Lonardi, S.: OligoSpawn: a software tool for the design of overgo probes from large unigene datasets. BMC Bio-inf. 7, 7 (2006)


An Application of Fuzzy Measure and Integral for Diagnosing Faults in Rotating Machines

Masahiro Tsunoyama1, Hirokazu Jinno2, Masayuki Ogawa2, and Tatsuo Sato2

1 Niigata Institute of Technology, 1719 Fujihashi, Kashiwazaki, Niigata 945-1195 Japan [email protected] 2 Niigata-Worthington Co., Ltd., 1-32 Shinbashi, Kashiwazaki, Niigata 945-0056 Japan {hirokazu_jinno,masayuki_ogawa,tatsuo_sato}@niigata-wor.co.jp

Abstract. This paper shows an application of fuzzy measure and fuzzy integral for diagnosing faults in rotating machines. The fuzzy degrees for fuzzy set of vibration spectra are determined by the membership functions for spectra. The functions are composed using values in the syn-drome matrices given by the knowledge of skilled engineers. The fuzzy measures for the set of spectra and for the groups are obtained from the weights of spectra and weighs of groups based on the definition shown in this paper. Both of the weight s are shown in the syndrome matrices and given by skilled engineers.

The possibility of faults can be obtained by fuzzy integrals for faults using fuzzy degrees and fuzzy measures. The paper also evaluates the method using both test and field data.

1 Introduction

Diagnosis of faults in rotating machines are made by applying prior knowledge in conjunction with diagnostic analysis techniques and regularly performed by skilled engineers [1]. The need for rotating machine diagnosis is rising due to the increased use of these machines in highly reliable systems such as nuclear power plants [2]. However, it is difficult to satisfy the current need because the requisite training is lengthy and very expensive. Several automatic diagnostic systems have been devel-oped to satisfy this need[3], but these systems have several problems, such as the abil-ity to diagnose only certain faults, or occasionally generating the wrong diagnosis. The main source of these problems is that the diagnoses are made by inferences based on Boolean Algebra and the knowledge of skilled engineers is difficult to implement in the system. Several fuzzy diagnostic systems for diagnosing rotating machines have been developed to improve the ordinary diagnostic systems[4][5]; however, they still have several problems, such as difficulty in isolating faults generating similar vi-bration spectra.

In this paper we propose a method for rotating machine diagnosis based on the fuzzy measures and fuzzy integrals. In this method, the possibility of faults is deter-mined by the fuzzy integral using the fuzzy degree of vibration spectra and fuzzy measure of the set of spectra. The fuzzy degrees are obtained by the membership functions, and fuzzy measures are determined by the weight of each spectrum shown in the syndrome matrix. The syndrome matrix is composed based on the knowledge of skilled engineers.

122 M. Tsunoyama et al.

This paper is organized as follows. The vibration frequency spectra for faults and vibration syndrome matrix are described in Section 2. The fuzzy measure and the fuzzy integral are explained in Section 3. A sample diagnosis of a typical problem and evaluation of the proposed method is provided in Section 4. Our conclusions are presented in Section 5.

2 Faults in a Rotating Machine and Vibration Spectra

2.1 Faults in a Rotating Machine

Several kinds of faults occur in rotating machines including abnormal vibration, oil or water leaks, and abnormal temperature. The proposed method diagnoses faults that produce abnormal vibration since a large number of faults in rotating machines are accompanied by vibration. However, the presence of vibration is not necessarily in-dicative of a failure mode when the vibration power is low. The power level required for machine failure is specified by ISO 2372. The proposed method diagnoses faults generating vibrations larger than this level.

2.2 Vibration Spectra

The vibration spectra vary depending on the type of fault. The vibration spectra method employed analyzes vibration at several locations in a machine using FFT (Fast Fourier Transform) and diagnoses faults using the spectra[6]. An example of spectra for a ma-chine with a fault is shown in Figure 1. In this example, unbalance fault is diagnosed since the amplitude of spectra for fundamental frequency, 50Hz, and its second harmon-ics, 100Hz, are very high compared with other spectra.

0

0.2

0.4

0.6

0.8

1

1.2

0 75 150 225 300 375 450 525 600 675 750 825 900 975

Frequency [Hz]

Lev

el

Fig. 1. Vibration frequency spectra

An Application of Fuzzy Measure and Integral for Diagnosing Faults 123

Table 1. Three spectra regions

0-1,000[Hz]

Velocity It is used to diagnose faults producing forced vibration such as unbalance or misalignment. 1,000-20,000[Hz]

Envelope It is used to examine periodicities for diagnosing faults producing impulsive vibrations such as bearing damage. 1,000-20,000[Hz]

Acceleration It is used to examine the level of vibra-tion for diagnosing faults producing im-pulsive vibrations such as cavitation.

In this method, the input signal is analyzed in three regions: the velocity region,

the envelope region, and the acceleration region. These regions are explained in Table 1. These regions are referred to as groups and are denoted by “v”, “e”, and “a”, respectively.

2.3 Positions and Directions for Spectra

Variations in frequency spectra depend not only on the faults within a machine but the location where the measurements are made. In this method, vibrations are measured at six positions, and diagnosis is performed using the variation depending on the location.

These positions are divided into two sides (coupling side of the oiler and its counter side) and three directions (vertical, axial, and horizontal), as shown in Figure 2. The pump in the figure is the overhung type. The sides and directions are

C C V,C C A ,C C H ,C V,C A ,C H

M otorC ouplingPum p O iler

Fig. 2. Measurement positions and directions


standardized by ISO 2372, and counter coupling side vertical direction, counter cou-pling side axial direction, counter coupling side horizontal direction, coupling side vertical direction, coupling side axial direction, and coupling side horizontal direction are denoted by CCV, CCA, CCH, CV, CA, and CH, respectively. These positions and directions are specified for the given spectrum using a symbol code in the

form vCCVS ,1 for the spectrum of fundamental frequency of velocity measured at the

counter coupling side vertical position. The fault diagnosis also utilizes the differences of the amplitude of vibrations

measured at the different positions. The difference of amplitude is represented by the

ratio defined by the following Equation (1). In the equation, viS ξ, and v

jS ζ, are the

name of the i-th and the j-th ),1( vnji ≤≤ spectra measured at the positions ξ

and ζ , and vn is the number of spectra in group v. )( ,viSI ξ is the amplitude of

spectrum viS ξ, .

(1)

The set of above ratios are referred to as the direction group and denoted by “d”.

2.4 Vibration Syndrome Matrix

The typical values of spectra for each fault are shown in a matrix called the syndrome matrix. An example of the matrix is shown in Table 2.

In the matrix, αβ,iS ( },,,{ daeυα ∈ , },,,,,{ CHCACVCCHCCACCV∈β ,

αni ≤≤1 ) is the name of the thi − spectrum in the group α measured at the posi-

tion β . The values in the matrix shows the ratio between the amplitude of spectrum α

β,iS and the total root mean square of all spectra in the group α as defined by the

following equation.

∑=

=α

κ

αβ

αβα

β n

k

ii

SI

SISr

1

2,

,,

)(

)()( (2)

where )( ,α

βiSI is the amplitude of the spectrum αβ,iS and αn is the number of spectra

in the group α .

∑

∑

=

==v

v

n

j

vj

n

i

vi

SI

SI

R

1

2,

1

2,

,

)(

)(

ζ

ξ

ζξ

},,,,,{, CHCACVCCHCCACCV∈ζξ


Table 2. Vibration syndrome matrix for Rotor Contact

Spectrum Min Mid Max Weight of spec-

trum

Weight of group

vCCHS ,1 - 0.35 0.5 0.1

vCCHS ,2 0.35 0.5 0.65 0.27

vCCHS ,3 0.18 0.33 0.48 0.27

vCCHS ,4 0.02 0.17 0.32 0.26

Velocity

vCCHS ,5 0.25 0.4 - 0.1

0.6

eCCHS ,1 0.35 0.5 0.65 0.33

eCCHS ,2 0.15 0.3 0.45 0.33 Envelope

eCCHS ,3 0.05 0.2 0.35 0.34

0.15

Acceleration aCCHS ,1 0.7 0.85 - 1.0 0.05

Direction CHCCHR / 1 1.4 - 1.0 0.2

Three values, “Min”, “Mid”, and “Max” in the matrix show the minimum, middle,

and maximum values of the ratio, respectively. These values are used to compose fuzzy membership functions as explained in the following section. The "Weight of spectrum" and the "Weight of group" are used to determine fuzzy measure. The dash, "-", represents "don’t care" and shows the value is undefined.

3 Fuzzy Measure and Integral

The amplitude of vibration may vary depending on the place where the machine is set, the extent of the fault, and so on. Skilled engineers diagnose faults by comparing the amplitude of spectra based on their knowledge and experience, and the theory of ro-tating machines. In our method, the knowledge of skilled engineers is represented by the fuzzy membership functions and the fuzzy measures. The resultant possibility of faults are obtained by fuzzy integral.

3.1 Fuzzy Set of Spectra and Their Membership Functions

A fuzzy set is represented by elements in a support set and their fuzzy degrees[7]. In this paper, the fuzzy set of spectra in each group are represented by the following equation.

∑=

=α

αβ

αβα μ

n

iii SSX

1,, /)(

~ (3)


Min Mid Min Max Mid Max

1

Mid (c)Maximum rule type (b)Middle rule type (a)Minimum rule type

Fig. 3. Three types of membership function

In the equation, )( ,α

βμ iS is a fuzzy degree of spectrum αβ,iS , and is determined by

the membership function and the value )( ,α

βiSr .

The membership function is composed of three values, Min, Mid, and Max in the syndrome matrix and categorized to three types as shown in Figure 3.

If a spectrum has Min and Mid but Max is don’t care, the type of membership function is the minimum rule type as shown in Figure 3 (a). The other two types are composed in the same manner.

3.2 Fuzzy Measure for Spectra

Fuzzy measure, g, is the set function having the following property[8], where

},,,{ ,,2,1α

βα

βα

βαβ αnSSSX L= ]1,0[2: →

αβXg . (4)

Property 1. Fuzzy measures satisfy following equations.

(5)

(6)

(7)

(8)

The property shown by Equation (8) in the above property can always be satisfied

for an infinite set. Several fuzzy measures have been proposed such as λ -fuzzy measure[9], but they

are difficult to calculate since they require considerable computation time. We pro-pose a fuzzy measure defined by the following definition.

0)( =φg

1)( =αβXg

)()( ,,,,α

βα

βα

βα

β jiji AgAgAA ≤⇒⊂αβ

αβ

αβ XAA ji ⊂,, ,

)lim()(lim ,,α

βα

β ii

ii

AgAg∞→∞→

=


Definition 1. Fuzzy measure, )( ,α

βiAg , for the set of spectra, αβ,iA , is given by the

following equation.

},,,{ daeυα ∈

∞≤≤ γ0

(9)

where )( ,α

βjSw is the weight of the spectrum αβ,jS .

In this paper, the weights of spectra have the following property. They are deter-mined by skilled engineers based on their knowledge and are shown in the syndrome matrix. Property 2. The following equation holds for the weight of spectra.

1)(1

, =∑=

αα

β

n

jjSw (10)

0.1)(0 , ≤≤ αβjSw , RSw j ∈)( ,

αβ

From the above property, it can be shown that the fuzzy measure defined by Definition 1 apparently satisfies Property 1.

The following theorem holds for the measure. Theorem 1 The fuzzy measure defined by Definition 1 has the superadditive, additive, and subadditive properties when, 1<γ , 1=γ , 1>γ , respectively.

Proof When 1=γ , apparently, the measure is additive.

If 1<γ , then

01))()(( 1,, >−++ −γαβ

αβ ba SWSw L (11)

since 1)()( ,, ≤++ αβ

αβ ba SWSw L and 01 <−γ .

Thus the measure is superadditive.

γαβ

αβ

αβ

αβ

))(()(,,

,, ∑∈

=ij AS

ji SwAg

αβ

αβ XAi ⊂,


If 1>γ , the following equation holds:

,0)1))()(((

))()((

))()(())()((

1,,

,,

,,,,

<−++

•++=

++−++

−γαβ

αβ

αβ

αβ

αβ

αβ

γαβ

αβ

ba

ba

baba

SWSw

SWSw

SWSwSwSw

L

L

LK

(12)

since 1)()( ,, ≤++ αβ

αβ ba SWSw L and 1>γ . Thus the measure is subadditive.

Q.E.D

3.3 Fuzzy Integral and the Possibility of Faults

Several fuzzy integrals have been proposed such as Sugeno's and Choque's[9]. In our method, the fuzzy integral based on Choque integral is used. At first fuzzy integral for every group is made according to the following definition. Definition 2. Fuzzy integral for a group α is given by the following equation that is based on the Choque integral.

∫= dgch μα )()(

=∑=

− ⋅−α

αβ

αβ

αβ μμ

n

jjjj AgSS

1,,1, )())()(( (13)

where },,,{ daeυα ∈

},,,{ ,,2,1α

βα

βα

βαβ α∂

⋅⋅⋅⋅= nSSSX

]1,0[: →αβμ X

]1,0[2: →αβXg

αβ

αβ

αβ ,,, bja AAA ⊃⋅⋅⋅⊃⊃⋅⋅⋅⊃

Then the possibility of a fault is determined by the fuzzy integral defined by the fol-lowing definition using the above result of fuzzy integral for every group and the weights of groups: Definition 6. Possibility of fault, fk , is given by the following fuzzy integral using the results of fuzzy integral for every group and the fuzzy measures for groups.

αβ

αβ

αβ

αβ

αβ XSSSA bjai ⊂⋅⋅⋅⋅⋅⋅= },,{ ,,,,

1)()()(0 ,,, ≤≤⋅⋅⋅≤≤⋅⋅⋅≤≤ αβ

αβ

αβ μμμ bja SSS


∫= dghcfP k )()( (14)

∑ ⋅−−= ))()1()(( lAglhlh

Xl ∈ },,,{ daeX υ=

XAl ⊂

1)()1(0 ≤≤−≤ lhlh

ll AA ⊃−1

γ

αα ))(()( ∑

∈=

lAl wAg

In the definition, )(αw is the weight of group α and satisfies the following property.

Property 3. The following equation holds for the weight of groups.

1)(},,,{

=∑∈ daev

wα

α (15)

Rww ∈≤≤ )(,0.1)(0 αα

The procedure for obtaining the possibility of fault is shown as follows; Procedure 1

(1) Calculate the ratios for spectra. (2) Determine the fuzzy degree for spectra in each group. (3) Form the fuzzy integral for every group. (4) Calculate the possibility of fault by making fuzzy integral for a fault.

4 Example of Diagnosis and Evaluation of the Method

4.1 Example of Diagnosis

The following example shows the diagnosing process for a fault. In the example, the fault “RotorContact (static)” is diagnosed. (1)Calculate the ratios for spectra

First, calculate the ratio of spectra from the amplitude of input spectra in each group. The ratios for spectra in each group and their weights are shown in Table 3 (2) Determine the fuzzy degree for spectra in each group

Then determine the fuzzy degree for spectra using the above ratios and the mem-bership functions. The fuzzy degrees are shown in Table 4.


Table 3. Input spectra with their ratios and weights

Spectrum )( ,α

βiSr )( ,α

βiSwvCCHS ,1 0.09 0.1

vCCHS ,2 0.566 0.27

vCCHS ,3 0.222 0.27

vCCHS ,4 0.222 0.26

Velocity

vCCHS ,5 0.163 0.1

eCCHS ,1 0.533 0.33

eCCHS ,2 0.267 0.33 Envelope

eCCHS ,3 0.26 0.34

Acceleration aCCHS ,1 0.727 1

Direction CHCCHR / 1 1

Table 4. Fuzzy degrees for spectra

Spectrum )( ,α

βμ iSvCCHS ,1 0.977

vCCHS ,2 0.704

vCCHS ,3 0.681

vCCHS ,4 0.385

Velocity

vCCHS ,5 1.0

eCCHS ,1 0.556

eCCHS ,2 0.556 Envelope

eCCHS ,3 1.0

Acceleration aCCHS ,1 1.0

Direction CHCCHR / 0.989

(3) Form the fuzzy integral for every group

Next, form the fuzzy integral for every group. In this example γ is assumed to

be 1.0 to trace the procedure easily.


(4) Calculate the possibility of a fault by making fuzzy integral

Finally, determine the possibility of a fault. In this example, the value of the weights for the groups are assumed to be 6.0)( =υw , 15.0)( =ew ,

05.0)( =aw , and 2.0)( =dw .

P(RotorContact(static)) = }),,({))()((}),,,({)( daeghehdaegh ×−+× υυυ

The possibility of the fault, RotorContact (static), is obtained as 0.758 by this procedure.

4.2 Evaluation of the Method

In the evaluation, we used test data and field data. The test data was generated by skilled engineers to evaluate the method, and field data are measured amplitude of spectra from rotating machines in a plant. The evaluation was made by comparing the possibility of faults obtained by this method with the correct fault diagnosed by skilled engineers.

(1) Results of diagnosis of test data

The following table shows the result of the evaluation for test data.

707.0151..00556.0

})({))()((

}),({))()((

}),,({)()(

,3,1,3

,3,1,2,1

,2,1,2

=++=×−+

×−+

×=

eCCH

eCCH

eCCH

eCCH

eCCH

eCCH

eCCH

eCCH

eCCH

eCCH

SgSS

SSgSS

SSgSeh

μμ

μμ

μ

672.0002.055.011.0219.0385.0

})({))()((

}),({))()((

}),,({))()((

}),,,({))()((

}),,,,({)()(

,5,1,5

,5,1,2,1

,5,2,1,3,2

,5,3,2,1,4,3

,5,4,3,2,1,4

=++++=×−+

×−+

×−+

×−+

×=

υυυ

υυυυ

υυυυυ

υυυυυυ

υυυυυυ

μμ

μμ

μμ

μμ

μυ

CCHCCHCCH

CCHCCHCCHCCH

CCHCCHCCHCCHCCH

CCHCCHCCHCCHCCHCCH

CCHCCHCCHCCHCCHCCH

SgSS

SSgSS

SSSgSS

SSSSgSS

SSSSSgSh

0.1})({)()( ,1,1 =×= aCCH

aCCH SgSah μ

989.00.1989.0})({)()( 11 =×=×= dd RgRdh μ

758.005.0)989.00.1(

25.0)707.0989.0(4.0)672.0707.0(0.1*672.0

})({))()((}),({))()((

=×−+×−+×−+=

×−+×−+ agdhahdagehdh


Table 5. Result of evaluation for test data

Result of diagnosis Data Target fault

Fault name Possibility[%] Unbalance 100.0

Test data1 Unbalance Wear of basis 78.3 Misalignment 99.0

Test data2 Misalignment Wear of basis 88.3 Wear of basis 80.4 Misalignment 78.8 Test data3

Unbalance and Misalignment

Unbalance 54.1

In the above table, "Target fault" column shows the correct fault for test data de-

termined by skilled engineers. The "Result of diagnosis” are top two or three faults having high possibilities obtained by this method.

The above table indicates that the target faults for Test data 1 and 2 have the highest possibility but the target faults for Test data 3 have the second and third pos-sibilities. Moreover, the fault, Wear of basis, always has high possibilities. From this result, we can see that the membership functions and the weights for spectra/groups must be adjusted to get proper fuzzy degrees and fuzzy measures. We can also see from Test data 3, that the exponent γ for fuzzy measure should be improved for mul-

tiple faults occurring in a machine.

(2)Results of diagnosis of field data The following table shows the result of the evaluation for field data.

Table 6. Result of evaluation for field data

Result of diagnosis Data Correct fault Fault name Possibility[%]

Field data1 Wear of basis Wear of basis 71.8 Field data2 Misalignment Misalignment 71.5 Field data3 Wear of basis Wear of basis 92.4

Misalignment 91.6 Field data4 Misalignment Wear of basis 90.3

From the Table, we can see that the correct faults for all data have the highest pos-

sibilities, but wear of basis also has high possibility in Field data 4.

5 Conclusion

This paper shows a method for diagnosing faults in rotating machines using fuzzy measures and integrals, and proposes a fuzzy measure that can be determined with little computation time. The knowledge of skilled engineers is represented by the


syndrome matrix. The matrix includes the ratio of the amplitude of spectra generated by faults and the weights of spectra which indicate how much the spectra affect the diagnosis by skilled engineers.

The membership functions for fuzzy set of spectra are composed using three val-ues, Min, Mid, and Max in the matrix. and the fuzzy measure is determined using weights of spectra and the weight of groups.

The results of the evaluation show that almost all single faults have the highest possibilities, but multiple faults do not. In future work, we are going to adjust the membership functions and weights for spectra and groups. Also, we are going to im-prove the exponent γ for fuzzy measure to obtain high possibilities for multiple

faults.

References

[1] Bloch, H.P.: Practical Machinery Management for Process Plants: Improving Machinery Reliability, 3rd edn., vol. 1. Gulf Publishing Company (1998)

[2] Chen, P., Feng, F., Toyoda, T.: Sequential Method for Plant Machinery by Statistical Tests and Possibility Theory. REAJ 24(4), 322–331 (2002)

[3] Durkin, J.: Expert Systems Design and Development. Prentice Hall, Englewood Cliffs (1994)

[4] Marinai, L., Singh, R.: A Fuzzy Logic Approach to Gas Path Diagnostics in Aero-engines. In: Computational Intelligence in Fault Diagnosis. Springer, Heidelberg (2006)

[5] Okuno, T., Tsunoyama, M., Ogawa, M., Sato, T.: A Fuzzy Expert System for Facility Di-agnosis of Rotating Machines. In: Proc. of the 2nd IASTED International Conference Neural Networks and Computational Intelligence, pp. 297–301 (2004)

[6] Goldman, S.: Vibration Spectrum Analysis. Industrial Press Inc. (1999) [7] Zadeh, L.A.: Fuzzy sets and applications. Wiley-Interscience publication, Chichester

(1987) [8] Wang, Z., Klir, G.J.: Fuzzy Measure Theory. Plenum Pub. (1992) [9] Grabis, M.ch., Murofushi, T., Sugeno, M.: Fuzzy Measures and Integrals. Physica-Verlag

(2000) [10] Yager, R.R.: Some relationships between possibility, truth and certainty. Fuzzy Sets and

Systems 11, 151–156 (1983)


A Multi-agent Architecture for Sensors and Actuators’ Fault Detection and Isolation in Case of Uncertain Parameter Systems

Salma Bouslama Bouabdallah1, Ramla Saddam2, and Moncef Tagina2

1 National Engineering School of Tunis Campus universitaire, BP 37, 1002 Tunis, Tunisia 2 National School of Computer Sciences Campus Universitaire, 2010 Manouba, Tunisia

Abstract. In this work, a multi-agent system for fault detection and isolation is proposed in case of faults affecting sensors and actuators of bond graph modelled uncertain parameter sys-tems. The diagnostic system is composed of different agents cooperating and communicating through messages exchange, each agent is specialized in a specific task of the diagnosis proc-ess. Detection and isolation routines are based on residuals processing. Residuals are obtained from bond graph model. To illustrate our approach, a hydraulic system in case of 5% of uncertainties is simulated under Matlab/Simulink environment. The diagnostic system is im-plemented in JADE environment and has shown good detection and isolation results.

1 Introduction

Due to the increasing size and complexity of modern processes, their safety and their efficiency become very important. The aim of our work is to keep process in a good level of safety. Presently, fault detection and isolation is an increasingly active re-search domain. A process is in a defective state if causal relations between its known variables changed [3]. A widespread solution for fault detection and isolation is to use analytical model-based redundancy [9].

The choice of a modelling formalism is an important step for fault detection and isolation because the quality of redundancy relations depends on the quality of the model.

In the frame of our work we chose Bond graph modelling. In fact the Bond graph is a powerful tool used for modelling dynamical systems. Thanks to its provided informa-tion, bond graph can be used directly to generate analytical redundancy relations between known variables [15].

In [14], formal specifications of a hybrid Bond graph modelling paradigm and a methodology for monitoring, prediction and diagnosis of abrupt faults in complex dynamic systems are developed.

Furthermore, the progress of control theory has allowed increasing product qual-ity, and reducing production cost. This progress makes possible designing more complex processes which in turn need further advanced methods for monitoring and supervision.

136 S.B. Bouabdallah, R. Saddam, and M. Tagina

Multi-agent systems are best suited for applications that are modular, decentralized, changeable, ill-structured and complex [17]. Besides a multi-agent architecture is flexible, it can be adapted to many applications and can be distributed on many ma-chines [11]. In a multi-agent system, there are several types of interaction between agents: Interaction can be cooperation (agents work together for a special goal), coordination and negotiation (Different parts search for a common accord) [2].

A multi-agent architecture in which different agents cooperate and communicate improves diagnosing complex processes. In fact, multi-agent approach reduces com-plicated tasks which are decomposed into sub-problems.

Besides, due to the fact that different types of systems as well as different types of faults require different types of algorithms to obtain the best results, it is rea-sonable to expect that a significant improvement of FDI system performance can only be achieved when FDI problems are solved in a framework of integrated use of different FDI technologies [16]. Multi-agent systems are appropriated for represent-ing problems which can be resolved with different methods. So, they are feeble due to redundancy [2].

Recent research on agent-based approaches for diagnostic purposes can currently be seen in Europe, Japon and the United States [17].

Multi-agent architectures have been used in many diagnosis projects. The project “MAGIC” is based on a multi-level-multi-agents architecture which integrates all levels necessary for a comprehensive diagnosis into a diagnostic system [16]. Besides, the magic framework offers a number of diagnostic algorithms which can run in parallel to detect a certain component fault at a given sub-system which helps to obtain the best results.

Another multi-agent system for monitoring and diagnosis was also developed within the framework of the EU Esprit Program: “DIAMOND: DIstributed Architec-ture for MONitoring and Diagnosis” in which the basic technologies that were ap-plied are CORBA, FIPA_ACL, JAVA and XML. In DIAMOND, the interaction of the software agents forms a modular multi-agent architecture with a heterogeneous and hierarchical distribution [18].

In [8], artificial agents are implemented for supervision in the area of physical hydraulic, each modelling a specific cognitive style, based on qualitative reasoning, categorization and case-based reasoning.

For sharing design information in a distributed engineering team, [12] proposes a so-lution in which specialists and the programs they use, are modelled as agents who share a common communication language and ontology. In this work, elements of the state, input and output vectors are described in terms of bond graph.

We have proposed in [4] a conception of a multi-agent architecture for fault detection and isolation composed of a blackboard and a set of agents, communication between agents is established through the blackboard which is divided into different abstraction levels and via messages exchange.

This paper proposes a multi-agent system for fault detection and isolation of bond graph modelled uncertain parameter systems in case of faults affecting sensors and actuators. In this work the proposed multi-agent system is based on messages exchange between agents.

In section 2, different agents as well as their structure are presented. Communica-tion between agents is described in section 3. Simulation example is exhibited in

A Multi-agent Architecture for Sensors and Actuators’ 137

section 4. Multi-agent system implementation is described in section 5 and results are presented in section 6. And finally conclusions and future work are in section 7.

2 Agents Description

The proposed multi-agent system for fault detection and isolation is composed of a set of autonomous agents which constitute software components responsible for specific tasks. The interaction of the software agents forms a modular multi- agent architecture with a heterogeneous and hierarchical distribution. Agents cooperate to ensure the di-agnosis process. These agents belong to one of the following:

1°) Acquisition agents: Their task consists in accessing data from variables to be supervised (known variables).

2°) Detection agents: Their task consists in detecting faults by residual processing. 3°) Isolation agents: They can isolate a detected fault at level of a known

variable using different methods. 4°) User interface agent: its main task consists in offering to user a comprehensible

interface.

2.1 Data Acquisition Agent

The interface between the Diagnosis system and variables to be supervised is per-formed by Data Acquisition agents. An acquisition agent is associated with each known variable; the acquisition agent can so access the measurements of the corre-sponding known variable.

2.2 Detection Agents

We have proved in [5] that Binary logic allows performing good detection and isola-tion results if the threshold is correctly chosen and if the values of parametric uncer-tainty are known. However, when uncertainty is faulty estimated, detection with binary logic is not suitable whereas the fuzzy proposed approach allows perform-ing good fault detection and isolation results. So we propose to implement both binary logic detection agents and fuzzy logic detection agents to obtain the best detection results.

The detection agents take measures of variables to be supervised from data ac-quisition agents.

Each binary logic detection agent is associated with a residual, it contains a routine for calculating the residual and a detection procedure based on residual processing. The detection procedure is based on binary logic consisting on a test of residuals amplitude using a threshold to decide whether residual is in normal state or in faulty one. How-ever, fuzzy logic detection agent calculates and processes all residuals by a fuzzy routine to decide whether system is in normal functioning mode or in faulty mode. De-tection agent result is called fault-index, it is equal to 0 in fault-free context and equal to 1 in the occurrence of a fault.


2.3 Isolation Agents

A set of different isolation agents are proposed. Each agent is specialized in an isola-tion method. The proposed isolation methods use respectively binary logic, fuzzy logic, Probabilistic Neural Network (PNN) and Multi-layer Perceptron (MLP). Bi-nary logic-based isolation agent uses signatures to isolate the fault. In fact, it com-pares fault signatures to the coherence binary vector obtained by combining binary logic detection agents’ results (fault-indexes). Fuzzy logic- based agent establishes a fuzzy processing of all residuals in order to isolate the fault. However, PNN-based and MLP-based isolation agents calculate and process residuals which constitute inputs of PNN and MLP. PNN has one output, it is trained to present 1 in normal functioning mode, 2 in case of a fault affecting the first known variable… However MLP is trained to present, at each output, 0 if the associated known variable is in normal state and 1 if it is faulty. Isolation results are communicated to user interface agent to be displayed for the user.

The description of the detection and isolation methods in detail is out of the objective of this paper.

2.4 User Interface Agent

User interface agent offers to the user a convivial interface in which diagnosis proc-ess can be initialized, user can select one or more diagnosis methods and results are displayed.

2.5 Agents Structure

Agents’ structure is inspired from MAGIC in which diagnostic agents are composed of the following parts:

1°) The Brain or diagnostic specific layer: responsible for symptom creation us-ing diagnostic algorithms based on system models.

2°) Communication layer: responsible for all basic communication and administra-tion tasks

3°) Generic diagnostic layer: responsible for basic diagnostic features, such as acquiring data or models from a data base.

In the proposed Detection agents, the brain is responsible for residual calculation and processing using a specified detection method like binary logic or fuzzy logic. The generic diagnostic layer is responsible for acquiring known variables values (see figure 1).

This architecture is extended to isolation agents, user interface agents and acquisi-tion agents including two main parts:

1°) The communication interface including the communication layer and generic diagnostic layer.

2°) The Brain which contains all agent specific tasks like data acquisition for acquisition agents, isolation procedures for isolation agents…


Communication interface

Communication layer

Generic diagnosis layer

Brain

Residual

Calculation

Residual

processing

routine

Diagnostic specific layer

Fig. 1. Detection agent structure

3 Communication

Detection agents, isolation agent, acquisition agents and user interface agent are inter-acting and co-operating with each other in order to perform the common task of di-agnosing the system. They communicate via messages exchange in the following way:

- User interface agent is linked to the user, if user initializes diagnostic process, user interface agent informs acquisition agents to start data acquisition and detec-tion or isolation agents concerned with selected diagnosis methods to start de-tection or isolation procedures. Besides, diagnosis results obtained at level of detection or isolation agents are communicated to the user interface agent to be displayed for the user.

- Acquisition results obtained at level of Acquisition agents are communicated to detection agents or isolation agents to start diagnostic procedure.

- In case of binary logic, detection agents communicate between them and communicate with isolation agent. In fact, if fault-index at level of a detec-tion agent is equal to 1, it informs the isolation agent and sends its fault index to it. Besides, it informs other detection agents, so they communicate their fault indexes to the isolation agent in order to isolate the fault. So isolation procedure starts. (see figure 2).

- In case of fuzzy logic method, there is one detection agent which communicates with the isolation agent. In fact, if its fault-index is equal to 1, it informs and


Acquisition Agents

Detection Agent

Isolation Agent

Detection Agent

User interface agent

Detection Agent

Fig. 2. Agents communication in case of binary logic method

Acquisition AgentsIsolation Agent


Detection Agent

Fig. 3. Agents communication in case of fuzzy logic method

sends residual values to the isolation agent. So isolation procedure starts (see figure 3). - In case of PNN and MLP diagnostic methods, there is not detection agents,

isolation agent can decide whether system is in normal or in faulty mode and isolate the fault when system is faulty (see 2.3). Isolation agent takes meas-urements of known variables from acquisition agents, calculates and processes residuals using the PNN or the MLP. And finally results are communicated to user interface agent (see figure 4).


Acquisition AgentsIsolation Agent


Fig. 4. Agents communication in case of PNN and MLP methods

4 Simulation Example

Our approach is illustrated with a simulation example, it is a hydraulic system (see figure 5) composed of three tanks communicating through vanes functioning in bi-nary mode. It is assumed that vanes are at the mode ‘on’.

.

Df1 Df2

MSf1 MSf2

C3C1 C2

Fig. 5. Hydraulic system model

3

1

4

25

12 1517

13

9

6 11

16

18

14107198

De1

MSfMS f1

CC1

RR1

0Zero1

Df2Df1

De3

De 2

MSfMSf2

CC3

CC2

RR23

RR12

RR3

RR2

0Zero

0Zero2

1One2

1One1

Fig. 6. Bond graph model


A procedure described in [1]-[7] enables us to elaborate its bond graph model shown in figure 6.

With: C1=C2=C3=10 (SI). R1=R2=R3=R12=R23=1 (SI). The system has two inputs: MSf1 and MSf2 which correspond to volume flows

applied to the system. As shown in figure 6, 5 sensors are placed in the bond graph model:

- The effort sensors De1, De2 and De3 measure pressure respectively at C1, C2 and C3.

- The flow sensors Df1 and Df2 measure respectively volume flow at vane 1 and 2.

Variables to be supervised (the known variables) are De1, De2, De3, Df1, Df2, MSf1 and MSf2.

Analytical redundancy relations consist in finding relations between known vari-ables of the system. We have elaborated the following analytical redundancy relations (ARRs) from the bond graph model by course of the causal paths. ARRs are written in integral form. The procedure of generating (ARRs) is described in [15].

(1)

ARR2:221

2 2 2 2

1 1 1 01Df De DfsC R C s C s

⎛ ⎞− + + =⎜ ⎟

⎝ ⎠ (2)

ARR3:322

3333

1 1 11 0M S f D f D eC s C s R C s

⎛ ⎞− − + =⎜ ⎟

⎝ ⎠ (3)

ARR4: 3 2 2 023

De De DfR− − =

(4)

ARR5: 1 2 1 0

12De De DfR− − =

(5)

The signature table of the known variables (Df1, Df2, De1, De2, De3, MSf1 and MSf2) is given by table 1 in which terms equal to 1 (respectively 0) indicate the presence of the considered variable (respectively the absence) in the associated residual.


Table 1. Signature Table

Df1 Df2 De1 De2 De3 MSf1 MSf2r1 1 0 1 0 0 1 0r2 1 1 0 1 0 0 0r3 0 1 0 0 1 0 1r4 0 1 0 1 1 0 0r5 1 0 1 1 0 0 0

Bond graph model is converted into a synopsis diagram which is simulated under Simulink/Matlab environment.

In normal functioning mode, the residuals have to be close to zero. We have verified that all residuals are null in this case.

Application choices: * Uncertainty: We consider uncertainty as a white noise added on values in the

synopsis diagram (5% of each value). We have selected uncertainty of 5% on one hand because many components are given with this uncertainty; on the other hand because the identification is possible in this case. For larger uncertainties, we can not distinguish between the fault and uncertainty. Uncertainty is added to C1, C2, C3, R12 and R23.

* Fault: We have considered fault as a step added to a known variable, the fault amplitude is equal to 15% of the variables amplitude.

5 Multi-agent System Implementation

The proposed multi-agent system is implemented in JADE (Java Agent DEvelopment framework) environment [13]. A JADE agent is a Java class extending Agent basic class. So, agent inherits a fundamental behaviour consisting in platform tasks, for ex-ample registration, configuration, management at a distance... JADE agent inherits also a set of standardized interaction protocols used in implementing agent spe-cific tasks such as messages sending, standard interaction protocols…

Communication language between agents is ACL (Agent Communication Lan-guage) developed by FIPA (Foundation for Intelligent Physical Agents) [10].

Agents are interacting with each other in textual form, following FIPA-ACL specifications

Whenever an agent wants to send a message to one other agent, an ACLMessage object is created, ACLMessage fields are specified and the send() method is used. When agent wants to receive a message, the receive() method is used.

The ACLMessage contains the following fields: - Sender: the agent that sends the message - Receiver(s): the agent or agents who receive the message - Reply-to: who the recipient should be reply to - Performative: an identifier indicating the purpose of the message. - Ontology: an identifier, indicating the ontology of the content.


- Encoding: indicates how the content of the message is represented. - Protocol: the interaction protocol this message is part of. - ConversationId: the unique identifier used to indicate which conversation this

message is part of. - Content: some content which is the body of the message.

ACLMessage fields can be obtained or modified respectively by the get() or set() methods.

In this work, the protocols used for agents conversation are the SEND, the RECEIVE and the BLOCKINGRECEIVE protocols.

In order to exchange information between the agents, a common ontology has to be used. The ontology includes representation of residuals, detection agents re-sults (Fault-indexes) and isolation results which are in matrix type. Notice: in the frame of our work, acquisition agents take sensors and actuators’ measurements from Matlab/ Simulink environment in the following way: an acquisi-tion agent is associated with a timed datastream where each data taken represents one measurement from one sensor or actuator of the synoptic diagram.

6 Tests and Results

The diagnosis process starts when user selects a diagnosis method: user can use either a diagnosis method based on binary logic, fuzzy logic, PNN or MLP (see figure 7). The corresponding detection and isolation agents are used to detect and isolate the fault.

For all detection and isolation methods, the system can detect and isolate faults af-fecting known variables, however in some cases correct isolation presents some

Fig. 7. Diagnosis system


Fig. 8. Fault Detection and Isolation in case of a step fault affecting Df1 at 50s

Fig. 9. Fault Detection and Isolation in case of a step fault affecting De3 at 30s

delay. Figure 8 presents fault detection and isolation when a step fault affects Df1 at 50s in case of binary logic-based method. Figure 9 presents fault detection and isola-tion when a step fault affects De3 at 30s in case of a PNN-based diagnosis


method. We notice in the first case an isolation delay of 0.08s. In the second, we notice an isolation delay of 0.12s. This delay is due to residual behaviour in case of systems with parameter uncertainties [6]. The detection and the isolation delay de-pends on the type of fault and the diagnostic method. So, we aim to improve the FDI system performances by integrating all proposed methods. In that case, agents’ negotiation helps to obtain the best detection and isolation results.


In this work, we have proposed a multi-agent system for fault detection and isolation. The multi-agent system is composed of different agents co-operating and communicat-ing via message exchange to ensure the diagnosis process. Detection and isolation rou-tines are based on residual processing. Residuals are obtained from bond graph model. The diagnosis system is implemented in JADE environment and has shown good de-tection and isolation results. Our main extension to this work is the possibility to use in parallel all proposed diagnosis methods. In that case, negotiations between different agents help to improve detection and isolation results and to reduce detection and isola-tion delays as well as false alarm rate. Furthermore, we aim to integrate the multi-agent diagnostic system in a real hydraulic system where acquisition agents take measure-ments directly from sensors and actuators.

References

[1] Borne, P., Dauphin-Tanguy, G., Richard, J.P., Rotella, F., Zambettakis, I.: Modélisation et identification des processus, Technip (1992)

[2] Pierre Briot, J., Demazeau, Y.: Principe et architecture des systèmes multi-agents, Lavoisier, HermesSciences (2003)

[3] Brunet, J.: Détection et diagnostic de pannes approche par modélisation , Editions Her-més (1990)

[4] Bouabdallah, S.B., Tagina, M.: Conception d’un système multi-agent pour la surveillance des systèmes complexes. JTEA 2006, Journées Tunisiennes d’Electo- technique et d’Automatique. Mai 2006 Hammamet, Tunisia, pp. 3–5 (2006)

[5] Bouabdallah, S.B., Tagina, M.: A fuzzy approach for fault detection and isolation of un-certain parameter systems and comparison to binary logic. ICINCO 2006, Intelligent con-trol systems and optimization. Août 2006 Portugal, pp. 98–106 (2006)

[6] Bouabdallah, S.B., Tagina, M.: A multilayer perceptron for fault detection and isolation of systems with parameter uncertainties. SSD 2007, Mars 2007, Tunisia, pp. SAC-1015 (2007)

[7] Dauphin-Tanguy, G.: Les bond graph. Hermes Sciences Publications (2000) [8] Dumont d’Ayot, G.: Coopération et évaluation cognitive d’agents artificiels pour la su-

pervision. Thèse de doctorat. CNRS. LAAS (2005) [9] Evsukoff, A., Gentil, S., Montmain, J.: Fuzzy reasoning in co-operative supervisin sys-

tems, Control Engineering Practice 8, pp. 389–407 (2000) [10] FIPA Specification, Foundation for intelligent Physical agents (2005), http://www.

fipa.org/


[11] Garcia-Beltran, C., Gentil, S., Stuecher, R.: Causal-based diagnosis of a hydraulic looper, IFAC Symposium on automation in Mining, Mineral and Metal processing MMM 2004, Nancy, France (2004)

[12] Howley, B., Cutkosky, M., Biswas, G.: Composing and sharing dynamic models in an Agent-based concurrent Engineering Environment. In: American Control Conference, San Diego, CA, June 2-4, pp. 3147–3153 (1999)

[13] http://jade.cselt.it/ [14] Mosterman, P.J.: A Hybrid Bond Graph modelling Paradigm and its application in Diag-

nosis, Dissertation, Vanderbilt University (1997) [15] Tagina, M.: « Application de la modélisation bond graph à la surveillance des systèmes

complexes», PhD thesis, University Lille I, France (1995) [16] Köppen-Seliger, B., Marcu, T., Capobianco, M., Gentil, S., Albert, M., Latzel, S.:

MAGIC: An integrated approach for diagnostic data management and operator support. In: IFAC SAFEPROCESS, Washington, USA (2003)

[17] Wörn, H., Längle, T., Albert, M.: Multi-Agent Architecture for Monitoring and Diagnos-ing Complex Systems. In: CSIT, Patras, Greece (2002)

[18] Wörn, H., Längle, T., Albert, M.: Development tool for distributed monitoring and Diag-nostic Systems. Institute for Control and Robotics, University of Karlsruhe, Germany


A User-Friendly Evolutionary Tool for High-School Timetabling

Charalampos N. Moschopoulos1, Christos E. Alexakos1, Christina Dosi1, Grigorios N. Beligiannis2, and Spiridon D. Likothanassis1

1 Department of Computer Engineering & Informatics, University of Patras, GR-26500, Rio, Patras, Greece {mosxopul, alexakos, dosich, likothan}@ceid.upatras.gr 2 Department of Business Administration in Food and Agricultural Enterprises, University of Ioannina, G. Seferi 2, GR-30100, Agrinio, Greece [email protected]

Abstract. In this contribution, a user-friendly timetabling tool is introduced that can create feasible and efficient school timetables in a few minutes. The tool is based in an adaptive algo-rithm which uses evolutionary computation techniques (Beligiannis et al. 2006). The algorithm has been tested exhaustible with real data input derived from many Greek high-schools, produc-ing exceptionally results. Moreover, the user friendly interface of the presented tool has been taken into serious consideration in order to be easy for anyone to use it. Nevertheless, the main advantage of the presented tool lies in its adaptive behaviour. The users can guide the presented timetabling tool, producing timetables that best fit into their needs.

1 Introduction

Timetabling is a very common problem that appears in many sections of real life such as course scheduling, assignment of hospital nursing personnel, etc. These problems deal with the allocation of resources to specific time periods so that some specific constraints are satisfied and the created timetable is feasible and effective. According to each case, the constraints, the sources and the elements defining the effectiveness of the timetable are determined. These timetabling problems are NP-complete in their general form, regarding their computational complexity, meaning that the difficulty to find a solution rises exponentially to their size and a deterministic algorithm, giving an acceptable solution in polynomial time, cannot be found.

The school timetabling problem is a subcategory of the family of timetabling prob-lems. Many computational methods have been applied in order to approximate a near optimal solution (Bardadym 1996), (Bufé et al. 2001), (Burke and Erben 2001), (Burke and Petrovic). There is a number of constraints that has to be taken into ac-count when the school timetabling is been approached. In most cases the goal is to minimize the teacher’s idle hours while achieving a balanced distribution of lessons taught on the weekly school schedule. Time constraints are divided in soft and hard constraints (Schaerf 1999), (Eikelder and Willemen 2000). The hard constraints must be satisfied, while the soft ones are used for the evaluation of the final solution.


Lately, many computational intelligence algorithms and techniques including genetic algorithms (Fernandes et al. 1999), (Carrasco and Pato 2001) have proved their effec-tiveness concerning this problem category.

The school timetable of each academic year, in many countries, is constructed by the teaching personnel at the beginning of each academic year. The teachers of each school make quickly a temporary timetable and usually after quite a long time they end up with the final timetable. As a consequence, for some time the school is not working with its full resources. So, due to the difficulty of the problem, the teachers usually use a Timetabling Tool in order to produce an effective school timetable in short time.

One of the earliest approaches is Hores (Alvarez-Valdes et al. 1995) which is a Timetabling Tool that uses tabu search and it is suited to the needs of Spanish secon-dary schools. Following that, W-Y Ng introduced TESS (Ng 1997), a semi-automatic timetabling system which is a little bit complicated for a naive user to use and quite time expensive. It uses a heuristic intelligent technique in order to place any unsched-uled lesson in its correct time slot and needs the extra help of the timetabler in order to avoid any conflicts and complete the timetabling procedure. As an improvement, a new timetabling system was proposed called @PT (Kwan et al. 2003). @PT can reach to better solutions than TESS in much lesser time. It uses sophisticated local search operators and a tabu-list like data structure. A transformation of PaLM (Jussien 2001) has been applied in school timetabling (Cambazard et al. 2004). This system is based on constraint programming and it can reach to a solution in a few minutes. Finally, THOR (Melício et al. 2006) was introduced as a timetabling system for Portuguese high-schools.

In this contribution, a user-friendly timetabling tool is introduced that can create feasible and efficient school timetables in a few minutes. The tool is based in an adap-tive algorithm which uses evolutionary computation techniques (Beligiannis et al 2006). The algorithm has been tested exhaustible with real data input derived from many Greek high-schools, producing exceptionally results. Moreover, the user friendly interface of the presented tool has been taken into serious consideration in order to be easy for anyone to use it. Nevertheless, the main advantage of the pre-sented tool lies in its adaptive behaviour. The users can guide the timetabling tool, producing timetables that best fit into their needs.

The rest of the paper is organized as follows: In section 2 a definition of the school timetabling problem is given and the constraints that are taken into consideration in the presented approach are stated. In section 3 we give a full review of the algorithm that is hosted in the presented timetabling tool. Section 4 presents in detail the pre-sented timetabling tool’s interface, while in section 5 the experimental results and the performances obtained are presented. Conclusions are given in the last section.

2 School Timetabling Problem Definition

The weekly school timetable is affected by many parameters and must satisfy a large number of constraints (Caldeira and Rosa 1997), (Fernandes et al. 1999), (Tavares et al. 1999), (Santiago-Mozos et al. 2005). The entities that are involved in such a timetable are the teachers, the lessons, the classes and the time periods. In particular,

A User-Friendly Evolutionary Tool for High-School Timetabling 151

the teachers teach some lessons at the classes and this process takes place in specific time periods. Therefore, constructing a timetable means to assign, for the entire teacher–lesson–class triples, the time periods that the teaching process will take place. This has to be done in a way that as many as possible constraints are satisfied, some of them partially and some of them totally. These constraints are divided into two groups: “hard” constraints and “soft” constraints. Both of them are also divided into the ones related to teachers and the ones that are related to classes. When all hard con-straints are satisfied then a “feasible” timetable is constructed, meaning a timetable that can actually be used by the school it has been made for. However, the quality of a timetable is mainly influenced by the proportion of soft constraints satisfied. The ba-sic aim is to construct a feasible timetable while maximizing its quality. The formal way in which “quality” is measured will be discussed next.

Due to the fact that the presented timetabling tool has been applied in order to con-struct quality timetables for Greek high schools, it should be noted at this point, that a weekly school timetable in Greece is divided typically into five days and each one of them into seven hours. The term “time period” refers to each hour of teaching in a week and consequently the number of each time period defines the day and the hour it takes place. The number of time periods for a typical Greek high school is 35 and is shown in Table 1.

Table 1. Relation between days – hours – time slots

Monday Tuesday Wednesday Thursday Friday 1st hour 1 8 15 22 292nd hour 2 9 16 23 303rd hour 3 10 17 24 314th hour 4 11 18 25 325th hour 5 12 19 26 336th hour 6 13 20 27 347th hour 7 14 21 28 35

Hard constraints that must be satisfied in order to keep the timetable feasible are the following:

1. Each teacher must not be assigned to more than one class in a specific time period.

2. Each teacher must teach only to classes that are initially assigned to him/her and only lessons that he/she is supposed to teach to these classes.

3. Each teacher must teach a limited and defined number of hours and lesson(s) to each class.

4. Each teacher must only teach at days he/she is available at school. 5. Each class must not have more than one lesson in a specific time period. 6. The number of lessons and hours each class must be taught by specific

teachers are determined by input data.


7. The classes’ empty (no teacher assigned) time periods, whose total accept-able number is determined by input data implicitly, must be the last hour of each day.

Soft constraints that should be satisfied in order the timetable to be considered of

high quality are the following:

1. Teachers should have as few empty periods as possible. 2. The total empty periods existing should be uniformly distributed to all teach-

ers while the empty periods of each teacher should be uniformly distributed to all days he/she is available at school.

3. The number of hours a teacher has lessons each day should not differ dra-matically among these days.

4. Classes should not have the same lesson in successive hours and, if possible, not even during the same day.

Apart from the above, there are some extra factors that affect the structure of a

timetable. These factors are related to co-teaching and subclasses cases. During the last decade it is common in many schools that some classes are divided, only for the teaching of some specific lessons, into two temporary ones. This is the case especially for lessons like “foreign language” where every class is divided into two temporary ones according to the level of students’ knowledge (“advanced” class and “beginners” class). In such cases, the teachers who teach the specific lesson to the temporarily divided class must be assigned to this class at the same time period. This case is re-ferred to as co-teaching. The subclasses case refers to the circumstance where a class is divided into two temporary subclasses in which different lessons are taught. This is the case especially when lab lessons, like “informatics” and “technology”, are con-cerned. In such cases, the teachers who teach these different lessons to the subclasses of the temporarily divided class must be assigned to this class at the same time period. In the proposed approach, both co-teaching and subclasses cases are considered as hard constraints.

All the above affect the construction process and the quality of a timetable. Having all these in mind, an algorithm for school timetabling generation has been designed, implemented and incorporated in the presented tool. This algorithm is presented in the next section.

3 Review of the Hosted Algorithm

As mentioned before, computational intelligence techniques have been already widely used for solving the school timetabling problem with very satisfactory results. The presented algorithm uses a smart chromosome encoding in order to occupy as small space as possible in relevance with the information it encodes. Also, its structure en-ables it to preserve, during the reproduction processes, as much as possible character-istics of the timetable concerning hard constraints. As in most evolutionary algorithms, the utilized algorithm can be divided in three main parts: the initializing procedure, the fitness function and the genetic operators.


The initialization procedure is an important issue in every evolutionary algorithm because it should create a random population with as much diversity as possible. This gives the algorithm the opportunity to search effectively the whole space of possible solutions and not to be trapped prematurely in local optima. Except that, the initializa-tion procedure of the proposed algorithm tries to avoid the violation of some of the hard constraints of the school timetabling problem (number 2,3,5,6 as mentioned in section 2). In this way, we save time as the execution process starts from a better starting point (Beligiannis et al 2006).

The fitness function returns a value representing the fitness of each specific chro-mosome, that is, how well it solves the problem faced. In addition, this is the function which checks if hard constraints and most of soft constraints are satisfied by each chromosome (school timetable). According to the constraint violation check done, every constraint violation is assigned a subcost and the sum of all subcosts having to do with a specific chromosome is the total cost of this chromosome (Beligiannis et al 2006). The presented algorithm owes its flexibility in the weights that are attached to each soft or hard constraint. In this way, we can guide the algorithm by assigning a high value to the constraints that we are most interested in. Weights with very high values should be assigned to all hard constraints in order the algorithm to satisfy them and produce feasible timetables.

The genetic operators used by the proposed algorithm are one selection operator and two mutation operators (Beligiannis et al 2006). We decided not to use the cross-over operator because experimental results have shown that crossover in this specific problem and chromosome encoding does not contribute satisfactorily, while it adds too much complexity and time delay. As selection operator we decided to use linear ranking selection (Blickle and Thiele 1997). The first mutation operator used, called “period mutation”, is applied to the timetable of a class and swaps the teachers

Table 2. (a) Before Mutation, (b) After mutation

time slot 1 time slot 2 … time slot n Class 1 Class 2

… Class i teacher C teacher Α … teacher Β

… class m

a

time slot 1 time slot 2 … time slot n Class 1 Class 2

… Class i teacher C teacher Β … teacher Α

… class m

b


between two time periods (Table 3). The second one, called “bad period mutation”, swaps the two most costly time periods regarding the timetable of teachers assigned to these time periods (Beligiannis et al 2006).

Finally, in order to preserve the best chromosome in every generation a simple elit-ism schema (Goldberg 1989), (Michalewicz 1996), (Mitchell 1995) is used. The best chromosome of each generation is copied to the next generation assuring that it is preserved in the current population for as long as it is the best compared to the other chromosomes of the population.

4 The Timetabling Tool

The users of a timetabling tool are, in most cases, teachers who are unfamiliar with complicated computational tools. So, in the presented timetabling tool we take special care in order to make it as friendly as possible and not to cause any confusion to the users. Moreover, the parameters of the presented tool are organized in such way that it is extremely easy to be filled out by the users. Additionally, the tool offers a Help menu that gives full guidelines about the timetabling tool and its use.

Initially, there is a welcome page in which the user can select the language of his preference: Greek or English. After the selection of the language, two menus appear on screen: The File menu and the Help menu. The File menu contains the options concerning the creation of a new input file, the options concerning the import of an already existing one and the view options.

Fig. 1. The initial input form


If the user selects to create a new input file, he/she enters in a new page where there is an input form (Fig.1). All necessary data must be inserted in the form in order to continue to the next step of the process of creating a new input file. These are the name of the new file, the number of classes, the number of teachers, the working days, the maximum number of teaching hours per day for a class and the number of co-teaching hours. Helpful information is provided at the bottom of the page.

Fig. 2. The second input form

When the user presses the “submit” button, this data are stored and a new form ap-pears (Fig.2). The elements that need to be completed in this new form are the names of the classes and the names of the teachers. Also, for each teacher the user should mark the following:

• The days that the teacher is available at school. • The teaching hours that the teacher has to teach each class during a week. • The number of different lessons that the teacher has to teach each class during

a week. • The co-teaching lessons, if any.

More specifically:

• In the columns under “Teachers” the user has to write the names of the Teachers.

• In the columns under “Days” and immediately after the name of each teacher the user has to mark the days in which the teacher is not available at school.


This is made by inserting a zero (‘0’) in the line that corresponds to each teacher and in the column that corresponds in each particular day.

• In the line under “Classes”, the names of Classes have to be entered. • In the columns under the name of each class and for as many lines as the

number of teachers, the user has to fill the next two text boxes. In the first box the user has to enter the number of teaching hours of each teacher in each class (in the box which lies in the line of the specific teacher and the column of the specific class). In the second box the number of different courses that each teacher teaches in each class has to be entered.

• Split Classes refer to teachers who teach in two new Classes which result from the combination of two existing ones. The user has to enter the initial Class in the line of the specific teacher in the new Class or in the subdivision one.

Next, the user has to press the “Save” button in order to complete the insertion of

the input parameters concerning the timetable. The tool saves the timetable’s preferences in a CSV (Comma Separated Value) file in order to be accessed in the future. The final step for the user is to complete the parameters of the evolutionary algorithm (EA) used in the presented tool. As mentioned before, the hosted algorithm of the presented tool has the ability to produce different solutions, depending on the importance of each soft constraint. Furthermore, the user can define the probabilities of the two kinds of mutation that are contained in the algorithm (Beligiannis et al 2006), the size of the initial population and the number of generations that the algo-rithm is allowed to run. Moreover, it is up to the user if he/she wants to use the cross-over operator and the elitism schema (Beligiannis et al 2006). Nevertheless, there are

Fig. 3. The Evolutionary Algorithm’s parameters


also some default values which have been proven, most of the times, to produce very good solutions in order to help an untrained user (Fig.3).

At this point, we have to note that in every step of this procedure, if some fields are not filled in, an error message is displayed informing the user about the existence of missing fields or about an error that might occur.

After the completion of the final step, the user can start the execution of the algo-rithm. When the execution has been completed, the tool informs the user whether it has managed to construct a feasible timetable or not. Moreover, if the tool has man-aged to construct a feasible timetable, it stores the last generation of the algorithm (all solutions which are contained in the last population) for further use (in case the user wants to further refine the final result).

The output is, of course, stored as a CSV file and it comprises the weekly timeta-ble, which shows for every teacher the class he/she is assigned to at every hour in a week. From this timetable someone can easily derive all conclusions that are related to its quality, that is, how many and which constraints are violated (Fig.4).

All created timetables and all input data are stored in files. The user can see them through the File menu and can edit or delete the files he/she desires (Fig.5).

Fig. 4. An example of a resulting (output) timetable


Fig. 5. The File menu of the presented tool

The user, by accessing the saved files, is able to execute one of the following functionalities:

• View the input CSV file with the timetable preferences of an already proc-essed timetable program.

• View the output program as it is generated from the tool. • Edit input timetable parameters and re-execute the program generation proc-

ess. • Delete all the files of a timetable program (input and output) from his/her

local repository.

Regarding, the “edit” option, the tool provides to users two major functionalities: a) changing the parameters and re-executing the algorithm, replacing the previous output, and b) changing the parameters and saving them as a new file in the system, generating a new timetable program which is independent from the previous one. The second option comes from the fact that in most schools the timetable preferences are the same between two consecutive school years and as a result the changes in the in-put are relatively small.

5 An Example of Using the Timetabling Tool

In order to demonstrate the efficiency and performance of the algorithm hosted in the presented tool, several simulation experiments were carried out. For all results


presented in this section the same set of EA’s operators and parameters were used in order to have a fair comparison of the algorithm’s efficiency and performance. The representation used for the genomes of the genetic population is the one described in (Beligiannis et al 2006). As far as the reproduction operator is concerned, the linear ranking selection was used, as described in section 3. As stated before, no crossover operator was used, while the mutation operator was the complex mutation operator presented in (Beligiannis et al 2006) (with mutation probability equal to 0.05). The size of the population was set to 25 and the algorithm was left to run for 10000 gen-erations. Except for that, the algorithm hosted in the presented tool used linear scaling (Michalewicz 1996) and elitism.

The algorithm was applied to real world input data coming from thirty different Greek high schools settled in the city of Patras. These data are available through http://prlab.ceid.upatras.gr/timetabling/. The comparison of the timetables constructed by the proposed EA and the real timetables used at schools in the city of Patras is based on three criteria. The first one is the distribution of teachers, that is, how evenly each teacher’s hours are distributed among the days of the proposed timetable. For example, if a teacher has to teach ten hours per week and is available at school for five days, the ideal “distribution of teachers” for him/her would be two hours of teaching per day. The second one is the distribution of lessons. One lesson should not be taught twice or more at the same day. The third one is the teachers’ gaps. The teaching hours of each teacher in a day should be continuous – no idle hours are al-lowed to exist between teaching hours.

Regarding the distribution of teachers, two results (numbers) are presented in the respective column of the following table (Table 3). The first number is the number of teachers that their teaching hours do not have an even distribution among the days of the timetable, while the second number (between the brackets) is the number of days that this uneven distribution occurs. For example, if a teacher is available at school for five days, has ten teaching hours and their distribution among these five days is the following: 1-4-2-2-1, then the number of days, in which uneven distribu-tion for this teacher occurs, is three. Concerning the second quantity, that is, the dis-tribution oflessons, the first number in the respective column, is the distinct number of classes in which the same lesson is taught twice or more at the same day. The second number (between the brackets) is the total number of classes in which this incident occurs per week. A class is counted more than once in estimating this number, if there is more than one day that the same lesson is taught twice or more in this class. Finally,

Table 3. Comparing timetables constructed by the presented tool with real world timetables used at schools

Timetables used at Schools Timetables created by the presented tool Test Data set

Number Teacher

s

Number Classes

TeachingHours

Distribution Teachers

Distribution Lessons

Teacher’s gaps

Distribution Teachers

Distribution Lessons

Teacher’s gaps

Time

1 34 11 35 21[48] 1[1] 25[34] 18[40] 1[1] 25[29] 30’ 2 35 11 35 15[34] 0[0] 30[52] 15[34] 0[0] 26[42] 30’ 3 19 6 35 9[23] 3[3] 8[24] 4[8] 3[3] 9[9] 24’ 4 19 7 35 6[14] 0[0] 15[31] 5[12] 0[0] 17[29] 24’ 5 18 6 35 6[17] 0[0] 15[39] 1[2] 0[0] 8[8] 22’ 6 34 10 35 24[47] 10[25] 9[9] 18[33] 10[20] 7[7] 30’ 7 35 13 35 24[56] 13[36] 24[33] 24[50] 13[27] 24[32] 45’


Fig. 6. The evolution of the EA hosted in the presented timetabling tool for test data set 2 (Table 5)

Fig. 7. The evolution of the EA hosted in the presented timetabling tool for test data set 5 (Table 5)

regarding the teachers’ gaps, the first number in the respective column is the distinct number of teachers in whose timetable idle hours exist, while the second number (between the brackets) is the total number of idle hours for all teachers.


In the previous table (Table 3) the performance and efficiency of the proposed EA is shown by comparing the timetables constructed by it with the real timetables used at schools in the city of Patras. These data was given to us by the Headship of Second Grade Education of Achaia county (http://www.dide.ach.sch.gr/). A more detailed description of this data can be found through http://prlab.ceid.upatras.gr/timetabling/. In the second, the third and the fourth column of Table 3 the number of teachers, the number of classes and the number of teaching hours of each specific school are pre-sented respectively. These parameters define the size of each school and affect the size and the complexity of the respective optimal timetable.

In the following figures (Fig 6 and Fig 7), we present the evolution of the EA hosted in the proposed timetabling tool in order to demonstrate its convergence. Our aim is to show that although crossover is not included in the algorithm, mutation is enough to generate new good solutions to evolve the population.

6 Conclusions

In this contribution, a user-friendly adaptive evolutionary timetabling tool has been designed, developed and applied to the timetabling problem of Greek high schools in order to create feasible and efficient timetables. The algorithm has been tested ex-haustively with real world input data coming from many different Greek high schools in order to demonstrate its quality and efficiency. Simulation results showed that the algorithm is able to construct a feasible and very efficient timetable quickly and eas-ily, thus preventing disagreements and arguments among teachers and assisting each school to operate with its full resources from the beginning of the academic year. Moreover, the user friendly interface of the presented tool has been taken into serious consideration in order to be easy for anyone to use it.

References

[1] Alvarez-Valdes, R., Martin, G., Tamarit, J.M.: Hores: A timetabling system for Spanish secondary schools. TOP 3(1), 137–144 (1995)

[2] Bardadym, V.A.: Computer Aided School and Timetabling: The New Wave. In: Burke, E.K., Ross, P. (eds.) PATAT 1995. LNCS, vol. 1153. Springer, Heidelberg (1996)

[3] Beligiannis, G.N., Moschopoulos, C.N., Kaperonis, G.P., Likothanassis, S.D.: Applying evolutionary computation to the school timetabling problem: The Greek case. Comput. Oper. Res. (in Press, 2006)

[4] Blickle, T., Thiele, L.: A comparison of selection schemes used in genetic algorithms. Evol. Comput. 4(4), 361–394 (1997)

[5] Bufé, M., et al.: Automated Solution of a Highly Constrained School timetabling Prob-lem-Preliminary Results. In: Boers, E.J.W., et al. (eds.) EvoIASP 2001, EvoWorkshops 2001, EvoFlight 2001, EvoSTIM 2001, EvoCOP 2001, and EvoLearn 2001. LNCS, vol. 2037. Springer, Heidelberg (2001)

[6] Burke, E.K., Erben, W.: PATAT 2000. LNCS, vol. 2079. Springer, Heidelberg (2001) [7] Burke, E.K., Petrovic, S.: Recent Research Directions in Automated Timetabling. Eur. J.

Oper. Res. 140(2), 266–280 (2002)


[8] Caldeira, J.P., Rosa, A.: School timetabling using genetic search. In: Burke, E.K., Carter, M. (eds.) PATAT 1997. LNCS, vol. 1408, pp. 115–122. Springer, Heidelberg (1998)

[9] Cambazard, H., Demazeau, F., Jussien, N., David, P.: Interactively solving school time-tabling problems using extensions of constraint programming. In: Burke, E.K., Trick, M.A. (eds.) PATAT 2004. LNCS, vol. 3616, pp. 190–207. Springer, Heidelberg (2005)

[10] Carrasco, M.P., Pato, M.V.: A multiobjective genetic algorithm for the class/teacher time-tabling problem. In: Burke, E., Erben, W. (eds.) PATAT 2000. LNCS, vol. 2079, pp. 3–17. Springer, Heidelberg (2001)

[11] Fernandes, C., Caldeira, J.P., Melicio, F., Rosa, A.C.: Evolutionary algorithm for school timetabling. In: GECCO 1999, pp. 1777–1783 (1999)

[12] Fernandes, C., Caldeira, J.P., Melicio, F., Rosa, A.C.: High school weekly timetabling by evolutionary algorithms. In: ACM SAC, pp. 344–350 (1999)

[13] Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning. Addi-son-Wesley, Reading (1989)

[14] Jussien, N.: e-constraints: explanation-based constraint programming. CP 2001 Work-shop on User-Interaction in Constraint Satisfaction (2001)

[15] Kwan, A.C.M., et al.: An Automated School Timetabling System Using Hybrid Intelli-gent Techniques. In: Zhong, N., et al. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871. Springer, Heidelberg (2003)

[16] Melício, F., Caldeira, J.P., Rosa, A.: THOR: A Tool for School Timetabling. In: PATAT 2006, pp. 532–535 (2006)

[17] Michalewicz, Z.: Genetic algorithms + data structures = evolution programs, 3rd edn. Springer, New York (1996)

[18] Mitchell, M.: An introduction to genetic algorithms. MIT Press, London (1995) [19] Ng, W.Y.: TESS: an interactive support system for school timetabling. In: Fung, et al.

(eds.) Proceedings of the IFIP TC3/WG 3.4 international Conference on information Technology in Educational Management For the Schools of the Future, Chapman & Hall Ltd., London (1997)

[20] Santiago-Mozos, R., Salcedo-Sanz, S., DePrado-Cumplido, M., Bousoρo-Calzσn, C.: A two-phase heuristic evolutionary algorithm for personalizing course timetables: a case study in a Spanish University. Comput. Oper. Res. 32(7), 1761–1776 (2005)

[21] Schaerf, A.: Local search techniques for large high school timetabling problems. IEEE T Syst. Man. Cy. A 29(4), 368–377 (1999)

[22] Tavares, R., Teofilo, A., Silva, P., Rosa, A.: Infected genes evolutionary algorithm. SAC, 333–338 (1999)

[23] Ten Eikelder, H.M.M., Willemen, R.J.: Some complexity aspects of secondary school timetabling problems. In: Burke, E., Erben, W. (eds.) PATAT 2000. LNCS, vol. 2079. Springer, Heidelberg (2001)


HEPAR: An Intelligent System for Hepatitis Prognosis and Liver Transplantation Decision Support

Constantinos Koutsojannis, Andrew Koupparis, and Ioannis Hatzilygeroudis

Department of Computer Engineering & Informatics, School of Engineering, University of Patras Rion, 265 00 Patras, Hellas (Greece) [email protected], [email protected]

Abstract. In this paper, we present the clinical evaluation of HEPAR, an intelligent system for hepatitis prognosis and liver transplantation decision support in an UCI medical database. The prognosis process, linguistic variables and their values were modeled based on expert’s knowl-edge the statistical analysis of the records of patients from the existed medical database and relevant literature. The system infers from rules the elements relating to prognosis and liver transplantation by combining the risk scores with weights which can vary dynamically through fuzzy calculations. First we introduce the medical problem, the design approach to the devel-opment of the fuzzy expert system and the computer environment. Then we present HEPAR architecture, the reasoning techniques in comparison with the probabilistic characteristics of the medical database. Finally we indicate a few details of the implementation. The expert sys-tem has been implemented in FuzzyCLIPS. The fuzzy rules are organized in groups to be able to simulate the diagnosis process. Experimental results showed that HEPAR did quite s well as the expert did.

1 Introduction

1.1 Problem Description

According to the rules of evidence-based medicine every clinical decision should al-ways be based on the best verified knowledge available. Additionally novel scientific evidence applicable to an individual case arises every day update from studies pub-lished all over the world. Checking day-by-day clinical decisions against the latest biomedical information available worldwide is a complex and time-consuming job, even though the advent of almost universal computer-based indexes. Most physicians lack the necessary time, skills or motivation to routinely search, select and appraise vast sets of online sources, looking for the most proper and up-to-date answers to a given clinical question concerning a single patient. The preferred option is still to ask more expert colleagues. When, as it is often the case, no experts are timely available, the physician has to rely on personal experience, which includes memory of guide-lines, policies and procedures possibly refreshed by consulting handbooks or journals that may be at hand and not necessarily contain information that is specific, relevant and up-to-date. On the other hand, patients, especially those affected by chronic or

164 C. Koutsojannis, A. Koupparis, and I. Hatzilygeroudis

progressive diseases may have time and motivation for systematically searching new sources of information that could possibly be related to their case. Though their knowledge may generally be inaccurate as an effect of medical illiteracy, media report distortions or wishful thinking; more and more frequently they ask questions about new findings or scientific advances of which the physicians are often completely un-aware. Even today intelligent systems for diagnosis, prognosis and decision support are desirable. Among the numerous applications which have been found for intelligent systems in the medical field, many aim at solving problems in gastroenterology [1].

1.2 Liver Transplantation

In this paper the issue of evaluating prognosis in hepatitis is addressed. It is often diffi-cult to identify patients at greater risk and foretell the outcome of the disease. The grad-ing of the disease is an estimation of the severity regardless of the problem being acute or chronic, active or not, mild or severe. Many tests and grading systems that combine them have been devised in order to increase the validity of these estimations [2]. Their importance is obvious as they are used for liver transplantation decisions. A better dis-criminative value means better management of transplants, less time in the waiting list and finally, less death. An example of such a scoring system is the Child-Turcotte-Pugh score, invented in 1964 and modified in 1973, designed to estimate the risk of certain surgical operations. It uses the values of blood bilirubin, prothrombin time, albumin, grade of ascites and severity of encephalopathy. The score ranges from 5 to 15 and dis-criminates the patients in 3 classes. Classification in class B (score > 7) has been used as a strong criterion for transplantation. A newer system is MELD (Model for End-stage Liver Disease), invented by Mayo Clinic in 2002 also intended for a surgical operation [3, 4, 5]. It uses creatinine values (for the estimation of the kidney function), bilirubin, INR (a test relevant to prothrombin time) and adds a factor for non-alcoholic causes. A 15% reduction of the number of deaths in waiting lists for liver transplantation is re-ported after its use. Development of such grading systems always focuses on specific conditions and a well defined population [6]. However, researchers try to evaluate their application in other clinical problems beyond the original scope. So, their validity is confirmed in a wider range of situations [7, 8, 9, 10].

In this paper, we present a Fuzzy Expert System for hepatitis prognosis (called HEPAR). The system primarily aims to help in liver transplantation by physicians (but not experts). Also, it can be used by medical students for training purposes.

2 Knowledge Modeling

2.1 Database Description

Appropriate diagnosis doctors with long experience in gastroenterology. To evaluate HEPAR we finally used the data from the UCI Machine Learning Repository (hepati-tis.data). These include the outcome of the disease (live – die), age, sex, history in-formation (use of steroids, antivirals, biopsy request), symptoms (fatigue, malaise, anorexia), signs (big liver, firm liver, palpable spleen, ascites, spider-like spots, varices) and laboratory results (bilirubin, alkaline phosphatase, SGOT, albumin, prothrombin time) [11].

HEPAR: An Intelligent System 165

The description (file hepatitis.names) is incomplete, but it sums up to:

● 6 continuous variables, ● 12 boolean (1-2 = yes-no), ● class (1-2 = die-live) and ● gender (1-2 = male-female).

Missing data are noted with “?”. Every line comprises of the 20 variables for each of the 155 cases.

2.2 Database Descriptive Statistics

For boolean variables, the number of cases in each category are given in Table 1. In brackets, the percentage of those who die. Underlined, the statistical significant differences.

Table 1a. Number of cases in each variable category in UCI hepatitis database

Input/Class 32 DIE 123 LIVE

Sex 139 Male (23%) 16 Female (0%)

Steroids 76 Yes (26%) 78 No (15%)

Antivirals 24 Yes (8%) 131 No (23%)

Fatigue 100 Yes (30%) 54 No (4%)

Malaise 61 Yes (38%) 93 No (10%)

Anorexia 32 Yes (31%) 122 No (18%)

Big Liver 25 Yes (12%) 120 No (20%)

Firm Liver 60 Yes (22%) 84 No (17%)

Spleen Palpable 30 Yes (40%) 120 No (16%)

Spider 51 Yes (43%) 99 No (9%)

Ascites 20 Yes (70%) 130 No (13%)

Varices 18 Yes (61%) 132 No (15%)

Biopsy 70 Yes (10%) 85 No (29%)

Table 1b. Number of cases for continuous variables, average and range in UCI hepatitis database

Input/Class Class DIE Class LIVE

Age 46.59 (30 - 70) 39.8 (7 - 78)

Bilirubin 2.54 (0.4 – 8) 1.15 (0.3 - 4.6)

Alkaline Phosphatase 122.38 (62 - 280) 101.31 (26 - 295)

SGOT 99.83 (16 - 528) 82.44 (14 - 648)

Albumin 3.15 (2.1 – 4.2) 3.98 (2.7 - 6.4)

Prothrombin time 43.5 (29 - 90) 66.57 (0 - 100)


In Fig.1 we can also see the primary statistics according to the input variables

Fig. 1. The HEPAR variables' distributions

3 System Architecture and Design

3.1 Data Entry

Function readatafacts is defined in file functions.clp to take care of facts entry. The function handles [12, 13]:

1. the conversion of commas (“,”) to spaces (“ ”) in the input line so that ex-plode$ function can operate and give us easy access to every value,

2. the assertion of a fact (patient (value 1) (value 2) (…)) using default values: fuzzy values with equal representation 0.5 in their whole range,

3. the modification of the fact for each variable in the input line (if it has nu-meric value and not a “?”) so that the new value is stored and finally end up with a fact storing the input values or the default for those missing,

4. initialization of the facts (class live) and (class die) with a certainty factor of 0.01.


3.2 Fuzzy Rules

Rules reside in file rules.clp. The last rule to be activated sets the result to the global variable out:

;This rule defines the outcome in the range 1 - 2 (defrule patientgoingtodie; last rule to be activated (declare (salience -1000)) ?f <- (class die) ?g <- (class live) => (bind ?*out* (- 2 (- (get-cf ?f) (get-cf ?g)))) )

If the certainty factor of (class die) is higher than the one of (class live), the result is a number lower than 2. The more closer it gets to 1, the more certain is that the pa-tient belongs to class DIE. A result greater than 2 declares that this case is more cer-tain to belong to class LIVE.

3.3 Fuzzy Templates for Variable Modelling

Age

Age is classified to “young” , “middle age” and “old”. Most studies use age either as a continuous variable or by discriminating to over 50 and below 50. King’s College Criteria include age over 40. We assume that young age is a protective factor and we define “young” with boundary values at 35±5, while old age is a risk factor, with “old” starting at 40±5. The boundaries for young age are not objective, but seem to discriminate UCI data quite well [14, 15].

(deftemplate fz_age ;For liver diseases bad prognosis for >40 0 100 years ( (young (0 1) (30 1) (45 0)) (middle age (30 0) (50 1) (65 0)) (old (50 0) (65 1)) ) )

Fig. 2. Linguistic value and membership functions of ‘AGE’


Gender

1 represents “male” and 2 represents “female”.

(deftemplate fz_sex ;Not necessary - just to be sure [in our data no female dies] 1 2 ( (male (1 1) (1.01 0)) (female (1.99 0) (2 1)) ) )

Yes/No

1 represents “yes” and 2 represents “no”. There are some problems in the calculations of certainty factors if ((1 1) (1 0)) is used, so I use ((1 1) (1.01 0)).

In theory, the two classes could crossover so that the probability of error could be taken into account. After all, many variables that use this template are highly subjec-tive. However, this doesn’t seem to help in the case of our UCI data.

(deftemplate fz_yesno ; yes/no 1 2 ( (yes (1 1) (1.01 0)) ; (1 1) (1 0) complicates

things with certainty factors (complementary probability perhaps) (no (1.99 0) (2 1)) ) )

Bilirubin

We define the classes based on the Child-Pugh scale, which discriminates bilirubin to <2, 2-3 and >3. We define values from 2 to 3 as “high” with an increasing certainty from 0 to 1, so that values <2 are definitely not high, while values >3 are definitely high.

(deftemplate fz_bilirubin ; normal values 0.3 - 1 we have range 0.3-8 0 10 mg/dL ( (normal (0.3 1) (1 1) (2 0)) ; we don't care for low

values (high (2 0) (3 1)) ) )

Alcaline Phosphatase

We consider normal values 30 – 120 U/dL, according to Harrison’s Principles of In-ternal Medicine, 15th ed., as we don’t have normal values according to the lab. We


assume that values higher than the double of the Upper Normal Value (UNV) are definitely “high” while an increase up to 33% of the UNV is definitely not “high”. This assumption is compatible with the expected increase in hepatitis.

(deftemplate fz_alkpho ; normal values 30 - 120, we have range 26-295, shows obstruction, in other diseases >1000 0 400 U/dL ( (normal (30 1) (120 1) (180 0)) ; terminate normal

at 1.5 x UpperNormalValue (high (160 0) (240 1)) ) )

SGOT

We consider normal values 5 – 50 U/L. We define that a value up to three times the UNV is not “high”, while a value more than five times the UNV definitely is. This assumption seems compatible with the UCI data.

(deftemplate fz_sgot ; difficult to define normal values (not bell shaped 5 - 50), we have range 14-648, in other diseases >10000 0 1000 U/L ( (normal (5 1) (30 1) (50 0.8) (100 0)) ; terminate

normal at 2 x UpperNormalValue (high (150 0) (250 1)) ; start at 3x UNV, established

at 5x UNV ) )

Albumin

Based on Child-Pugh score once more, we define values less than 2.8 definitely “low” while values over 3.5 are definitely normal – not low [15].

(deftemplate fz_albumin ; we have range 2.1-6.4, Child score divides: >3.5, 2.8-3.5, <2.8 0 10 g/L ( (normal (3 0) (3.4 0) (3.6 1) (5.5 1)) ; normal 3.5-

5.5, dont care for higher values (low (2.8 1) (3.5 0)) ) )

Prothrombin Time

Normally, PT values should be compared to martyr. Child-Pugh scale divides are less than +4 seconds to martyr’s time, +4 to +6 and more than +6. Here, assuming that normal values are near 12-15, a +6 would give a value of 21. In our data this is


useless, as all patients are reported to have a PT value greater than 21 (with the excep-tion of a “0” which is probably a mistake). We do define a “high” class based on Child-Pugh scale, nevertheless. According to King’s College Criteria for liver trans-plantation (KCH criteria), there is a discrimination for values >100 (useless for our data as all patients have <=100) and one for values >50. We use this to define a class “extreme” for values >50.

(deftemplate fz_pt ; normally this is compared to martyr... I take 12-15 as normal (range 21-100) 0 100 sec ( (normal (0 0) (12 1) (15 1) (19 0)) ; Child divides:

<normal+4, +4 - +6, >normal+6 (high (15 0) (21 1)) (extreme (45 0) (50 1)) ) )

3.4 Fuzzy Rule Description

3.4.1 Protective Rules These rules assert the fact (class live) when one of the protective factors is present. In total, these conditions may be regarded as protective (Table 2):

1. young age 2. use of antivirals 3. female gender 4. use of steroids

Regarding gender, there is no clear evidence. Although there are reports that women have better prognosis, many believe that this difference is not statistically significant. In the UCI data none of the 16 women dies, but we don’t believe this can be used. In the rule femalegender the certainty factor is 0 (inactive).

There is no clear evidence for the use of steroids, too. In most cases of (viral) hepa-titis steroids are not recommended. Under very specific conditions, for example in autoimmune hepatitis, they may have some positive result. In our data a possible negative effect on prognosis is indicated. We don’t know the cause of the disease in every case so that a more specific rule can be formulated (for example a positive ef-fect in acute or autoimmune hepatitis). The rule onsteroids doesn’t seem to affect the program results, so we set the certainty factor to 0 (inactive).

It is certain that the use of antivirals in viral hepatitis ameliorates prognosis. How-ever, in UCI data we don’t see a statistically significant difference. We set a CF of 0.1 to the onantivirals rule.

Age is a very important factor. We assume that younger patients have better prog-nosis – 20% more deaths are reported on patients of older age. Youngage rule has a CF of 0.2.

We also assume that young age and use of antivirals are independent factors. As such, a patient with both factors has better prognosis. A rule antivir_young is formu-lated with a 0.3 CF.


Table 2. Protective fuzzy rules

(age young) (antivirals yes) (steroids yes) (sex female) CF

youngage + 0.2

onantivirals + 0.1

onsteroids + 0.0

femalegender + 0.0

antivir_young + + 0.3

The certainty factor for the fact (class live) has a range of 0.01 to 0.3, as we see from these rules. With the use of only those 3 rules we get an estimation of 0.719 for the area under the ROC curve, which means that about 72% of the cases can be classified cor-rectly. Indeed, if we assume that only those cases for which a rule has been activated survive, we have an accuracy of 67.7%, which shows mostly the effect of age.

Fig. 3. The general structure of HEPAR

3.5 Intermediate Rules

We use these rules for an estimation of the general condition of the patient, the existence of portal hypertension (and therefore cirrhosis) and a test for the factors used by the Child-Pugh scoring system. The general condition is a combination of three variables, fatigue, malaise and anorexia. We assume that these are at least partially independent and that the existence of more than one is a sign of worse general condition. Fatigue and malaise show a statistically significant difference in the two classes – always regarding the UCI data. There are reports that fatigue doesn’t have any prognostic value, though the frequency of appearance in chronic hepatitis is recognized. A group of rules assert the fact (generalcondition bad) when these symptoms are reported.

The other combination of clinical evidence is used to evaluate the existence of por-tal hypertension, which is a result of cirrhosis and therefore a bad prognostic feature.


Table 3. Intermadiate fuzzy rules

(fatigue yes) (malaise yes) (anorexia yes) CF

generalcondition1 + 0.5



generalcondition12 + + 0.7



generalcondition123 + + + 0.9

Of the available data, portal hypertension is connected with palpable spleen, varices, spider spots, ascites (which is not used in these rules) and maybe a firm liver. A group of rules estimates the existence of portal hypertension and asserts the fact (portal hy-pertension). Combinations of the variables are pointless in this case, as the existence of one is enough for the diagnosis.

Table 4. Diagnosis fuzzy rules

(spleen yes) (kirsoi yes) (spider yes) (firmliver yes) CF

hyperportal1 + 0.9

hyperportal2 + 0.9

hyperportal3 + 0.9

hyperportal4 + 0.2

Child-Pugh scoring system uses the values of bilirubin, albumin, prothrombin time, severity of ascites and severity of encephalopathy. These are graded from 1 to 3, and thus the result has a range of 5 to 15. For a score greater than 10, the patient is classi-fied to class C where we have the greatest risk. We don’t have the data to evaluate encephalopathy; however, the existence of three of the other four is enough to assume we have a greater danger. A group of rules asserts the fact (child high) when these conditions are met.

Table 5. Child fuzzy rules

(albumin low) (bilirubin high)

(protime high) (ascites yes)

CF

highChildscore1 + + + 0.8




highChildscoreall + + + + 1.0


High risk rules The factors used for the assertion of (class die) and the rules that handle them are

1. child score: this is the most reliable, CF 0.9 2. general condition: this is not objective, CF 0.1 3. portal hypertension: this uses clinical observations, CF 0.2 4. ascites: the existence of ascites alone is a very important prognostic feature;

70% of patients with ascites die in the UCI data. Although child score uses this as well, it is reported as an independent factor. CF 0.5.

5. alkaline phosphatase: not considered a statistically important factor, it is ac-knowledged as a possible tool, CF 0.35

6. SGOT and bilirubin values: inspired by Emory score, which uses among others bilirubin and SGPT, we use bilirubin and the available transaminase SGOT, CF 0.5.

7. Age, PT, bilirubin: This is the only rule using the “extreme” PT, based on the King’s College criteria. CF 0.7

Rules with low certainty factor are useful when the data cannot support a rule of higher validity.

Table 6. Risk calculation fuzzy rules

(child high) (generalcondition bad)

(portal hyper-tension)

(ascites yes)

Cf

Ischildhigh + 0.9

Badcondition + 0.1

Hasportalhyper + 0.2

Hasacites + 0.5

For example, when we don’t have the results of laboratory tests we can use the less objective clinical evidence.

Table 7. Alternative fuzzy rules

(alkpho high) (sgot high)

(bilirubin high)

(age old) (protime extreme)

CF

highalcpho + 0.35

emory + + 0.50

KingsCollegeTransplant

+ + + 0.70

3.6 Prognosis Rules

The majority of the rules are based on an existing grading system or report on the data of the UCI database. Many modern scales cannot be used, as the data necessary for their calculation are unavailable. For example, MELD uses creatinine values and INR, Child-Pugh uses information about encephalopathy, etc. In addition, every scale has


its use on hepatitis of specific etiology, like Lille score for alcoholic hepatitis, which is important information also absent in the UCI data. These variables are not used in any rules: gender, steroid use, liver size and decision to perform biopsy. Liver size, while useful for diagnosis, cannot help in the evaluation of prognosis. It is difficult to have an objective estimation on liver size. Moreover, only the change over time could be useful in the evaluation of the severity of the disease. In the UCI data, the differ-ence in the two classes is not statistically important. The decision to perform biopsy may be a sign of chronic disease, which could lead to a worse prognosis. However, the biopsy result may result in better treatment. Statistical processing on the UCI data suggests a difference, with fewer deaths on those patients who were subjected to liver biopsy. We cannot support a reasonable rule using this information; on the other hand, random tests don’t seem to ameliorate the prognostic value of the rules.

4 Implementation Issues

4.1 FuzzyCLIPS 6.10d

The system has been implemented in FuzzyCLIPS 6.10d. Two main facts are used, (class die) and (class live). Every rule representing risk factors asserts the first, while every protective rule asserts the second. The final result is the difference between the certainty factors of these facts. While forming the rules, we keep in mind that some of the data can be combined (as all grading systems do) to form values useful for prog-nosis. The representation of the data is achieved through a fuzzy template which de-fines “normal” and “abnormal” for every variable. Default fuzzy values with equal representation to all their range are used when a value is missing. The program is segmented in 4 files.

Input-output processes We begin with batch file start.bat

(defglobal ?*out* = "2") (defglobal ?*in* = "") (load "definitions.clp") (load "functions.clp") (load "rules.clp") (open "../Hepatitis-uci/hepatitis.data" infile "r") (open "results.csv" outfile "w") (while (stringp (bind ?*in* (readline infile))) (readdatafacts ?*in*) (run) (printout outfile ?*in* "," ?*out* t) (reset) )

(close infile) (close outfile)


Variables out and in are defined. In will store a line of input from file hepatitis.data and out will store a value in [1,2] where 1 classifies the patient to class DIE and 2 to class LIVE. If no rule is activated, out retains its default value of 2.

Fuzzy templates are defined in file definitions.clp, a helper function readdatafacts is defined in file functions.clp and the rules are stored in file rules.clp. Input and out-put files are opened. Input file contains lines of this format:

2,30,2,1,2,2,2,2,1,2,2,2,2,2,1.00,85,18,4.0,?,1

A while loop begins, where

1. the next line of the input file is stored to variable in 2. function readdatafacts converts this line to facts 3. rules are executed with (run) and thus the value of out is set 4. the input line and the result are stored to the output file 5. facts are cleared with (reset)

The loop terminates when in doesn’t contain a string, in other words when end of file has been reached and EOF symbol is returned.

Fig. 4. Receiver Operating Characteristic curves

Finally, the files are closed. The output file is a comma separated values (CSV) file and can be readily processed from spreadsheets programs (like Excel).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ROC Curve


4.2 Template Definition

The following fuzzy templates are defined in file definitions.clp

1. fz_age 2. fz_sex 3. fz_yesno 4. fz_bilirubin 5. fz_alkpho 6. fz_sgot 7. fz_albumin 8. fz_pt

A template is also defined for the patient. Each given variable for the case is stored in a slot:

(deftemplate patient (slot prognosis) (slot age (type FUZZY-VALUE fz_age)) (slot sex (type FUZZY-VALUE fz_sex)) (slot steroid (type FUZZY-VALUE fz_yesno)) (slot antivirals (type FUZZY-VALUE fz_yesno)) (slot fatigue (type FUZZY-VALUE fz_yesno)) (slot malaise (type FUZZY-VALUE fz_yesno)) (slot anorexia (type FUZZY-VALUE fz_yesno)) (slot bigliver (type FUZZY-VALUE fz_yesno)) (slot firmliver (type FUZZY-VALUE fz_yesno)) (slot spleen (type FUZZY-VALUE fz_yesno)) (slot spider (type FUZZY-VALUE fz_yesno)) (slot ascites (type FUZZY-VALUE fz_yesno)) (slot kirsoi (type FUZZY-VALUE fz_yesno)) (slot bilirubin (type FUZZY-VALUE fz_bilirubin)) (slot alkpho (type FUZZY-VALUE fz_alkpho)) (slot sgot (type FUZZY-VALUE fz_sgot)) (slot albumin (type FUZZY-VALUE fz_albumin)) (slot protime (type FUZZY-VALUE fz_pt)) (slot biopsy (type FUZZY-VALUE fz_yesno)) )


The program returns a continuous value with a minimum of 1 for the cases definitely belonging to class “die”, whereas if only protective rules are activated, we get values larger than 2, with a maximum of 2.3. We need to set a threshold over which cases are classified to class “live”. Afterwards, we can perform the classical analysis for false and true positive and negative results, and calculate the sensitivity and the specificity. For example, for a threshold of 1.87 we have 96.9% sensitivity (only one false


negative) and 69.9% specificity. Of those classified to “die” only 45% do indeed die, whereas 98.9% of those classified to “live” actually survive. If we set a threshold of 1.68, specificity rises to 91.9%, but sensitivity goes down to 62.5%.

This relationship between sensitivity and specificity is readily observed in Receiver Operating Characteristic1 curves, where sensitivity vs. (1 – specificity) is plotted for every threshold point [18, 19]. The area under the curve is a measure for the discrimi-native value of the test, with 1 showing a perfect discrimination while 0.5 is a useless test (equal to flipping a coin). For the estimation of the curve we use the Johns Hop-kins University program in the website http://www.jrocfit.org/ (some numeric conver-sions are necessary for compatibility). The calculated area under the ROC curve is 0.911 (standard error 0.0245), which is considered a very good result.

It is possible that changes in the certainty factors or the boundaries in the defini-tions of abnormal could give slightly better results.

The calculated ROC curve and 95% confidence intervals: The estimation for the area under the curve:

AREA UNDER ROC CURVE:

Area under fitted curve (Az) = 0.9110

Estimated std. error = 0.0245

Trapezoidal (Wilcoxon) area = 0.9069

Estimated std. Error = 0.0364

6 Related Work

Various systems have been created in order to be used as hepatitis diagnosis but no one for liver transplantation. One of the most successful is HERMES [20].

This system focus is the evaluation of prognosis in chronic liver diseases. Our fo-cus is the prognosis in hepatitis, regardless if it is acute or chronic. So, the systems overlap in the case of chronic hepatitis. Our system covers cases of acute hepatitis; Hermes covers cases of chronic liver diseases other than hepatitis. Note that UCI doesn’t specify the patient group in hepatitis – it is not certain if our data cover both acute and chronic disease. HERMES evaluates prognosis as a result of liver damage, therapy, hemorrhagic risk and neoplastic degeneration effects – it calculates interme-diate scores which are linearly combined (with relevant weights) to estimate progno-sis. Our data can cover estimations of liver damage and therapy (antiviral) effects on prognosis, and partially (though the existence of varices) hemorrhagic risk; every rule has a certainty factor, similar to the weight of HERMES. In general, HERMES makes use of more functional tests – they report 49 laboratory tests. Some rules make use of temporal data as well. Our variables are fuzzy. For example, HERMES would assert a glycid score (a functional test we don’t have in our data) if it is > 200, or if we have two tests > 140. A similar variable in our data would be defined as definitely

1 The name originates from the ability of the operator to discriminate whether a dot on the radar

is an enemy target or a friendly ship. ROC curves are a part of Signal detection Theory devel-oped during 2nd World War.


abnormal if it was > 200, definitely normal if it was for example < 120 and 50% ab-normal if it was 140. This gives our program power over marginal scores. The rule would be activated in our program if we had 199, but not in HERMES. Note that the HERMES description does not cover all their variables and rules. It isn’t specified if the program gives some result when many variables are missing. The authors don’t give any results: we cannot evaluate the HERMES approach – at least not in this arti-cle where the prototype is discussed.

In an other system for the diagnosis, treatment and follow-up of hepatitis B is de-scribed in [21]. The focus of this system is the diagnosis through serological tests (we don’t have any in the UCI data) of the exact type of the virus and an estimation (using biochemical data and patient’s history) of a possible response to therapy (interferon therapy). The system uses rules, and thus judged a very powerful learning tool. How-ever, “in real-life cases, consultation of an expert system cannot serve as a substitute of the consultation of a human expert”. This is one of the many examples of expert systems focused on diagnosis. In viral hepatitis a multitude of serological data and their changes over time must be evaluated in order to establish a diagnosis with cer-tainty, a task rather complicated for a non-expert doctor or a student and that is why most systems aim at this problem. The rules are well-known and simple, which makes the use of expert systems ideal. Neural networks have been used, as well. Data dif-ferentiation and parameter analysis of a chronic hepatitis B database with an artificial neuromolecular system [22]. For hepatitis B, one can define different classes, healthy subjects, healthy carriers, chronic patients. Thus, neural networks can be used to dis-criminate them.

The last one is described in [22] were the authors use similar data (plus a serologi-cal), but their aim is diagnosis, and finally don't use rules.


In this paper, we present the design, implementation and evaluation of HEPAR, a fuzzy expert system that deals with hepatitis prognosis and liver transplantation. The diagnosis process was modeled based on expert’s knowledge and existing literature. Linguistic variables were specified based again on expert’s knowledge and the statistical analysis of the records of 155 patients from the UCI hepatitis database. Lin-guistic values were determined by the help of expert, the statistical analysis and bib-liographical sources. It cannot be denied that an expert system able to provide correct prognosis in the field of chronic liver disease is to be considered an intrinsically ambi-tious project, for which there is no guarantee of the final success. The large number of variables involved, the fact that knowledge of various aspects of the problem is still incomplete and, above all, the presence of a variable which is impossible to quantify, i.e. the peculiarity of each patient, may represent a barrier to achieving the intended results, especially with reference to individual patients. Finally setting up a prototype of this kind may provide new information about the mechanisms which regulate the development the disease severity and risk for liver transplantation, providing a more rational foundation for those assumptions which have so far been based only on the intuition and experience of the non-expert physician. Experimental results showed that HEPAR did quite well. A possible improvement could be the re-determination of


the values (fuzzy sets) of the linguistic variables and their membership functions. Bet-ter choices may give better results. One the other hand, use of more advanced representation methods, like hybrid ones [23], including genetic algorithms or neural networks may give better results.

References

[1] Disorders of the Gastrointestinal System: Liver and Biliary Tract disease. In: Harrison’s Principles of Internal Medicine, 15th edn (2001)

[2] Johnston: Special Considerations in Interpreting Liver Function Tests. American Family Physician 59(8) (1999)

[3] Shakil,, et al.: Acute Liver Failure: Clinical Features, Outcome Analysis, and Applicabil-ity of Prognostic Criteria. Liver Transplantation 6(2), 163–169 (2000)

[4] Dhiman, et al.: Early Indicators of Prognosis in Fulminant Hepatic Failure: An Assess-ment of the Model for End-Stage Liver Disease (MELD) and King’s College Hospital Criteria, LIVER TRANSPLANTATION 13, pp. 814–821 (2007)

[5] Soultati, A., et al.: Predicting utility of a model for end stage liver disease in alcoholic liver disease. World J. Gastroenterol 12(25), 4020–4025 (2006)

[6] Kamath, Kim.: The Model for End-Stage Liver Disease (MELD), HEPATOLOGY (March 2007)

[7] Schepke, et al.: Comparison of MELD, Child-Pugh, and Emory Model for the Prediction of Survival in Patients Undergoing Transjugular Intrahepatic Portosystemic Shunting. AJG 98(5) (2003)

[8] Cholongitas, E., et al.: Review article: scoring systems for assessing prognosis in criti-cally ill adult cirrhotics. Aliment Pharmacol Ther. 24, 453–464 (2006)

[9] Louvet, et al.: The Lille Model: A New Tool for Therapeutic Strategy in Patients with Severe Alcoholic Hepatitis Treated with Steroids. HEPATOLOGY 45(6) (2007)

[10] Botta, et al.: MELD scoring system is useful for predicting prognosis in patients with liver cirrhosis and is correlated with residual liver function: a European study. Gut. 52, 134–139 (2003)

[11] Hepatitis UCI database. G. Gong (Carnegie-Mellon University) via Bojan Cestnik, Jozef Stefan Institute, Ljubljana Yugoslavia (1988)

[12] Chan, et al, Evaluation of Model for End-Stage Liver Disease for Prediction of Mortality in Decompensated Chronic Hepatitis B. American Journal of Gastroenterology (2006)

[13] Katoonizadeh, et al.: MELD score to predict outcome in adult patients with non-acetaminophen-inducedacute liver failure. Liver International (2007)

[14] Jones,: Fatigue Complicating Chronic Liver Disease. Metabolic Brain Disease 19(3/4) (December 2004)

[15] Forrest, Evans, Stewart, et al.: Analysis of factors predictive of mortality in alcoholic hepatitis and derivation and validation of the Glasgow alcoholic hepatitis score. Gut. 54, 1174–1179 (2005)

[16] Wadhawan, et al.: Hepatic Venous Pressure Gradient in Cirrhosis: Correlation with the Size of Varices, Bleeding, Ascites, and Child’s Status. Dig. Dis. Sci. 51, 2264–2269 (2006)

[17] Vizzutti, et al.: Liver Stiffness Measurement Predicts Severe Portal Hypertension in Pa-tients with HCV-Related Cirrhosis. HEPATOLOGY 45(5) (2007)

[18] Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Researchers (2004)


[19] Lasko, T.A., et al.: The use of receiver operating characteristic curves in biomedical in-formatics. Journal of Biomedical Informatics 38, 404–415 (2005)

[20] Bonfà, C., Maioli, F., Sarti, G.L., Milandri, P.R.: Dal Monte HERMES: an Expert System for the Prognosis of Hepatic Diseases, Technical Report UBLCS-93-19 (September 1993)

[21] Neirotti, R., Oliveri, F., Brunetto, M.R., Bonino, F.: Software and expert system for the management of chronic hepatitis B. J. Clin. Virol 34(suppl. 1), 29–33 (2005)

[22] Chen, J.: Data differentiation and parameter analysis of a chronic hepatitis B database with an artificial neuromolecular system. Biosystems 57(1), 23–36 (2000)

[23] Medsker, L.R.: Hybrid Intelligent Systems. Kluwer Academic Publishers, Boston (1995)


Improving Web Content Delivery in eGoverment Applications

Kostas Markellos1, Marios Katsis2, and Spiros Sirmakessis3

1 University of Patras, Computer Engineering and Informatics Department, University Campus, 26500, Rio, Greece [email protected] 2 University of Patras, Computer Engineering and Informatics Department, University Campus, 26500, Rio, Greece [email protected] 3 Technological Educational Institution of Messolongi Department of Applied Informatics in Administration and Economy 30200, Messolongi, Greece [email protected]

Abstract. The daily production of statistical information increases exponentially worldwide, making the processing of this information a very difficult and time – consuming task. Our work focuses in creating a web platform for accessing, processing and delivering statistical data in an effective way. The platform consists of a set o integrated tools specified in the organization, validation, maintenance and safe delivery over the web of statistical information recourses. The idea is to enhance the current system, providing direct dissemination functionalities to author-ized international users (mainly Eurostat). In this case, the datasets will be tested and validated at national level and then they will be stored in the Transmission’s System Database, from which they can be downloaded by Eurostat, whenever necessary (in this way, the transmission via STADIUM will be skipped). Using the XTNET Rules Support and Transmission System designed for the National Statistical Service of Greece (NSSG), the whole statistics production and dissemination process is improved because the validation checks take place at an earlier stage as it is performed by the NSI itself whereas the transmission is performed by Eurostat itself. Furthermore, an overall improvement of data quality is expected.

1 Introduction

One of the major problems in the management and dissemination of current Web con-tent is the need for efficient search, filtering and browsing techniques of the available recourses. This is most urgent and critical in the field of Statistics which is a concep-tually rich domain where information is complex, huge in volume, and highly valu-able and critical for governmental institutes and public or private enterprises. The daily production of statistical information increases exponentially worldwide, making the processing of this information a very difficult and time – consuming task. Our work focuses in creating a web platform for accessing, processing and delivering sta-tistical data in an effective way. The platform consists of a set o integrated tools

182 K. Markellos, M. Katsis, and S. Sirmakessis

specified in the organization, validation, maintenance and safe delivery over the web of statistical information recourses.

The main objective of our work is to present the development and implementation of a prototype system which concludes to the XTNET web –based system for validat-ing and transmitting to Eurostat trade data. The intention is to extend the implemented web – oriented services not only to national users (individuals or institutes) but also to international collaborators, like Eurostat.

The expected results from the development and functioning of the application will be:

Web – based XTNET rules platform with validation, processing and transmission tools.

Web Access to database with the proper security measures, allowing all the in-volving parties to see, manipulate and use the data.

Better validation and correction process of records through usable web interface. Automatic creation of Validation history files that will provide a useful insight of the quality of data.

Export of datasets in appropriate format for instant transmission to Eurostat. Enhancement of the current dissemination system, providing flexibility to users. Extension of web – oriented services not only to national users but also to inter-national users.

Overall improvement of data quality.

The rest of the paper is organized as follows. Section 2 presents the motivation and the previous works that led to the design and implementation of the extended mod-ules. The next section describes the necessary technology background, while in Section 4 we present the system’s process and architecture. In Section 5 we present a case study assessment focusing in the functionality of the implemented tools. Finally we conclude in Section 6 with a discussion about our work and our future plans.

2 Motivation

NSSG is the General Secretariat of the Ministry of Economy and Finance and is responsible for the collection, processing, production and dissemination of the Offi-cial Foreign Trade Statistics. The published statistics include both Intrastat and Ex-trastat surveys.

Until now, data were collected at National level, processed by the NSI and finally, they were transmitted to Eurostat via STADIUM. Eurostat validated the received data and in case of errors they communicated with the Member States in order to cor-rect the errors and improve the quality of the dataset. This procedure resulted into great delays for the production of the final datasets, while in some cases error correc-tion was difficult.

The overall process is depicted in figure 1. In order to enhance the overall experience of the XTNET users (viewers, editors,

administrators) we have created a set of tools to provide added value services in the data exchange process between users. We have used as a basis the XTNET Rules

Improving Web Content Delivery in eGoverment Applications 183

• Codes checking on INTRA and EXTRA Files• Checks on Invalid Combinations between Several Variables• Checks for Missing or Duplicated Records • Unit Value Checking• Mirror Statistics

• Codes Checks on INTRA and EXTRA Files received• Checks on Invalid Combinations between Several Variables• Checks for Missing or Duplicated Records• Checks on Value and Net Mass for Outlier Detection

1. Data Provided by the PSIs

2. Data Receipt by the NCAs

3. Data Consolidation at the NCAs Final Database

4. Data Dispatch to Eurostat

5. Data Receipt by Eurostat

6. Data Consolidation at the Eurostat Final Database

Eurostat Environment

NCA Environment

Fig. 1. Controls for the Dataflows from NCAs to Eurostat

Support System for the appropriate validation and preparation of data of Eurostat according to XTNET rules. The main goals of the XTNET Rules Support System provided by the National Statistical Service of Greece (NSSG) are to modernize and improve the Trans-European network of trade statistics. It is a new information sys-tem for better trade data collection, processing, validation and aggregation, and for production of useful reports and publications. XTNET Rules Support System also provides an audit trail in order to keep history of controls and corrections to the trade data. The main objectives of the new offered features are to enhance the current sys-tem, providing direct dissemination functionalities to authorized international users (mainly Eurostat). The XTNET Rules Transmission System will contain the final datasets on external trade statistics. In this case, datasets will be tested and validated at national level and then they will be stored in the Transmission’s System Database, from which they can be downloaded by Eurostat, whenever necessary (in this way, the transmission via STADIUM will be skipped).

Using the XTNET Rules Support and Transmission System, the whole statistics production and dissemination is improved because the validation checks take place at an earlier stage as it is performed by the NSI itself whereas the transmission is per-formed by Eurostat itself.

The main features of the proposed module are the following:

The System’s database (Oracle Ver. 8i) designed and implemented according to Eurostat’s rules and specifications for storing “ready – to - transmit” data-sets.

Production and Usable presentation of error – free, validated datasets and transmission to Eurostat.

New usable interface secured with appropriate user authorization and authenti-cation techniques.

Secure Web access to database featuring state – of – art technologies (XML, web services, Internet Security)

Full – scale application of the XTNET rules at National level.


Development of common, open modules for system interoperability and integration.

Different views depended on the privileges of the user that is logged in the system.

Users Administration page accessed only by the Administrator of the system with add user, delete user, modify users’ rights functionalities.

Enhanced paging of large datasets for more instant responses from the server. Ability of client –side validation during the correction of records by the edi-

tors. Now the validation controls provide immediate feedback (without a round trip to the server) and the user experience with the page is enhanced.

Creation of Errors Database for returning more detailed descriptions of errors to users.

Enhancement of application with added – value services. Provide assistance to editors for records retrieval using search mechanism for filtering results (data-sets of errors or records).

Overall improvement of data preparation, validation and presentation by inte-grating optimum techniques.

The following sections describe the process in detail, the controls that the XTNET

Rules Support System applies at the codes of the INTRA and EXTRA files, the for-mat of the INTRA and EXTRA files and the details of the Validation History file. Finally, the paper concludes with the technical part of the XTNET Rules Support System which deals with technologies and programs that were used during the devel-opment process, and the presentation of the basic features of the system.

3 Process and System Architecture

Extrastat declarations are collected by the Integrated System of Customs Offices, while data of Intrastat declarations are collected automatically from the Online In-trastat Declarations System (web) and from the National Revenue Services (data en-try). After an initial data validation and correction, the declarations are transmitted to the NSSG for statistical processing, analysis and dissemination to the end users. At the time the declarations are registered into the system, corrections are done manu-ally, after having contacted the PSIs.

NOTE: The electronic system of Intrastat statements’ declaration operates success-fully since January 2003 and it serves 45.682 registered enterprises and 4.655 ac-countants / accounting offices that have declared in total 793.704 statements.

End users of the system (editors at national level) are able to run further validation controls and correction procedures (automated or manual) according to the rules documented to the DocMeth400 - Rev.15 in order to produce validated INTRA, EXTRA and Aggregated Statistics files.

During the data validation process, a record keeping of the various steps and results of the process is generated that represents the validation history file for a specific dataset. If the history file contains errors then Editors should correct them and rerun


the data validation process (as many times is necessary) in order to conclude to a final error–free dataset ready to transmit to Eurostat. In other words, when all errors of the INTRA or EXTRA datasets are corrected the dataset is automatically copied into the final tables ready for transmission to Eurostat. The whole process is depicted analyti-cally in the following figure.

Fig. 2. The XTNET Rules Transmission System Process

In the following part of this section we describe in details the steps of con-trols/check that datasets undergo. We also describe the Formats for the INTRA, EXTRA and Aggregated Final files, the basic features of the application and all the technical specifications.

There are five main steps of controls/checks:

1. Validity checks, 2. Credibility checks, 3. Checks for missing information or duplicates values, 4. Control on the unit prices (value/net mass, value/supplementary units), 5. and finally Mirror Statistics exercises (optional).


Validity checks

Validity checks are carried out in order to indentify invalid codes, as well as omis-sions. The exact controls that the XTNET Rules Support System applies at the codes of the Intra or Extra files are described in the following paragraph.

As an example, we select to validate a section of an Extra File. In the appendix ta-ble 1 describes the format and how many sections have a datarow of an Extra or Intra File. The first section is describing the reference period of the product.

The reference period has six digits code (yyyymm), in which (yyyy) is the refer-ence year and (mm) is the reference month. The reference period should be checked to be a six digit code. The first four digits should be checked to be the reference year and the two last digits should be checked to be the reference month.

There are three possible errors:

a. This section is not a six digit code b. The code contains characters c. The first four digits are not the reference year or the last two digits

are not the reference month

In case of errors, editors get a fatal error and they have to examine again the section 1 of the datarow.

Credibility checks

The output INTRA and EXTRA files could be checked for implausible combinations of data, in order to detect inconsistencies, not identified in previous control stages. The controls on implausible combinations will allow the transmission of data that are consistent and credible, thus improving the quality of the trade data figures.

Some common control rules that could be performed include commodity code against mode of transport (Table 1, Appendix). Certain commodities codes could be combined with specific modes of transport. For goods whose value per unit mass is known (unit price), it is worth investigating whether checks should be made to see whether the unit price justifies air transport. In other words, one would expect that an item whose unit price is below a threshold air transport is not economical. An exam-ple is mineral ores.

Invalid combinations are possible errors. In case of that kind of errors, editors should correct mode of transport depending on the commodity code or investigate the section.

Another common control rules that could be performed include Commodity Code against Partner Country (Table 1, Appendix). Certain commodities codes could be traded with specific partner countries, or certain commodities codes cannot be traded with specific partner countries.

Invalid combinations are possible errors. In case of that kind of errors, editors should investigate the section.

Checks for missing information or duplicates values

Checks for missing information or duplicates should be carried out in the INTRA and EXTRA output files before sending them to Eurostat.


Sometimes, output files contain non-accurate or partially complete records while the data in the NCA consolidation database are accurate; in such cases the errors in the output file are due to the extraction process. For instance, a flawed extraction process may yield duplicated records or a partial set of records (i.e. some records are left out).

The controls to be applied concern:

• Comparison of the total number of output records against the corresponding number of past output records sent each month to Eurostat. If large devia-tions are noticed then investigations should be carried out to identify and remedy the causes.

• Comparison of global figures in import/export or on aggregated CN2 level per partner country with past reported data. If large deviations are noticed then investigations should be carried out to detect the cause of discrepancies.

There are two possible errors:

d. Incomplete output file e. Existence of duplicate records

In case of errors, editors have to investigate again all the datarows, or get a warning which means that they have to delete duplicate data rows.

Control on the unit prices (value/net mass, value/supplementary units)

Control on the unit prices (value/net mass, value/supplementary units, Table 1 Appendix) in order to detect outliers and identify suspicious transactions.

A unit value checking could be also carried out in the output file before its transmis-sion to Eurostat. For each transmitted observation, which involve mainly aggregated data by flow, commodity code and partner country, unit values will be calculated: value per kilo and value per supplementary unit. Those unit values will be compared with unit values intervals based on historical data.

If a unit value falls within the interval, then it will be considered valid, otherwise, it would be suspicious and further analysis will be required.

There are two possible errors:

f. The unit value is above the upper limit. g. The unit value is below the lower limit.

In case of errors, editors get a warning that an outlier detected and they have to in-vestigate again all the datarows, or conditional auto-correction might be applied and be adapted by the NCAs.

Mirror Statistics

Mirrors statistics exercises can be adopted by the NCAs as a very effective tool for identifying recurring problem areas and ultimately improving the quality of external trade statistics. Mirror statistics exercises involve the comparison of “mirror” values (i.e. statistical value, net mass or supplementary unit, Table1 Appendix) of trade flows between two countries measured by each of the partner countries.


Eurostat has undertaken work on asymmetries and relevant statistical methodology for mirror statistics. From 2006, the NCAs will be able to check their trade data against the reported data by other NCAs due to Eurostat’s Asymmetries reporting.

The exercises can be carried out with major trading partners starting on a more ag-gregated level and drilled down to detailed level. The discrepancies can be identified by first comparing the total trade as well as the statistical information on ex-port/import on the CN2-level. The groups of CN2-level with the most noticeable dis-crepancies will be selected to be studied more closely. In these groups, the statistical information can be compared on the CN6-level which corresponds the most detailed level of HS (Harmonized System), and the reasons for the most significant discrepan-cies can be studied in detail.

Possible error is significant discrepancies in value to have been identified. In case of errors, editors should undertake an investigation seeking to identify the causes of the discrepancies. Editors also might take revision of trade statistics.

NOTE: NSSG is planning to perform checks for implausible combinations of vari-ables, for missing or duplicate records, for unit values and mirror statistics in future enhancement actions in the XTNET Rules Support System.

Validation History File

During the data validation process, a record keeping of the various steps and results of the process is generated that represents the validation history file for a specific dataset.

This validation history file contains information on the validation process (i.e. edi-tor, date, records checked, errors detected per record, automatic changes per record, etc) in order to provide a useful insight of the quality of data. That information will highlight the critical errors and the variables that present the highest proportion in controls failure. This could be provided as a feedback to NCAs to find methodologi-cal ways to reduce their errors.

Moreover, another aim of the validation history file could be to study and analyze the process events to find the weaknesses and to make changes where necessary to minimize, perhaps even eliminate those weaknesses. For example, information on the number of cases failing each control, and on the number of these for which changes in the data resulted, indicates whether the checks are working efficiently in detecting errors.

Therefore, the validation history file should contain information for each record of the checked dataset as well as summary information for the validation process. Both those type of information helps the NCAs to assess the quality of their data as well as the performance of the validation process.

The validation history file will contain the following information:

Per dataset file:

Reference period Revision number Type of trade (E, I) Time and Date of processing (start time); ID of the person (editor) who executes the validation process; Total number of records checked;


Total number of records with fatal errors; Total number of records with errors; Total number of records with warnings; Total trade value of records with fatal errors; Total trade value of records with errors; Total trade value of records with warnings

Per each line of dataset where errors are found:

Line number (and if necessary pertinent data items to uniquely iden-tified erroneous records);

Control type: what type of control is executed, i.e. code checking, implausible combinations, outlier detection;

Type of the Error: i.e. fatal error, error, warning; Erroneous section(s) (e.g. product code); Type of change, i.e. automatic correction, further investigation by

the editor, no change; Original value - Corrected value.

Eurostat Users are able to view and export the history files directly from the XTNET Rules Transmission System.

4 Technology Background

The current sections deals with the technical part of the XTNET Rules Transmission System. Initially, the technologies and the programs that were used during the devel-opment process are described and a presentation of the basics features of the system follows.

The development of the application was based on the technologies ASP.NET, .NET Framework, while the products that have been used were Microsoft Windows 2000 Advanced Server, Oracle Database Version 8i, Visual Studio .NET Developer Tools, Microsoft Internet Information Services and C #.

The development platform of the XTNET Rules Transmission System is Microsoft .NET. The .ΝΕΤ platform has been designed so that it can unify all the applications that are used by an enterprise, independently of the software or hardware system that it might use. The key in this architecture are the XML Web Services, which are pieces of reusable code that has been based on widespread standards and which an applica-tion can use without further knowledge of the operating system in which they have been created, or the platform on which they are based.The.NET platform is supple-mented by other products and services like Visual Studio .NET and .NET Framework for the development of applications, the .NET Passport service and .NET My Services which improve considerably the service of users.

The .ΝΕΤ Framework is the programming model of the .NET platform for the de-velopment and operation of XML Web Services and applications. It provides a pro-ductive environment that is based on well known standards for the unification of an enterprise’s existing investments with the nest generation’s applications and services. It is constituted by two main parts: the Common Language Runtime and an integrated


total of class libraries and specifically of ASP.NET for Web applications, XML Web Services, Windows Forms for client applications and ADO.NET for accessing data.

The Visual Studio .NET is based on the previous editions of Visual Studio, so that it can offer flexibility and to facilitate the development of integrated services. More-over, Visual Studio .NET provided the ability to the developers to create applications and XML Web Services that run in any browser and any device. Still, any program-ming language can be used for the development of the application, increasing that way the productivity of the development team, since they can take advantage of al-ready existing knowledge.

Finally, ASP.NET is a rich collection of objects and classes, which allow the de-velopment of an application with the following advantages: facile application devel-opment, since all the settings are kept in XML files, while the components and their changes are automatically updated by the system. Additionally, there are provided important facilities and possibilities to debug the application. Also, is given the possi-bility of designing a particularly elegant application, with dissociation of the presenta-tion level from the logic level, by using code modules that are apart from the page and event-driven object oriented programming. Moreover, any programming lan-guage can be used and its source code is really compiled, allowing even bigger performance speeds. Another important advantage of ASP.NET is that it allows the creation of Web Services using the SOAP technology; it provides facilities for the Web application in Web Farm or Web Garden and provides great possibilities of caching all or part of the pages, in order to increase the application’s performance.

For storing and retrieving the data files the Oracle Database Version 8i is selected to be the relational database management system that will cooperate efficiently with the above technologies as clearly stated in the specifications given by Eurostat.

5 Case Study Implementation

In order to show the general functionality of the implemented system and the added –value services that offer to the users we will describe some screenshots taken from the system available at http://www.statistics.gr/xtnetToolWeb.

The users of the XTNET Rules Support System are provided by their username / password in order to gain access to the system. After the authorization and authentica-tion process users are able to navigate through the site by selecting from the menu bar at the top of the site. The selections are different depending on the privileges of role assigned to the user (Editor/Administrators/Eurostat). Τhe abilities they have are:

General (a set of pages describing the system, the formats of the files, the con-trols that are performed etc.).

Files (The Datasets of Intra, Extra and Aggregated files). Revisions (The history files of INTRA and EXTRA files validation processes). Final Datasets (The Final Validated Datasets of Intra, Extra and Aggregated

files). Final Revisions (The Final history files of INTRA and EXTRA files validation

processes).


Users (Users Administration – Available only to Administrators). Logout (Logout and exit from the XTNET Rules Support System).

By selecting INTRA Files or EXTRA Files from the Files Menu the authenticated user (Editors or Administrators) is able to view and modify the INTRA / EXTRA files. More specifically he can:

View the INTRA / EXTRA Dataset simply by selecting the reference period from the list.

View the details of each record by clicking its Record number. Delete records (Delete button). Export the Dataset of the selected Reference period (.txt file formatted as de-

scribed in DocMeth400 Rev.15) (Export Button). Validate the Dataset (run controls at INTRA / EXTRA - Community Trade

codes) (Run Validations Button)

Fig. 3. EXTRA Files Page (Run Validations button is clicked)

Fig. 4. Search Functionality by Code


Fig. 5. EXTRA Records Page

Users have the ability to search the dataset to find a specific row by typing the code in the search textbox and click search to narrow the records of the dataset.

Additional, users have the ability to export a selected dataset in the Format speci-fied by DocMeth400 Rev.15

With XTNET Rules Support System Editors are able to review every record of each dataset and modify it in case of error. This is accomplished by clicking in each record (from the INTRA / EXTRA Files Pages or from the INTRA / EXTRA Revi-sions Page). This action loads the INTRA / EXTRA Records Page where Editors are able to correct the erroneous Sections (a detailed tip for the type of error and an ex-planation appears beside each section) as presented in Figure 5.

One of the added value services of the XTNET Rules Support System is the acti-vation of client –side validation during the correction of records by the editors. If the user types a wrong value, the validation controls provide immediate feedback (with-out a round trip to the server).

In the Revisions section users are able to view the revisions (errors reports) of the INTRA or EXTRA DataSets. Specifically they can:

View the details of INTRA / EXTRA Revisions for a specified reference pe-riod from the list (the information that every Revision contains has been de-scribed in the Section Validation History File).

View errors for each section of all records (grouped by Line). Correct records by clicking the link for each record. Export the Revision (.txt file) (Export Button). Filter the selected Revision by Error type (Group by Error type List)


Fig. 6. EXTRA Revisions Page

Fig. 7. Aggregated Statistics Page


Finally, the aggregated statistics page deals with Aggregated Statistics files. Spe-cifically a user can:

View the INTRA Aggregated Statistics Dataset simply by selecting the refer-ence period from the list.

Export the Dataset of the selected Reference period (.txt file formatted as described in DocMeth400 Rev.15) (Export Button).


In this work we presented the development and implementation of the web –based XTNET Rules Support and Transmission system for validating and transmitting trade data to Eurostat. Our intention was to extend the implemented web – oriented services not only to national users (individuals or institutes) but also to international collabora-tors, like Eurostat. In order to enhance the overall experience of the XTNET users (viewers, editors, administrators) we have created a set of tools to provide added value services in the data exchange process between users. The main goals of the im-plemented system provided by the National Statistical Service of Greece (NSSG) are to modernize and improve the Trans-European network of trade statistics providing tools for better trade data collection, processing, validation and aggregation, and for production of useful reports and publications. Additionally an audit trail is offered in order to keep history of controls and corrections to the trade data.

The idea is to enhance the current system, providing direct dissemination function-alities to authorized international users (mainly Eurostat). In this case, the datasets will be tested and validated at national level and then they will be stored in the Transmission’s System Database, from which they can be downloaded by Eurostat, whenever necessary (in this way, the transmission via STADIUM will be skipped). Using the XTNET Rules Support and Transmission System, the whole statistics pro-duction and dissemination is improved because the validation checks take place at an earlier stage as it is performed by the NSI itself whereas the transmission is performed by Eurostat itself.

Overall, the implemented solution provides:

Web – based XTNET rules platform with validation, processing and transmission tools.

Web Access to database with the proper security measures, allowing all the in-volving parties to see, manipulate and use the data.

Better validation and correction process of records through usable web interface. Automatic creation of Validation history files that will provide a useful insight of the quality of data.

Export of datasets in appropriate format for instant transmission to Eurostat. Enhancement of the current dissemination system, providing flexibility to users.


Extension of web – oriented services not only to national users but also to inter-national users.

Overall improvement of data quality.

Future steps include the integration of appropriate modules and extensions to perform checks for implausible combinations of variables, for missing or duplicate records, for unit values and mirror statistics. These checks demand a preparation off datasets of previous years to be imported in the data repositories (Oracle database). Furthermore, utilization of different formats (XML format) of the provided dataset could be implemented to achieve better data availability.

Appendix

Format for the INTRA and EXTRA files

Eurostat expects from Member States to transmit their intra-EU trade and extra-EU trade results in accordance with the format and technical specifications of the INTRA and EXTRA file, as described in the following table.1

Table 1. Format for INTRA and EXTRA file

Section Nr.

From byte to byte

Nr. of bytes Format Contents File1

1 1 - 6 6 Character Reference period I + E

2 7 - 7 1 Character Threshold indicator I +E

3 8 - 10 3 Character Reporting country I + E

4 11 - 11 1 Character Flow I + E

5 12 - 21 10 Character CN product or TARIC (real code *) I + E

6 22 - 24 3 Character Partner country (real code *) I + E

7 25 - 27 3 Character Other partner country I + E

8 28 - 28 1 Character Statistical procedure E

9 29 - 31 3 Character Preference E

10 32 - 32 1 Character Mode of transport at the frontier I + E

11 33 - 33 1 Character Container E

12 34 - 36 3 Character Nationality of the means of transport E

13 37 - 37 1 Character Internal mode of transport E

14 38 - 39 2 Character Nature of transaction I + E

1 I: Intra File, E: Extra File.


Table 1. (continued)

Section Nr.

From byte to byte

Nr. of bytes Format Contents File1

15 40 - 42 3 blank Not relevant (ex Terms of delivery)

16 43 - 43 1 Character Confidentiality flag I + E

17 44 - 53 10 Character CN product (confidential code) I + E

18 54 - 58 5 Character SITC product code (confidential code) I + E

19 59 - 61 3 Character Partner country (confidential code) I + E

20 62 - 64 3 blank Not relevant (ex Other p. country conf. code)

21 65 - 65 1 blank Not relevant (ex Stat. proc. confidential code)

22 66 - 66 1 Character Value flag I + E

23 67 - 67 1 blank Not relevant (ex Invoiced amount flag)

24 68 - 68 1 Character Quantity flag I + E

25 69 - 69 1 Character Supplementary units flag I + E

26 70 - 83 14 Numerical Statistical value I + E

27 84 - 97 14 blank Not relevant (ex Invoiced amount)

28 98 - 111 14 Numerical Quantity expressed in net mass I + E

29 112 - 125 14 Numerical Quantity expressed in supplementary units I + E

Sections and Format of the aggregated data file The transmission of the Aggregated Statistics data file should follow the format de-scribed in the following table.

Table 2. Format for the Aggregated Statistics file

Section Nr. Format Contents

1 6 characters Reference period

2 2 characters Reporting country

3 1 character Flow

4 1 character Product code

5 4 characters Partner area

6 Numerical Statistical value


References

[1] Panagopoulou, G.: Data Quality Improvement. In: The proceedings of the CODACMOS EUROPEAN SEMINAR on The most effective ways to collect data and metadata: ad-vanced strategies, tools and metadata systems, Bratislava (October 7-8, 2004)

[2] Katsis, M., Markellos, K., Panagopoulou, G., Sirmakessis, S.: Applying Web Services to Governmental Organizations; the Case of the National Statistical Service of Greece. In: Joint UNECE (United Nations Statistical Commission and Economic Commission for Europe), Eurostat (European Communities) and OECD (Organisation for Economic Coop-eration and Development, Conference of European Statisticians, European Commission, Seminar on the Management of Statistical Information Systems (MSIS), June 21-23, Sofia, Bulgaria (2006)

[3] Markellos, K., Katsis, M., Panagopoulou, G., Sirmakessis, S., Tsakalidis, A.: «Using Stan-dardized Web Services for Integrating Processes of Autonomous Actors Communicating within a Public Information System – A Case Study from Greece». International Journal of Public Information Systems (IJPIS), Prof. Bo Sundgren (Ed.), Mid Sweden University (2007)

[4] National Statistical Service of Greece, Internet Server for Online Research & Surveys, Electronic System for Online Intrastat Declarations

[5] http://eurostat.statistics.gr/intrastat/default.en.aspx [6] European Commission - Eurost, http://ec.europa.eu/eurostat [7] XTNET Rules Support System NATIONAL Harmonisation and Improvement of External

Trade Statistical Production Controls, http://www.statistics.gr/xtnetWebTool/

C. Koutsojannis and S. Sirmakessis (Eds.): Tools & Appl. with Artificial Intel., SCI 166, pp. 199–211. springerlink.com © Springer-Verlag Berlin Heidelberg 2009

Improving Text-Dependent Speaker Recognition Performance

Donato Impedovo and Mario Refice

Dipartimento di Elettrotecnica ed Elettronica - Politecnico di Bari, via Orabona 4, 70123 Bari, Italy [email protected], [email protected]

Abstract. In this paper we investigated the role of the frame length on the computation of MFCC acoustic parameters in a text-dependent speaker recognition system. Since the vocal characteristics of subjects may vary along the time, the related information conveyed by the MFCCs usually cause a significant degradation on recognition performance. In our experiment we tested the use of different frame lengths for the features extraction in the training and the recognition phases for a set of speakers whose speech productions spanned over 3 months. Results show that a suitable choice of the frame lengths combination for training and testing phases can improve the recognition performance reducing the false rejection rate. An expert system driven to look for the best combination of frame lengths in order to obtain the maximum performance level of the HHM engine may help in de-creasing the amount of false rejections.

1 Introduction

In everyday life, it is a common experience for people to be able to identify speakers by their voices, and even “distinguishing between voices they have heard only one or two times” [23]. In speech technology, many attemps have been made aiming at modelling such human ability for a number of applications, such as in security access control systems, or in specific investigation fields like computational forensics. Such a task is particularly challanging because, differ-ently from fingerprints or DNA sequence, a person’s voice can change strongly, depending on several factors like state of health, emotional state, familiarity with interlocutors [23], and also along the time.

In voice identification, several linguistic-phonetic parameters have been investi-gated and proposed as representatives of the individual vocal tract characteristics. Average fundamental frequency, formants frequency values at some specific points, intonation patterns are, among others, the most popular features for which a certain degree of reliability have been experimentally demonstrated. In speech technology applications, the Mel Frequency Cepstrum Coefficients (MFCC) are widely recognised to be a good set of acoustic parameters as they encode vocal tract nad some source information, even though a reduced set of phonetic features has also been demonstrated to be effective in text independent speaker identification perfomance [24].

200 D. Impedovo and M. Refice

In this paper we report on the influence of the frame length on the compu-tation of MFCC in a text-dependent speaker recognition system. Findings of this study constitute an additional component of an expert system we are developing aiming at identifying a person in a control security access application. Other components of such expert system will deal with the computation of several dif-ferent parameters, like the ones mentioned above, where the final decision will be depending on the hypotheses formulated by all the system components.

2 Speaker Verification Process

In our work we deal specifically with speaker verification aspects, here defined as the process of deciding if a speaker is who s/he claims to be. Nowadays text-dependent ap-plications are the ones with the highest performances and can be applied successfully in real situations. In these kind of systems, the speaker is recognized through an uttered phrase known by the system, for example a password. This procedure implies at least a double security level: the first consists in the secrecy of the chosen password while the second is represented by the vocal characteristics of the speaker.

From a general point of view, the process of speaker verification consists in a decision derived by a comparison between the features extracted by a recognition engine and those stored in a database, as schematised in Figure 1.

The state of the art for the recognition engine is based on statistical classifiers such as Hidden Markov Models (HMM) [5] or Gaussian Mixture Models (GMM)

Fig. 1. Overall model of a speaker verification system

Improving Text-Dependent Speaker Recognition Performance 201

[4]. These systems work in two phases: enrolment and recognition. During the training (enrolment) phase, the system learns the generative models for each user, while during the recognition phase the unknown input is classified according to the known models, or possibly rejected.

As it can be observed in Figure 1, the first macro-step both for the training and the recognition phases is the Features Extraction from the speech input files. Fea-tures Extraction is the process through which a reduced set of parameters is de-rived from the input speech in order to characterize and represent it. As already mentioned, spectral based features, like Mel Frequency Cepstral Coefficients (MFCCs) and their derivatives are widely used and accepted as the basic features able to give a suitable representation of the vocal tract [1], [2], [3].

The process for extracting MFCCs from the input speech emulates the way human ears capture and process sounds with different accuracy levels over differ-ent frequency bandwiths. Unfortunately, since the vocal tract characteristics tend to vary along the time, whereas the identification process is based on models trained on features extracted during the enrolment phase, the performance of the recognition process results in a significant degradation. A reason of this degrad ation may be due to what is called “pitch-mismatch” [6], i.e. the change of the pitch cycle along the time, even though both the speaker and the spoken utterances remain the same.

In this paper we refer to a system we are developing, where the password used by the speaker for verification simply consists in his/her name+surname. In order to cope with the above mentioned problem of performance degradation, we carried out a set of tests by varying the frame length of speech analysis window. Tests have been carried out by observing verification performances on single speakers who have been asked to perform different authentication session over a period of three months.

In order to cope with the huge amount of data, and with the needed flexibility of the system in dealing with the experimental adjustment of the varying analysis windows, a fast prototyping system has been used (see section 2.2 for details).

As for the testing process, an HMM recognizer has been used which has been developed in our lab [7], [8].

3 The Speaker Recognition System

As mentioned before, our system consists of two components, the first checking the validity of the uttered password, and the second verifying the identity of the speaker.

3.1 The Speech Recognition Engine

The utterance recognition component uses a set of speaker independent models in order to recognize the pronounced password (name+surname). The model is based on a very simple grammar with few syntactic rules for the dynamic con-catenation of the basic units. At the end of the process, the recognition hypothesis is parsed by using the knowledge of the known and allowed identities. The output


of the system is a variable that contains the transcription of the identity present in the database or, if the utteraance has been judged as not belonging to the defined sets, a rejection message is provided. The complete description of the utterance recognition system is beyond the aims of this paper and will not be illustrated in more details here.

3.2 The Speaker Verification Engine

The speaker verification engine is here described according to its main compo-nents: the features extraction for the speech signal and the recognition engine.

3.2.1 Features Extraction MFCCs are obtained from the power spectrum of the speech signal. Since speech is a non stationary signal, in order to perform the Discrete Fourier Transform (DFT) a short time analysis must be performd. Figure 2 shows the framing proc-ess. In speaker verification applications, this step is usually performed using 20÷30ms frames and a fixed shift of 10÷15ms, with the assumption that the signal is quasi-stationary within a frame interval.

Fig. 2. Framing process

The DFT is computed of each time window for the discrete time signal x(n) with length N, given by:

X(k) = ∑−

=−

1

0)/2exp()()(

N

nNknjnxnw

(1)

for k = 0,1,…,N-1, where k corresponds to the frequency f(k) = kfs/N, fs is the sampling frequency in Hertz and w(n) is the Hamming time window given by w(n)=0.54-0.46cos(pn/N).

The magnitude spectrum |X(k) | is then scaled in frequency and magnitude. The frequency is scaled using the Mel filter bank H(k,m) and then the logarithm is considered:

X'(m) = ⎟⎠⎞⎜

⎝⎛ ⋅∑

−

=

1

0),(|)(|ln

N

kmkHkX

(2)

for m = 1, 2,…., M, with M the number of filter banks and M << N. The Mel filter bank is a collection of triangular filters defined by the center frequencies fc (m) and defined as follow:


⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

+>

+<≤+−+−

≤≤−−−−−

−<

=

)1()(0

)1()()()1()()1()(

)()()1()1()()1()(

)1()(0

),(

mfkffor

mfkfmfformfmf

mfkf

mfkfmfformfmfmfkf

mfkffor

mkH

c

cccc

c

cccc

c

c

(3)

and the center frequencies of the filter banks are spaced logarithmically on the frequency axis.

Finally the MFCCs are obtained by computing the Discrete Cosine Transform (DCT) of X'(m):

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞⎜

⎝⎛ −= ∑

= 21cos)(

1m

MiX'(m)ic

M

m

(4)

for i = 1, 2, ..., M, and c(i) is the ith MFCC. The mel warping process transforms the frequency scale in order to place less em-

phasis on high frequencies, as it is based on the nonlinear human perception of sounds. Unfortunately, the alignment between the position of the frame under analysis

and the stationary part of the signal is not guaranteed by any algorithm. As a consequence, artefacts are introduced in the power spectrum and in the infor-mation related to the fundamental frequency (pitch) conveyed by the MFCC. It has been observed that for high pitched speakers and for those characterised by av-erage pitch variations between enrolment and testing phases (“pitch mismatch”), the fine spectral structure related to the pitch causes degradation on speaker rec-ogn ition performance [9], [10]. In the last years, different approaches have been proposed, which are largely based on pitch synchronous methods by windowing the cepstrum coefficients after their evaluation [11], or on the attempt of calcu-lating features less sensitive to pitch changes yet capable of retaining good dis-criminative properties [12], or aligning each individual frame to its natural cycle [13], [6].

In this work, an approach that uses different frame lengths is investigated. Rec-ognition performances have been computed by using different frame lengths to extract features from the speech signal in the training and in the recognition phases. The use of frames having different sizes between the two different phases has been already demonstrated to cope with the pitch mismatch, reducing false rejections for the subset of the high pitched speakers [14]. The frame lengths here considered are 22, 25, 28 and 31ms: they are values that equally divide the range of the most used lengths, while the fixed size of 10ms has been adopted for the shift in the framing process.

3.2.2 The Recognition Engine The system is based on HMMs with continuous observation densities. In text-independent applications, GMMs are generally adopted [4] even though recently


discriminative approach have been proposed [20]. A GMM can be considered as a special case of continuous observation densities HMM, where the number of states is one. A continuous HMM is able to keep information and to model not only the sound, bur even the articulation and the temporal sequencing of the speech. However, in text independent speaker recognition tasks, the sequencing of sound in the training data does not necessarily represent the sound sequences of the testing data, so the state transition probabilities have a little influence, while in text-dependent application they play a fundamental role.

The Hidden Markov Models considered in all our experiments adopt a left-to-right no skip topology: a transition may only occur from one state to the immediate following or to the same state. For each state, the Gaussian observation probability-density func-tion (pdf) is used to statistically characterize the observed speech feature vectors.

An HMM ? can be characterized by a triple of state transition probabilities A, observation densities B, and initial state probabilities ? through the following notation:

{ } { }iiji baBA ,,,, ,=Π= , i, j=1,…, N

(5)

where N is the total number of states in the model, ai,j is the transition prob-ability from the state i to j. Given an observation sequence (features vectors) O={ot} with t=1,…,T, the continuous observation probability density for the state j is characterized as a mixture of Gaussian probabilities:

),,;()|Pr()(1

jmjmt

M

mjmttj RoPcjoob μ∑

=

==

(6)

with

⎭⎬⎫

⎩⎨⎧ −−= −−− )()(

21exp||)2(),;( 12/12/

jmtjmT

jmtjmd

jmt oRoRRjmoP μμπμ

(7)

where M is the total number of the Gaussian components in the mixture, µjm and Rjm are the d-dimentional mean vector and covariance matrix of the mth

component at state j, finally cjm are the mixture weights and satisfy the constraint

.11

=∑ =

M

m jmc

The mentioned model parameters have been estimated by the Baum-Welch iterative methods (also known as the expectation-maximization EM) in order to maximize Fr(O|?) [21, 22].

The verification phase is based on the Viterbi algorithm: given an unknown input, it evaluates the highest probability for that input to be generated by a ? model. In this phase, the output of the speech recognition system is used for a first valid ation of the result given by the Viterbi algorithm. If the recognized utterance corresponds to the identity verified by the speaker verifier, then the output is accepted, otherwise it is re-jected. This allows the rejection of all unknown (i.e. not included in the database) utterances from the inset and non-inset speakers.



4.1 Experiment Setup

The experiments have been carried out on a corpus specifically collected which currently includes 34 speakers (16 female and 19 male) aged between 20 and 50 (we are still in the process of improving the size of this corpus). Each of them repeated his/her name and surname in 4 or 5 recording sessions during a time span of 3 months. The time between sessions, and the number of recordings in each session is variable across speakers, making speech data much closer to real situations. In all the recordings sessions, a typical PC sound card has been used in an office environment with 22.050 Hz sampling rate, 16-bit quantization level and single channel.

A filter bank of 24 Mel filters has been used on each frame to extract 19 MFCC coefficients, their time derivatives (known as ? cepstra and used to model trajectory information) and two energy values [15], [16], [17], [18], [19]; this makes a total of 30 parameters per vector. These parameters remain the same while the extracted features vectors differ by varying the frame length within the values 22, 25, 28 and 31 msec. No score normalisation has been applied.

Table 1 summarizes the parameters adopted to extract features from the speech signal..

Table 1. Features parameterization details

Parameters MFCCs + Energy + ?MFCCs + ? Energy

Frame lengths(ms) 22, 25, 28, 31

Frame shift (ms) 10

No. frames for ? calculation 5

No.of vector elements per frame (19+1)*2=30

No. of Mel filters 24

For each speaker in the corpus, four HMMs have been trained (one per frame length) using recordings of the first session (between 25 and 40 seconds of train-ing data per speaker), and tested on the remaining recordings of the following sessions. Each model has 8 states and 3 Gaussian mixtures per state (this has been determined experimentally).

The relative little amount of training data simulates real applications where an exhaustive training session cannot be imposed on the subjects.

4.2 Recognition Performances

The results here presented refer to 5 male and 3 female speakers who have been selected as the most representative of the corpus. The first test refers to


speech productions recognized with the same frame length as that used during the training phase; the corresponding models (baseline models) are referred to as M-22, M-25, M-28 and M-31 in the Table 2, where the Error Rate is reported for speakers grouped by gender.

Table 2. Error rate for males and females using the four baseline models

Model Males ERR (%) Females ERR(%)

M-22 6.40 13.52

M-25 3.94 8.50

M-28 1.51 7.05

M-31 3.03 7.96

As expected, females show a consistent greater Error Rates due to their high pitch characteristics. It can also be observed that in this case a 28 msec model appears to be the best baseline model for both the groups, even though a substan-tial gap exists between the two related performances.

Table 3. Performance using the same frame length for training and recognition over the T set

Spkr.22ms

ER (%)

25ms

ER (%)

28ms

ER (%)

31ms

ER (%)

TSP-1 3.34 3.34 3.34 3.34

TSP-2 5.00 5.00 5.00 5.00

TSP-3 66.67 60.00 33.34 48.34

TSP-4 20.00 15.00 10.00 8.34

Table 4. Performances using the same frame length for training and recognition over the N set

Spkr.22ms

ER (%)

25ms

ER (%)

28ms

ER (%)

31ms

ER (%)

NSP-1 1.67 1.67 1.67 1.67

NSP-2 1.67 1.67 1.67 1.67

NSP-3 3.34 3.34 1.67 1.67

NSP-4 1.67 1.67 1.67 3.34


A breakdown of the recognition results for each speaker is shown in Table 3. and Table 4., where speakers have been divided into two groups, named T and N respec-tively, according to their recognition performance, for the same mentioned models.

A closer inspection reveals that the N set consists of only males while the T set includes 3 females and 1 male. A further analysis shows that the male in the T set is characterized by a high pitch variation between the enrolment and the test-ing phase, so confirming the already mentioned expections.

Table 5. Performance using 22ms input on the 4 models for the TS set

Spkr ER (%)

M- 22

ER(%)

M- 25

ER(%)

M- 28

ER(%)

M- 31

TSP-1 3,34 3,34 3,34 3,34

TSP-2 5,00 6,67 16,67 31,67

TSP-3 66,67 63,34 1,67 6,67

TSP-4 20,00 16,67 10,00 43,34

Table 6. Performance using 22ms input on the 4 models for the NS set

Spkr ER (%)

M- 22

ER (%)

M- 25

ER (%)

M- 28

ER (%)

M- 31

NSP-1 1,67 3,34 3,34 1,67

NSP-2 1,67 1,67 1,67 1,67

NSP-3 3,34 3,34 1,67 1,67

NSP-4 1,67 3,34 3,34 1,67


Spkr ER (%)

M- 22

ER (%)

M- 25

ER (%)

M- 28

ER (%)

M- 31

TSP-1 25,00 3,34 1,67 3,34

TSP-2 48,34 5,00 1,67 45,00

TSP-3 71,67 60,00 18,34 3,34

TSP-4 58,34 15,00 10,00 1,67



Spkr ER (%)

M- 22

ER (%)

M- 25

ER (%)

M- 28

ER (%)

M- 31

NSP-1 6,67 1,67 1,67 3,34

NSP-2 5,00 1,67 1,67 3,34

NSP-3 10,00 3,34 1,67 1,67

NSP-4 15,00 1,67 1,67 3,34


Spkr ER (%)

M- 22

ER (%)

M- 25

ER (%)

M- 28

ER (%)

M- 31

TSP-1 26,67 3,34 3,34 3,34

TSP-2 18,34 13,34 5,00 1,67

TSP-3 26,67 21,67 33,34 33,34

TSP-4 25,00 15,00 10,00 1,67


Spkr ER (%)

M- 22

ER (%)

M- 25

ER (%)

M- 28

ER (%)

M- 31

NSP-1 28,34 1,67 1,67 1,67

NSP-2 13,34 1,67 1,67 1,67

NSP-3 20,00 1,67 1,67 1,67

NSP-4 13,34 1,67 1,67 3,34

As it can be also noted, the frame length of 28 msec does not satisfy any more the previous findings of the best frame length for all the speakers (shown in Table 1): while speakers TSP-1 and TSP-2 show the same performances under the different configura-tions, TSP-3 achieves the best performance using 28 ms frames while TSP-4 does the same by using 31ms frames.

Based on these results, we further analysed the corpus by varying the frame length in the enrolment and recognition phases for all the combinations of the selected values and for all speakers, with the aim of determining pairs of values which might result as the optima for each individual speaker. Results of the recognition performances for all the mentioned possible combinations of the enrolment and testing frames lengths are shown in Tables 5-12.



Spkr ER (%)

M- 22

ER (%)

M- 25

ER (%)

M- 28

ER (%)

M- 31

TSP-1 25,00 3,34 3,34 3,34

TSP-2 80,00 65,00 8,34 5,00

TSP-3 65,00 5,00 16,67 48,34

TSP-4 48,34 13,34 8,34 8,34


Spkr ER (%)

M- 22

ER (%)

M- 25

ER (%)

M- 28

ER (%)

M- 31

NSP-1 58,34 28,34 1,67 1,67

NSP-2 30,83 1,67 1,67 1,67

NSP-3 21,67 1,67 1,67 1,67

NSP-4 30,00 1,67 3,34 3,34

As it can be noted, for the N set values obtained with 28ms frames in the classic ap-proach (i.e. same frame length for enrolment and recognition), recognition rate is never outperformed by any other combination, even though the same ER values have been obtained using different frames lengths.

A different behaviour is observed for the T set, where the optimal pair of values for training and recognition are practically speaker-dependent, as summarized in Table 13.

Table 13. Best frame length combinations for T set of speakers

Spkr. Training Recognition

TSP-1 28 25

TSP-228

31

25

28

TSP-3 28 22

TSP-431

31

25

28


5 Conclusion

In this paper we investigated the influence of frame length variation in the computation of MFCC acoustic parameters in a text-dependent speaker verification system, aiming at ascertaining whether such information can be used to cope with the degradation of speaker recognition performance along the time. The obtained results suggest that, at least for a set of speakers, frame length variation indeed plays a role in the recognition performance by an HMM engine. In particular, our preliminary results show that the best combination of frame length values for training and recognition phases are speaker-dependent, even in a text-dependent application.

Degradation of recognition performance along the time may lead to a false rejection due to the threshold mechanism on which the decision is based. An expert system driven to look for the best combination of frame lengths in order to obtain the maximum performance level of the HHM engine may help in decreasing the amount of false rejections.

In order to confirm the reported preliminary investigations, further analyses are needed, notably those concerning the aspects of false rejections. Future work will therefore include analysis of impostors’ recordings already collected in our database.

References

[1] Doddington, G.R.: Speaker Recognition-Identifying People by their Voices. Proceedings of IEEE 73(11), 1651–1664 (1985)

[2] Mammone, R.J., Zhang, X., Ramachandran, R.P.: Robust Speaker Recognition, A Feature- based Approach. IEEE Signal Processing Magazine, 58–71 (1996)

[3] Furui, S.: Digital Speech Frocessing, Synthesis, and Recognition. Marcel Dekker, New York (1989)

[4] Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 10, 19–41 (2000)

[5] Rabiner, L.R., Juang, B.H.: An Introduction to Hidden Markov Models. IEEE ASSF Magazine 3(1), 4–16 (1986)

[6] Zilca, R.D., Kingsbury, B., Ramaswamy, G.N.: Pseudo Pitch Synchronous Analysis of Speech With Applications to Speaker Recognition. IEEE Transactions on Audio, Speech, and Language Processing 14(2) (2006)

[7] Impedovo, D., Refice, M.: Modular Engineering Prototyping Plan for Speech Recognition in a Visual Object Oriented Environment. Information Science and Applications 2(12), 2228–2234 (2005)

[8] Impedovo, D., Refice, M.: A Fast Prototyping System for Speech Recognition based on a Visual Object Oriented Environment. In: Proceedings of 5th ISCGAV (2005)

[9] Quatieri, T.F., Dunn, R.B., Reynolds, D.A.: On the influence of Rate, Pitch, and Spectrum on Automatic Speaker Recognition Performance. In: Proceedings of ICSLP 2000 (2000)

[10] Kim, S., Eriksson, T., Kang, H.G., Youn, D.H.: A pitch synchronous feature extraction method for speaker recognition. In: Proceedings of ICASSP 2004, pp. II-405 – II-408 (2004)

[11] Sae-Tang, S., Tanprasert, C.: Feature Windowing-Based for Thai Text-Dependent Speaker Identification Using MLP with Backpropagation Algorithm. In: Proceedings of ISCAS 2000 (2000)


[12] Liu, J., Zheng, T.F., Wu, W.: P itch Mean Based Frequency Warping. In: Proceedings of ISCSLP 2006, pp. 87–94 (2006)

[13] Zilca, R.D., Navratil, J., Ramaswamy, G.N.: Depitch and the role of fundamental frequency in speaker recognition. In: Proceedings of ICASSP 2003, pp. II-81 – II-84 (2003)

[14] Impedovo, D., Refice, M.: The Influence of Frame Length on Speaker Identification Performance. In: Proceedings of IAS 2007, Manchester (2007)

[15] Young, S.J.: HTK, Hidden Markov model toolkit V1.4, Technical report. Cambridge University, Speech Group

[16] Rabiner, L.R., Schafer, R.: Digital Frocessing of Speech Signals, ISBN: 0132136031 [17] Parsons, T.: Voice and Speech Frocessing. McGraw-Hill, New York (1987) [18] Hoppenheim, A.V., Schafer, R.W.: Homomorphic Analysis of Speech. IEEE Transaction

On Audio and Electroacustics, vol. AU-16(2), pp. 221–226 [19] Deller, J., Hansen, J., Proakis, J.: Discrete-Time Frocessing of Speech Signals. IEEE

Press Classic Reissue (1999) ISBN: 0780353862 [20] Wan, V., Renals, S.: Speaker Verification Using Sequence Discriminant Support Vector

Machines. IEEE Transaction on Speech and Audio Processing 13(2) (March 2005) [21] Baum, L., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the

statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 14, 164–171 (1970)

[22] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society 39(1), 1–38 (1977)

[23] Nolan, F.: Dynamic Variability in Speech (DyViS). A forensic phonetic study on British English, http://www.ling.cam.ac.uk/dyvis/

[24] Espy-Wilson, C.Y., Manocha, S., Vishnubhotla, S.: A new set of features for text-independent speaker ide ntification. In: Proceedings of ICSLP 2006, pp. 1475–1478 (2006)

Author Index

Alexakos, Christos E. 149Arshad, Mohd Rizal 1

Becker, Eric 11Beligiannis, Grigorios N. 105, 149Bouabdallah, Salma Bouslama 135

Chen, Austin H. 89Chen, Guan Ting 89

Dosi, Christina 149

Ford, James 11Fotakis, Dimitrios 105

Gorji, Aliakbar 51

Hatzilygeroudis, Ioannis 163Henz, Martin 25

Impedovo, Donato 199

Jinno, Hirokazu 121

Karatzas, Kostas D. 37Katsis, Marios 181Koupparis, Andrew 163Koutsojannis, Constantinos 163Kriksciuniene, Dalia 77Kyriakidis, Ioannis 37

Likothanassis, Spiridon D. 105, 149

Makedon, Fillia S. 11Markellos, Kostas 181Mat Sakim, Harsa Amylia 1Menhaj, Mohammad Bagher 51Moschopoulos, Charalampos N. 105,

149Mourtos, Ioannis 69

Ogawa, Masayuki 121Othman, Nor Hayati 1

Papadourakis, George 37

Refice, Mario 199

Saddam, Ramla 135Sakalauskas, Virgilijus 77Salleh, Nuryanti Mohd 1Sato, Tatsuo 121Shiry, Saeed 51Sirmakessis, Spiros 181

Tagina, Moncef 135Truong, Hoang-Minh 25Tsiatsis, Panagiotis 105Tsunoyama, Masahiro 121

Xanthopoulos, Spyros 69Xu, Yurong 11

tools and applications with artificial intelligence studies in computational intelligence

Documents