Download - ADCOM 2009 Conference Proceedings

Proceedings of the

17th International Conference on Advanced Computing and

Communications

December 14 – 17, 2009, Bangalore, India

ADCOM 2009

Copyright © 2009 by the Advanced Computing and Communications SocietyAll rights reserved.

Advanced Computing & Communications Society

Gate #2, CV Raman Avenue,(Adj to Tata Book House) Indian Institute of Science,

Bangalore, IndiaPIN: 560012

CONTENTS Message from the Organizing Committee i Steering Committee ii Reviewers iii GRID ARCHITECTURE Parallel Implementation of Video Surveillance Algorithms on GPU Architecture using NVIDIA CUDA Sanyam Mehta, Ankush Mittal, Arindam Misra, Ayush Singhal, Praveen Kumar, and Kannappan Palaniappan

2

Adapting Traditional Compilers onto Higher Architectures incorporating Energy Optimization Methods for Sustained Performance Prahlada Rao BB, Mangala N and Amit Chauhan

10

SERVER VIRTUALIZATION Is I/O Virtualization ready for End-to-End Application Performance? J. Lakshmi, S.K.Nandy

19

Eco-friendly Features of a Data Centre OS S. Prakki

26

HUMAN COMPUTER INTERFACE -1 Low Power Biometric Capacitive CMOS Fingerprint Sensor System Shankkar B, Roy Paily and Tarun Kumar

34

Particle Swarm Optimization for Feature Selection: An Application to Fusion of Palmprint and Face Raghavendra R, Bernadette Dorizzi, Ashok Rao and Hemantha Kumar

38

GRID SERVICES OpenPEX: An Open Provisioning and EXecution System for Virtual Machines Srikumar Venugopal, James Broberg and Rajkumar Buyya

45

Exploiting Grid Heterogeneity for Energy Gain Saurabh Kumar Garg

53

Intelligent Data Analytics Console Snehal Gaikwad, Aashish Jog and Mihir Kedia

60

COMPUTATIONAL BIOLOGY Digital Processing of Biomedical Signals with Applications to Medicine D. Narayan Dutt

69

Supervised Gene Clustering for Extraction of DiscriminativeFeatures from Microarray Data. C. Das, P.Maji, S. Chattopadhyay

75

Modified Greedy Search Algorithm for Biclustering Gene Expression Data S.Das, S.M. Idicula

83

AD-HOC NETWORKS Solving Bounded Diameter Minimum Spanning Tree Problem Using Improved Heuristics Rajiv Saxena and Alok Singh

90

Ad-hoc Cooperative Computation in Wireless Networks using Ant like Agents Santosh Kulkarni and Prathima Agrawal

96

A Scenario-based Performance Comparison Study of the Fish-eye State Routing and Dynamic Source Routing Protocols for Mobile Ad hoc Networks Natarajan Meghanathan and Ayomide Odunsi

83

NETWORK OPTIMIZATION Optimal Network Partitioning for Distributed Computing Using Discrete Optimization Angeline Ezhilarasi G and Shanti Swarup K

113

An Efficient Algorithm to Reconstruct a Minimum Spanning Tree in an Asynchronous Distributed Systems Suman Kundu and Uttam Kumar Roy

118

A SAL Based Algorithm for Convex Optimization Problems Amit Kumar Mishra

125

WIRELESS SENSOR NETWORKS Energy Efficient Cluster Formation using Minimum Separation Distance and Relay CH’s in Wireless Sensor Networks V. V. S. Suresh Kalepu and Raja Datta

130

An Energy Efficient Base Station to Node Communication Protocol for Wireless Sensor Networks Pankaj Gupta, Tarun Bansal and Manoj Misra

136

A Broadcast Authentication Protocol for Multi-Hop Wireless Sensor Networks R. C. Hansdah, Neeraj Kumar and Amulya Ratna Swain

144

GRID SCHEDULING Energy-efficient Scheduling of Grid Computing Clusters Tapio Niemi, Jukka Kommeri and Ari-Pekka Hameri

153

Energy Efficient High Available System: An Intelligent Agent Based Approach Ankit Kumar, Senthil Kumar R. K. and Bindhumadhava B. S

160

3. A Two-phase Bi-criteria Workflow Scheduling Algorithm in Grid Environments Amit Agarwal and Padam Kumar

168

HUMAN COMPUTER INTERFACE -2

Towards Geometrical Password for Mobile Phones Mozaffar Afaq, Mohammed Qadeer, Najaf Zaidi and Sarosh Umar

175

Improving Performance of Speaker Identification System Using Complementary Information Fusion Md Sahidullah, Sandipan Chakroborty and Goutam Saha

182

Right Brain Testing-Applying Gestalt psychology in Software Testing Narayanan Palani

188

MOBILE AD-HOC NETWORKS Intelligent Agent based QoS Enabled Node Disjoint Multipath Routing Vijayashree Budyal, Sunilkumar Manvi and Sangamesh Hiremath

193

Close to Regular Covering by Mobile Sensors with Adjustable Ranges Adil Erzin, Soumyendu Raha and.V.N. Muralidhara

200

Virtual Backbone Based Reliable Multicasting for MANET Dipankaj Medhi

204

DISTRIBUTED SYSTEMS Exploiting Multi-context in a Security Pattern Lattice for Facilitating User Navigation Achyanta Kumar Sarmah, Smriti Kumar Sinha and Shyamanta Moni Hazarika

215

Trust in Mobile Ad Hoc Service GRID Sundar Raman S and Varalakshmi P

223

Scheduling Light-trails on WDM Rings Soumitra Pal and Abhiram Ranade

227

FOCUSSED SESSION ON RECONFIGURABLE COMPUTING AES and ECC Cryptography Processor with Runtime Reconfiguration Samuel Anato, Ricardo Chaves, Leonel Sousa

236

The Delft Reconfigurable VLIW Processor Stephen Wong, Fakhar Anjam

244

Runtime Reconfiguration of Polyhedral Process Network Implementation Hristo Nikolov, Todur Stefanov, Ed Depprettere

252

REDEFINE: Optimizations for Achieving High Throughput Keshavan Varadarajan, Ganesh Garga,Mythri Alle, S K Nandy, Ranjani Narayan

259

Poster Papers A Comparative Study of Different Packet Scheduling Algorithms with Varied Network Service Load In IEEE 802.16 Broadband Wireless Access Systems Prasun Chowdhury Iti Saha Misra

267

A Simulation Based Comparison of Gateway Load Balancing Strategies in Integrated Internet-MANET Rafi-U-Zaman Khaleel-Ur-Rahman Khan M.A.Razzaq A. Venugopal Reddy

270

ECAR: An Efficient Channel Assignment and Routing in Wireless Mesh Network S. V. Rao, Chaitanya P. Umbare

273

Rotational Invariant Texture Classification of Color Images using Local Texture Patterns A.Suruliandi,E.M.Srinivasan K.Ramar

276

Time Synchronization for an Efficient Sensor Network System Anita Kanavalli, Vijay Krishan, Ridhi Hirani, Santosh Prasad, Saranya K.,P Deepa Shenoy, and Venugopal K R

280

Parallel Hybrid Germ Swarm Computing for Video Compression K. M. Bakwad, S.S. Pattnaik, B. S. Sohi, S. Devi1, B.K. Panigrahi, M. R. Lohokare

283

Texture Classification using Local Texture Patterns: A Fuzzy Logic Approach E.M. Srinivasan,A. Suruliandi K.Ramar

286

Integer Sequence based Discrete Gaussian and Reconfigurable Random Number Generator Arulalan Rajan, H S Jamadagni, Ashok Rao

290

Parallelization of PageRank and HITS Algorithm on CUDA Architecture Kumar Ishan, Mohit Gupta, Naresh Kumar, Ankush Mittal

294

Designing Application Specific Irregular Topology for Network-on-Chip Virendra Singh, Naveen Choudhary M.S.Gaur, V. Laxmi

297

QoS Aware Minimally Adaptive XY routing for NoC Navaneeth Rameshan , Mushtaq Ahmed , M.S.Gaur , Vijay Laxmi and Anurag Biyani

300

Message from the Organizers

Welcome to the 17th International Conference on Advanced Computing and Communications (ADCOM 2009) being held at the Indian Institute of Science, Bangalore, India during December 14-18, 2009.

ADCOM, the flagship event of the Advanced Computing and Communication Society (ACCS), is a major international conference that attracts professionals from industry and academia across the world to share and disseminate their innovative and pioneering views on recent trends and development in computational sciences. ACCS is a registered scientific society founded to provide a forum to individuals, institutions and industry to promote advanced Computing and Communication technologies.

Building upon the success of last year’s conference, the 2009 Conference will focus on "Green Computing" to promote higher standards for energy-efficient data centers, central processing units, servers and peripherals as well as reduced resource consumption towards a sustainable 'green' ecosystem. ADCOM will also explore computing for the rural masses in improving delivery of public services like education and primary health care.

Prof. Patrick Dewilde from Technical University Munich, and Prof. N. Balakrishnan from the Indian Institute of Science are the General Chairs for ADCOM 2009. The organizers thank Padma Bhushan Professor Thomas Kailath, Professor Emeritus at Stanford University for being the Chief Guest for the inaugural event of the conference, and to honor the “DAC-ACCS Foundation Awards 2009” awardees Prof. Raj Jain from Washington University, USA and Prof. Anurag Kumar from Indian Institute of Science, India for their exceptional contributions to the advancement of networking technologies.

The conference features 8 plenary and 8 invited talks from internationally acclaimed leaders in the industry and academics. The Programme Committee had the arduous task of selecting 30 papers to be presented in 12 sessions and 11 poster presentations from a total of 326 submissions. ADCOM 2009 will have a Special Focused session on “Emerging Reconfigurable Systems” with 4 invited presentations to be followed by an open forum for discussions. In tune with the theme of the conference, an Industry Session on Green Datacenters is organized to disseminate awareness on energy-efficient solutions. A total of 8 tutorials in current topics in various aspects of computing is arranged following the main conference.

The organizers sincerely thank all authors, reviewers, programme committee members, volunteers and participants for their continued support for the success of ADCOM 2009. We welcome you all to enjoy the green and serene ambience of the Indian Institute of Science, the venue of the conference, in the IT capital of India.

i

ADCOM 2009 STEERING COMMITTEE

Patron P. Balaram,IISc, India

General Chairs

N. Balakrishnan, IISc, India Patrick Dewilde, TU Munich

Technical Programme Chairs

S. K. Nandy, IISc, India S Uma Mahesh,Indrion, India

Organising Chairs

B.S. Bindhumadhava, CDAC, India S. K. Sinha, CEDT, India

Industry Chairs

H. S. Jamadagni,IISc, India Krithiwas Neelakantan, SUN, India Lavanya Rastogi, Value-One, India

Saragur M. Srinidhi, Prometheus Consulting, India

Publicity & Media Chairs G. L. Ganga Prasad, CDAC, India

P. V. G. Menon, VANN Consulting, India

Publications Chair K. Rajan, IISc, India

Finance Chairs

G. N. Rathna, IISc, India

Advisory Committee Harish Mysore, India

K. Subramanian, IGNOU, India Ramanath Padmanabhan, INTEL, India

Sunil Sherlekar, CRL, India Vittal Kini, Intel, India

N. Rama Murthy, CAIR, India Ashok Das, Sun Moksha, India

Sridhar Mitta, India

ii

Reviewers for ADCOM 2009 The following reviewers’ participated in the review process of ADCOM 2009. We greatly acknowledge them for their contributions. Benjamin Premkumar Dhamodaran Sampath Ilia Polian Jordi Torres Lipika Ray Hari Gupta Sudha Balodia Madhu Gopinathan Kapil Vaswani K V Raghavan Chakraborty Joy Aditya Kanade Arnab De Aditya Kanade Aditya Nori Karthik Raghavan Rajugopal Gubbi A Sriniwas P V Ananda Mohan Anirban Ghosh Asim Yarkhan Bhakthavathsala Chandra Sekhar Seelamantula Chiranjib Bhattacharyya Debnath Pal Debojyoti Dutta Deepak D' Souza Haresh Dagale Joy Kuri K R Ramakrishnan R. Krishna Kumar K S Venkataraghavan Krishna Kumar R K M J Shankar Raman Manikandan Karuppasamy Mrs J Lakshmi Nagasuma Chandra Nagi Naganathan Narahari Yadati Natarajan Kannan Natarajan Meghanathan Neelesh B Mehta Nirmal Kumar Sancheti

P S Sastry Parag C Prasad R Govindarajan Rahul Banerjee Santanu Mahapatra Sathish S Vadhiyar Shipra Agarwal Soumyendu Raha Srinivasan Murali Sundararajan V T V Prabhakar Thara Angksun V Kamakoti Vadlamani Lalitha Veni Madhavan C E Venkataraghavan k Vinayak Naik Virendra Singh Vishwanath G V C V Rao Gopinath K Vivekananda Vedula Y N Srikant Zhizhong Chen

iii

ADCOM 2009GRID ARCHITECTURE

Session Papers:

1. Sanyam Mehta, Ankush Mittal, Arindam Misra, Ayush Singhal, Praveen Kumar, and Kannappan Palaniappan, “Parallel Implementation of Video Surveillance Algorithms on GPUArchitecture using NVIDIA CUDA”.

2. Prahlada Rao BB, Mangala N and Amit Chauhan , “Adapting Traditional Compilers onto Higher Architectures incorporating Energy Optimization Methods for Sustained Performance”

1

Parallel Implementation of Video Surveillance Algorithms on GPU Architecture using NVIDIA CUDA

Sanyam Mehta‡, Arindam Misra‡, Ayush Singhal‡, Praveen Kumar†, Ankush Mittal‡, Kannappan Palaniappan†

‡Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee, INDIA

†Department of Computer Science,University of Missouri-Columbia,

USAE-mail: [email protected],[email protected],[email protected],

[email protected], [email protected], [email protected]

AbstractAt present high-end workstations and clusters are the commonly used hardware for the problem of real-time video surveillance. Through this paper we propose a real time framework for a 640×480 frame size at 30 frames per second (fps) on an NVIDIA graphics processing unit (GPU) (GeForce 8400 GS) costing only Rs. 4000 which comes with many laptops and PC’s. The processing of surveillance video is computationally intensive and involves algorithms like Gaussian Mixture Model (GMM), Morphological Image operations and Connected Component Labeling (CCL). The challenges faced in parallelizing Automated Video Surveillance (AVS) were: (i) Previous work had shown difficulty in parallelizing CCL on CUDA due to the dependencies between sub-blocks while merging (ii) The overhead due to a large number of memory transfers, reduces the speedup obtained by parallelization. We present an innovative parallel implementation of the CCL algorithm, overcoming the problem of merging. The algorithms scale well for small as well as large image sizes. We have optimized the implementations for the above mentioned algorithms and achieved speedups of 10X, 260X and 11X for GMM, Morphological image operations and CCL respectively, as compared to the serial implementation, on the GeForce GTX 280.

Keywords: GPU, thread hierarchy, erosion, dilation, real time object detection, video surveillance.

1. Introduction

Automated Video Surveillance is a sector that is witnessing a surge in demand owing to the wide range of applications like traffic monitoring, security of

public places and critical infrastructure like dams and bridges, preventing cross-border infiltration,identification of military targets and providing crucialevidence in the trials of unlawful activities [11][13].Obtaining the desired frame processing rates of 24-30fps in real-time for such algorithms is the majorchallenge faced by the developers. Furthermore, withthe recent advancements in video and networktechnology, there is a proliferation of inexpensivenetwork based cameras and sensors for widespreaddeployment at any location. With the deployment ofprogressively larger systems, often consisting ofhundreds or even thousands of cameras distributedover a wide area, video data from several cameras needto be captured, processed at a local processing serverand transmitted to the control station for storage etc.Since there is enormous amount of media stream datato be processed in real time, there is a greatrequirement of High Performance Computational(HPC) solution to obtain an acceptable frameprocessing throughput.

The recent introduction of many parallelarchitectures has ushered a new era of parallelcomputing for obtaining real-time implementation ofthe video surveillance algorithms. Various strategiesfor parallel implementation of video surveillance onmulti-cores have been adopted in earlier works [1][2],including our work on Cell Broadband Engine [15].The grid based solutions have a high communicationoverhead and the cluster implementations are verycostly.

The recent developments in the GPU architecturehave provided an effective tool to handle the workload. The GeForce GTX 280 GPU is a massively parallel,unified shader design consisting of 240 individual

2

stream processors having a single precision floating point capability of 933 GFlops. CUDA enables new applications with a standard platform for extracting valuable information from vast quantities of raw data. It enables HPC on normal enterprise workstations and server environments for data-intensive applications, eg. [12]. CUDA combines well with multi-core CPU systems to provide a flexible computing platform.

In this paper the parallel implementation of various video surveillance algorithms on the GPU architecture is presented. This work focuses on algorithms like (i) Gaussian mixture model for background modeling, (ii) Morphological image operations for image noise removal (iii) Connected component labeling for identifying the foreground objects. In each of these algorithms, different memory types and thread configurations provided by the CUDA architecture have been adequately exploited. One of the key contributions of this work is novel algorithmic modification for parallelization of the divide and conquer strategy for CCL. The speed-ups obtained with GTX 280(30 multiprocessors or 240 cores) were very significant, the corresponding speed-ups on 8400 GS (2 multiprocessors or 16 cores) were sufficient enough to process the 640×480 sized surveillance video in real-time. The scalability was tested by executing different frame sizes on both the GPUs. 2. GPU Architecture and CUDA

NVIDIA’s CUDA [14] is a general purpose parallel computing architecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems. The programmable GPU is a highly parallel, multi-thread, many core co-processors specialized for compute intensive highly parallel computation.

Fig 1 Thread hierarchy in CUDA

The three key abstractions of CUDA are the thread hierarchy, shared memories and barrier synchronization, which render it as only an extension of C. All the GPU threads run the same code and, are very light weight and have a low creation overhead. A kernel can be executed by a one dimensional or two dimensional grid of multiple equally-shaped thread blocks. A thread block is a 3, 2 or 1-dimensional group of threads as shown in Fig. 1. Threads within a block can cooperate among themselves by sharing data through some shared memory and synchronizing their execution to coordinate memory accesses. Threads in different blocks cannot cooperate and each block can execute in any order relative to other blocks. The number of threads per block is therefore restricted by the limited memory resources of a processor core. On current GPUs, a thread block may contain up to 512 threads. The multiprocessor SIMT (Single Instruction Multiple Threads) unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.

The constant memory is useful only when it is required that the entire warp may read a single memory location. The shared memory is on chip and the accesses are 100x-150x faster than accesses to local and global memory. The shared memory, for high bandwidth, is divided into equal sized memory modules called banks, which can be accessed simultaneously. However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The banks are organized such that successive 32-bit words are assigned to successive banks and each bank has a bandwidth of 32 bits per two clock cycles. For devices of compute capability 1.x, the warp size is 32 and the number of banks is 16. The texture memory space is cached so a texture fetch costs one memory read from device memory only on a cache miss, otherwise it just costs one read from the texture cache. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture addresses that are close together will achieve best performance. The local and global memories are not cached and their access latencies are high. However, coalescing in global memory significantly reduce the access time and is an important consideration (for compute capability 1.3, global memory accesses are easily coalesced than earlier versions). Also CUDA 2.2 release provides page-locked host memory helps in increasing the overall bandwidth when the memory is required to be read or written exactly once. Also, it can be mapped to device address space – so no explicit memory transfer required.

3

Fig. 2 The device memory space in CUDA 3. Our approach for the Video Surveillance Workload

A typical Automated Video Surveillance (AVS) workload consists of various stages like background modelling, foreground/background detection, noise removal by morphological image operations and object identification. Once the objects have been identified other applications can be developed as per the security requirements. Fig. 3 shows the multistage algorithm for a typical AVS system. The different stages and our approach to each of them are described as follows:

Fig. 3 A typical video surveillance system

3.1 Gaussian Mixture Model

Many approaches for background modelling like [4][5]have been proposed. Here, Gaussian Mixture model proposed by Stauffer and Grimson [3] is taken up, which assumes that the time series of observations,

at a given image pixel, is independent of the observations at other image pixels. It is also assumed that these observations of the pixel can be modelled by a mixture of K Gaussians (currently, from 3 to 5 are used). Let xt be a pixel value at time t. Thus, the probability that the pixel value xt is observed at time t is given by:

= ∑

|

(1)

Where, ,

and are the weights, the mean, and

the standard deviation, respectively, of the k-th Gaussian of the mixture associated with the signal at time t. At each time instant t the K Gaussians are ranked in descending order using w=0.75 value (the most ranked components represent the “expected” signal, or the background) and only the first B distributions are used to model the background, where

= arg ∑ > (2)

T is a threshold representing the minimum fraction of data used to model the background. As the parameters of each pixel change, to determine the value of Gaussian that would be produced by the background process depends on the most supporting evidence and least variance. Since the variance for a new moving object that occludes the image is high which can be easily checked from the value of . GMM offers pixel level data parallelism which can be easily exploited on CUDA architecture. Since the GPU consists of multi-cores which allow independent thread scheduling and execution, perfectly suitable for independent pixel computation. So, an image of size m × n requires m × n threads, implemented using the appropriate size blocks running on multiple cores. Besides this the GPU architecture also provides shared memory which is much faster than the local and global memory spaces. In fact, for all threads of a warp, accessing the shared memory is as fast as accessing a register as long as there are no bank conflicts [14] between the threads. In order to avoid too many global memory accesses, the shared memory was utilised to store the arrays of various Gaussian parameters. Each block has its own shared memory (upto 16 KB) which is accessible (read/write) to all its threads simultaneously, which greatly improves the computation on each thread since memory access time is significantly reduced. The value for K (number of Gaussians)is selected as 4, which not only results in effective coalescing [14] but also reduces the bank conflicts. As shown in the Table 1 the efficacy of coalescing is quite prominent.

The approach for GMM involves streaming (Fig. 4) i.e. processing the input frame using two streams

4

allows for the memory copies of one stream to overlap with the kernel execution of the other stream.

Fig. 4 Algorithm depicting streaming in CUDA Streaming resulted in significant speed up in the case of 8400 GS, where the time for memory copies was closely matched to the time for kernel execution, while in case of GTX 280, the speed up was not so significant as the kernel execution took little time, being spread over 30 multiprocessors.

3.2 Morphological Image Operations

After the identification of the foreground pixels from the image, there are some noise elements (like salt and pepper noise) that creep into the foreground image. They essentially need to be removed in order to find the relevant objects by the connected component labelling method. This is achieved by morphological image operation of erosion followed by dilation [6]. Each pixel in the output image is based on a comparison of the corresponding pixel in the input image with its neighbours, depending on the structuring element (Fig. 5) used. In case of dilation (denoted by ʘ) the value of the output pixel is the maximum value of all the pixels in the input pixel's neighbourhood. In a binary image, if any of the pixels in the neighbourhood corresponding to the structural element is set to the value 1, the output pixel is set to 1. With binary images, dilation connects areas that are separated by spaces smaller than the structuring element and adds pixels to the perimeter of each image object.

In erosion (denoted by Ɵ), the value of the output pixel is the minimum value of all the pixels in the input pixel's neighbourhood. In a binary image, if any of the pixels in the neighbourhood corresponding to the

structural element is set to the value 0, the output pixel is set to 0. With binary images, erosion completely removes objects smaller than the structuring element and removes perimeter pixels from larger image objects. This is described mathematically as:

!ʘ = "#$%&

' ! ( ) (3)

and !Ɵ = "|& * ! (4)

where ($ ) is the reflection of set B and & is the translation of set B by point z as per the set theoretic definition.

0 0 1 0 0

0 1 0 1 0

1 0 0 0 1

0 1 0 1 0

0 0 1 0 0

Fig. 5 A 5×5 structuring element

As the texture cache is optimized for 2-dimensional

spatial locality, the 2-dimensional texture memory is used to hold the input image; this has an advantage over reading pixels from the global memory, when coalescing is not possible. Also, the problem of out of bound memory references at the edge pixels are avoided by the cudaAddressModeClamp addressing mode of the texture memory in which out of range texture coordinates are clamped to a valid range. Thus the need to check out of bound memory references by conditional statements never arose, preventing the warps from becoming divergent and adding a significant overhead.

Fig. 6 Approach for erosion and dilation

As shown in Fig. 6 a single thread is used to

process two pixels. A half warp (16 threads) has a bandwidth of 32 bytes/cycle and hence 16 threads, each processing 2 pixels (2 bytes) use full bandwidth,

for i <=2 do create stream i //cudaStreamCreate for each stream i do copy half the image from host to device each stream i. //cudaMemcpyAsync for each stream i do kernel execution for each stream i. (half image processed) //gmm<<<….>>>

cudaThreadSynchronize();

5

while writing back noise-free image. This halves the total number of threads thus reducing the execution time significantly. A structuring element of size 7×7 was used both in dilation and erosion. A straightforward convolution was done with one thread running on two neighbouring pixels. The execution times for the morphological image operations for the GTX 280 and the 8400 GS are shown in Table 2.

3.3 Connected Component Labelling

The connected component labelling algorithm works on a black and white (binary) image input to identify the various objects in the frame by checking pixel connectivity [8]. The image is scanned pixel-by-pixel (from top to bottom and left to right) in order to identify connected pixel regions, i.e. regions of adjacent pixels which share the same set of intensity values and temporary labels are assigned. The connectivity can be either 4 or 8 neighbour connectivity (8-connectivity in our case). Then, the labels are put under equivalence class, pertaining to their belonging to the same object. After constructing the equivalence classes the labels for the connected pixels are resolved by assigning label of the equivalence class to all the pixels of that object.

Here the approach for parallelizing CCL on the GPU belongs to the class of divide and conquer algorithms [7]. The proposed implementation divides the image into small parts and labels the objects in those small parts. Then in the conquer phase the image parts are stitched back to see if the two adjoining parts have the same object or not.

For initial labelling the image was divided into N×N small regions and the sub-images were scanned pixel by pixel from left to right and top to bottom. These small regions were executed in parallel on different blocks (32×32 in case of 1024×1024 images).

Each pixel was labelled according to its connectivity with its neighbours. In case of more than one neighbour, one of the neighbour’s labels was used and rest were marked under one equivalence class. This was done similarly for all blocks that were running in parallel. The equivalence class array was stored in shared memory for each block which saved a lot of memory access time. The whole image frame was stored in texture memory to reduce memory access time, as global memory coalescing was not possible due to random but spatial accesses.

Fig 7. The connected components are assigned the maximum label after resolution

In earlier works on CCL like [9][10], the major limitation was that the sub-blocks into which the problem was broken had to be merged serially, the reason being each sub-block had blobs with serial labels and while merging any two connected sub-blocks, the labels in all the other sub-blocks had to be modified – clearly no parallelization possible. A new approach to enable parallelization of CCL is presented in this paper. The code (as indicated in Fig. 7) labels the blobs (objects) independent of the other sub-blocks, but according to the CUDA thread ids ( i.e. 1st sub-block can label the blobs from 0 to 7, the 2nd sub-block can label the blobs from 8 to 15 and so on). So in this case no sub-block can detect more than 8 blobs (which is generally the case, but one may easily choose to have a higher limit). In order to avoid conflicts between sub-blocks, connected parts of the image, in different regions, were given the highest label from amongst the different labels in different regions, as shown in the fig. 7.

So, as a result of making the entire code ‘portable on GPU’, the speed up obtained was enormous – the entire processing being split and made parallel to be executed on the GTX 280, resulting in the entire CCL (i.e. including merge) code to be executed in just 2.4 milliseconds (Table 3) for a 1024×768 image, a speed-up of 11x as compared to sequential code.

Region 1 – labels limited from 0 to 7




6

4. EXPERIMENTAL RESULTS

The parallel implementation of the above mentioned AVS workload was executed on two NVIDIA GPUs, the first GPU used is the GeForce GTX 280 on board a 3.2 GHz Intel Xeon machine with 1GB of RAM, the second one was the GeForce 8400 GS on board a 2 GHz Intel Centrino Duo machine. The GTX 280 has a single precision floating point capability of 933GFlops and a memory bandwidth of 141.7 GB/sec, it also has 1 GB of dedicated DDR3 video memory and consists of 30 multiprocessor with 8 cores each, hence a total of 240 stream processors. It belongs to a compute category 1.3 which supports many advanced features like page-locked host memory and those which take care of the alignment and synchronization issues. The 8400 GS has a memory bandwidth of 6.4 GB/sec and

Fig. 8 Execution times for GMM (Speed up = 10 for image size 320×240, as compared to sequential code)

has two multiprocessors with 8 cores each, i.e. 16 stream processors, single precision floating point capability of 28.8 GFlops and 128 MB of dedicated memory. It belongs to the compute capability 1.2. The development environment used was Visual Studio 2005 and the CUDA profiler version 2.2 was used for profiling the CUDA implementation. The image sizes that have been used are 1600×1200, 1024×768, 640×480, 320×240 and 160×120. In the subsequent discussion we mention the results obtained for the image size of 1024×768 on the GTX 280 (30 multiprocessors).

The Gaussian Mixture Model, used for background modelling has the kind of parallelism that is required for implementation on a GPU. As evident by the Fig. 8, the time of execution increases with the increase in image size and the amount of speedup achieved also increases almost proportionately; this is due to the execution of a large number of threads that keeps the GPU busy. Hence, a significant speedup of 10x has been achieved for 320×240 image. Table 2. Execution times for erosion and dilation for a 3×3 structuring element.

Shared memory was used to reduce the global memory accesses keeping in view the shared memory size ( 16 KB). As can be seen from the Table 1 a total of 4 blocks (192×4 threads out of a maximum of 1024 threads) could be executed in parallel on a

Function Occup-

ancy

Coalesced

global

memory

loads

Coalesced

global

memory

stores

Total

branches

Divergent

Branches Static

Memory

per block

Total

global

memory

loads

Total

global

memory

stores

GMM 0.75 5682 826 829 27 3112 5682 347

GMM 0.75 5094 186 500 5 3112 5094 93

Erode 1 0 102 408 2 20 0 37

Dilate 1 0 154 4975 31 20 0 77

CCL 0.25 2657 1902 4128 0 536 2657 823

Merge 0.25 9898 36 1936 2 1064 9898 18

IMAGE SIZE

GTX 280 (ms)

8400 GS (ms)

160×120 0.0445 0.120 320×240 0.0586 0.465 640×480 0.1254 1.75 1024×768 0.2429 3.61 1600×1200 0.5625 11.7

Table 1 CUDA Profiler output for 1024X768 image size

7

multiprocessor, giving an occupancy of 0.75.As a result of using K=4 all the global memory loads were coalesced as can be seen from the were lesser bank conflicts. The use of streaming reduced memory copy overhead but not to the extentanticipated, due to the efficient memory copying in GTX 280 - compute capability 1.3. This approach however, was of great help in the 8400 GS capability1.2.

The morphological image operations contribute a major portion of the computational expense of the of the AVS workload. In our approach we are able to drastically reduce their execution time. The speedup scales with the image size both on the GTX 280 and the 8400 GS, the comparison of sequential code with the parallel implementation of the 1024×768 image size shows a significant speedup of structuring element of 7×7. The time taken by the sequential implementation was 89.806to the 0.352 ms taken by the parallel implementation.

For this image size we were able to unleash the fullcomputational power of the GPU with (i.e. neither shared memory, nor the registers per multiprocessor were the limiting factors)indicated in the Table 1. Moreover, the use ofmemory and address clamp modes havepercentage of divergent threads to <1%.GS also a significant speedup has been achieved.

(a)

(c)(a) Input Image (b) Foreground Image (c) Image after noise removal (d) Output

Fig. 9 Image output of various stages of AVS

In CCL (i.e. CCL & Merge)independent sub-blocks were assigned to each thread and 32 threads were run on one block (which was experimentally observed to be optimal). maximum number of active blocks on a multiprocessor

multiprocessor, giving an occupancy of 0.75.As a global memory loads were

coalesced as can be seen from the Table 1 also, there The use of streaming

reduced memory copy overhead but not to the extentdue to the efficient memory copying in compute capability 1.3. This approach

he 8400 GS – compute

The morphological image operations contribute a major portion of the computational expense of the of the AVS workload. In our approach we are able to drastically reduce their execution time. The speedup

both on the GTX 280 and the 8400 GS, the comparison of sequential code with the parallel implementation of the 1024×768 image size shows a significant speedup of 260X with a

element of 7×7. The time taken by the .806 ms as compared

to the 0.352 ms taken by the parallel implementation.For this image size we were able to unleash the full

computational power of the GPU with occupancy of 1. (i.e. neither shared memory, nor the registers per

miting factors) 280, as indicated in the Table 1. Moreover, the use of texture memory and address clamp modes have reduced the percentage of divergent threads to <1%. On the 8400 GS also a significant speedup has been achieved.

(b)

(d)Input Image (b) Foreground Image

(c) Image after noise removal (d) OutputFig. 9 Image output of various stages of AVS

(i.e. CCL & Merge) 32×32 sized blocks were assigned to each thread

and 32 threads were run on one block (which was experimentally observed to be optimal). Since, the maximum number of active blocks on a multiprocessor

can be 8, the total number of active threads per multiprocessor were 256 and hence an occupancy of 0.25. The optimal parallelization of the CCL algorithm was significant in itself, as the parallelization of CCL on CUDA has not been reported and was deemed very difficult. Apart from the code been parallelized, the use of shared memory and then texture memory to store appropriate data led to significant increases in speedup. The use of texture memory not only prevented any warps from diverging by avoiding the conditional statements (due to clamped accesses in texture memory) but also led to speedup due to the spatial locality of references in CCL.implementation of CCL is block size dependestill remains a bottleneck.

Table 3 Execution times for CCL

In each of the above kernels pagememory (a feature of CUDA 2.2 ) has been used whenever only one memory read and write were involved which increased the memory throughput.

Architectures dedicated to video surveillance cost as much as lakhs of rupees, while the GeForce GTX 280 costs Rs.17000 and the 8400 GS costs merely Rs.4000. Even for an image size of frames per second could be processed on 8400 GS, for an image size of 1024×768 close to 15 frames per second could be processed and for images of smaller size 30 frames could be easily processed as shown in the fig 10.

Fig. 10 Comparision of total tdifferent sizes

IMAGE SIZE GTX 280160×120 0.106 320×240 0.220 640×480 1.2561024×768 2.4941600×1200 2.649

the total number of active threads per ere 256 and hence an occupancy of

lization of the CCL algorithm was significant in itself, as the parallelization of CCL

reported and was deemed very ode been parallelized, the use

of shared memory and then texture memory to store appropriate data led to significant increases in speedup. The use of texture memory not only prevented any

diverging by avoiding the conditional clamped accesses in texture

memory) but also led to speedup due to the spatial locality of references in CCL. However, the

ion of CCL is block size dependent, which

Execution times for CCL

In each of the above kernels page-locked host memory (a feature of CUDA 2.2 ) has been used whenever only one memory read and write were involved which increased the memory throughput.

Architectures dedicated to video surveillance cost as much as lakhs of rupees, while the GeForce GTX 280 costs Rs.17000 and the 8400 GS costs merely Rs.4000. Even for an image size of 640×480, 30 frames per second could be processed on 8400 GS, for an image size of 1024×768 close to 15 frames per second could be processed and for images of smaller size 30 frames could be easily processed as shown in

Comparision of total time for image of

(ms) 8400 GS (ms)0.106 0.522 0.220 1.34 1.256 4.5 2.494 14.1 2.649 46.2

8

5. CONCLUSION AND FUTURE WORK

Through this paper, we describe the implementation of a typical AVS workload on the parallel architecture of NVIDIA GPUs to perform real time AVS. The various algorithms, as described in the previous sections are GMM for background modelling,morphological image operations, and CCL for object identification. In our previous work [15] a detailedcomparison has been done between Cell BE and CUDA for these algorithms. During the implementation on the GPU architecture, major emphasis was given to selecting the thread configurations, the memory types for each kind of data, out of the numerous options available on GPU architecture, so that the memory latency can be reduced and hidden. Lot of emphasis was given tomemory coalescing and avoiding bank conflicts.

Efficient usage of the different kinds of memories offered by the CUDA architecture and subsequent experimental verification resulted in the most optimal implementations. As a result, significant overall speed-up was achieved. Further testing and validations are going on. We have examined the performance on only 8400 GS(2 multiprocessors) and GTX 280(30 multiprocessors) in this paper, hence a range of intermediate devices are yet to be explored. Our future work will include the implementation of the AVS workload on other GPU devices to examine the scalability, as well as comparison with other parallel architectures to get an idea of their viability as compared to the GPU implementation.

6. REFERENCES

[1] S. Momcilovic and L. Sousa. A parallel algorithm for advanced video motion estimation on multicore architectures. Int. Conf. Complex, Intelligent and Software Intensive Systems, pp 831-836, 2008.[2] M. D. McCool. Data-Parallel Programming on the Cell BE and the GPU Using the RapidMind Development Platform. GSPx Multicore Applications Conference, 9 pages, 2006. [3] C. Stauffer and W. Grimson,Adaptive background mixture models for real-time tracking, In Proceedings CVPR,pp.246–252,1999. [4] Zoran Zivkovic, Improved Adaptive Gaussian Mixture Model for Background Subtraction. In Proc. ICPR,pp 28-31 vol. 2,2004 [5] Toyama, K.; Krumm, J.; Brumitt, B.; Meyers, B., Wallflower: principles and practice of background maintenance. The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol.1,

pp.255-261, 20-25 September, 1999, Kerkyra, Corfu, Greece [6] H.Sugano and R.Miyamoto. Parallel implementation of morphological processing on cell/be with opencv interface. Communications, Control and Signal Processing, 2008. ISCCSP 2008, pp 578–583, 2008. [7] J. M. Park, G. C. Looney, and H. C. Chen, “ A Fast Connected Component Labeling Algorithm Using Divide and Conquer”, CATA 2000 Conference on Computers and Their Applications, pp 373-376, Dec. 2000.[8] R. Fisher, S. Perkins, A. Walker and E. Wolfart Connected Component Labeling, 2003 [9] K.P. Belkhale and P. Banerjee, "Parallel Algorithms for Geometric Connected Component Labeling on a Hypercube Multiprocessor," IEEE Transactions on Computers, vol. 41, no. 6, pp. 799-709,1992 [10] M. Manohar and H.K. Ramapriyan. Connected component labeling of binary images on a mesh connected massively parallel processor. Computer vision, Graphics, and Image Processing, 45(2):133-149,1989. [11] K.Dawson-Howe, “Active surveillance using dynamic background subtraction,” Tech. Rep.TCD-CS-96-06, Trinity College, 1996. [12] Michael Boyer, David Tarjan, Scott T. Acton†, and Kevin Skadron, Accelerating Leukocyte Tracking using CUDA:A Case Study in Leveraging Manycore Copocessors,2009. [13] A.C. Sankaranarayanan, A. Veeraraghavan, and R. Chellappa. Object detection, tracking and recognition for multiple smart cameras. Proceedings of the IEEE, 96(10):1606–1624,2008. [14] NVIDIA CUDA Programming Guide,Version 2.2, page10,27-35,75-97,2009. [15] Praveen Kumar, Kannappan Palaniappan, Ankush Mittal and Guna Seetharaman. Parallel Blob Extraction using Multicore Cell Processor. Advanced Concepts for Intelligent Vision Systems (ACIVS) 2009. LNCS 5807, pp. 320–332, 2009.

9

Adapting Traditional Compilers onto Higher Architectures incorporating Energy Optimization Methods

for Sustained Performance

Prahlada Rao B B, Mangala N, Amit K S Chauhan Centre for Development of Advanced Computing (CDAC),

#1, Old Madras Road, Byappanahalli, Bangalore-560038, India email: prahladab, [email protected]

ABSTRACT - Improvements in processor technology are offering benefits, such as large virtual address space, faster computations, non-segmented memory, higher precision etc., but require up-gradation of system software to be able to exploit the benefits offered. The Authors in this paper present various tasks and constraints experienced to enhance compilers from 32-bit to 64-bit platform. The paper describes various aspects, ranging from design changes, to porting and testing issues that have been dealt with while enhancing the C-DAC’s Fortran90 (CDF90) compiler to work with 64-bit architecture and I:8 support. Features supported by CDF90, for energy efficiency of code, are presented. The regression testing carried out to test the I:8 support added in the compiler are discussed.

KEYWORDS: Compilers, Testing, Porting, LP64, CDF90, Optimizations

I. INTRODUCTION Based on Moore’s law, processors will continue to show an increase in speed and processing capability with time, and chip making companies like Intel, AMD, IBM, Sun etc. put efforts to stay on the Moore curve of progress. Processor architectures are evolving with different techniques to gain in performance – dividing each instruction into a large number of pipeline stages, scheduling instructions in an order different from the order the processor received them, providing 64-bit architecture, providing multi cores, etc. However, the benefits of all these components can be derived only if the system software is tailored to match the new processor. To keep pace with the rapid changes in processor technology, usually the existing system software codes are tweaked to exploit the offerings of the new architectures.

Domain experts in areas such as weather forecasting, climate modeling and atomic physics still continue to maintain large programs written in Fortran language. Thus, Fortran being popular in the scientific community, requires to be supported on new architectures to take advantage of advancements in the hardware. However, redesigning a compiler from scratch is a major task. Instead, modifying an existing compiler so that it generates code for newer targets is a common way to make compilers compliant to enhanced processors. This paper describes the important aspects of

adapting the Fortran90 compiler (CDF90) developed by C-DAC to higher architectures.

A. Overview of CDAC’s Fortran90 Compiler The highlights of CDF90 compiler are that it supports both

F77 and F90 standards; it supports Message Passing Interface (MPI), and mixed language programming with C [6]. It also has an in-built Fortran77 to Fortran90 converter. It is available on AIX, Linux and Solaris. CDF90 source code is written in ‘C’ language and comprises of about 557 source code files with 190 Kilo Lines of Code (KLOC). As with other traditional compilers CDF90 includes the key phases of lexical analysis, syntax - semantic analysis, optimization and code generation. Yacc [9] is used for developing the syntax analysis modules in CDF90. Internally the context free grammar is represented by a tree using AST (Abstract Syntax Tree) [7] notation. The tree can be traversed to generate intermediate code and also to carry out optimization transformations.

B. Traditional and Retargetable Methodologies Advanced versions of the GNU Compiler Collection

(GCC) offers many advantages over traditional compilers even though traditional compilers are important for they provide the basic building block for adding sophisticated features according to the requirements arising or to pursue research work which can be carried out on top of the existing compiler projects. It provides the flexibility of not writing everything from scratch.

In an attempt in early 90’s to make the compiler highly portable, the code generator module was replaced with a ‘translator to C’ in order to generate intermediate C code; since stable C compilers were available on majority of the platforms. Hence CDF90 acts as a converter by translating input Fortran77/90 program into equivalent C program which is further passed to some standard ‘C’ compiler like gcc. CDF90 incorporates traditional compiler development approach.

CDF90 offers various optimization techniques such as, Loop Unrolling, Loop Interchanging, Loop Merging, Loop Distribution and Function in-lining etc, for efficient storage

10

and execution of the code while gcc comes with more sophisticated and complex optimizations procedures, such as inter-procedural analysis, for better performance. CDF90 takes the benefits of both the approaches by converting the code into intermediate C code through traditional approach and later passing the intermediate C code to gcc compiler to exploit the benefits offered by advanced techniques. Various compilation approaches are depicted in Figure 1.

Figure1. Different Compiler Approaches

II. CDF90 ARCHITECTURE CDF90 is conventionally composed of following main

modules

B. Lexer This module converts sequence of characters into a

sequence of tokens, which will be given as input to the Parser for construction of Abstract Syntax Tree. This Abstract Syntax Tree will be used in later modules of compiler for further processing.

C. Parser This module receives input in the form of sequential source

program instructions, interactive commands, markup tags, or some other defined interface and breaks them up into parts that can then be managed by other compiler components. Parser will get the tokens from the Lexer and will construct a data structure, usually an Abstract Syntax Tree.

D. Optimizer This module provides a suite of traditional optimization

methods to minimize energy cost of a program by minimizing memory access instructions and execution time. Optimization techniques like Loop Unrolling, Loop interchanging, Loop Merging, Loop Distribution and Function in-lining etc. have been applied on the Parse Tree structure.

E. Translator This module translates FORTRAN source code to correct,

compliable and clean C source code. Translator makes an in-order traversal of the full parse tree

and replaces each FORTRAN construct by their corresponding C construct. Output of translator module is a ‘.c’ file. This .c file is passed to some standard C compiler to produce final executable. I/O libraries of CDF90 are linked using ‘–l’ option and passed to the linker ‘ld’

F. I/O Library This module contains methods that are invoked by the

intermediate C code generated as an output of translator module with the help of linker at link time.

Figure2. Control Flow Graph for CDF90 Compiler

III. CONSIDERATIONS FOR MIGRATING CDF90 TO 64-BIT

A. Need for Migration of CDF90 64 bit architectures are becoming popular since a decade

with a promise of higher accuracy and speed through use of 64-bit registers and 64-bit addressing. The advantages offered by 64-bit processors are:

Large virtual address Non-segmented memories 64–bit arithmetic Faster Computations Removal of certain system limitations

A study was taken up to understand the feasibility, impact, and effort for migrating existing Fortran90 compiler. Considering the extensive features offered by CDF90 and the advantages of enhancing this to higher bit processors, it was decided to enhance the existing compiler to support LP64 model, and this would require reasonable changes mainly in parser, translator and library modules.

Lexer

Parser

Optimizer

Translator

C Compiler(gcc)

Tokens

AST

Optimized AST

Intermediate C Code

Executable File(XCOFF/XCOFF-64bit)

I/O Library

Fortran77/90 Application

(.f or .f90)

Lexer

Parser

Optimizer

Translator

C Compiler(gcc)

Tokens

AST

Optimized AST

Intermediate C Code

Executable File(XCOFF/XCOFF-64bit)

I/O Library

Fortran77/90 Application

(.f or .f90)

Scientific Code

Lexical Analyzer Syntax + Semantic Analyzer Optimizer

Machine Independent Code Generator & Assembler for Target Architecture

Machine Dependent Executable for Target

Architecture

Scientific Code

Lexical Analyzer Syntax + Semantic Analyzer Optimizer Translator to commonly used language ( c/c ++) for portability

Machine Independent Compile using popular compiler

Machine Dependent

Different Executables for Targets

Scientific Code

Lexical Analyzer Syntax + Semantic Analyzer General Optimizer Intermediate Code Generator

Machine Independent Code Generation using . md & . rtl for Target Architecture

Machine Dependent Different

Executables for Targets

Traditional Portable Retargetable Scientific

Code

Lexical Analyzer Syntax + Semantic Analyzer Optimizer

Machine Independent Code Generator & Assembler for Target Architecture

Machine Dependent Executable for Target

Architecture

Scientific Code



Machine Dependent


Scientific Code



Machine Dependent


Scientific Code





Scientific Code





Traditional Portable Retargetable

11

B. Data Model Standards Followed For higher precision and better performance, newer

architectures have been designed to support various data models. The three basic models that are supported by most of the major vendors on 64-bit platform are LP64, ILP64 and LLP64 [5].

LP64 (also known as 4/8/8) denotes int as 4 bytes, long and pointer as 8 bytes each.

ILP64 (also known as 8/8/8) means int, long and pointers are 8 bytes each.

LLP64 (also known as 4/4/8) adds a new type (long long) and pointer as 64 bit types

Many 64-bit compilers support LP64 [5] data model including gcc and xlc on AIX5.3 platform.

CDF90 acts as a front-end and depends upon gcc/xlc for backend compilation. Since gcc/xlc follows LP64 data model for 64-bit compilation on AIX5.3 platform so 64BitCDF90 also needs to follow the same data model. Hence LP64 data models have been adopted for 64-bit compilation on 64-bit AIX5.3/POWER5 platform.

C. Approach Followed for Migration Adding 8-byte integer (I: 8) support to CDF90 compiler

was required to enjoy the benefit of faster computing with larger integer values. In order to implement this in CDF90, various Implicit FORTAN Library functions needed to be modified and new functions written that support 8-byte integer computations. Also various changes were required in different compiler modules which will be described in later part of the paper.

FORTRAN applications passed to CDF90 are translated to C code hence it was required to consider the data model of the underlying C compiler also. gcc4.2 follows LP64 data model for 64-bit compilation and for 32-bit compilation it uses ILP32 data model even on the 64-bit AIX platform. Hence LP64 data model would be suitable in this situation.

Most 64-bit processors support both 32-bit and 64-bit execution modes [10]. Hence 64BitCDF90 also needs to provide both 32-bit and 64-bit compilation support through the help of 32-bit and 64-bit compilation libraries, though the executable may be 32-bit only. Hence it is identified that two different libraries need to be prepared for 32-bit and 64-bit compilation environments.

Fortran77/90 executable file format requires to be changed to XCOFF64 when compiled by 64BitCDF90 on 64-bit platform using 64-bit libraries. 64-bit CDF90 APIs need to be generated for 64-bit compilation of any application, though CDF90 executable may be 32-bit only. Please note that the same approach has been followed by ‘gcc’ which uses 32-bit executable for 64-bit compilation, of any application file passed to it, through use of 64-bit library files.

64BitCDF90 needs to be validated for 64-bit architecture compliance against the existing test cases along with the newly added test cases specific for 8-byte integer support.

IV. EFFECT OF LP64 DATA MODEL ON CDF90 Some fundamental changes occur when moving from

ILP32 data model to LP64 data model which are listed as following long and pointers are no longer 32-bit size. Direct or

indirect assignments of int to long or pointer value is no more valid.

For 64-bit compilation, CDF90 needs to use 64-bit library archives. Also need to supply 64-bit specific flags to backend compiler so that it can operate in 64-bit mode.

System derived types such as size_t, time_t, ptrdiff_t are 64-bit aligned in 64-bit compilation environments. Hence these values must not be contained or assigned to 32-bit variables.

V. DESIGN/IMPLEMENTATION CHANGES FOR 64-BIT ENHANCEMENTS 64BitCDF90 is able to perform 64-bit compilation

correctly on 64-bit platform with I:8 support after carrying out the following tasks described in two phases as below:

A. Migration from 32-bit to 64-bit Major porting concerns can be summarized as below: Pointers size changes to 64-bit. All direct or implied

assignments or comparisons between “integer” and “pointer” values have been examined and removed.

Long size changes to 64-bit. All casts to allow the compiler to accept assignment and comparison between “long” and “integer” have been examined to ensure validity.

Code has been updated to use the new 64-bit APIs and hence the executable generated after 64-bit compilation is 64-bit compliant.

Macros depending on 32-bit layout have been adjusted for the 64-bit environment.

A variety of other issues, like data truncation, that can arise from sign extension, memory allocation sizes, shift counts, array offsets, and other factors have to be dealt with extreme care.

User has option to select between 32-bit or 64-bit APIs. If 64-bit compiler flag is used, for e.g. ‘-maix64’ flag is used for backend compilation with gcc 4.2, while compiling then 64-bit API is linked and 64-bit object file format is generated. If user does not use any 64-bit flag for compilation, then by default 32-bit API is linked to the application and 32-bit object file format is generated. Code has been added in the compiler to select either 32-bit or 64-bit API depending upon whether user has supplied 64-bit specific flags or not.

64-bit Library files are generated by compiling the source code using ‘gcc 4.2’ in 64-bit mode using ‘–

12

maix64’ option on AIX5.3/Power5 platform to produce 64-bit XCOFF file formats. These 64-bit object files are passed to ‘ar’ tool using ‘–X64’ flag to produce 64-bit library archives.

Compiler code, which is written in C, is compiled using gcc 4.2 in 64-bit architecture (AIX5.3/Power5) without using any 64-bit specific flags and by default it generates 32-bit executables for CDF90 compiler. The same CDF90 executable can be linked to 64-bit library API to compile applications, passed to it, in 64-bit mode and to generate object file in 64-bit XCOFF file format. The same 32-bit CDF90 executable can be linked to 32-bit API also to compile applications in 32-bit mode and to generate 32-bit XCOFF object file format on the same 64-bit platform (AIX5.3/POWER5).

32-bit library archives are generated using compilation by ‘gcc 4.2’ followed by archive file generation using ‘ar’ tool without any 64-bit specific flags on AIX5.3 platform with PowerPC_Power5 architecture.

B. Adding I:8 Support in 64BitCDF90 32-bit CDF90 compiler supports following KIND values, in

a Fortran77/90 program, for Integer data types

TABLE 1

KIND VALUES FOR INTEGERS

KIND Value Size in bytes 1 1 2 2 3 4

The compiler has been enhanced to allow KIND value 4 for

which the size of the Integer is 8 bytes. Modifications are performed in the following modules to achieve the desired results.

Lexer: Code has been added to identify INTEGER*8 as a valid token.

Parser: The parser code has been modified so that it may correctly add the symbol (I: 8) correctly in the AST generated after the parser phase. All other programming constructs, like functions, macros, data types etc, dealing with 8-byte integer size are also updated, in the parser module, to transform in to the correct AST structure.

Translator: If the integer variable KIND value is 4 in an input Fortran77/90 program, then the translator should be able to declare and translate the symbol in to the corresponding C code symbol. Corresponding code has been added in the translator module. Functionalities have been added, in the translator module, to correctly convert implicit FORTRAN library function dealing with KIND value 4 integer data types to their corresponding C library function names based on conditional type checking for 8-byte integer data types. After successful conversion, the function call is dumped into an intermediate .c file. The C functions, dealing with I:8 data types, are called from CDF90 libraries.

E.g. The Translator is now able to internally translate the implicit matmul (a, b) function, where a, b are matrix with 8-byte integer elements, in to _matmul_i8i8_22 (long long **a, long long **b) which is an intermediate C library function to carry out the actual multiplication. This translation is carried out on the basis of integer size conditional checking. Most of such changes have been carefully debugged with the help of gdb6.5.

I/O Library: There are two libraries supported by 64BitCDF90.One library is used for 32-bit mode and the other is linked for 64-bit mode compilation. Most of the library functions in 32-bit CDF90 library have been modified /added to handle 64-bit integer data types and hence to create 64-bit CDF90 library. Hence 64-bit CDF90 library can handle 8-byte integer for most of the functions it contains. The translator module will generate corresponding C code that invokes these library functions, which handle 8-byte integer data size. E.g. _matmul_i8i8_(long long **a, long long **b) library function has been added in the CDF90 library to compute matrix-matrix multiplications whose elements are 8-byte integers.

Building Libraries: The 32-bit library archives have been created by compiling the source code in 32-bit mode while 64-bit library archives have been prepared by compiling the source code in 64-bit mode on 64-bit platform. More specifically 64-bit libraries have been prepared by compiling source code using gcc4.2 with ‘-maix64’ flag and further passing it to archive tool ‘ar’ with the ‘-x64’ flag to create 64-bit library archives for 64BitCDF90.

VI. CODE OPTIMIZATIONS: ENERGY EFFICIENT METHODS OFFERED BY CDF90 Energy cost of a program depends upon the number of

memory access[15]. Large number of memory access results in high energy dissipation. Energy efficient compilers reduce memory access and thereby reduce energy consumed by these accesses. One approach is to reduce the LOAD and STORE instructions and storing data in registers or cache memory to provide faster and efficient computations reducing the CPU cycles significantly.

According to Kremer [22], “Traditional compiler optimizations such as common subexpression elimination, partial redundancy elimination, strength reduction, or dead code elimination increase the performance of a program by reducing the work to be done during program execution”. There are also several other compiler strategies for energy reduction such as - compiler assisted Dynamic Voltage Scaling [20], instruction scheduling to reduce function unit utilization, memory bank allocation, inter-program optimizations [19] etc. which are being tried by the researchers.

Reducing power dissipation of processors is also being attempted at the hardware level and in the operating system. Examples of these include power aware techniques for on-

13

chip buses, transactional memory, memory banks with low power modes [16,17] and work load adaptation [18,23]. However energy reduction through compiler promises some advantages such as - no overhead at execution time, assess ‘future’ program behavior through aggressive whole program analysis, identify optimizations and make code transformations for reduced energy usage [24].

A. Optimizations in CDF90 compiler Programs compiled without any optimization generally

runs very slowly and hence takes more the cpu cycles and results in higher energy consumption. Usually a medium level optimization (-o2 on many machines) typically leads to a speed up by a factor of 2-3 without any significant improvements in compilation time. Different types of optimizations performed by CDF90 to improve program performance are listed as following

i. Common Sub-expression elimination The compiler takes the common sub-expression out from

various expressions and calculate it only once instead of several times

t1=a+b-c t2=a+b+c Above two statements will be reduced by an optimizing

compiler in to following statements t= a+b t1=t-c t2=t+c

Though this approach may not exhibit any significant

performance enhancements for smaller expressions but for bigger expression it shows some certain performance enhancements

ii. Strength Reduction Strength Reduction is used to replace an existing

arithmetic expression by an equivalent expression that can be evaluated faster. A simple example is replacing 2*i by i+i since integer addition is faster than integer multiplication.

iii. Loop Invariant Code The code that is not dependent on loop iterations is

removed from the loop and calculated one-time only instead of calculating for each loop iteration. The following FORTRAN code let

do i=1,n

a(i)=m*n*b(i) end do

will be replaced by the following code by an optimizing compiler

t=m*n do i=1,n a(i)=t*b(i) end do

iv. Constant Value Propagation An expression involving several constant values will be calculated and replaced by a new constant value

For ex: x=2*y z=3*x*s

will be transformed into the following code by an optimizing compiler

z=6*y*s

v. Register Allocation and Instruction Scheduling This particular optimization is the most difficult and most important also. Since CDF90 depends upon gcc for backend compilation so it is left up to the backend compiler only.

B. Code Transformation Criteria and applied Techniques CDF90 offers following loop optimization techniques.

i. Loop Interchange It is a Process of exchanging the order of two iteration variable. Loop interchange can often be used to enhance the performance of code on parallel or vector machines. Determining when loops may be safely and profitably interchanged requires a study of the data dependences in the program Loop interchange mechanism has been implemented in CDF90. It contains a function which checks whether interchanging of loops can be done. The loops can be interchanged if the loops are perfectly nested. If one of the loop bound is dependent on the index variable of some other loop the loops can not be interchanged, also the loop should be totally independent, since after loop interchanging there should not be invalid computation. Based on condition output, loop interchanging is performed by CDF90 supported functions.

For example consider the FORTRAN matrix addition algorithm below: DO I = 1, N DO J = 1, M A(I, J) = B(I, J) + C(I, J) ENDDO ENDDO

The loop accesses the arrays A, B and C row by row, which, in FORTRAN, is very inefficient. Interchanging I and J loops, as shown in the following example, facilitates column by column access.

DO J = 1, M DO I = 1, N A(I, J) = B(I, J) + C(I, J) ENDDO ENDDO

14

ii. Loop Vectorization It is the conversion of loops from a non-vectored form to

a vectored form. A vectored form is one that has the same operation happening on all members of a range (SIMD) without dependencies. Theoretically the ranges should be in contiguous memory areas because this makes it easy for the processor to know where to apply the computations next. Conceptually, it could be any set of data. Only built-in data type in C and FORTRAN are arrays which have contiguous memory.

iii. Loop Merging Another technique implemented by CDF90 to reduce

loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other's data.

iv. Loop Unrolling Duplicates the body of the loop multiple times, in order to

decrease the number of times the loop condition is tested and the number of jumps, which may degrade performance by impairing the instruction pipeline. Completely unrolling a loop eliminates all overhead (except multiple instruction fetches & increased program load time), but requires that the number of iterations be known at compile time.

Loop unrolling, is designed to unroll loops for parallelizing and optimizing compilers. To illustrate, consider the following loop:

for (i = 1; i <= 60; i++) a[i] = a[i] * b + c;

This FOR loop can be transformed into the following equivalent loop consisting of multiple copies of the original loop body:

for (i = 1; i <= 60; i+=3)

a[i] = a[i] * b + c; a[i+1] = a[i+1] * b + c; a[i+2] = a[i+2] * b + c;

The loop is said to have been unrolled twice, and the

unrolled loop runs faster because of reduction in loop overhead.

Loop unrolling was initially developed for reducing loop overhead and for exposing instruction level parallelism for machines with multiple functional units.

Loop unrolling is limited by the number of registers available. If numbers of registers are less, there is an increased number of LOAD/STORE and hence increased memory access.

In case of loop unrolling leads to performance loss because of frequent LOAD/STORE, loop is spitted into two, a procedure called loop fission .This can be done in a straight-forward manner if two independent code segments are present in the loop.

v. Loop Distribution Loop distribution is used for transforming a sequential

program into a parallel one. Loop distribution attempts to break a loop into multiple loops over the same index range but each taking only a part of the loop's body. This can improve locality of reference, both of the data being accessed in the loop and the code in the loop's body.

vi. Function Inlining Function inlining is a powerful high level optimization

which eliminates call cost and increases the chances of other optimizations taking effect due to the breaking down of the call boundaries.

By declaring a function inline, you can direct CDF90 to integrate that function's code into the code for its callers. This makes execution faster by eliminating the function-call overhead. The effect on code size is less predictable; object code may be larger or smaller with function inlining, depending on the particular case.

VII. TESTING 64BitCDF90 ON 64-BIT ARCHITECTURE Compilers are used to generate software for systems where

correctness is important [4] and testing [2] is needed for quality control and error detection in compilers.

A. Challenges One of the challenges encountered in the CDF90 project

was non-availability of free (open source), standard test suite to test the 64-bit compiler (specially I:8 check). Hence a test suite to test all the functionalities of Fortran 77/90 was developed (DPTESTCASES). This was not an easy task as the test suite developer needed to be aware of all the features of the language.

B. Test Suites Used Two suites of test programs, FCVS [8] and

DPTESTCASES, are used, and whenever the compiler is modified, the test programs are compiled using both the new and old versions of the compiler. Any differences in the target programs output are reported back to the development team.

i. FORTRAN Compiler Validation System (FCVS) FCVS, developed by NIST is a standard test suite for

FORTRAN compiler validation and used to validate 64BitCDF90 against FORTRAN77/90 applications. Scripts files are written to compile and execute each and every test case present in the test suite. After running script file one can see the result/status for FCVS.

ii. DPTESTCASES Unfortunately FCVS does not contain test cases to test 8-

byte integer support in various FORTRAN language constructs. Hence a large set of small and specific test cases

15

that covered the complete language constructs with 8-byte integer support was developed called DPTESTCASES. These tests verify each of the compiler construct with 8-byte integer support. DPTESTCASES enables to test for 8-byte integer support for various FORTRAN language features like Data type and Declarations, Specification Statements, Control statements, IO Statements, Intrinsic functions, Subroutine functions, boundary checks for integer constants.

Shell scripts are written to compile all test cases, execute them and check the compiler warnings and error messages against expected results of DPTESTCASES.

The test package has been modified in several ways, like supplying large values in possible test cases for boundary value analysis, and check for the correct results. This approach of testing proved to be helpful in verifying compiler capabilities. C. Testing Approach

The approach followed for testing 64BitCDF90 can be described by listing out following main points:

i. Testing of 64BitCDF90 in 32-bit mode on 64-bit Machine Execute each test case with 64BitCDF90 compiler in 32-bit

mode and check for correct results. This check is required to see if modifications performed, for 64-bit compatibility, in the compiler code may have resulted in any adverse side effects. In 32-bit mode compiler internally uses 32-bit library archives and generate 32-bit X-COFF on AIX5.3 platform. FCVS is used for testing in 32-bit mode.

ii. Testing of 64BitCDF90 in 64-bit mode on 64-bit Machine Execute each test case with enhanced cdf90 compiler in 64-

bit mode and check for correct results. This check is required to see if the enhanced compiler is successfully compiling each test case in the 64-bit environment. Both FCVS and DPTESTCASES are used to test 64-bit compatibility and 8-byte integer support in various FORTRAN language constructs. In 64-bit mode compiler internally uses 64-bit library archives and generate 64-bit X-COFF on AIX5.3/Power5 platform.

iii. White Box Testing White-Box testing is the execution of the maximum number of accessible code branches with the help of debugger or other means. The more code coverage is achieved the fuller is the testing provided [13]. For the large code base of 64BitCDF90, it is not easy to traverse the full code. Hence white-box testing methods are used in the case when an error is found with test cases and we need to find out the reason which caused it.

iv. Black Box Testing Unit testing may be treated as black-box testing. The main

idea of the method consists in writing a set of tests for separate modules and functions of Fortran 90, which test all the main constructs of Fortran.

v. Functional Testing Tested 64BitCDF90 along with original CDF90 for the

same test cases with the intent of finding defects, demonstrated that defects are not present, verifying that the module performs its intended functions with integer up to 8 bytes since Fortran supports integers of 1,2,4 and 8 bytes, and establishing confidence that a program does what it is supposed to do.

vi. Regression Testing Similar in scope to a functional test, a regression test allows

a consistent, repeatable validation of each new release of a compiler for new requirement. Such testing ensures reported compiler defects have been corrected for each new release and that no new quality problems were introduced in the maintenance process. Regression testing can be performed manually.

vii. Static Analysis of the CDF90 code The code size being very large, we needed a suitable static analyser which supports LP64 data model for 64-bit compilation on 64-bit platform. We found lint [12] as most suitable and offered following advantages: Warns about incorrect, error-prone or nonstandard code

that that the compiler does not necessarily flag. Issues potential bugs and portability problem Assists in improving source code’s effectiveness,

including reducing its size and required memory Out of various options provided by lint, –errchk=longptr64, in particular, has proved to be very helpful for migrating CDF90 to 64-bit platform. D. Testing Statistics

i. 32-bit Compilation FCVS is compiled by 64BitCDF90 in 32-bit mode on 64-

bit machine (AIX5.3/Power5).198 out of total 200 test cases, are successful. The two buggy test cases are being worked on. DPTESTCASES consisting of 210 test cases are compiled successfully with 64BitCDF90 and producing correct results.

ii. 64-bit Compilation FCVS is compiled in 64-bit mode by setting 64-bit flag (set

‘-maix64’ flag when using ‘gcc4.2’ for backend compilation) to obtain 64-bit executable files. The same two test cases mentioned above are unsuccessful in 64-bit mode also.

All the test cases present in DPTESTCASES are compiled successfully and producing correct results.

VIII. CONCLUSIONS Major hardware vendors have recently shifted to 64-bit

processors because of the performance, precision, and scalability that 64-bit platforms can provide. The constraints

16

of 32-bit systems, especially the 4GB virtual memory ceiling, have spurred companies to consider migrating to 64-bit platforms.

This paper presents some important porting issues that are faced while porting a compiler from 32-bit to 64-bit like - adding a new language feature (I:8 data type) to the existing compiler and ensuring that this feature is supported by a range of already existing library function. The testing methodology is elaborated. Performance optimizations contributing to energy efficiency of the code are explained in the paper. The authors in this paper present the traditional optimizations implemented in CDF90. Other energy-saving optimizations are being explored and shall be presented in future reports.

REFERENCES [1] AHO, ALFRED V. / JEFFREY D.

ULLMAN, “Principles of Compiler Design”, Tenth Indian Reprint, Pearson Education, 2003.

[2] Jiantao Pan, “Software testing”, 18-849b Dependable Embedded Systems, Carnegie Mellon University, Spring 1999 http://www.ece.cmu.edu/~koopman/des_s99/sw_testing/

[3] DeMiIIo, RA1; Krauser, EW2; Mathur, AP3, “An Overview of Compiler-integrated Testing” Australian Software Engineering Conference 1991: Engineering Safe Software; Proceedings, 1991

[4] A. S. Kossatchev and M. A. Posypkin, “Survey of Compiler Testing Methods”, Programming and Computer Software, Vol. 31, No. 1, 2005, pp. 10–19

[5] Harsha S. Adiga, ’” Porting Linux applications to 64-bit systems”, 12 Apr 2006.

http://www.ibm.com/developerworks/library/l-port64.html [6] http://cdac.in/html/ssdgblr/f90ide.asp. [7] http://www.cocolab.com/en/cocktail.html [8] http://www.fortran2000.com/ArnaudRecipes/ cvs21_f95.html [9] Stephen C. Johnson “Yacc: Yet Another Compiler- Compiler”, July 31, 1978

http://www.cs.man.ac.uk/~pjj/cs211/yacc/yacc.html [10] Cathleen Shamieh,” Understanding 64-bit PowerPC

architecture”, 19 Oct 2004, http://www.ibm.com/developerworks/library/pamicrodesign/

[11] Steven Nakamoto and Michael Wolfe,” Porting Compilers & Tools to 64 Bits” Dr. Dobb’s Portal, August 01, 2005

[12] “lint Source Code Checker”, C User's Guide, Sun Studio 11, 819-3688-10 [13] Andrey Karpov, Evgeniy Ryzhkov, “Traps detection

during migration of C and C++ code to 64-bit Windows” http://www.viva64.com/content/articles/64-bit-development/?f=TrapsDetection.html=en&content=64-bit-development

[14] http://www.itl.nist.gov/div897/ctg/fortran_form.htm [15] Stefan Goedecker, Adolfy Hoisie “Performance

optimization of numerically intensivecodes”, Society for Industrial and Applied Mathematics (Cambridge University Press), 1987

[16] Tali Moreshet, R. Iris Bahar, Maurice Herlihy, “Energy Reduction in Multiprocessor Systems Using Transactional Memory”, ISLPED’05, USA

[17] CaoY, Okuma, Yasuura, “Low-Energy Memory Allocation and Assignment Based on Variable Analysis for Application-Specific Systems”, IEIC Technical Report, p31-38, Japan, 2002

[18] Changjiu Xian, Yung-Hsiang Lu, “Energy reduction by workload adaptation in a multi-process environment”, Proceedings of the conference on Design, automation and test in Europe, 2006

[19] J. Hom and U. Kremer, “Inter-Program Optimizations for Conserving Disk Energy”, International Symposium on Low Power Electronics and Design (ISLPED'05), San Diego, California, August 2005

[20] C-H. Hsu and U. Kremer, “Compiler-Directed Dynamic Voltage Scaling Based on Program Regions” Rutgers University Technical Report DCS-TR461, November 2001.

[21] Wei Zhang , “Compiler-Directed Data Cache Leakage Reduction”, IEEE Computer Society Annual Symposium on VLSI, ISVLSI'04

[22] U. Kremer. “Low Power/Energy Compiler Optimizations”, Low-Power Electronics Design (Editor: Christian Piguet), CRC Press, 2005.

[23] Majid Sarrafzadeh, Prithviraj Banerjee, Alok Choudhary, Andreas Moshovos, “PACT: Power Aware Compilation and Architectural Techniques”, California University LA, Dept of Computer Science.

[24] U Kremer, “Compilers for Power and Energy Management”, Tutorial, ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'03), San Diego, CA, June 2003.

17

ADCOM 2009SERVER

VIRTUALIZATION

Session Papers:

1. J. Lakshmi, S.K.Nandy, “ Is I/O Virtualization ready for End-to-End Application Performance? ”INVITED PAPER

2. S. Prakki, “ Eco-friendly Features of a Data Centre OS” – INVITED PAPER

18

1

Is I/O virtualization ready for End-to-End

Application Performance? J. Lakshmi, S. K. Nandy

Indian Institute of Science, Bangalore, India

jlakshmi,[email protected]

Abstract: Workload consolidation using system

virtualization feature is the key to many successful

green initiatives in data centres. In order to exploit

the available compute power, such systems warrant

sharing of other hardware resources like memory,

caches, I/O devices and their associated access

paths among multiple threads of independent

workloads. This mandates the need for ensuring

end-to-end application performance. In this paper

we explore the current practices for I/O

virtualization, using sharing of the network

interface card as the example, with the aim to study

the support for end-to-end application performance

guarantees. To ensure end-to-end application

performance and limit interference caused due to

the sharing of devices, we present an evaluation of

previously proposed end-to-end I/O virtualization

architecture. The architecture is an extension to the

PCI-SIGV specification of I/O hardware to support

reconfigurable device partitions and uses VMM-

bypass technique for device access by the virtual

machines. Simulation results of the architecture for

application quality of service guarantees

demonstrate the flexibility and scalability of the

architecture.

Keywords – Multicore server virtualization, IO-

virtualization architectures,QoS, Performance.

I. Introduction

Multi-core servers have brought in tremendous

computing capacity to the commodity systems.

These multi-core servers have not only prompted

applications to use fine-grained parallelism to gain

advantage of the abundance of CPU cycles, they

have also initiated the coalescing of multiple

independent workloads onto a single server.

Multicore servers combined with system

virtualization have led to many successful green

initiatives of data centre workload consolidation.

This consolidation however needs to satisfy end-to-

end application performance guarantees. Current

virtualization technologies have evolved from the

prevalent single-hardware single-OS model which

presumes the availability of all other hardware

resources to the current scheduled process. This

causes performance interference among multiple

independent workloads sharing an I/O device,

based on the individual workloads. Major efforts

towards consolidation have focussed on

aggregating the CPU cycle requirement of the

target workloads. But I/O handling of these

workloads on the consolidated servers results in

sharing of the physical resources and their

associated access paths. This sharing causes

interference that is dependent on the consolidated

workloads and makes the application performance

non-deterministic [1][2][3]. In such scenarios, it is

essential to have appropriate mechanisms to define,

monitor and ensure resource sharing policies across

the various contending workloads. Many

applications like real-time hybrid voice and data

communication systems onboard aircraft and naval

vessels, streaming and on-demand video delivery,

database and web-based services, when

consolidated onto virtualized servers, need to

support soft real-time application deadlines to

ensure performance.

Standard I/O devices are not virtualization aware

and hence their virtualization is achieved using a

software layer for multiplexing device access to

independent VMs. In such cases I/O device

virtualization is commonly achieved following two

basic modes of virtualization, namely para-

virtualization and emulation [4]. In para-

virtualization the physical device is accessed and

controlled using a protected domain which could be

the virtual machine monitor (VMM) itself or an

independent virtual machine (VM), also called the

independent driver domain (IDD) as in Xen. The

VMM or IDD actually do the data transfer to and

from the device into their I/O address space using

the device’s native driver. From there the copy or

transfer of the data to the VM’s address space is

done using what is commonly called the para-

virtualized device driver. The para-virtualized

driver is specifically written to support a specific

mechanism of data transfer between the VMM/IDD

and the VM and needs a change in the OS of the

VM (also called GuestOS).

In emulation, the GuestOS of VM installs a device

driver of the emulated virtual device. All the calls

of this emulated device driver are trapped by the

VMM and translated to the native device driver’s

calls. The advantage of the emulation is that it

allows the GuestOS to be unmodified and hence

easier to adopt. However, para-virtualization has

been found to be much better in performance when

compared to emulation. This is because emulation

results in each instruction translation whereas para-

19

2

virtualization involves only page-address

translation. But, both these modes of device

virtualization impose resource overheads when

compared to non-virtualized servers. These

overheads translate into application performance

loss.

The second drawback of the existing architectures

is their lack of sufficient quality of Service (QoS)

controls to manage device usage on a per VM

basis. A desirable feature of these controls is that

they should guarantee application performance

with specified QoS on the shared device and this

performance should be unaffected by the workloads

sharing the device. The other desirable feature is

that the unused device capacity should be available

for use to the other VMs. Prevalent virtualization

technologies like Xen and Vmware and even

standard linux distributions use a software layer

within the network stack to implement NIC usage

policies. Since these systems were built with the

assumption of single-hardware single-OS model,

these features provide required control on the

outgoing traffic from the NIC of the server. The

issue is with the incoming traffic. Since the policy

management is done above the physical layer,

ingress traffic accepted by the device is later

dropped based on input stream policies. This results

in the respective application not receiving the data,

which perhaps satisfies the application QoS, but

causes wasted use of the device bandwidth that

affects the delivered performance of all the

applications sharing the device. Also, it leads to

non-deterministic performance that varies with the

type of applications using the device. This model is

insufficient for the virtualized server supporting

sharing of the NIC across multiple VMs. In this

paper we describe and evaluate an end-to-end I/O

virtualization architecture that addresses these

drawbacks.

Rest of the paper is organized as follows. section II

presents experimental results on existing

virtualization technologies, namely Xen and

Vmware, that motivate this work; section III then

describes an end-to-end I/O virtualization

architecture to overcome the issues raised in

section II; section IV details the evaluation of the

architecture and presents the results; section V

highlights the contributions of this work with

respect to existing literature and section VI details

the conclusions.

II. Motivation

Existing I/O Virtualization architectures use extra

CPU cycles to fulfill equivalent I/O workload.

These overheads reflect in the achievable

application performance, as depicted by the graph

of Figure 1. The data in this graph represents

achievable throughput by the httperf [5] benchmark

hosted on a non-virtualized and virtualized Xen[6]

and Vmware-ESXi [7] servers. In each of the case

the http server was hosted on a Linux(FC6) OS and

for the virtualized server, the hypervisor, IDD

(Xen) [8] and the virtual machine were pinned to

use the same physical CPU. The server used was

dual core Intel Core2Duo system with 2GB RAM

and 10/100/1000Mbps NIC. In the Xen hypervisor

the virtual NIC used by the VM was configured to

use a para-virtualized device driver implemented

using event channel mechanism and a software

bridge for creating virtual NICs. In the case of

Vmware hypervisor the virtual NIC used inside the

VM was configured using a software switch with

access to device through emulation.

Figure 1: httperf benchmark throughput graph for non-

virtualized, Xen and Vmware-ESXi virtual machine hosting http server on Linux(FC6). 1

As can be observed from the graphs of Figure 1,

the sustainable throughput of the benchmark drops

considerably when the http server is moved to a

virtualized server when compared to the non-

virtualized server. The reason for this drop is

answered by the CPU Utilization graph depicted in

Figure 2. From the graphs we notice that moving

the http server from non-virtualized to virtualized

server, the %CPU utilization, to support the same

httperf workload, is increased significantly and this

increase is substantial for the emulated mode of

device virtualization. The reason for this increased

CPU utilization is because of I/O device

virtualization overheads.

Further, when the same server is consolidated with

two VMs sharing the same NIC, each supporting

one stream of an independent httperf benchmark,

there is further drop of achievable throughput per

VM. This is explicable since each VM now

contends for the same NIC. The virtualization

mechanisms not only share the device but also the

1 Some data of the graph has been reused from [9][10]

20

3

device access paths. This sharing causes

serialization which leads to latencies and

application performance loss which is dependent on

the nature of the consolidated workloads. Also, the

increased latencies in supporting the same I/O

workload on the virtualized platform causes loss of

usable device bandwidth which further reduces the

scalability of device sharing by multiple VMs.

Figure 2: CPU resource utilized by the http server to support

httperf benchmark throughput. 1

This scalability can be improved to some extent by

pinning different VMs to independent cores and

using a high speed, high bandwidth NIC. Still the

high virtualization overheads coupled with

serialization due to shared access paths restricts

device sharing scalability.

The next study is on evaluating the NIC specific

QoS controls existing in Xen and Vmware. Since

current NICs do not support QoS controls these are

provided by the OS managing the device. In either

case, these controls are implemented in the network

stack above the physical device. Because of this

they are insufficient as is displayed by the graphs in

Figure 3and Figure 4.

Figure 3: httperf achievable throughput on a Vmware-ESXi consolidated server with NIC sharing and QoS guarantees.

Figure 4: httperf achievable throughput on a Xen consolidated

server with NIC sharing and QoS guarantees.

Each of the graphs shows the maximum achievable

throughput by VM1 when VM2 is constrained by a

specified QoS guarantee. This guarantee is the

maximum throughput that VM2 should deliver and

is implemented using the network QoS controls

available in Xen and Vmware servers. In Xen these

QoS controls are implemented using tc utilities of

the netfilter module of linux OS of the Xen-IDD. In

Vmware, Veam enabled network controls are used.

VM1 and VM2 are two different VMs hosted on

the same server sharing the same NIC. We observe

that for the unconstrained VM, in this case VM1,

maximum achievable throughput does not exceed

200 for Vmware and 475 for Xen. This is

considerably low when compared to the maximum

achievable throughput for a single VM using the

NIC. The reason being, the constrained VM,

namely VM2, is receiving all requests. VM2 is also

processing these requests and generating

appropriate replies which results in CPU resource

consumption. Only some replies of the received

requests are dropped based on the currently

applicable QoS on the usable bandwidth. This is

because both Vmware and Xen support QoS

controls on the egress traffic at the NIC. This

approach of QoS control on resource usage is

wasteful and coarse-grained. As can be observed,

as the constraint on VM2 is relaxed the behaviour

of NIC sharing reaches best effort and the resulting

throughput achievable by any of the VM is

obviously less than what can be achieved when a

single VM is hosted on the server. These graphs

clearly demonstrate the insufficiency of the existing

QoS controls.

From the above experiments we conclude the

following drawbacks in the existing I/O

virtualization architectures:

Building hypervisors or VMMs using single-

hardware single-OS model leads to cohesive

architectures leading to high virtualization

overheads. Virtualization overheads being high

cause loss of usable device bandwidth. This

21

4

often results in under-utilized resources and

limited consolidation ratios, particularly for I/O

workloads. The remedy to this approach is to

build I/O devices that are virtualization aware

and decouple the device management from

device access i.e, provide native access to the

I/O device from within the VM and allow

VMM to manage concurrency issues rather than

the ownership issues.

Lack of fine-grained QoS controls on device

sharing cause performance loss which is

dependent on the workloads of the VMs sharing

the device. This leads scalability issues in

sharing the I/O device. To address this the I/O

device should support QoS controls for both the

incoming and outgoing traffic.

To overcome the above drawbacks, we propose an

I/O virtualization architecture. This architecture

proposes an extension to the PCI-SIG IOV

specification [11] for virtualization enabled

hardware I/O devices with a VMM-bypass [12]

mechanism for virtual device access.

III. End-to-End I/O Virtualization

Architecture

We propose an end-to-end I/O virtualization

architecture that enables direct or native access to

the I/O device from within the VM rather than

accessing it through the layer of VMM or IDD.

PCI-SIG IOV specification proposes virtualized

I/O devices that can support native device access

by the VM, provided the hypervisor is built to

support such architectures. IOV specified hardware

can support multiple virtual devices at the hardware

level. The VMM needs to be built such that it can

recognize and export each virtual device to an

independent VM, as if the virtual device was an

independent physical device. This allows native

device access to the VM. When a packet hits the

hardware virtualized NIC, the VMM should

recognize the destination VM of an incoming

packet by the interrupt raised by the device and

forwards it to the appropriate VM. The VM

processes the packet as it would do so in the case of

non-virtualized environment. Here, device access

and scheduling of device communication is

managed by the VM that is using the device. This

eliminates the intermediary VMM/IDD on the

device access path and reduces I/O service time

which improves the usable device bandwidth and

application throughput.

To support the idea of QoS based on device usage,

we extend the IOV architecture specification by

enabling reconfigurable memory on the I/O device.

For each of the virtual device defined on the

physical device, the device memory associated with

the virtual device is derived from the QoS

requirement of the VM to which the virtual device

is allocated. This, along-with features like TCP

offload, virtual device priority and bandwidth

specification support at the device level provide

fine-grained QoS controls at the device while

sharing it with other VMs, as is elaborated upon in

the evaluation section.

Figure 5 gives a block schematic of the proposed

I/O virtualization architecture.2 The picture depicts

a NIC card that can be housed within a virtualized

server. The card has a controller that manages the

DMA transfer to and from the device memory. The

standard device memory is replaced by a re-

partitionable memory supported with n sets of

device registers. A set of m memory partitions,

where m ≤ n, with device registers forms the

virtual-NICs (vNICs). Ideally the device memory

should be reconfigurable, i.e. dynamically

partitionable, and the VM’s QoS requirements

would drive the sizing of the memory partition. The

advantage of having a dynamically partitionable

device memory is that any unused memory can be

easily extended into or reduced from a vNIC in

order to match adaptive QoS specifications. The

NIC identifies a vNIC request by generating

message signaled interrupts (MSI). The number of

interrupts supported by the controller restricts the

number of vNICs that can be exported. Based on

the QoS guarantees a VM needs to honour,

judicious use of native and para-virtualized access

to the vNICs can overcome this limitation. A VM

that has to support stringent QoS guarantees can

choose to use native access to the vNIC whereas

those VMs that are looking for best-effort NIC

access can be allowed para-virtualized access to the

vNIC.

Figure 5: NIC architecture supporting MSI interrupts with

partitionable device memory, multiple device register sets and

DMA channels enabling independent virtual-NICs.

2 This section is being reproduced from [10] to maintain

continuity in the text. Complete architecture description with

performance statistics on achievable application throughput can

be found in [10].

22

5

The VMM can aid in setting up the appropriate

hosting connections based on the requested QoS

requirements.

The proposed architecture can be realized by the

following modifications:

Virtual-NIC: In order to define vNIC, the

physical device should support time-sharing in

hardware. For a NIC this can be achieved by

using MSI and dynamically partitionable device

memory. These form the basic constructs to

define a virtual device on a physical device as

depicted in Figure 5. Each virtual device has a

specific logical device address, like the MAC

address in case of NICs, based on which the

MSI is routed. Dedicated DMA channels, a

specific set of device registers and a partition of

the device memory are part of the virtual device

interface which is exported to a VM when it is

started. We call this virtual interface as the

virtual-NIC or vNIC which forms a restricted

address space on the device for the VM to use

and remains in possession of the VM till it is

active or relinquishes the device.

Accessing virtual-NIC: For accessing the

virtual-NIC native device driver is hosted inside

the VM and is initialized with the help of VMM

when the VM is initialized. This device driver

can only manipulate the restricted device

address space which was exported through the

vNIC interface by the VMM. With the vNIC,

the VMM only identifies and forwards the

device interrupts to the destination VM. The OS

of the VM now handles the I/O access and thus

can be accounted for the resource usage it

incurs. This eliminates the performance

interference due to VMM/IDD handling

multiple VMs’ request to/from a shared device.

Also, because the I/O access is now directly

done by the VM, the service time on the I/O

access reduces thereby resulting in better

bandwidth utilization. With the vNIC interface,

data transfer is handled by the VM. While

initializing the device driver for the virtual NIC

the VM sets up the Rx/Tx descriptor rings

within its address space and makes request to

the VMM for initializing the I/O page

translation table. The device driver uses this

table and performs DMA transfers directly into

the VM’s address space.

QoS and virtual-NIC: The device memory

partition acts as a dedicated device buffer for

each of the VMs and with appropriate logic on

the NIC card one can easily implement QoS

based SLAs on the device that translate to

bandwidth restrictions and VM specific priority.

The key is being able to identify the incoming

packet to the corresponding VM, which the NIC

is now expected to do. While communicating,

the NIC controller decides on whether to accept

or reject the incoming packet based on the

bandwidth specification or the virtual device’s

available memory. This gives a fine-grained

control on the incoming traffic and helps reduce

the interference effects. The outbound traffic

can be controlled by the VM itself using any of

the mechanisms as is done in the existing

architectures.

IV. Architecture Evaluation for QoS

controls

The proposed architecture was generated using

Layered Queuing Network (LQN) Model and

service times for the various entries of the model

were obtained by using runtime profilers on the

actual Xen based virtualized server. Complete

model building and validation details are available

in [9][10]. Here we present the results of QoS

evaluation carried out using the LQN model [12] of

the proposed architecture. The QoS experiments

were conducted along the same lines as described

in the introduction section. The difference now is

that the QoS control is applied on the ingress traffic

of the constrained VM, namely VM2. The results

obtained are depicted in Figure 6.The proposed

architecture allows for achieving higher application

throughput on the shared NIC firstly because of the

VMM-bypass [12]. Also, as can be observed from

the graphs above, the control of ingress traffic in

the case of httperf benchmark shows highly

improved performance benefit to the unconstrained

VM, namely VM1.

Figure 6: httperf throughput sharing on a QoS controlled, shared NIC between two VMs using the proposed architecture

with throughput constraints applied on the ingress traffic of

VM2 at the NIC.

For request-response kind of benchmarks like the

httperf, controlling the ingress bandwidth is

beneficial because once a request is dropped due to

saturation of allocated bandwidth, there is no

23

6

downstream activity associated with it and wasteful

resource utilization of NIC and CPU is avoided.

The QoS control at the device on the input stream

of VM2 and the native access to the vNICs by the

VMs gives the desired flexibility of making the

unused bandwidth available to the unconstrained

VM.

V. Related work

In early implementations, I/O virtualization

adopted dedicated I/O device assignment to a VM.

This later evolved to device sharing across multiple

VMs through virtualized software interfaces

[14][4]. A dedicated software entity, called the I/O

domain is used to perform physical device

management. The I/O domain is either part of the

VMM or is by itself an independent domain, like

the IDD of Xen [8][15]. With this intermediary

software layer between the device and the VM, any

application in a VM seeking access to the device

has to route the request through it. This architecture

still builds over the single-hardware single-OS

model [16]-[21]. The consequence of such

virtualization techniques is visible in the loss of

application throughput and usable device

bandwidth on virtualized servers as discussed

earlier. Because of the poor performance of the I/O

virtualization architectures a need to build

concurrent access to the shared I/O devices was felt

and recent publications on concurrent direct

network access (CDNA)[22] [19] and scalable self-

virtualizing network interface describe such efforts.

However, the scalable self-virtualizing interface

[23] describes assigning a specific core for network

I/O processing on the virtual interface and exploits

multiple cores on embedded network processors for

this. The paper does not detail how the address

translation issues are handled, particularly in the

case of virtualized environments. The CDNA

architecture is similar to the proposal in this paper

in terms of allowing multiple VM specific Rx and

Tx device queues. But CDNA still builds over the

VMM/IDD handling the data transfer to and from

the device. Although the results of this work are

exciting, the architecture still lacks the flexibility

required to support fine-grained QoS. And, the

paper does not discuss about the performance

interference due to uncontrolled data reception by

the device nor does it highlight the need for

addressing the QoS constraints at the device level.

The proposed architecture in this paper addresses

these and also the issue of pushing the basic

constructs to assign QoS attributes like required

bandwidth and priority into the device to get finer

control on resource usage and on restricting

performance interference.

The proposed architecture has it basis in

exokernel’s[24] philosophy of separating device

management from protection. In exokernel, the idea

was to extend native device access to applications

with exokernel providing the protection. In our

approach, the extension of native device access is

to the VM, the protection being managed by the

VMM. A VM is assumed to be running the

traditional OS. Further, the PCI-SIG community

has realized the need for I/O device virtualization

and has come out with the IOV specification to

deal with it. The IOV specification however, details

about device features to allow native access to

virtual device interfaces, through the use of I/O

page tables, virtual device identifiers and virtual

device specific interrupts. The specification

presumes that QoS is a software feature and does

not address this. Many implementations adhering to

the IOV specification are now being introduced in

the market by Intel[25], Neterion[26], NetXen[27],

Solarflare [28] etc. CrossBow[29]suite from SUN

Microsystems talks about this kind of resource

provisioning, but it is a software stack over a

standard IOV complaint hardware. The results

published using any of these products are exciting

in terms of the performance achieved, but almost

all of them have ignored the control of reception at

the device level. We believe that lack of such a

control on highly utilized devices will cause non-

deterministic application performance loss and

under-utilization of the device bandwidth.

VI. Conclusion

In this paper we described how the lack of

virtualization awareness in I/O devices leads to

latency overheads on the I/O path. Added to this,

the intermixing of device management and data

protection issues further increases the latency,

thereby reducing the effective usable bandwidth of

the device. Also, lack of appropriate device sharing

control mechanisms, at the device level leads to

loss bandwidth and performance interference on the

device sharing VMs. To address these issues we

proposed I/O device virtualization architecture, as

an extension to the PCI-SIG IOV specification, and

demonstrated its benefit through simulation

techniques. Results demonstrate that by moving the

QoS controls to the shared device, the unused

bandwidth is made available to the unconstrained

VM, unlike the case in prevalent technologies. The

proposed architecture also improves the scalability

of VMs sharing the NIC because it eliminates the

common software entity which regulates I/O device

sharing. The other advantage is that with this

architecture, the maximum resource utilization is

now accounted for by the VM. Also, this

architecture reduces the workload interference on

sharing a device and simplifies the consolidation

process.

24

7

References

[1] M. Welsh and D. Culler, “Virtualization considered

harmful: OS design directions for well-conditioned

services”, Hot Topics in OS, 8th Workshop, 2001.

[2] Kyle J. Nesbit, James E Smith, Miquel Moreto,

Francisco J. Cazorla, Alex Ramirez, Mateo Valero,

“Multicore Resource Management “, IEEE Micro,

Vol. 28, Issues 3, P-6-16, 2008.

[3] Kyle J. Nesbit, Miquel Moreto, Francisco J.

Cazorla, Alex Ramirez, Mateo Valero, and James E.

Smith, Virtual Private Machines:

Hardware/Software Interactions in the Multicore

Era, In IEEE Micro special issue on Interaction of

Computer Architecture and Operating System in the

Manycore Era, May/June 2008.

[4] Scott Rixner, “Breaking the Performance Barrier:

Shared I/O in virtualization platforms has come a

long way, but performance concerns remain”, ACM

Queue – Virtualization, Jan/Feb 2008.

[5] D. Mosberger and T. Jin, “httperf: A Tool for

Measuring Web Server Performance,” ACM,

Workshop on Internet Server Performance, pp. 59-

67, June 1998.

[6] Paul Barham , Boris Dragovic , Keir Fraser , Steven

Hand , Tim Harris , Alex Ho , Rolf Neugebauer ,

Ian Pratt , Andrew Warfield, “Xen and the art of

virtualization”, 19th ACM SIGOPS, Oct. 2003.

[7] “VMware ESX Server 2 - Architecture and

Performance Implications”, 2005, available at

http://www.vmware.com/pdf/esx2_performance_im

plications.pdf

[8] K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A.

War_eld, and M. Williamson, “Safe hardware

access with the Xen virtual machine monitor.” 1st

Workshop on OASIS, Oct 2004.

[9] J.Lakshmi, S.K.Nandy, “Modeling Architecture-OS

interactions using Layered Queuing Network

Models”, International Conference Proceedings of

HPC Asia, March, 2009, Taiwan.

[10] J. Lakshmi, S. K. Nandy, “I/O Device virtualization

in Multi-core era, a QoS Pespective”, Workshop on

Grids, Clouds and Virtualization, International

Conference on Grids and Pervasive computing,

Geneva, May 2009.

[11] PCI-SIG IOV Specification available online at

http://www.pcisig.com/specifications/iov

[12] J. Liu, W. Huang, B. Abali, and D. K. Panda. High

performance VMM-bypass I/O in virtual machines.

In Proceedings of the USENIX Annual Technical

Conference, June 2006.

[13] Layered Queueing Network Solver software

package, http://www.sce.carleton.ca/rads/lqns/

[14] T. von Eicken and W. Vogels. Evolution of the

virtual interface architecture. Computer, 31(11),

1998.

[15] J. Sugerman, G. Venkatachalam, and B. Lim.

Virtualizing I/O devices on VMware Workstation’s

hosted virtual machine monitor. In Proceedings of

the USENIX Annual Technical Conference, June

2001.

[16] D. Gupta, L. Cherkasova, R. Gardner, and A.

Vahdat. Enforcing performance isolation across

virtual machines in Xen. In M. van Steen and M.

Henning, editors, Middleware, volume 4290 of

Lecture Notes in Computer Science, pages 342–362.

Springer, 2006.

[17] Weng, C., Wang, Z., Li, M., and Lu, X. 2009. The

hybrid scheduling framework for virtual machine

systems. In Proceedings of the 2009 ACM

SIGPLAN/SIGOPS international Conference on

Virtual Execution Environments (Washington, DC,

USA, March 11 - 13, 2009).

[18] Kim, H., Lim, H., Jeong, J., Jo, H., and Lee, J. 2009.

Task-aware virtual machine scheduling for I/O

performance. In Proceedings of the 2009 ACM

SIGPLAN/SIGOPS international Conference on

Virtual Execution Environments (Washington, DC,

USA, March 11 - 13, 2009).

[19] Menon, J. R. Santos, Y. Turner, G. J. Janakiraman,

and W. Zwaenepoel. Diagnosing performance

overheads in the Xen virtual machine environment.

In Proceedings of the ACM/USENIX Conference

on Virtual Execution Environments, June 2005.

[20] Menon, A. L. Cox, and W. Zwaenepoel. Optimizing

network virtualization in Xen. In Proceedings of the

USENIX Annual Technical Conference, June 2006.

[21] Santos, J. R., Janakiraman, G., Turner, Y., Pratt, I.

2007. Netchannel 2: Optimizing network

performance. Xen Summit Talk (November)

[22] Willmann, P., Shafer, J., Carr, D., Menon, A.,

Rixner, S., Cox, A. L., Zwaenepoel, W. Concurrent

direct network access for virtual machine monitors.

In Proceedings of the International Symposium on

High-Performance Computer Architecture,2007

(February).

[23] H. Raj and K. Schwan. Implementing a scalable

self-virtualizing network interface on a multicore

platform. In Workshop on the Interaction between

Operating Systems and Computer Architecture, Oct.

2005.

[24] M. Frans Kaashoek, et. Al., “Application

Performance and Flexibility on Exokernel Systems

“, 16th ACM SOSP, Oct, 1997.

[25] Intel Virtualization Technology for Directed-I/O

www.intel.com/technology/itj/2006/v10i3/2-io/7-

conclusion.htm

[26] Neterion http://www.neterion.com/

[27] NetXen http://www.netxen.com/

[28] Solarflare Communications

http://www.solarflare.com/

[29] CrossBow: Network Virtualization and Resource

Control

http://www.opensolaris.org/os/community/networki

ng/crossbow_sunlabs_ext.pdf

25

http://www.vmware.com/pdf/esx2_performance_implications.pdf

http://www.vmware.com/pdf/esx2_performance_implications.pdf

http://www.netxen.com/

Eco-Friendly Features of a Data center OS

Surya PrakkiSun MicrosystemsBangalore, India

[email protected]

Abstract—This paper presents the different technologies a modern operating system like OpenSolaris offers which will help data centers to become more eco-friendly. It starts off with the various virtualization technologies OpenSolaris has got which help in driving consolidation of systems thus reducing system foot print in a data center. It introduces the power aware dispatcher (PAD), how it plays well with the various processor supported power states, and moves onto observability tools which prove how well the system is using the power management features.

Keywords-virtualization; consolidation; green computing;

I. INTRODUCTION

Data centers are becoming the backbone of success of any enterprise. Enterprises are offering lot of services over the web – whether it be bank transactions or booking of tickets or maintaining social relationships, etc. This is leading to automation of more and more services, thus resulting in even more computers being deployed in a datacenter - server sprawl. And as these computers are getting more and more powerful, their energy requirements have gone up, and with increasing energy costs, data centers are impacting both ecology as well as economics of running them.

This brings along a very interesting challenge to the modern OS developers:

How to move away from the traditional approach of hosting one service on one computer to, running multiple services on a single computer and address the following challenges in doing so :

o Isolation

o Minimizing the overheads

o Meeting peak load requirements (QoS)

o Secure execution

o Meeting different patch requirements of different applications

o Heterogeneous work loads

o Testing and Development

o Enforcing resource controls

o observability

o Fail over through replication

o Supporting legacy applications

o Simple administration

How to reduce the energy requirements of an idling computer?

First problem of workload consolidation can be addressed using virtualization.

There is no silver bullet virtualization technology that can address all of the above challenges.

II.OPERATING SYSTEM LEVEL VIRTUALIZATION

If the requirement is to consolidate homogenous workloads [applications compiled for the same platform [I.e. Operating System]], it is preferable to opt for OS level virtualization. For the rest of the discussion let us look at Zones technology in OpenSolaris which provides this feature and see how it solves the above challenges.

A. Zones

Zones provide a very low overhead mechanism to virtualize operating system services, allowing one or more processes to run in isolation from other activity on the system. The kernel exports a number of distinct objects that can be associated with a particular zone, like processes, file system mounts, network interfaces (I/F) etc. No zone can access objects belonging to another zone. This isolation prevents processes running within a given zone from monitoring or affecting processes running in other zones. Thus a zone is a 'sandbox' within which one or more applications can run without affecting or interacting with the rest of the system.

As the underlying kernel is the same, physical resources are multiplexed across zones, improving the effective utilization of the system.

The first zone that comes up on installing OpenSolaris system is referred to as 'global zone' and all non-global zones need to be configured and installed from this global zone.

Zones framework provides an abstraction layer that separates applications from physical attributes of the machine on which they are deployed, such as physical device paths and network I/F names. This enables, things like, multiple web servers running in different zones to connect to the same port using the distinct IP addresses associated with each zone.

OpenSolaris had broken down the privileges associated with root owned process into finer grained ones and the zones inherit a subset of these privileges(5), thus making sure that even if a zone is compromised, the intruder can't do any

26

damage to the rest of the system. Another outcome of this is, even privileged processes in non-global zones are prevented from performing operations that can have system-wide impact.

Administration of individual zones can be delegated to others, knowing that any actions taken by them would not affect the rest of the system.

Zones do not present a new API or ABI to which applications need to be 'ported' – I.e. the existing OpenSolaris applications run inside zones without any changes or recompilation. So a process running inside a zone runs natively on the CPU and hence doesn't incur any performance penalty.

A zone or multiple zones can be tied to a resource pool – which pulls together CPUs and scheduler. The resources associated with a pool can be dynamically changed. This enables the global administrator to give more resources to a zone when it is nearing its peak demand. Faire Share Scheduler (FSS) can be associated with a pool. Using FSS a physical CPU assigned to multiple zones, via a resource pool, can be shared as per their entitlement – eve in the face of most demanding work loads, thus guaranteeing quality of service (QoS). The physical memory and swap memory are also configured per zone and can be changed dynamically to meet the varying demands of a zone.

Zones can be booted and halted independent of the underlying kernel. As zone boot involves only setting up the virtual platform, mounting the configured file systems, this is a much faster operation compared to booting of even a smallest physical system.

The software packages that need to be available in a zone is a function of the service it is hosting and a zone administrator is free to pick and choose. This makes a zone look like 'Just Enough OS' kind of slick environment which is tailor made to the services it hosts. Likewise a zone administrator is also free to maintain the patch levels of the packages she installed.

To save on file system space, two types of zones are supported – sparse and whole root. In case of sparse zone, some file systems like /usr are loop back mounted from global zone so that multiple copies of the binaries are not present. In whole root zone, all components that make up the platform are installed and thus takes more disk space.

The zones technology is extended to be able to run applications compiled for earlier versions of Solaris. This is referred to as branding a zone and is used to run Solaris 8 and Solaris 9 applications. In such a branded zone, run time environment for an Solris 8 application is no different from what it was on a box running Solaris 8 kernel. This branding feature helps customers to replace old power-hungry systems with newer eco-friendly computers. Operating environment should also provide tools which will help in making this transition smoother.

The OpenSolaris system has got a rich set of observability tools like DTrace(1M), kstat(1M), truss(1), proc(4) and debugging tools like mdb(1) which could be used to study the behavior and debug applications in production environment.

These tools could be run either from global zone or from inside a non-global zone itself.

The extensibility of the zones framework can be gauged by the fact that there is an 'lx' brand using which linux 2.6, 32bit applications can be run unmodified on an OpenSolaris x86 system inside an lx branded zone. Thus even linux applications can be observed from the global zone using earlier mentioned tools.

A zone can be configured with an exclusive IP stack, such that it can have its own routing table, ARP table, IPSec policies and associations, IP filter rules and TCP/IP ndd variables. Each zone can continue to run services such as NFS on top of TCP/UDP/IP. This way a physical NIC can be set aside for exclusive use of the zone. Zones can also be configured to have shared IP stacks on top of a single NIC.

Disks can be provided to a zone for its exclusive use, or a portion of a file system space can be set aside for the zone, or using ZFS file system, space can be grown as the demand grows dynamically.

The following block diagram captures the above discussion :

Figure 1.

Thus to summarize, zones virtualizes the following facilities :

Processes

File Systems

Networking

Identity

Devices

Packaging

27

1) Field Use Case : www.blastwave.com offsers open source packages for

different versions of Solaris. As a result they needed to continue to have physical systems running Solaris 8 and Solaris 9. When they made use of the branded zones feature, they were able to consolidate these legacy systems onto systems running Solaris10 and they quantified the gain as follows :

o 65 percent reduction in rack space, saving tens of thousands of dollars in power, cooling, and hardware-maintenance costs

o Reduced setup time from hours to minutes.

Sun IT heavily uses zones to host quite a few services, to quote a few:

o Request to make source changes, is made via a portal which runs in a zone.

o Service to manage lab infrastructure.

o Namefinder service which reports basic information about employees.

o Service to host software patches.

One web server is run in each of the above zones and they all listen on port:80 without stamping on each other. All the above services are very critical to running of the business. First 3 cases do not see much change in workload and hence are easily virtualized using zones. In case of 4th, patches are released periodically and there will be lot of hits in the first 48hrs of patches being made available and during this time, additional CPUs and physical memory can be set aside for this zone using dynamic resource pools via a cron(1M) job. In consolidating these services, we replaced 4 physical systems with a single one.

Price et al. [11] report less than 4% performance degradation for time sharing workloads and is attributed to the loop back mounts in case of sparse zone.

B. Crossbow

Crossbow provides the building blocks for network virtualization and resource control by Virtualizing the network stack and NIC around any service (HTTP, HTTPS, FTP, NFS etc.), protocol or Virtual machine. Crossbow does to networking stack what Zones did to OS services.

One of the main component of Crossbow is the ability to virtualize a physical NIC into multiple virtual NIC(VNICs). These VNICs can be assigned to either zones or any virtual machines sharing the physical NIC. Virtualization is implemented by the MAC layer and the VNIC pseudo driver of OpenSolaris network stack. It allows physical NIC resources such as hardware rings and interrupts to be allocated to specific VNICs. This allows each VNIC to be scheduled independently

as per the load on the VNIC and also allows classification of packets between VNICs to be off-loaded in hardware.

Each VNIC is assigned its own MAC address and optional VLAN id (VID). The resulting MAC+VID tuple is used to identify a VNIC on the network, physical or virtual.

Crossbow allows a bandwidth limit to be set on a VNIC. The bandwidth limit is enforced by the MAC transparently to the user of the VNIC. This mechanism allows the administrator to configure the link speed of VNICs that are assigned to zones or VMs. - this way they can't use more bandwidth than their assigned share.

Crossbow provides virtual switching semantics between VNICs created on top of the same physical NIC. Virtual switching done by Crossbow is consistent with the behavior of a typical physical switch found on a physical network.

Crossbow VNICs and virtual switches can be combined to build a Crossbow Virtual Wire (vWire). A vWire can be a fully virtual network in a box. A vWire can be used to instantiate a layer-2 network which can be used to run a distributed application spanning multiple virtual hosts,

Virtual Network Machines (VNMs) are pre-canned OpenSolaris zones which encapsulate a network function such as routing, load balancing, etc. VNMs are assigned their own VNIC(s), and can be deployed on a vWire to provide network functions needed by the network applications. VNMs can come in pre-configured fashion, which helps in deploying new instances quickly.

III. PLATFORM LEVEL VIRTUALIZATION

There are some instances where OS level virtualization falls short :

Any decent sized data center will have heterogeneous workloads - applications compiled for different platforms and any effort to consolidate such workloads can't be fully achieved by OS level virtualization.

A service or application needs a specific kernel module or driver to operate.

Different applications need different kernel patch levels.

Consolidate legacy applications which expect end of life (EOL) Operating Systems.

In such scenarios, platform virtualization solution can be used to consolidate the workloads. These solutions carve multiple Virtual Machines (VMs) out of a physical machine and are referred to as either Hypervisors or Virtual Machine Monitors (VMMs).

For the rest of the discussion let us look at the hypervisors LDOMs and Xen, that OpenSolaris supports on Sparc and x86 architectures respectively and see how they address some of the challenges mentioned in Introduction.

28

http://www.blastwave.com/

A. Logical DOMains

LDOMs can be viewed as a hypervisor implemented in firmware. LDOMs hypervisor is shipped along with sun4v based Sparc systems.

Hypervisor divides the physical Sparc machine into multiple virtual machines, called domains. Each domain can be dynamically configured with a subset of machine resources and its own independent OS. Isolation and protection are provided by the LDOMs hypervisor, by restricting access to registers and address space identifiers (ASI).

Hypervisor takes care of partitioning of resources. Hardware resources are assigned to logical domains using 'Machine Descriptions (MD)'. It is a graph describing devices available, which includes CPU threads/strands, memory blocks, PCI bus. Each LDOM has its own MD. This is used to build Open Boot Prom (OBP) and consequently the OS device tree upon booting guest. A fallout of this model is, guest OS doesn't even see any HW resources not present in its MD.

The first domain that comes up on a sun4v based system acts as a control domain. LDOMs software manager runs in the control domain. This helps us to interact with the hypervisor and allocate and deallocate physical resources to guest domains. Management software consists of a daemon (ldmd(1M)) which interacts with the hypervisor and ldm(1M) CLI to interact with the daemon. ldmd(1M) can be run only in the control domain.

Hypervisor provides a fast point-to-point communication channel called 'logical domain channel' (LDC) to enable communication: between LDOMs, between LDOM and hypervisor. LDC end points are defined in the MD. ldmd(1M) also uses LDC to interact with the HV in managing domains.

Initially all the resources are given to the control domain. Using ldm(1M), administrator needs to detach the resources from the primary domain and pass them onto the newly created guest domains.

I/O for guest domains can be configured in two ways:

Direct I/O: A guest domain could be given direct access to a PCI device and the guest domain could manage all the I/O devices connected to them using its own disk drivers.

Virtualized I/O: One of the domains, called a 'service domain' controls the HW present on the system and presents 'virtual devices' to other guest domains. Then the guest could use a virtual block device driver which forwards the I/O to the service domains through LDC. Virtual device being presented to the guest could be backed by a physical disk, or a slice of it, or even a regular file.

In case of network I/O, hypervisor presents a layer-2 'virtual network switch' which enables domain to domain traffic. Any number of v-LANs can be created by creating additional v-switch in the service domain. The switch talks to the device driver in the service domain to connect to physical NIC for external network connection; And can tag along the layer-3 features of service domain kernel to do routing, iptable filtering, NAT and firewalling.

Hypervisor automatically powers off CPU cores that are not in use – I.e. not assigned to any domains.

The following block diagram captures the above discussion:

Figure 2.

B. X86 Platform virtualization

In x86 space, over the years, many virtualization technologies have come in and they can be classified under two types :

Type 1 – Where virtualization solution does not need a host OS to operate.

Type 2 – Where virtualization solution needs host OS to operate.

For the rest of the discussion, let us look at Xen (a Type 1 hypervisor) and VirtualBox (a Type 2 hypervisor).

1)Xen

Xen is an open source hypervisor technology developed at University of Cambridge. This solution is specific to Solaris running on x86 based systems. Each VM created can run a complete OS. Xen sits directly on top of HW and below the guest operating systems. It takes care of partitioning of available CPU(s) and physical memory resources across the VMs.

Unlike other contemporary virtualization solutions that exist for x86, xen started off with a different design approach to minimise the overheads a guest OS incurs – guest needs to be ported to xen architecture, which very closely resembles the underlying x86 arch – This approach is referred to as 'Para Virtualization (PV)'.

In this approach, a PV guest kernel is made to run in ring-1 of the x86 architecture, while the hypervisor runs in ring-0. This way hypervisor protects itself against any malicious guest

29

kernel. The way an application makes a system call to get into kernel, a PV kernel needs to make a hypercall to get into the hypervisor – hypercalls are needed to request any services from the hypervisor.

The first VM that comes up on boot is referred as Dom0 and the rest of the VMs are referred as DomUs. To keep the hypervisor thin, Xen runs the device drivers that control the peripherals on the system, in dom0. This approach makes a lot of code execute in ring-1 rather than ring-0 and thus improves the security of the system.

In PV mode, IO between guests and their virtual devices is directed through a combination of Front End (FE) drivers that run in a guest and Back End (BE) drivers that run in a domain that is hosting the device. The communication between the front end and back end happens via the hypervisor and hence is subject to privilege checks. The FE and BE drivers are class specific I.e. one set for all block devices, and one set for network devices – thus the finer details associated with each specific device is completely avoided. This model of implementing the IO performs far better than the emulated devices approach.

Depending on the guest OS needs, more physical memory can be passed on dynamically and likewise, if there is, memory shortage in the system, hypervisor can also take back memory from a guest.

Likewise, the number of virtual CPUs (vCPUs) associated with a guest can by dynamically increased if the guest needs more compute resources. Xen schedules these vCPUs onto the physical CPUs.

The management tools needed to configure and install guests, also run in dom0 thus making the hypervisor even thinner. This also improves the debuggability of the tools as they run in user space.

Given the recent advances in the CPUs, like VT-x of Intel and SVM of AMD, xen allows unmodified guests to be run in a VM – these are referred to as HVM guests. There are a couple of major differences between how a PV guest is handled vis-a-vis how a HVM guest is handled :

Unlike PV guest, HVM guest, expects to handle the devices directly by installing its own drivers – for this, the hypervisor has to emulate different physical devices and this emulation is done in the user land of dom0. As can be inferred IO this way will be slower than in PV approach.

HVM guest tries to install its own page tables. Xen uses shadow page tables which track guests 'page table modifications'. This can slow down handling page faults in the guest.

In both PV as well as HVM guests, the dom0 needs to be ported to xen platform. The applications run without any modifications inside the guest operating systems.

x86 CPUs, off late, are seeing a steady stream of new features, to help implement VMMs easier:

Extended Page Tables (EPT): With EPT, VMM doesn't have to maintain shadow page tables. The way page tables convert VA to PA, EPT converts guests physical to host physical address. This virtualizes CR3, which guest continues to manipulate.

VT-d: This enables guests to access the devices directly so that performance impact of emulated devices is cut out.

To get the best of both the worlds [I.e. PV approach and wanting to use HW advances], hybrid virtualization is picking up momentum – where we start off with a guest in HVM mode [so that there is no porting exercise] and then incrementally add PV drivers which will bypass device emulation and thus reduce virtualization overheads – in case of OpenSolaris these PV drivers are already implemented for block and network devices.

Xen tracing facility provides a way to record hypervisor events like VCPU getting on and off a CPU, VM blocking, etc, and this data can help nail performance issues with virtual machines.

Xen allows live migration of guests to similar physical machines, which effectively brings in load balancing features.

Xen also allows suspend and resume of guests, which can be used to start service on demand – this feature along with ZFS snapshots can be used to configure a guest and then take a snapshot of it, and move it to a different physical machine and these nodes could give a simple fail over capability.

The following block diagram captures the above discussion:

Figure 3.

To summarize, xen helps in consolidating heterogeneous work loads that are commonly seen in a data center.

2)Virtual Box (VB)

VB is an open source virtualization solution for x86 based systems. VB runs as a process in the host operating system and supports various latest as well as legacy operating systems to run as guests. The power of VB lies in the fact that the guest is completely virtualized and is being run as just a process on the

30

host OS. VB follows the usual x86 virtualization technique of running guest ring-0 code in ring-1 of the host; and running guest ring-3 code natively.

Given its ease of use, VB can be used to consolidate desktop workloads – a new desktop can be configured and deployed in a few seconds time.

It supports cold migration of the guests, where one could copy over the Virtual Disk Image (VDI) to a different system along with XML files (config details) and start the VM on the other system.

Creating duplicate VMs is as easy as copying over the VDI and removing the uuid from the image, and registering it with VB. VB emulates PCnet, and Intel PRO network adopters, supports different networking modes: NAT, HIF, Internal networking.

C. Field Use Case :

The following are real world cases where we used platform virtualization inside Sun :

Once a supported release of Solaris is made, a separate build machine is spawned off which is used to host sources and allows gate keepers to create patches. Earlier one physical system used to be set aside for each release. Now, with Xen and LDOMs, a new VM is created for each release and multiple of such guests could be hosted on a single system. Patch windows for each of the releases are staggered such that multiple guests won't hit the peak load together. A complete build of Solaris takes at least 10% longer inside a guest, but this is still acceptable as it is not a time critical job. Performance impact on the interactive workload which an engineer might see while pushing her code change is too less to be noticed.

Engineers heavily use VB to test their changes in older releases of Solaris. So even though performance degradation could be in the range of 5-20% depending on workload, it is still acceptable as it is only functionality and sanity testing [after my kernel change will the system boot?].

So though there is an increase in the number of supported Solaris releases there is not a corresponding increase in the number of physical systems in the data center, thus significantly saving on the capital expenditure and carbon foot print.

IV.CPU MANAGEMENT

To address the second problem mentioned in the introduction, operating system should support the various features provided by the hardware to reduce the idling power consumption of the system. For the rest of the discussion let us look at how OpenSolaris kernel supports various power saving features of both the x86 and Sparc platform.

A. Advanced Configuration and Power Interface (ACPI) Processor C-states :

ACPI C0 state refers the normal functioning of the CPU when it draws the rated voltage. CPU enters C1 state while running idle thread by issuing halt instruction, in this state, other than APIC and bus, rest of the units do not consume power. CPU runs idle thread when there are no threads in runnable state.

ACPI process C3 state and beyond are referred to as Deep C-state support, as wake up latency of this state is higher than the earlier state. In C3 state, even APIC and bus units are also stopped, and caches lose state. OS support is needed in C3 state because of state loss and OpenSolaris incorporates this support.

B. Power Aware Dispatcher (PAD)

This feature extends existing topology aware scheduling facility to bring 'power domain' awareness to the dispatcher. With this awareness in place, the dispatcher can implement coalescence dispatching policy to consolidate utilization onto a smaller subset of CPU domains, freeing up other domains to be power managed. In addition to being domain aware, the dispatcher will also tend to prefer to utilize domains already running at lower C-states – this will increase the duration and extent to which domains can remain quiescent, improving the kernel's power saving ability.

Because the dispatcher will track power domain utilization along the way, it can drive active domain state changes in an event driven fashion, eliminating the need for the power management subsystem to poll.

These current conservative logs yield a 3.5% improvement in SpecPower on Nehalem but more importantly a 22.2% idle power savings.

C. CPU Frequency Scaling :

It is possible to reduce power consumption of a system by running at a reduced clock frequency, when it is observed that CPU utilization is low.

D. Memory Power Management:

Like CPUs, even memory could be put in power saving state when the system is idle. OpenSolaris enables this feature on chip sets which support this feature.

E. Suspend to RAM:

It is common for desktops and laptops to have extended periods of no activity as the user could be away at lunch. To save power in such cases, OpenSolaris supports what is referred to as ACPI S3 support – whereby the whole system is suspended to RAM and power cut off to CPU.

F. CPU hotplug:

This is supported on quite a few Sparc platforms and is achieved by 'Dynamic Reconfiguration (DR)' support in the kernel. A DRed out board containing CPUs is effectively powered off and can be pulled out; But can be left in there, so

31

that depending on workload changes, the board can be DRed back in.

V.OBSERVABILITY TOOLS

There should be a mechanism by which administrator can see how well a system is taking advantage of the power management features discussed above. For this OpenSolaris provides a couple of tools :

A. Power TOP:

powertop(1M) reports the activity that is making the CPU to move to lower C-states and thus increase the power consumption. Addressing the reasons for those, will make a CPU stay longer in higher C-states.

B. Kstat :

kstat(1M) reports at what clock frequency the CPU is currently operating and what are the supported frequencies. Lower the frequency, the better from power consumption point of view.

REFERENCES

[1] Paul Barham et al. Xen and the art of virtual-ization. In Proceedings of the 19th Symposium on Operating Systems Principles, 2003.

[2] Bryan Cantrill, Mike Shapiro, and Adam Leventhal. Dynamic instrumentation of production systems. In USENIX Annual Technical Conference, 2004.

[3] I. Pratt, et al. The Ongoing Evolution of Xen. In Proceedings of the Ottawa Linux Symposium (OLS), Canada, 2006.

[4] C. Clark, et al. Live Migration of Virtual Machines. In Proceedings of the 2nd Symposium on Networked Systems Design and Implementation (NSDI), Boston, 2005. USENIX

[5] http://www.usenix.org/event/lisa04/tech/price.html

[6] http://hub.opensolaris.org/bin/view/Project+tesla/

[7] http://hub.opensolaris.org/bin/view/Community+Group+xen/

[8] http://hub.opensolaris.org/bin/view/Community+Group+ldoms/

[9] http://hub.opensolaris.org/bin/view/Project+ldoms-mgr/

[10] http://hub.opensolaris.org/bin/view/Community+Group+zones/

[11] http://www.sun.com/customers/software/blastwave.xml

32

http://hub.opensolaris.org/bin/view/Community+Group+xen/

http://hub.opensolaris.org/bin/view/Community+Group+zones/

http://hub.opensolaris.org/bin/view/Community+Group+ldoms/

ADCOM 2009HUMAN COMPUTER

INTERFACE -1

Session Papers:

1. Shankkar B, Roy Paily and Tarun Kumar , “Low Power Biometric Capacitive CMOS Fingerprint Sensor System”

2. Raghavendra R, Bernadette Dorizzi, Ashok Rao and Hemantha Kumar, “Particle Swarm Optimization for Feature Selection: An Application to Fusion of Palmprint and Face”

33

Low Power Biometric Capacitive CMOSFingerprint Sensor System

B. Shankkar, Tarun Kumar and Roy Paily

Department of Electronics and Communication EngineeringIIT Guwahati, Assam, India

Abstract—A charge sharing based sensor for obtaining fin-gerprint has been designed. The design used sub thresholdoperations of MOSFETs for achieving low power sensor deviceworking at 0.5 V. Also the interfacing circuitry and a fingerprintmatching algorithm was also designed to support the sensor andcomplete a fingerprint verification system.

Index Terms—Fingerprint, sensor, charge sharing, sub-threshold, low power

I. INTRODUCTION

Biometrics is the automated method of recognizing a personbased on a physiological or behavioral characteristic. Biomet-ric recognition can be used in identification mode, where thebiometric system identifies a person from the entire enrolledpopulation by searching a database for a match based solely onthe biometric. A biometric system can also be used in verifica-tion mode, where the biometric system authenticates a person’sclaimed identity from their previously enrolled pattern. Variousbiometrics used for such a purpose are signature verification,face recognition, fingerprints, iris recognition. Fingerprintingis one of the oldest ways of technically identifying a person’sidentity. Fingerprints have distinct ridge and valley pattern ontip of a finger for every individual. Every finger print can bedivided into two structures - Ridge and Valleys.As the world becomes more accustomed to devices reducing insize with each passing day and dependence on mobile devicesincreasing at enormous rate, a fingerprint authentication andidentification system that can be mounted on such a device isimperative. Any fingerprint authentication system requires asensor and corresponding circuitry, an interface and fingerprintmatching algorithm. Paper presents the the results achieved forsimulation for all these modules.

II. SENSOR AND CORRESPONDING CIRCUITRY

Figure 1 shows the basic principle of charge-sharing scheme[1]. The finger is modeled as upper metal plate of the capacitor,with its lower metal plate in the cell. These electrodes areseparated by the passivation layer of the silicon chip, causedby the metal oxide layer on the chip. The series combination ofthe two capacitances is called Cs. The basic principle behindthe working of capacitive fingerprinting sensors is that Cs

changes according to whether the part of finger on that pixelis a ridge or valley. If the part of finger is a ridge then Cs

is higher than when it encounters a valley, in which case theseries combination of the two capacitances falls low due to the

modeling of the capacitor between the metal plate and fingeras a capacitor with air medium in between. Cp1 and Cp2 arethe internal parasitic capacitances of the nodes N1 and N2. Inthe pre-charge phase, the switches S1 and S3 are on and S2 isoff. The capacitors Cp1 and Cp2 get charged up. During theevaluation phase, S2 is turned on. The voltage stored duringthe pre-charge phase is now divided between Cs, Cp1 andCp1. The output voltage at N1 is easily seen as the followingexpression:

V0 = VN 1 = VN 2 =Cp1V1 + Cp2V2 + CsV1

Cp1 + Cp2 + Cs(1)

Fig. 1. Basic Charge Sharing Scheme [1]

As given in figure 2 Cs differs for ridge and valley, thus theoutput voltage also differs according to the above expression.This difference in voltage when passed to a comparator withan appropriate reference voltage, gives a binary output. Thebinary values from all the pixels in the chip then constitutethe required fingerprinting image. In the pre-charge phase (pch= 0), it can be seen that N1 and N3 are kept at Vdd bythe PMOS transistors. During this phase, the capacitors Cp2and Cp3 are shorted with voltages at both ends as groundand Vdd respectively. The capacitors Cp1 and Cs begin tocharge up. They store a charge of Cp1Vdd and Cs(Vdd - Vf in

)respectively. This is the charge accumulation phase.

At the beginning of the evaluation phase (pch = 1), boththe input and output voltages are equal to Vdd. Even whenthe voltage at N1 starts decreasing due to charge-sharingbetween the capacitors, the unity-gain buffer ensures that thevoltage at N3 is equal to voltage at N1, thus effectivelyshorting the capacitor Cp3 and removing its effect. Meanwhile,

34

Fig. 2. The Sensor Circuitry

the comparator is also enabled, which is able to producethe required binary output. Thus the fingerprint pattern iscaptured. The circuits were simulated and implementationresults are shown in Figure 3. The presence of Ridge andValley are characterized by the voltage levels 1.6 V and 1.2V respectively.

Fig. 3. Resolution obtained for basic Sense Amplifier

III. INTERFACE

Circuit was designed for interfacing the sensor with thedigital processing system. Circuitry was required to selectparticular pixels to be activated and then to selectively transferthe pixel value to the FPGA board for fingerprint matching.The module involved use of a decoder and an and-or gate array.Also a basic circuitry to detect long sequence of single valuewas designed to implement auto correction. This ensured thata long stream of a single bit value is not consecutively sensedas it is highly unlikely to occur in practical conditions. Thecomplete sensor module along with circuitry and auto errordetection module is shown in Figure 4. The in0 and in1 signalsare from a counter which increases with positive edge of theclock. The ctrl signal goes to the reset of the counter. The outsignal is the sensed value of fingerprint and goes to fingerprintmatching algorithm. The four ′fps′ blocks represent the sensorarray. The decode unit helps activate single pixel sensor at atime values of which are passed to the output through theand-or gate array.

Fig. 4. Building blocks for integrating sensor module with FPGA kit

IV. RESOLUTION IMPROVEMENT

We proposed a modification to the basic circuit to improvethe resolution of the output signal. We introduced an inverterat the output stage, which magnifies the output difference forridge and valley obtained at output port. The inverter wasdesigned to have a characteristic such that the point whereVin = Vout

occurs at value that is equal to the middlevalue of voltage swing for ridge and valley. This differenceis reasonably easily discernible, and it could help easing inon the limited range of Vref

. It could give a wider range forVref

thus making the circuit more reliable. A voltage swingof around 2V was achieved using the designed inverter. Theresult of proposed improvement is shown in Figure 5

Fig. 5. Results obtained with Resolution Improvement Circuit

V. POWER IMPROVEMENT

Sub-Threshold [3] operations of MOSFET were used toachieve reduction in supply voltage and power for requiredpurpose. Various steps of implementing the circuitry usingsub-threshold approach are given as follows:

35

• The supply voltage was first fixed at 0.5V.

• Since the current when the transistor is in weak-inversionis very less, the node capacitances take a long time tocharge and discharge. Thus the frequency of clock wasreduced to 1KHz.

• The sense amplifier has to work as a unity gain amplifierwhen the output and V- terminals are shorted. To designthe sense amplifier to work at 0.5V, 0.5 V was given at V+and the width of the transistors was changed to get outputvoltage as 0.5V. It is well-known that NMOS transistoracts as a pull-down device and the PMOS transistor actsas a pull-up device. After many iterations, the widthswere decided, with the corresponding characteristics. Thecircuit developed was as shown in Figure 6. The outputcharacteristics are shown in Figure 7

Fig. 6. Low Power Sense Amplifier

Fig. 7. Low Power Sense Amplifier input output response

• The inverter was designed to have a change from logic0 to logic 1 at 0.25V. The width of the transistors waschanged to get the desired characteristics. The circuitis shown in Figure 8 and the output characteristics aregiven in Figure 9.

Fig. 8. Low Power Inverter

Fig. 9. DC transfer characteristics of low Power Inverter

• After the modules were designed and characterizedindividually, they were combined to obtain the overallcircuit. Since the voltages to the node capacitances arevery less, it has to be taken care to allow more currentflow for faster charging. This decided the widths of thetransistors in the main circuit.

• Thus the overall circuit was designed at 0.5V supplyvoltage with the transistors in weak inversion to reducepower.

The improvement resulted in power requirement of 736nW perpixel position. The improvements introduced in the MOSFETdesigns also improved resolution further and we obtained aresolution of around 350 mV for a power supply of 0.5 V.Results are presented in Figure 10

36

Fig. 10. Results obtained with low power Resolution Improvement Circuit

VI. FINGERPRINT MATCHING ALGORITHMS

Minutiae points [4] are these local ridge characteristics thatoccur either at a ridge ending or a ridge bifurcation. A ridgeending is defined as the point where the ridge ends abruptlyand the ridge bifurcation is the point where the ridge splits intotwo or more branches. Once the fingerprint image has beenenhanced and thinning algorithm applied, the minutiae are thenextracted. The algorithms were implemented and simulated onXilinx 10.1 ISE simulator. The simulations were performedbased on Virtex II Pro FPGA board. Following points canbe considered as basic steps used for implementation of thefingerprint matching algorithm.

• Removing local variations: Local variations in thefingerprint were removed by scanning the whole imagepixel by pixel and at each location determining thevalues of all neighboring pixels. If the particular pixelvalue is different from all other pixel values than thatis changed to the value of neighboring pixel values.Thecode was implemented using VHDL.

• Thinning Algorithm: Thinning algorithms are used toto narrow down the ridge structures within a fingerprintimage. Thinning [5] is done by removing the pixels fromborders of the ridges without causing any disturbanceto the continuity of the structures. For this purpose theimage was scanned from left to right. When ever for everypixel value on left did not match the values on the right,the particular pixel values were changed to ’1’ (white).This way only the single pixel values at the end wereretained as ’0’ to represent single pixel width of ridgepattern.

• Minutiae Extraction: The minutiae that were consideredare ridge endings and ridge bifurcations. At every pixellocations sharp changes were detected by checking op-posite points. If any of two opposite pixel positions withrespect to current pixel position had different values thenit was considered to be some special point. Then certainnext pixels were analyzed and determined whether aparticular pixel was a minutiae or not. For every minutiaeits location with respect to first minutiae were saved forlater comparison.

• Minutiae matching: In this step minutiae of fingerprintsin the records were loaded and were compared to the

minutiae of the recently processed fingerprint. The minu-tiae of the fingerprint records were saved as separatetemplates. for every clock cycle one of the templateswas accessed. The minutiae set of current candidatefingerprint was compared element by element to that ofthe accessed template. In case of 15 or more matches inminutiae locations a match is announced and loop exited.

One sample of the actual 100 pixel * 50 pixel image thatwas input to the matching algorithm and the correspondingthinned image with minutiae positions is shown in figure 11.

Fig. 11. Fingerprint image and extracted minutiae

Another approach based on correlation of the test finger-print image with the fingerprint images in the database wasimplemented on FPGA board. In this approach every singlepixel value that was fed to the FPGA board was comparedto the corresponding pixel value in all the other fingerprintimages stored in the database. Parameters were updated atoccurrence of each pixel value to calculate the correlation ofthe complete images. A correlation value higher than 0.6 wasused to announce a match.

VII. CONCLUSION

The paper presents a complete basic model for obtain-ing a fingerprint and identifying or authenticating it. Thewhole project was implemented on suitable simulators foreach individual module. The sensor and related circuitry wassimulated using Mentor tools and the fingerprint algorithmwas implemented on Xilinx 10.1. Important achievement werein terms of resolution improvement and reduction in powerrequirement.

REFERENCES

[1] J.W. Lee and D.J. Min and J.Y. Kim and W.C. Kim, “A 600-dpi capacitivefingerprint sensor chip and image-synthesis technique,” IEEE J. Solid-StateCircuits,1999.

[2] Jin-Moon Nama and Seung-Min Jungb and Moon-Key Lee,“Designand implementation of a capacitive fingerprint sensor circuit in CMOStechnology,” EE. Sensors and Actuators, 2007.

[3] Hendrawan Soeleman and Kaushik Roy, ”Ultra-low power digital sub-threshold logic circuits,” Proceedings of the international symposium onLow power electronics and design, 1999.

[4] Nimitha Chama, “Fingerprint image enhancement and minutiae extrac-tion” http://www.ces.clemson.edu/ stb/ece847/fall2004/projects/proj10.doc.

[5] Pu Hongbin, Chen Junali, Zhang Yashe , “Fingerprint Thinning AlgorithmBased on Mathematical Morphology,” The Eighth International Conferenceon Electronic Measurement and Instruments, 2007.

37

PARTICLE SWARM OPTIMIZATION FOR FEATURE SELECTION: AN APPLICATION TOFUSION OF PALMPRINT AND FACE

Raghavendra.R1 Bernadette Dorizzi2 Ashok Rao3 Hemantha Kumar.G1

1Dept of Studies in Computer Science, University of Mysore, Mysore-570 006, India.2 Institut TELECOM, TELECOM and Management SudParis,France

3Professor, Dept of E & C, CIT, Gubbi, India.

ABSTRACT

This paper relates to multimodal biometric analysis. Here wepresent an efficient feature level fusion and selection schemethat we apply on face and palmprint images. The features foreach modality are obtained using Log Gabor transform andconcatenated to form a fused feature vector. We then use aParticle Swarm Optimization (PSO) scheme, a random op-timization method to select the dominant features from thisfused feature vector. Final classification is performed in theprojection space of the selected features using Kernel DirectDiscriminant Analysis. Extensive experiments are carried outon a virtual multimodal biometric database of 250 users builtfrom the face FRGC and the palmprint PolyU databases. Wecompare the proposed selection method with the well knownfeature selection methods such as Sequential Floating For-ward Selection (SFFS), Genetic Algorithm (GA) and Adap-tive Boosting (AdaBoost) in terms of both number of featuresselected and performance. Experimental results show that theproposed method of feature fusion and selection using PSOoutperforms all other schemes in terms of reduction of thenumber of features and it corresponds to a system that is eas-ier to implement, while showing the same or even better pe-formance.

Keywords: Multimodal Biometrics, Feature level fusion,Feature selection, Particle Swarm Optimization

1. INTRODUCTION

Recently, multimodal biometric fusion techniques have at-tracted much attention as the complementary information be-tween different modalities could improve the recognition per-formance. In practice, fusion of several biometric systems canbe performed at 4 different levels: Sensor level, Feature level,Match score level and Decision level. As reported in [1], abiometric system that integrates information at earlier stageof processing is expected to provide better performance thansystems that integrate information at a later stage, because ofthe availability of more and richer information. In this paper,we are going to explore the interest of feature fusion and we

will experiment it on two widely studied modalities namely,face and palmprint. Very few papers address the feature fu-sion of palmprint and face [2][3][4][5]. From these articles, itis clear that, performing feature level fusion leads to the curseof dimensionality due to the large size of the fused featurevector and hence linear or non linear projection schemes areused by above mentioned authors to overcome dimensional-ity problem. In this work, we address this problem by reduc-ing the dimension of feature space through some appropriatefeature selection procedure. To this aim, we experimentedthe binary Partial Swarm Optimization (PSO) algorithm pro-posed in [6] to perform feature selection. Indeed PSO basedfeature selection has been shown to be very efficient on somelarge scale application problems with performance better thanGenetic Algorithms [7][8]. We therefore implemented it forthis biometric feature fusion problem of high dimension andthis is the main novelty of this paper. Extensive experimentsconducted on a virtual multimodal biometric database of 250users show the efficacy of the proposed scheme.

The rest of this paper is organized as follows: Section 2describes the proposed method of feature fusion using PSOand it also discusses the selection of parameters for PSO. Sec-tion 3 presents the experimental setup, Section 4 describes theresults and discussion and finally Section 5 draws the conclu-sion.

2. PROPOSED METHOD

Face Log Gabor

Palmprint Log Gabor

FeatureSelection

using PSO

FeatureConcatenation

KDDA

Accept

Reject

Classification

Fig. 1. Proposed scheme of feature fusion and selection inLog Gabor Space including PSO feature selection

38

Figure 1 shows the proposed block diagram of featurelevel fusion of palmprint and face in Log Gabor space. Asobserved from Figure 1, we first extract the texture featuresof face and palmprint separately using Log Gabor transform.We use Log Gabor transform as it is suitable for analyzinggradually changing data such as face, iris and palmprint [3]and also it is mentioned in [9] that the Log Gabor transform,can reflect the frequency response of images more realisti-cally than usual Gabor transform. On the linear frequencyscale, the transfer function of the Log Gabor transform hasthe form [9]:

G(ω) = exp

− log(ω/ωo)2

2× log(k/ωo)2

(1)

Where, ωo is the filter center frequency. To obtain a constantshape filter, the ratio k/ωo must be held constant for varyingωo.

The Log Gabor transform used in our experiments has4 different scales and 8 orientations. We fixed these valuesbased on the results of different trials and also in conformitywith the literature [4][3]. Thus, each image (of palmprint andface) is analyzed using 8×4 different Log Gabor filters result-ing in 32 different filtered images of resolution 60 × 60. Toreduce the computation cost, we down sample the image by aratio equal to 6. Thus, the final size is reduced to 40×80. Sim-ilar type of analysis is also carried out for palmprint modality.By concatenating the column vectors associated to each im-age we obtain the fused feature vector of size 6400×1. As theimaging conditions of face and palmprint are different, a fea-ture vector normalization is carried out as mentioned in [3].In order to reduce the size of each vector, we propose to per-form feature selection through PSO as explained in Section2.1 and illustrated in Figure 2 where ’K’ indicates the dimen-sion of the feature space after concatenation and ’S’ indicatesthe reduced dimension obtained by PSO. Then, we use KDDAto project the selected features on Kernel discriminant space.Here we employ KDDA because of its good performance aswell as its high dimension reduction ability. Finally, decisionabout accept/reject is carried out using NNC.

2.1. Particle Swarm Optimization(PSO)

PSO is a stochastic, population based optimization techniqueaiming at finding a solution to an optimization problem ina search space. The PSO algorithm was first described byJ. Kennedy and R.C. Eberhart in 1995 [8]. The main ideaof PSO is to simulate the social behavior of birds flockingto describe an evolving system. Each candidate solution istherefore modeled by an individual bird that is a particle in asearch space. Each particle adjusts its flight by making use ofits individual memory and of the knowledge gained from itsneighbors to find the best solution.

2.2. Principle of PSO

The main objective of PSO is to optimize a given functioncalled fitness function. PSO is initialized with a population ofparticles distributed randomly over the search space and eval-uated to compute the fitness function together. Each particle istreated as a point in the N-Dimension space. The ith particleis represented as Xi = x1, x2, . . . , xN. At every iteration,each particle is updated by two best values called pbest andgbest. pbest is the best position associated with the best fit-ness value of particle ’i’ obtained so far and is represented aspbesti = pbesti1, pbesti2, . . . , pbestiN with fitness func-tion f(pbesti). gbest is the best position among all the par-ticles in the swarm. The rate of the position change (veloc-ity) for particle ’i’ is represented as Vi = vi1, vi2, . . . , viN.The particle velocities are updated according to the followingequations [8]:

V newid = w ∗ V old

id + C1 ∗ rand1() ∗ (Pbestid − xid)+C2 ∗ rand2() ∗ (gbestd − xid) (2)

xid = xid + V newid (3)

Where, d = 1, 2, . . . , N ,; w is the inertia weight. The suitableselection of inertia weights provides a balance between globaland local explorations, and results in fewer iterations on av-erage to find near optimal results. C1 and C2 are the accel-eration constants used to pull each particle towards pbest andgbest. Low values of C1 and C2 allow the particle to roamfar from target regions, while high values result in abruptmovements towards or past the target regions. rand1() andrand2() are random numbers between (0,1).

2.3. Binary PSO

The original PSO was introduced for continuous populationbut has been later extended by J. Kennedy and R.C. Eberhart[6] to the discrete valued population. In the binary PSO, theparticles are represented by binary values (0 or 1). Each par-ticle velocity is updated according to the following equations:

S(V newid ) =

11 + e−V new

id(4)

if(rand < S(V newid )) then xid = 1; else xid = 0; (5)

Where V newid denotes the particle velocity obtained from

Equation 2, function S(V newid ) is a sigmoid transformation

and rand is a random number selected from the uniform dis-tribution (0,1). If S(V new

id ) is larger than a random numberthen its position value is represented as 1 else its positionvalue is represented as 0. Binary PSO is well adapted tothe feature selection context [10][7]. In order to apply theidea of binary PSO for feature selection of face and palmprintfeatures, we need to adapt the general binary PSO conceptto this precise application. This will be the objective of thefollowing subsections.

39

F

FF

F

F

F

F

F

F

FF

F

F

FF

F

PP

P

P

PP

P

P

PP

P

P

PP

P

P

LGF

LGF

LGF

LGF

LGF

LGF

LGP

LGP

LGP

LGP

LGP

LGP

Face Log Gabor Features

Palm Log Gabor Features

1

K

0 1 1 1

1 K

1 S

PSO

Selected Face Features

Selected Palm Features

LGF1LGF2

LGPk-1LGPk

LGP

k-1

LGP

k

LGF2

Feature Selection in Log Gabor

Domain

FACE Palmprint

Log Gabor Features Log Gabor Features

Feature Concatenation

Fig. 2. PSO features selection scheme in Log Gabor Space

2.3.1. Representation of Position

The initial swarm is created such that the population of theparticles is distributed randomly over the search space. Sincewe are using binary PSO, the particle position is representedas a binary bit string of length N, where ’N’ is the total di-mension of the feature set. Every bit in the string represents afeature, value ’1’ means that corresponding feature is selectedwhile ’0’ means that it is not selected. Each particle velocityis updated according to the Equations 2,4,5.

2.3.2. Fitness Function

The feature selection relies on an appropriate formulation ofthe fitness function. In Biometric verification, it is difficult toidentify a single function that would characterize the match-ing performance across a range of False Acceptance Rate (FAR)and False Reject Rate (FRR) values i.e. across all matchingthresholds [1]. Thus in our experiments, we first computethe distance between reference and testing samples to get thematch scores and then we compute FAR and genuine accep-tance rate (GAR) by setting thresholds at different points. Inorder to optimize the performance gain across a wide rangeof thresholds, we define the objective function to be the av-erage of 12 GAR values corresponding to 12 different FARvalues (90%,70%, 50%, 30%, 10%, 0.8%, 0.6%, 0.4%, 0.2%,0.09%, 0.05%, 0.01%). Thus, the main objective of the verifi-cation fitness function is to maximize this average GAR value.

2.3.3. Velocity Limitation Vmax

In the binary version of PSO, the value of Vmax limits theprobability that bit xid takes value 0 or 1 and therefore theuse of high Vmax value in binary PSO will decrease the rangeexplored by the particle[6]. In our experiments, we tried dif-ferent values of Vmax and finally selected Vmax = 6, as itallows the particle to reach near optimal solutions.

2.3.4. Inertia Weight and Acceleration Constant

The weight of inertia is an important parameter as it providesthe particles with a degree of memory capability. It is exper-imentally found that inertia weight ’w’ in the range [0.8, 1.2]yields a better performance [6]. Hence in our present work,we initially set ’w’ to 1.2 and then decrease it to zero duringsubsequent iterations(in our work, we use 50 iterations). Thisscheme of decreasing inertia weight is found to be better thanthe fixed one [6] as it allows reaching an optimal solution.Even though the rate of acceleration constants C1 and C2 arenot so critical in the convergence of PSO, a suitable chosenvalue may lead to faster convergence of the algorithm. In ourexperiments, we varied the value of C1 and C2 from 0 to 2and finally chose C1 = 0.7 and C2 = 1.2 as it yields betterconvergence.

40

2.3.5. Population Size

In our present work, we experimentally varied the size of thepopulation from 10 to 30 in steps of 5 and finally, we fixedthe population size as 20 which was found to be an optimalvalue.

3. EXPERIMENTAL SETUP

This section describes the experimental setup that we havebuild in order to evaluate the proposed feature level fusionschemes. Because of a lack of real multimodal database offace and palmprint data, experiments are carried out on adatabase of virtual persons using face and palmprint data com-ing from two different databases. This procedure is valid asfor one person, face and palmprint can be considered as twoindependent modalities [1].

For face modality we choose the FRGC face database[11]as it is a big database, widely used for benchmarking. Fromthis database, we choose 250 different users from 2 differentsessions. The first session consists of 6 samples for each usertaken from data collected during Fall 2003 and the secondsession consists of 6 samples for each user taken from datacollected during Spring 2004. Out of these 6 samples, thefirst 4 samples are taken in controlled condition and the next2 samples are taken from uncontrolled conditions. For palm-print modality, we select a subset of 250 different palmprintsfrom polyU database [12], each of these users possesses 12samples such that 6 samples are taken from the first sessionand next 6 samples are taken from the second session. Theaverage time interval between the first and second session istwo months. In building our multimodal biometric databaseof face and palmprint, each virtual person is associated with12 samples of face and palmprint produced randomly fromthe face and palmprint samples of 2 persons in the respectivedatabases are associated to each virtual persons. Thus, thebuilt virtual multimodal biometric database consists of 250users such that each user has 12 samples.

3.1. Experimental Protocol

This section describes in detail the experimental protocol em-ployed in our work. For learning the projection spaces, weuse a subset of 100 users called LDB, such that each user has6 samples (selected randomly out of 12 samples). To validatethe performance of all the algorithms, we divide the wholedatabase of 250 users into two independent sets called Set-Iand Set-II. Set-I consists of 200 users and Set-II consists of 50users. Set-II is used as the validation set to fix the parametersof PSO (like Vmax, C1, C2, Population Size), match scorefusion and also those of AdaBoost. Set-I is divided into twoequal partitions providing 6 reference data and 6 testing datafor each of the 200 persons. The reference and testing parti-tion was repeated ’m’ times (where m = 10) using Holdoutcrossvalidation and there is no overlapping between these two

subsets. Thus, in each 10 trials we have 1200 (= 200 × 6)reference samples and 1200 (= 200× 6) testing samples andhence we have 1500 genuine matching scores and 238800(= 200 × 199 × 6) impostor matching scores, as for eachuser, all other users are considered as an impostor. In closedidentification, we calculate the recognition rate using the 1200reference samples and the 1200 testing samples that will give1200 × 1200 matching scores. Note that the persons are ex-actly the same in reference and test sets, this is why we speakof closed identification. Finally, results are presented by tak-ing the mean of all trials (10) and we also present the statis-tical variation of the results with 90% parametric confidenceinterval [13]which gives a better estimation of the deviationthan the one that we can obtain thanks to the cross validation.

4. RESULTS AND DISCUSSION

Fig. 3. ROC curves of the different verification systems

This section discusses the results of the proposed featurefusion and selection scheme in terms of performance and num-ber of feature selected. The proposed method is comparedwith three different feature selection schemes such as Ad-aBoost (feature fusion-AdaBoost)[14], Genetic Algorithm (fea-ture fusion-GA) [15] and Sequential Floating Forward Selec-tion (feature fusion-SFFS) [16] in terms of number of featureselected. Further, we also present the comparative analysis offeature level fusion and selection schemes with feature fusionscheme using complete set of features (feature fusion-LG).We also present the comparative analysis of feature level fu-sion with match score level fusion in terms of performance.

Figure 3 shows the ROC curves of the individual biomet-rics, feature fusion using complete set of features (featurefusion-LG), feature fusion and selection schemes and matchscore level fusion. To perform the match score level fusion,we first obtain the match scores of face and palmprint inde-pendently using the combination of Log Gabor transform andKDDA. Note that this architecture corresponds to state of artsystems in both face and palmprint. We therefore perform a

41

Table 1. Comparative performance of the different feature se-lection schemes (Mean GAR at FAR = 0.01%) (Verification)

Methods GAR at 0.01% of FAR(%)with 90% confidence interval

Face Alone 65.32 [63.06; 67.37]Palmprint Alone 74.62 [72.55; 76.08]

Match Score Fusion 86.50 [84.89; 88.11]Feature Fusion-LG 92.51 [91.65; 93.75]

Feature Fusion-SFFS 92.55 [91.32; 93.77]Feature Fusion-AdaBoost 92.88 [91.76; 94.00]

Feature Fusion-GA 92.75 [91.52; 93.97]Feature Fusion-PSO 94.72 [93.85; 95.59]

Table 2. Comparison of features selection schemesMethods Dori DFS DKDDA

Feature fusion-LG 6400 6400 224Feature fusion-SFFS 6400 5286 207

Feature fusion-AdaBoost 6400 4090 184Feature fusion-GA 6400 3855 170Feature fusion-PSO 6400 3520 139

weighted SUM rule by computing the weights using empiri-cal method. Here, we vary the weights W1 and W2 from 0to 1 such that W1 + W2 = 1 and finally, we fix the weightsfor which we get the best performance on Set II. Indeed ithas been shown in [1] this leads to a good and simple schemefor match score fusion. In order to have better comparison,we employed the same fitness function as that of proposedfusion scheme using PSO (see Section 2.3.2) with other fea-ture selection schemes used for comparative analysis used inour present work. Table 1 shows the relative performanceof these algorithms in terms of mean value of GAR at thevalue of FAR=0.01%. From Figure 3 and also from Table 1it can be observed that, the performance of palmprint outper-forms the face with GAR=74.62% at FAR=0.01% while forthe face we have GAR=65.32% at FAR=0.01%. Further, thefeature level fusion (feature fusion-LG) of these two modal-ities shows a big improvement in performance as comparedwith that of match score level fusion and individual biomet-ric with GAR=92.51% at FAR=0.01%. It is also observedfrom Figure 3 (and also Table 1) that, the use of a selectionscheme allows keeping the same level of performance as Fea-ture Fusion-LG but with less number of features. Table 2 in-dicates the number of features selected by proposed featurefusion and selection scheme using PSO and also of other threediffrent feature selection schemes for the same level of perfor-mance as the complete features (Feature Fusion-LG). Here,DOri indicates the initial features dimension, DFS indicatesthe dimension after feature selection scheme and DKDDA in-

dicates the final dimension of the KDDA projection space.From Table 2 we can observe that, PSO based feature selec-tion scheme uses less number of features as compared withthree different feature selection schemes. Indeed, the pro-posed Feature Fusion-PSO scheme reduces the fused featurespace by roughly 45% while, SFFS, AdaBoost and GA re-duces the fused feature space dimension by around 17%, 36%and 39% respectively. These figures clearly indicate the ef-ficacy of the proposed Feature Fusion-PSO. Further, It alsoobserved from our experiments that, fusion at feature levelallows an improvement of 5% over match score level fusion.

5. CONCLUSION

In this paper, we investigated the dimensionality reduction ofhigh dimensional feature fusion space and proposed a novelfeature fusion scheme based on PSO. The proposed methodis compared with three different state-of-art feature selectionmethods namely SFFS, AdaBoost and GA. Extensive experi-ments carried out on a virtual multimodal biometric databaseof 250 users composed of palmprint and face indicates that,the proposed Feature Fusion-PSO approach reduces the fusedfeature space dimension by a factor of roughly 45% whilekeeping same level of performance as that of the global sys-tem: Feature Fusion-LG. Moreover, PSO implementation al-lowed the recognition process to be made faster and less com-plex by reducing the number of features while preserving theirdiscriminative ability. Thus, from the above analysis, we canconclude that proposed method Feature Fusion-PSO can con-tribute in developing more fast and accurate multimodal bio-metric systems based on feature level fusion.

Our future work will focus on further investigating the se-lected features obtained with the PSO based scheme and inparticular to search for a correlation between certain selectedfeatures and the quality of the data.

6. REFERENCES

[1] A.Ross, K.Nandakumar, and A.K. Jain, Handbook ofMultibiometrics, Springer-Verlag edition, 2006.

[2] G. Feng, K. Dong, D. Hu, and D. Zhang, “When facesare combined with palmprints:a novel biometric fusionstrategy,” in First International Conference on Biomet-ric Authentication (ICBA), 2004, pp. 701–707.

[3] Y.Yao, X. Jing, and H. Wong, “Face and palmprint fea-ture level fusion for single sample biometric recogni-tion,” Nerocomputing, vol. 70, no. 7-9, pp. 1582–1586,2007.

[4] X.Y. Jing, Y.F. Yao, J.Y. Yang, M. Li, and D. Zhang,“Face and palmprint pixel level fusion and kernel DCV-RBF classifier for small sample biometric recognition,”

42

Pattern Recognition, vol. 40, no. 3, pp. 3209–3224,2007.

[5] Y. Yan and Y.J. Zhang, “Multimodal biometrics fusionusing Correlation Filter Bank,” in proceedings of In-ternational Conference on Pattern Recognition (ICPR-2008), 2008, pp. 1–4.

[6] J.Kennedy and R.C.Eberhart, “A discrete binary versionof the particle swarm algorithm,” in IEEE InternationalConference on Systems, Man and Cybernetics, 1997, pp.4104–4108.

[7] X.Wang, J.Yang, X.Teng, W. Xia, and B. Jensen, “Fea-ture selection based on rough sets and particle swarmoptimization,” Pattern Recognition Letters, vol. 28, pp.459–471, 2007.

[8] J. Kennedy and R.C. Eberhart, “Particle swarm opti-mization,” in IEEE International Conference on NeuralNetworks, 1995, pp. 1942–1948.

[9] X. Zhitao, G. Chengming, Y. Ming, and L. Qiang, “Re-search on log gabor wavelet and its application in im-age edge detection,” in proceedings of 6th InternationalConference on Signal Processing, 2002, pp. 592– 595.

[10] M. Najjarzadeh and A. Ayatollahi, “A comparison be-tween Genetic Algorithm and PSO for linear phase FIRDigital filter design,” in IEEE International Conferenceon Signal Processing(ICSP), 2008, pp. 2134–2137.

[11] P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer,J. Chang, K. Hoffman, J. Marques, J. Min, andW. Worek, “Overview of the face recognition grandchallenge,” in Proceedings of CVPR05, 2005, p.947954.

[12] “polyU palmprint Database,”www.comp.polyu.edu.hk/ biometrics/.

[13] R. M. Bolle, N. K. Ratha, and S. Pankanti, “An Evalua-tion of Error Confidence Interval Estimation Methods,”in Proceedings of ICPR-04, 2004, pp. 103–106.

[14] S.Shan, P.Yang, X.Chen, and W.Gen, “Adaboost gaborfisher classifier for face recognition,” in Proceedings ofAFGR, 2005, pp. 278–291.

[15] G.Bebis, S. Uthiram, and M. Georgiopoulos, “Face de-tection and verification using genetic search,” Interna-tional Journal of Artificial Tools, vol. 6, no. 2, pp. 225–246, 2000.

[16] F.Ferri, P.Pudil, M.Hatef, and J.Kittler, “Comparativestudy of techniques for large scale feature selection,”Pattern Recognition in Practice IV, Elsevier Science, pp.403–413, 1994.

43

ADCOM 2009GRID SERVICES

Session Papers:

1. Srikumar Venugopal, James Broberg and Rajkumar Buyya, “OpenPEX: An Open Provisioning and EXecution System for Virtual Machines”

2. Saurabh Kumar Garg, “Exploiting Grid Heterogeneity for Energy Gain”

3. Snehal Gaikwad, Aashish Jog and Mihir Kedia, “Intelligent Data Analytics Console”

44

OpenPEX: An Open Provisioning and EXecution System for Virtual Machines

Srikumar VenugopalSchool of Computer Science and Engineering,

University of New South Wales, AustraliaEmail:[email protected]

James Broberg and Rajkumar BuyyaDepartment of Computer Science and Software Engineering,

The University of Melbourne, AustraliaEmail:brobergj, [email protected]

Abstract

Virtual machines (VMs) have become capable enoughto emulate full-featured physical machines in all aspects.Therefore, they have become the foundation not only forflexible data center infrastructure but also for commercialInfrastructure-as-a-Service (IaaS) solutions. However, cur-rent providers of virtual infrastructure offer simple mech-anisms through which users can ask for immediate allo-cation of VMs. More sophisticated economic and alloca-tion mechanisms are required so that users can plan aheadand IaaS providers can improve their revenue. This paperintroduces OpenPEX, a system that allows users to provi-sion resources ahead of time through advance reservations.OpenPEX also incorporates a bilateral negotiation protocolthat allows users and providers to come to an agreement byexchanging offers and counter-offers. These functions aremade available to users through a web portal and a REST-based Web service interface.

1 Introduction

In the past, many networked services (such as web sites,databases and computational services) were hosted on ded-icated physical hardware, which were configured exclu-sively to suit application dependent requirements. However,recent hardware and software advances have made it possi-ble to host these services within Virtual Machines (VMs)on ‘commodity x86 hardware with minimal overhead. Vir-tualisation solutions such as Xen [2] and VMWare [1] cre-ate a virtual representation of a complete physical machine,enabling operating systems (and any running applications)to be de-coupled from the physical hardware. This allowsimproved utilisation and consolidation of computing infras-

tructure by multiplexing many VMs onto one physical host.These developments provide an interesting framework

where a lightweight VM image can be a unit of execution(i.e. instead of a task or process on a shared system) andmigration. Migration of VMs on the same subnet can beachieved in a totally transparent, work-conserving fashionwithout the running applications, nor any dependant clientsor external resources being aware that it has occurred [7, 5].Also, VMs enable workload isolation and can be shutdownwithout adverse effects on other concurrent VMs.

The ability to fashion VMs into expendable resourceunits has led to their use for ad-hoc deployment of compu-tational infrastructure in order to meet sudden spikes in re-source demand. VMs also enable users to create computingenvironments customised to suit the requirements of theirspecific e-Science or e-Business applications. These capa-bilities have led to the advent of infrastructure provisioningservices both private (within an enterprise or organisation),or commercial. Providers and consumers of such servicesnegotiate Service Level Agreements (SLAs) that encapsu-late the user requirements in terms of Service Level Objec-tives (SLOs), and the rewards and penalties for meeting andviolating them respectively. Therefore, the provider’s aimis to maximise its own return of investment by maximisingresource utilisation while avoiding or minimising penaltiescaused by SLA violations.

VMs have also become the ‘enabling technology’ be-hind the recent emergence of Cloud Computing [4]. In-frastructure as a Service (Iaas) Providers such as AmazonEC2, GoGrid or Mosso Cloud Servers have emerged thatoffer virtualised machines (or resource slices) that can beobtained under a pay-per-use arrangement (i.e. no commit-ment required, utility style pricing). Users can take advan-tage of elastic capacity, where they can scale up and scaledown on demand using a self-service interface, such as a

1

45

Web Service or Web Portal. The resources themselves (suchas compute and storage) are highly abstracted or virtualised.

However, IaaS service models are evolving and cur-rently, most providers operate on a lease model – the userspay for the time the VM was active. Also, the choices avail-able to the user in terms of specifying requirements are lim-ited. Such models do not allow for more flexible strategieswhere the user can reduce both his costs and risk by bookinghis resources in advance. An advance reservation provides aguaranteed allocation of the resources at the needed time tothe consumer and helps the provider plan capacity require-ments better. However, advance reservations induce newchallenges in resource management and require new archi-tectures for realisation. In this paper, we introduce Open-PEX, a utility-based virtual infrastructure manager that en-ables users to reserve VM instances in advance. OpenPEXalso offers a bilateral negotiation protocol that allows usersand providers to exchange offers and counter-offers, andcome to an agreement that is mutually beneficial. In thenext section, we distinguish the contributions of OpenPEXfrom the state-of-the-art. Section 3 discusses the design andimplementation of OpenPEX at length. Section 4 discussesthe Web Service interface to OpenPEX and finally, we con-clude the paper with details on our future plans for the sys-tem.

2 Related Work

Virtual Machine technology has become an essentialenabling technology of Cloud Computing environments.Cloud Computing is a style of computing where resourcescan be obtained in a pay-per-use manner (no commitment,utility pricing). Such resources have elastic capacity, wherethey can be scaled up and down on demand. Resources arehighly abstracted and virtualised, and can be obtained via aself-service interface.

VMs are highly attractive to manage resources in suchenvironments as they improve utilisation by multiplexingmany VMs on one physical host (consolidation), allow ag-ile deployment and management of services, provide ondemand cloning, (live) migration and checkpoint whichimproves reliability. Furthermore, a VM can be a self-contained unit of execution and migration. As such, effec-tive management of VMs and Virtual Machine infrastruc-ture is critical for any Cloud Computing Infrastructure as aService (IaaS) provider.

2.1 Public IaaS Cloud Services

Amazon Elastic Compute Cloud (EC2) is an IaaS ser-vice that provides resizable compute capacity in the cloud.These services can be leveraged via Web Services (SOAP orREST),a web-based AWS Management Console or the EC2

Command Line Tools. The Amazon service provides hun-dreds of pre-made AMIs (Amazon Machine Image), givingusers a wide choice of operating systems (i.e. Windows orLinux) and pre-loaded software. Instances come in differentsizes, from Standard Instances (S, L, XL), which have pro-portionally more RAM than CPU, to High CPU Instances(M, XL), which have proportionally more CPU than RAM.A user can deploy these instances in two different regions,US-East and EU-West Regions, with EU instances costingmore per hour than their US counterparts.

Amazon EC2 provides an alternative to it’s on-demandinstances, known as a reserved instance. This facility of-fers a number of benefits over simply requesting instanceson demand, as it provides a lower per-hour rate, and pro-vides assurances that any reserved instance you launch isguaranteed to succeed (provided you have booked them inadvance). That is, users of such instances should not beaffected by any transient limitations in EC2 capacity.

2.2 Private IaaS Cloud Platforms

Many different platforms exist to assist with the deploy-ment and management of virtual machines on a virtualisedcluster (i.e. a cluster running Virtual Machine software).Such platforms are often referred to as ‘Private Clouds’,as they can bring the benefits of Cloud Computing (suchas elasticity, dynamic provisioning and multiplexing work-loads onto fewer machines) into local clusters.

Eucalyptus [9, 8] is an open-source (BSD-licensed) soft-ware infrastructure for implementing Infrastructure as aService (IaaS) Compute Cloud on commodity hardware.Eucalyptus is notable by offering a Web Service interfacethat is fully Amazon Web Services (AWS) API compliant.Specifically, it emulates Amazon’s Elastic Compute Cloud(EC2), Simple Storage Service (S3) and Elastic Block Store(EBS) services at the API level. However, as the imple-mentation details of Amazon’s services are not published,Eucalyptus’ internal implementation would differ.

OpenNebula [11] is an open-source Virtual Infrastruc-ture management software that supports dynamic resizing,partitioning and scaling of computing resources. OpenNeb-ula can be deployed in a private, public or hybrid Cloudmodels. The OpenNebula software turns an existing clus-ter into private cloud, which can be used privately or canexpose service to public via XML-RPC Web Services. Theintegration of Cloud plugins (EC2, GoGrid) enable hybridmodel, where you can mix and match private and publicresources. Haizea [11] has extended OpenNebula further,allowing resource providers to lease their resources usingsophisticated leasing arrangements, instead of only provid-ing on-demand VMs like most other IaaS services.

2

46

2.3 Issues with existing solutions

With the exception of OpenNebula (when used in con-junction with the Haizea extension) and Amazon EC2(when used with Reserved Instances), none of the abovepublic or private platforms offer the ability to perform anAdvanced Reservation of computing resources, rather theyonly supply on-demand capacity on a best-effort basis (i.e.if adequate resources are available). Furthermore, none ofthese platforms provide an alternate offer (that is, a modi-fied offering that can satisfy a users request that may differfrom their initial request) in the event that the system cannotsatisfy a user’s specific request for resources.

Whilst Haizea supports a form of Advanced Reservation(which it denotes as advanced reservation leases), if the re-quest cannot be satisfied there is no recourse - the requestwill be rejected. Under the same circumstances, the Open-PEX system enacts a bilateral negotiation protocol that al-lows users and providers to come to an agreement by ex-changing offers and counter-offers, so a user’s advancedreservation request can be satisfied.

Amazon EC2 offers its own variation on the notion ofAdvanced Reservation with its Reserved Instances product.However, you need to purchase a Reserved Instance for ev-ery instance you wish to guarantee to be available at somepoint in the future. This essentially requires the end user toforecast exactly how many they will require in advance. Ac-quisition of Reserved Instance is not instantaneous either, inthe authors’ experience a request for an Reserved Instancehas taken more than an hour on previous occasions.

2.4 Utility Computing Platforms

With increasing popularity and usage, large Grid instal-lations are facing new problems, such as excessive spikes indemand for resources coupled with strategic and adversarialbehaviour by users. Traditional Grid resource managementtechniques did not ensure fair and equitable access to re-sources in many systems. Traditional metrics (throughput,waiting time, slowdown) failed to capture the more subtlerequirements of users. There were no incentives for usersto be flexible about resource requirements or job deadlines,nor provisions to accommodate users with urgent work.

In such systems, users assign a “utility” value to theirjobs, where utility is a fixed or time-varying valuation thatcaptures various QoS constraints (deadline, importance, sat-isfaction). The valuation is amount they are willing to paya service provider to satisfy demands. Service providersattempt to maximise their own utility, where utility maydirectly correlate with their profit. Providers can priori-tise high yield (i.e. profit per unit of resource) user jobs,and shared Grid systems are then viewed as a marketplace,where users compete for resources based on the perceived

utility / value of their jobs. Further information and compar-ison of these utility computing environments are availablein extensive survey of these platforms [3].

Reservation Manager Allocator

Node Monitor VM Monitor

Event Queue

PEX Resource Mgr

Xen Dispatcher

PEX Portal Web Service

Xen Pool Mgr.

leon pris deckard roy zhora

Xen Cluster

Physical Resources

Figure 1. The OpenPEX Resource Manager.

3 OpenPEX

OpenPEX was constructed around the notion of usingadvance reservations as the primary method for allocatingVM instances. The use case followed here is of a userwho, either through a web portal or through the web ser-vice makes a reservation for any number of instances of aVirtual Machine that have to be started at a specific time andhave to last for a specific duration. The VM is described bya template that is already registered in the system. If the re-quest can be satisfied for the price asked for by the user, thenOpenPEX creates the reservation, else it creates a counter-offer with an alternate time interval where the request canbe accommodated. The counter offer may also instead spec-ify a different price for the original time interval. Once thereservations have been finalised, the user can chose to ac-tivate the reservation or have OpenPEX automatically startthe instances when required.

3

47

These requirements motivate a resource managementsystem that is able to manage physical nodes in such a man-ner that it maintains the capacity to satisfy as many advancereservation requests as possible. Such adaptive manage-ment may be enabled by a variety of techniques includingload forecasting, migrating existing VMs to other resources,or/and suspending some of the VMs in order to increase theavailable resource share.

Figure 1 shows the architecture of the OpenPEX Re-source Manager. The Resource Manager has the followingcomponents:

Reservation Manager - This component interacts with theusers through the portal or the web service, and re-ceives incoming requests. It examines them to checkwhether they are feasible according to the reservationpolicy employed, and creates counter-offers when re-quired.

Allocator - Manages the allocation of VMs to physicalnodes. The allocator’s roles are to: create capacity fornew VMs by triggering migration of existing VMs toother nodes, or by suspending long running VMs; toidentify physical nodes on which reservations can beactivated; and to react to events such as the loss of aphysical node.

Node Monitor - Monitors the health and load of the phys-ical nodes.

VM Monitor - Monitors the health of the VMs that havebeen started in OpenPEX. It detects events such as aVM shut down by a user from the inside, a VM sud-denly crashing, or a VM being unresponsive.

Dispatcher - Interacts with the virtual machine manageror the hypervisor on the physical nodes. It relays com-mands such as create, start or shutdown a VM to thevirtual machine manager. The dispatcher is the onlycomponent that is specific to the underlying virtual ma-chine manager, the rest of OpenPEX is designed to beindependent of the underlying infrastructure.

All these components are connected to an Event Queuethat acts as a simple message bus for the system. The EventQueue also enables scheduling of future tasks by allowingdelayed events. The entire system is backed by a Persis-tence layer that saves the current state to a database.

3.1 Negotiating Advance Reservations

As described previously, OpenPEX allows advancereservations to be negotiated bilaterally between the pro-ducers and consumers. As described previously, Open-PEX allows advance reservations to be negotiated bilat-erally between the producers and consumers. For this,

alt

User/Portal/WS PEX RM

initiateReservation

reservationID

requestReservation

ACCEPT/REJECT/COUNTER

ClusterNode

isFeasible()

Yes/No

Xen Pool Mgr

Create VM

Done

Start VM StartVM

Activate Reservation

Start VM Instance

Create VM

Done

start ()

CONFIRM

[if ACCEPT]

REQUEST-SUCCESS

ACCEPT/REJECT/COUNTER

IF ACCEPT THENCONFIRM_REQUEST

[if COUNTER]

CONFIRM

[if REJECT]

Figure 2. Alternate Offers Negotiation for Ad-vance Reservations.

we have employed a protocol based on the Alternate Of-fers mechanism [10] which was previously used for ne-gotiation of SLAs in an enterprise Grid framework [12].The implementation of this protocol in OpenPEX is shownin Figure 2. The user opens the interaction by sendingan initiateReservation request in reply to whichOpenPEX returns a unique reservationID identifier.This identifier acts as a handle for the session and ifthe reservation goes through, then until its life-cycle iscomplete. The user then submits a proposal through arequestReservation call. The proposal contains adescription of the VM being requested (e.g. instance size),the number of instances required, the start time for activat-ing the reservation and the duration for which the reserva-tion is required. The instance sizes are detailed in the nextsection and examples of these descriptions are given in Sec-tion 4.

In return, OpenPEX can respond with: ACCEPT, if theproposal is acceptable; REJECT, if the proposal cannot besatisfied in any manner; and COUNTER, if the reservationrequired cannot be fulfilled with the parameters given in the

4

48

OpenPEXUser has Reservation activates Instance

Reservation Node

OpenPEXNode

maps to

maps to

1 0 : M 1 0 : M1

1 : M

1 1

Figure 3. OpenPEX Entity Relationship Diagram.

Figure 4. OpenPEX Welcome Screen.

proposal but an alternative can be generated instead. Withthe last option, OpenPEX returns an alternative (or counter)proposal generated by replacing terms of the user’s origi-nal proposal with those acceptable to it. For example, auser could ask for 5 instances of a Virtual Machine of smallinstance size (refer Section 3.2) and with Red Hat Enter-prise Linux operating system. These instances have to bestarted at 10:00 a.m. on August 21, 2009 for six days af-ter which they can be shut down. However, OpenPEX maynot be able to provision these instances on August 21, butmay have free nodes for six days starting August 23. Inthis case, it will generate a counter proposal that replacesonly the start time with the new start time in the user’s orig-inal proposal. If OpenPEX is not able to provision the VMsin any case (e.g. number of instances requested exceedsits capacity), then the proposal is rejected. When the userreceives an ACCEPT, he can then reply with CONFIRM toconfirm the reservation. In reply to a COUNTER, the userhas the same three reply options. In case the user acceptsthe counter-proposal (through the reply ACCEPT), Open-PEX sends back a CONFIRM-REQUEST so that the usercan reply back with a CONFIRM to confirm the reservation.This extra step is necessary as, even though the protocol isbilateral, only OpenPEX can confirm a reservation.

Once the reservation is confirmed, the user can activate itby instantiating the VMs in the reservation after the agreed-upon start time. The user can start or shutdown VMs at anytime during the course of the reservation. Once the durationis over, the reservation expires, and all active VMs on thatreservation are shutdown.

Table 1. List of available VM configurations inOpenPEX.

Size ConfigurationSMALL 1 CPU, 768 MB RAM, 10 GB HDDMEDIUM 2 CPU, 1.5 GB RAM, 20 GB HDDLARGE 3 CPU, 2.5 GB RAM, 40 GB HDDXLARGE 4 CPU, 3.5 GB RAM, 60 GB HDD

3.2 Resource Management in OpenPEX

Users are only able to ask for standardised configurationsof virtual machines from the OpenPEX Resource Manager.These configurations depict the ”sizes” of the virtual ma-chines as given in Table 1. The virtual machines can bepaired with an operating environment chosen from the tem-plates available in the OpenPEX database.

When a user request arrives at the Reservation Manager,the individual nodes are polled to determine which of themare free at the requested time. If more than one nodes arefree, then the Manager choses the most loaded of them tohost the request. In case, there are no free nodes available,an alternate time slot is requested from each of the nodes.The node that provides a starting time closest to the originalrequest is temporarily locked, and the new starting time issent as an alternate offer to the user via a COUNTER replyin the negotiation protocol.

5

49

Figure 5. OpenPEX Reservations Screen.

Figure 6. OpenPEX Instances Screen.

3.3 Web Portal Interface

OpenPEX is developed completely in Java and is de-ployed as a service in an application container on the clus-ter head node (or the storage node). It communicates withthe pool manager using the Xen API and uses the JavaPersistence API (JPA) for the persistence back-end. Thedatabase structure for PEX is depicted in Figure 3 and fol-lows Object-Relational Mapping (ORM) for easy develop-ment and extensibility.

The OpenPEX system provides an easy to use Web Por-tal interface, enabling the user to access all the functional-ity of the OpenPEX. Users can access the system via a webbrowser, register for an account and login to the system.Upon logging in they will be greeted by a simple WelcomeScreen depicted in Figure 4, which shows what functionsare available to the end user.

The user can choose to make a new reservation, wherethey can choose the size of the reservation they wish tomake (from the choices listed in Table 1), the start and endtime, the template (i.e. Operating System) they wish to useand the number of instances they require. Their request canbe accepted or they can enter into a negotiation until theycome to an agreement with the OpenPEX cluster.

Once this process has occurred, they can view their ex-isting reservations and activate any unclaimed reservations

via the Reservations screen. Figure 5 shows the Reserva-tions screen with three reservations. If a reservation wasnot yet activated, a user can choose to delete it (if they nolonger required it) or activate it, so the associated instancescan start at the appropriate start time.

Virtual Machine instances can be viewed and manipu-lated via the Instances screen depicted in Figure 6. Herethe user can view salient information regarding their VMinstance, such as it’s machine name, status (e.g. HALTED,RUNNING, PAUSED, SUSPENDED). start time, end time,and IP address. An instance can also be stopped early (i.e.before it’s designated end time) if desired.

4 RESTful Web Service Interface

It is essential to provide programmatic access to thefunctions and capabilities of an OpenPEX cluster, in orderfor users to be able to dynamically request reservations fromthe system (i.e. scaling out during periods of peak load), oreven to integrate an OpenPEX system into a wider pool ofcomputing resources. As such, the full functionality of theOpenPEX system is exposed via Web Services, which areimplemented in a RESTful (REpresentational State Trans-fer) style [6].

The REST-style architecture provides a clear and cleandelineation between the functions of the client and the

6

50

Table 2. OpenPEX RESTful Endpoints.OpenPEX Operation HTTP Endpoint Parameters Return typeCreate reservation POST /OpenPEX/reservations JSON (Fig. 7) JSON (Fig. 8,9)Update reservation PUT /OpenPEX/reservations/requestId JSON JSONDelete reservation DELETE /OpenPEX/reservations/requestId None HTTP 200 (OK)Activate reservation PUT /OpenPEX/reservations/requestId/activate None HTTP 200 (OK)Get reservation information GET /OpenPEX/reservations/requestId None JSONList reservations GET /OpenPEX/reservations None JSONGet instance information GET /OpenPEX/instances/vm id None JSONList instances GET /OpenPEX/instances None JSONStop instance PUT /OpenPEX/instances/vm id/stop None HTTP 200 (OK)Reboot instance PUT /OpenPEX/instances/vm id/reboot None HTTP 200 (OK)Delete instance DELETE /OpenPEX/instances/vm id None HTTP 200 (OK)

server. A client performs operations on resources (suchas reservations and instances) which are identified throughstandard URIs. The server returns a JSON1 representationof the resource back to the client to indicate the current stateof that resource. Clients can modify and delete these re-sources by altering and returning their representations asrequired. A client could be a Java or Python program, ora Web Portal management interface for the OpenPEX sys-tem.

"duration": 3600000,"numInstancesFixed": 1,"numInstancesOption": 0,"startTime": "Mon, 17 Aug 2009 04:49:03 GMT","template": "PEX Debian Etch 4.0 Template","type": "XLARGE"

Figure 7. Create reservation JSON body

Table 4 lists the functions exposed by the Web Ser-vice interface, along with their corresponding HTTP meth-ods and endpoints. Some calls require a JSON bodywhilst others trigger their functionality by simply being ac-cessed. The calls typically return a JSON object or ar-ray, or simply a HTTP code denoting whether an opera-tion was a success or failure. All the methods listed re-quire HTTP basic authentication. From this table we cansee that the full OpenPEX life-cycle is exposed via the WebServices interface. Customers can create a new reserva-tion, engage in the bilaterally negotiation (via the Alter-nate Offers protocol described earlier in this paper) via the/OpenPEX/reservations endpoint, and finally acti-vate their reservation. Once a reservation has been ac-tivated, the corresponding Virtual Machine instances arestarted at their designated start time. A user has control overthese instances via the /OpenPEX/instances endpoint,

1The application/json Media Type for JavaScript Object Notation(JSON) - http://tools.ietf.org/html/rfc4627

where they can stop, reboot or delete these instances.

"proposal":

"duration": 3600000,"id": "5D0FA0EB-90A8-F4E6-1DFF-61B2CEC6AD91","numInstancesFixed": 1,"numInstancesOption": 0,"startTime": "Mon, 17 Aug 2009 04:49:03 GMT","template": "PEX Debian Etch 4.0 Template","type": "XLARGE","userid": 1

,"reply": "ACCEPT"

Figure 8. Reply to reservation request

Figure 7 depicts the JSON body for a new reservationcall. A user specifies the duration of the reservation, thenumber of instances required, the start time, the desiredtemplate (e.g. Operating System) required and the type (Ta-ble 1) of instance required. These preferences are expressedin the JSON body of the call. OpenPEX will then respondwith a JSON reply that indicates the outcome of the request,which could be an acceptance of the proposed reservation(shown in Figure 8), a counter offer indicated an alternatereservation that could satisfy the user (shown in Figure 9),or an outright rejection of the proposed reservation.

Upon successfully obtaining a reservation in the system,a user can get the reservation record and activate the reser-vation. Once the reservation has been activated the user canthen operate on the instances themselves, obtaining the in-stance record, and control the state of the Virtual Machineinstance itself by stopping, rebooting or deleting it.

5 Conclusion and Future Work

In this paper we introduced OpenPEX, a system that al-lows users to provision resources ahead of time through ad-

7

51

"proposal":

"duration": 3600000,"id": "F07640D4-32BC-DDB6-457E-32B5595BA066","numInstancesFixed": 1,"numInstancesOption": 0,"startTime": "Mon, 17 Aug 2009 05:52:31 GMT","template": "PEX Debian Etch 4.0 Template","type": "XLARGE","userid": 1

,"reply": "COUNTER"

Figure 9. Counter Reply to reservation re-quest

vance reservations, instead of being limited to on-demand,best-effort resource acquisition. OpenPEX also incorpo-rates a novel bilateral negotiation protocol that allows usersand providers to come to an agreement by exchanging offersand counter-offers, in the event that a user’s original requestcannot be precisely satisfied.

The fundamental aim of OpenPEX was to harness virtualmachines for adaptive provisioning of services on sharedcomputing resources. Adaptive provisioning may involve acombination of: 1) creating of new VMs to meet increasein demand; 2) migrating existing VMs to other availableresources; and/or 3) suspending execution of some VMsin order to increase the resource share available to oth-ers. These techniques are, however, governed by negotiatedagreements between the users and the resource providers,and between providers that encapsulate costs and guaran-tees for deployment and maintenance of VMs for services.Demand and supply for services in such an environment is,therefore, mediated by market-driven resource managementmechanisms, thereby leading to a so-called utility comput-ing environment.

As such, we are endeavouring to implement provisionalmarket-based resource management techniques in Open-PEX system to collect pricing and utilisation data, and intro-duce policies and strategies for managing virtual machinesin a market-driven utility computing environment. We in-tend to achieve this by:

1. Soliciting a wide range of users (from other facultiesand other collaborating Universities) to run VM encap-sulated workloads on the test-bed.

2. Measuring crucial pricing (using simulated currencymechanism) and usage data from users of the system,which is difficult to obtain from commercial comput-ing centres and largely absent from the literature.

3. Formulating market-driven policies for scheduling andmigration in VM platforms based on data collected.

4. Integrating market-driven scheduling and migrationpolicies into OpenPEX.

5. Evaluating strategies for negotiating among multipleVM providers and users based on market conditions inconjunction with the proposed market-drive policies.

References

[1] K. Adams and O. Agesen. A comparison of software andhardware techniques for x86 virtualization. In Proceed-ings of the 12th international conference on Architecturalsupport for programming languages and operating systems,pages 2–13, San Jose, California, USA, 2006. ACM.

[2] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris,A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen andthe art of virtualization. ACM SIGOPS Operating SystemsReview, 37(5):164–177, 2003.

[3] J. Broberg, S. Venugopal, and R. Buyya. Market-orientedGrids and Utility Computing: The state-of-the-art and futuredirections. Journal of Grid Computing, 6(3):255–276, 2008.

[4] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, andI. Brandic. Cloud computing and emerging IT platforms:Vision, hype, and reality for delivering computing as the 5thutility. Future Generation Computer Systems, 25(6):599–616, June 2009.

[5] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul,C. Limpach, I. Pratt, and A. Warfield. Live migration of vir-tual machines. In Proceedings of the 2nd conference on Sym-posium on Networked Systems Design \& Implementation -Volume 2, pages 273–286. USENIX Association, 2005.

[6] R. T. Fielding. Architectural Styles and the Design ofNetwork-based Software Architectures. PhD thesis, Univer-sity of California, 2000.

[7] M. Nelson, B. Lim, and G. Hutchins. Fast transparent mi-gration for virtual machines. In Proceedings of the annualconference on USENIX Annual Technical Conference, pages25–25, Anaheim, CA, 2005. USENIX Association.

[8] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. So-man, L. Youseff, and D. Zagorodnov. Eucalyptus: A Tech-nical Report on an Elastic Utility Computing ArchietctureLinking Your Programs to Useful Systems UCSB ComputerScience Technical Report Number 2008-10.

[9] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. So-man, L. Youseff, and D. Zagorodnov. The Eucalyptus Open-source Cloud-computing System. Proceedings of CloudComputing and Its Applications, 2008.

[10] A. Rubinstein. Perfect equilibrium in a bargaining model.Econometrica, 50(1):97–109, January 1982.

[11] B. Sotomayor, R. Montero, I. Llorente, I. Foster, andF. de Informatica. Capacity Leasing in Cloud Systems usingthe OpenNebula Engine. Cloud Computing and Applica-tions, 2008, 2008.

[12] S. Venugopal, X. Chu, and R. Buyya. A negotiation mech-anism for advance resource reservations using the alternateoffers protocol. In Proceedings of the 16th InternationalWorkshop on Quality of Service (IWQoS 2008), pages 40–49. IEEE Computer Society Press, Los Alamitos, CA, USA,June 2008.

8

52

Exploiting Heterogeneity in Grid Computing forEnergy-Efficient Resource Allocation

Saurabh Kumar Garg and Rajkumar BuyyaThe Cloud Computing and Distributed Systems

Department of Computer Science and Software EngineeringThe University of Melbourne, Australia

Email: sgarg, [email protected]

Abstract—The growing computing demand from industry andacademia has lead to excessive power consumption which not onlyimpacting long term sustainability of Grids like infrastructures interms of energy cost but also from environmental perspective. Theproblem can be addressed by replacing with more energy efficientinfrastructures, but the process of switching to new infrastructureis not only costly but also time consuming. Grid being consistof several HPC centers under different administrative domain,make problem more difficult. Thus, for reduction in energyconsumption, we address the challenge by effectively distributingcompute-intensive parallel applications on grid. We presenteda metascheduling algorithm which exploits the heterogeneousnature of Grid to achieve reduction in energy consumption. Sim-ulation results show that out algorithm HAMA can significantlyimprove the energy efficiency of global grids by a factor oftypically 23% and as much as a factor of 50% in some caseswhile meeting user’s QoS requirements.

I. INTRODUCTION

From last many years, global grid is serving as a main-stream High Performance Computing (HPC) platform to pro-vide massive computational power to execute large-scale andcompute-intensive scientific and technological applications.Enlarging the existing global grid infrastructure to meet theincreasing demand from grid users can progressively speed upthe advancement of science and technology. But the growingenvironmental and economic impact due to high energy con-sumption of HPC platforms has become a major bottleneck inexpansion of grid like platforms.

In April 2007, Gartner estimates that the ICT industry isliable for 2% of the global CO2 emissions annually, which isequal to the aviation industry [1][2]. In addition to that, thehigh power consumption has not only lead to rapid increase inutility bills but also affecting the reliability of servers due tohigh concentrated heat loads. The power efficiency of a HPCcenter depend on number of factors such as processor’s powerefficiency, cooling and air conditioning system, infrastructuredesign and lighting/physical system. A recent study [3] doneby Lawrence Berkeley National Laboratory shows the coolingefficiency (the ratio of computer power : cooling power) ofdata centers varies drastically from a low of 0.6 to a high of3.5. Thus, the sustainable and environmental-friendly solutionsmust be employed by current HPC community to increase theenergy efficiency of HPC systems which can more effectivelymake use of electricity.

While a lot of research has been performed to increase effi-ciency of individual clusters at various levels such as processorlevel (CPU) [4][5], in virtualization based resource managers[6], and cluster resource managers [7][8], the research onimproving the energy efficiency of global systems such as gridis still in its infancy. Most of the existing grid meta-schedulers,such as Maui/Moab scheduling suite [9], Condor-G [10], andGridWay [11], focus on improving system-centric performancemetrics such as utilization, average load and application’sturnaround time. Others such as Gridbus Broker [12] focuson deadline and budget constrained scheduling. Thus, thispaper examines how a grid meta-scheduler can exploit theheterogeneity of the global grid infrastructure to achievereduction in energy consumption of overall grid. In particular,we focus on designing a meta-scheduling policy that can beeasily adopted by existing grid meta-schedulers without manychanges in current grid infrastructure. This work will alsohave relevance to emerging cloud computing paradigm whenscaling of application across multiple clouds is considered[13]. The key contributions of this paper are:

1) It defines a novel Heterogeneity Aware Meta-schedulingAlgorithm (HAMA) that considers various factors con-tributing to high energy consumption of grids, includingcooling system efficiency and CPU power efficiency.

2) It demonstrates through extensive simulations using realworkload traces that the energy efficiency of global gridscan be improved as much as 23% with HAMA.

The rest of this paper is organized as follows: Section 2 dis-cusses related work. Section 3 defines the grid meta-schedulingmodel. Section 4 describes HAMA. Section 5 explains theevaluation methodology and simulation setup for comparingHAMA with existing meta-scheduling policies. In Section 6,the performance results of HAMA are analyzed. Section 7concludes the paper and presents future work.

II. RELATED WORK

This section presents related work on energy-efficient/power-aware scheduling on grids. To the bestof our knowledge, no previous work has proposed a meta-scheduler that explicitly addresses energy efficiency of gridsfrom a global perspective.

Currently, for global grids, meta-schedulers in operation,such as GridWay [11] use heuristics such as First Come First

53

Serve (FCFS). Moab also has a FCFS batch scheduler witheasy backfilling policy [9], [14]. Condor-G [10] uses eitherFCFS or matchmaking with priority sort [15] as schedulingpolicies. These schedulers mostly schedule jobs with severalgoals such as minimizing job completion time and achievingload balancing. The issue of energy consumption emission bythe grids still need to be addressed.

There are several research efforts on power-aware resourceallocation to optimize energy consumption at a single resourcesite, typically within a single cluster or data center. The powerusage reduction within the resource site is achieved throughtwo methods: by switching off parts of the cluster that are notutilized [16], [17], [18], [7]; or by Dynamic Voltage Scaling(DVS) to slow down the speed of CPU processing [19], [20],[21], [22], [8], [23], [24], [7]. Hence, these efforts help reducethe energy consumption of one resource site such as cluster orserver farm, but not across multiple resource sites distributedgeographically.

Orgerie et al. [16] propose a prediction algorithm to reducethe power consumption in a large-scale computational gridsuch as Grid’5000 by aggregating the workload and switch-ing off unused CPUs. They focus on reducing CPU powerconsumption to minimize the total energy consumption. Asthe power efficiency of grid sites can vary across the grid,reducing CPU power consumption itself may not necessarylead to a global reduction in the energy consumption by theentire grid. We focus on conserving energy of grids from aglobal perspective.

Meisner et al. [19] show that in the case of high andunpredictable workload, it is difficult to exploit the poweron/off facility even though it is ideal to simply switch offidle systems. Thus, DVS-enabled CPUs will be much betterin saving energy in this case. Therefore, in this work we useDVS to reduce the energy consumption of CPUs since ourmain focus is on large-scale computational grid resource siteswhich generally have unpredictable workload.

III. GRID META-SCHEDULING MODEL

A. System Model

A grid meta-scheduler acts as an interface to grid resourcesites and schedules jobs on the behalf of users as shown inFigure 1. It interprets and analyzes the service requirementsof a submitted job and decides whether to accept or rejectthe job based on the availability of CPUs. Its objective is toschedule jobs so that the energy consumption of grid can bereduced while the Quality of Service (QoS) requirements ofthe jobs are met. As grid resource sites are located in differentgeographical regions, they have different power efficiency ofCPUs and cooling systems. Each resource site is responsiblefor updating this information at the meta-scheduler for energyefficient scheduling. The two participating parties, grid usersand grid resource sites, are discussed below along with theirobjectives and constraints:

1) Grid Users:Grid users submit parallel jobs with QoS requirementsto the grid meta-scheduler. Each job must be executed

`

Users Local Scheduler

Meta-Scheduler

Resource Provider

1. Job request from

users with deadline

6. Acknowledge user

about

Resource match

2. Resource site energy

efficiency related

parameters

3. Find most energy efficient

resource site

4. Meta-scheduler send

jobs to local scheduler

for execution

5. Scheduling of

job for matched

time slot

Fig. 1. Meta-scheduling protocol

on an individual grid resource site and does not havepreemptive priority. The reason for this requirement isthat the synchronization among various tasks of paralleljobs can be affected by communication delays when jobsare executed across multiple resource sites. The user’sobjective is to have his job completed by a deadline.Deadlines are hard, i.e., the user will benefit only ifthe job completes before its deadline [25]. To facilitatethe comparison between the algorithms described in thiswork, the estimated execution time of a job providedby the user is considered to be accurate [26]. Severalmodels, such as those proposed by Sanjay and Vadhiyar[27], can be applied to estimate the runtime of paralleljobs. In this work, a job execution time is inverselyproportional to the CPU operating frequency.

2) Grid Resource Sites:Grid resource sites consist of clusters at different loca-tions, such as the sites of the Distributed European In-frastructure for Supercomputing Applications (DEISA)[28] with resource sites located in various Europeancountries and LHC Grid across the world [29]. Eachresource site has a local scheduler that manages theexecution of incoming jobs. Each local scheduler peri-odically supplies information about available time slots(ts, te, N) to the meta-scheduler, where ts and te are thestart time and end time of the slot respectively and N isthe number of CPUs available for the slot. To facilitateenergy efficient computing, each local scheduler alsosupplies information about cooling system efficiency,CPU power-frequency relationship, and CPU operatingfrequencies of the grid resource site. All CPUs within asingle resource site are homogeneous, but CPUs can beheterogeneous across resource sites.

B. Grid Resource Site Energy Model

The major contributors for total energy usage in grid re-source site are computing devices (CPUs) and cooling system

54

which constitute about 80% of total energy consumption.Other systems such as lighting are not considered due to theirnegligible contribution to the total energy cost.

The power consumption P of a CPU at a grid resourcesite is composed of dynamic and static power [21][7]. Thestatic power includes the base power consumption of theCPU and the power consumption of all other components.Thus, the CPU power P is approximated by the followingfunction (similar to previous work [21][7]): P = β + αf3,where β is the static power consumed by the CPU, α is theproportionality constant, and f is the frequency at which theCPU is operating. We consider that CPUs support DVS facilityand thus their frequency can be varied from a minimum offmin to a maximum of fmax discretely. Let Ni be numberof CPUs at a resource site i. Thus, if the CPU j running atfrequency fj for tj time, then the total energy consumptiondue to computation is given by:

Ec,i =j∑Ni

(βi + αif3j )tj . (1)

The energy cost of an cooling system depends on the Coeffi-cient Of Performance (COP) factor of the cooling system [30].COP is indication of efficiency of cooling system which isdefined as the ratio of the amount of energy consumed byCPUs to the energy consumed by the cooling system. TheCOP is however not constant and varies with cooling airtemperature. We assume that COP will remain constant duringscheduling cycle and resource sites will update meta-schedulerwhenever COP changes. Thus, the total energy consumed bycooling system is given by:

Eh,i =Ec,iCOPi

(2)

Thus, the resultant total energy consumption by a gridresource site is given by:

Ei = (1 +1

COPi)Ec,i (3)

IV. HETEROGENEITY AWARE META-SCHEDULINGALGORITHM (HAMA)

This section gives the details of our Heterogeneity AwareMeta-scheduling Algorithm (HAMA) which enables the gridmeta-scheduler to select the most energy efficient grid resourcesite. The grid meta-scheduler runs HAMA periodically toassign jobs to grid resource sites. HAMA achieves this byfirst selecting the most energy efficient grid resource site andthen by using DVS for further reduction in the energy con-sumption. Algorithm 1, described next, shows the pseudo-codefor HAMA. At each scheduling interval, the meta-schedulercollects information from both grid resource sites and users(Algorithm 1: Line 2–3). Considering that a grid consistsof n resource sites (supercomputer centers), all parametersassociated with each resource site i are given in Table I. Auser submits his QoS requirements for a job j in the form ofa tuple (dj , nj , ej , fm,j), where dj is the deadline to complete

job j, nj is the number of CPUs required for job execution,and ej is the job execution time when operating at the CPUfrequency fm,j . In addition, let fij be the initial frequency atwhich CPUs of a grid resource site i operate while executingjob j. HAMA, then, sorts the incoming jobs based on EarliestDeadline First (EDF) (Algorithm 1: Line 4). The grid resourcesites are sorted in order of their power efficiency (Algorithm 1:Line 5) which is calculated by Cooling system efficiency ×CPU Efficiency, i.e., (1+ 1

COPi)×( βi

fmaxi

+αi(fmaxi )2). Then,meta-scheduler assigns jobs to resource sites according to thisordering (Algorithm 1: Line 7–11).

Algorithm 1: HAMA

while current time < next schedule time do1RecvResourcePublish(P)2//P contains information about grid resource sitesRecvJobQoS(Q)3//Q contains information about grid users

Sort jobs in ascending order of deadline4Sort resource sites in ascending order of5(1 + 1

COPi)× ( βi

fmaxi

+ αi(fmaxi )2)

foreach job j ∈ RecvJobQoS do6foreach resource site i ∈ RecvResourcePublish do7

//find time slot for scheduling job j at resource site iif FindTimeSlot(i,j) then8

Schedule job j on resource site i using DVS;9Update available time slots at resource site ibreak10

11

TABLE IPARAMETERS OF A GRID RESOURCE SITE i

Parameter NotationAverage Cooling system effi-ciency

COPi

CPU power Pi = βi + αif3

CPU frequency range [fmini , fmaxi ]Time slots (start time, endtime, number of CPUs)

(ts, te, n)

The energy consumption is further reduced by schedulingjobs using DVS at the CPU level which can save energyby scaling down the CPU frequency. Thus, when the gridmeta-scheduler assigns a job to a grid resource site, it alsodecides the time slot in which jobs should be executed at theminimum frequency level to decrease energy consumption byCPU (Algorithm 1: Line 8). If the job deadline is violated, themeta-scheduler scales up the CPU frequency to the next leveland then again tries to find the free slot to execute the job.The execution time an application is considered to increaselinearly with the decrease in CPU frequency. Thus, in nextCPU frequency level, since CPU will be executing applicationat higher frequency level, the time slot required will be shorter.

As at a resource site CPUs may or may not have the DVSfacility, thus the scheduling at the local scheduler level canbe of two types: CPUs run at the maximum frequency (i.e.without DVS); or CPUs run at various frequency using DVS

55

(i.e. with DVS). If the meta-scheduler fails to schedule the jobon the resource site because no free slot is available, then thejob is forwarded to the next energy efficient resource site forscheduling.

V. PERFORMANCE EVALUATION

We use workload traces Feitelson’s Parallel WorkloadArchive (PWA) [31] to model the global grid workload. Sincethis paper focuses on studying the application requirementsof grid users, the PWA meets our objective by providing jobtraces that reflect the characteristics of real parallel appli-cations. The experiments utilize the jobs in the first weekof the LLNL Thunder trace (January 2007 to June 2007).The LLNL Thunder trace from the Lawrence LivermoreNational Laboratory (LLNL) in USA is chosen due to itshighest resource utilization of 87.6% among available tracesto ideally model a heavy workload scenario. From this trace,we obtain the submit time, requested number of CPUs, andactual runtime of jobs. However, the trace does not containthe service requirement of jobs (i.e. deadline). Hence, we usea methodology proposed by Irwin et al. [32] to syntheticallyassign deadlines through two classes namely Low Urgency(LU) and High Urgency (HU).

A job i in the LU class has a high ratio ofdeadlinei/runtimei so that its deadline is definitely longerthan its required runtime. Conversely, a job i in the HU classhas a deadline of low ratio. Values are normally distributedwithin each of the high and low deadline parameters. The ratioof the deadline parameter’s high-value mean and low-valuemean is thus known as the high:low ratio. In our experiments,the deadline high:low ratio is 3, while the low-value deadlinemean and variance is 4 and 2 respectively. In other words, LUjobs have a high-value deadline mean of 12, which is 3 timeslonger than HU jobs with a low-value deadline mean of 4.The arrival sequence of jobs from the HU and LU classes israndomly distributed.

Provider Configuration: The grid modelled in our simu-lation contains 8 resource sites spread across five countriesderived from European Data Grid (EGEE) testbed [29]. Theconfigurations assigned to the resources in the testbed forthe simulation are listed in Table II. The configuration ofeach resource site is decided so that the modelled testbedwould reflect the heterogeneity of platforms and capabilitiesthat is normally the characteristic of such installations. Powerparameters (i.e. CPU power factors and frequency level) ofthe CPUs at different sites are derived from Wang and Lu’swork [7]. Current commercial CPUs only support discretefrequency levels, such as the Intel Pentium M 1.6 GHz CPU,which supports 6 voltage levels. We consider discrete CPUfrequencies with 5 levels in the range [fmini , fmaxi ]. Forthe lowest frequency fmini , we use the same value usedby Wang and Lu [7], i.e. fmini is 37.5% of fmaxi . Eachlocal scheduler at a grid site use Conservative Backfillingwith advance reservation support as used by Mu’alem andFeitelson [33]. The grid meta-scheduler schedules the jobperiodically at a scheduling interval of 50 seconds, which is

to ensure that the meta-scheduler can receive at least one jobin every scheduling interval. The cooling system efficiency(COP) value of resource sites is randomly generated usinga uniform distribution between [0.5, 3.6] as indicated in thestudy conducted by Greenberg et al. [3].

Grid Meta-scheduling Algorithms: We examine the per-formance of HAMA in terms of job selection and resourceallocation of the grid meta-scheduler. We compare our jobselection algorithm with EDF-FQ which prioritize jobs basedon deadline and submit jobs to resource site in earliest starttime (FQ) manner with the least waiting time. We alsocompare HAMA with another version of HAMA i.e. HAMA-withoutDVS to analyze the affect of DVS facility on energyconsumption.

Performance Metrics: We consider two metrics: averageenergy consumption and workload (i.e. amount of workloadexecuted). Average power consumption shows the amount ofenergy saved by using HAMA in comparison to other gridmeta-scheduling algorithms, whereas workload shows HAMAaffect on the workload executed successfully by grid.

Experimental Scenarios: We run the experiments in twoscenarios 1) urgency class and 2) arrival rate of jobs. Forthe urgency class, we use various percentages (0%, 20%,40%, 60%, 80%, and 100%) of HU jobs. For instance, if thepercentage of HU jobs is 20%, then the percentage of LUjobs is the remaining 80%. For the arrival rate, we use variousfactors (10, 100, and 1000) of submit time from the trace. Forexample, a factor of 10 means a job with a submit time of 10sfrom the trace now has a simulated submit time of 1s. Hence,a higher factor represents higher workload by shortening thesubmit time of jobs.

Equation 3, we know that the performance of HAMA ishighly dependent on the CPU efficiency and Cooling Systemefficiency of grid resource sites. We compare performance ofour algorithm in worst case scenario (HL) i.e., when resourcesite with the highest CPU power efficiency has the lowest COP,and best case scenario (HH) i.e., when resource site with thehighest CPU power efficiency has the highest COP (HH).

VI. PERFORMANCE RESULTS

A. Affect on Energy consumption

This section compares energy consumption of HAMA withother meta-scheduling algorithms for grid resource sites withHH and HL configurations. The figure 2 shows how energyconsumption varies with deadline urgency and arrival rate ofjobs. HAMA has clearly outperformed its competitor EDF-FQ by saving about 17%-23% energy in worst case and about52%in best case.

The effect of job urgency on energy consumption can beclearly seen from figure 2(a) and 2(b). As the percentage ofHU jobs with more urgent (shorter) deadline increases, theenergy consumption (Figure 2(a) and 2(b)) also increases dueto more urgent jobs running on resource sites with lower powerefficiency and at the highest CPU frequency to avoid deadlineviolations. On the other hand, the effect of job arrival rate on

56

TABLE IICHARACTERISTICS OF GRID RESOURCE SITES

Location of Grid Site CPU Power Factors No. of CPUs MIPS Ratingβ α fmaxi

RAL, UK 65 7.5 1.8 2050 1140Imperial College (UK) 75 5 1.8 2600 1200NorduGrid (Norway) 60 60 2.4 650 1330

NIKHEF (Netherlands) 75 5.2 2.4 540 1176LYON (France) 90 4.5 3.0 600 1166Milano (Italy) 105 6.5 3.0 350 1320Torina (Italy) 90 4.0 3.2 200 1000Padova (Italy) 105 4.4 3.2 250 1330

energy consumption (Figure 2(c) and 2(d) is minimal with aslight increase when more jobs arrive.

For grid resource sites without DVS, HAMA-without canreduce up to 15-21% of the energy consumption (Figure 2(a))in the HL configuration and 28-50% of energy consumption(Figure 2(b)) in the HH configuration compared to EDF-FQwhich also doesn’t consider the DVS facility while schedulingacross the entire grid. This highlights the importance of thepower efficiency factor in achieving energy-efficient meta-scheduling. In particular, HAMA can reduce energy consump-tion (Figure 2(a) and 2(b) even more when there are more LUjobs with less urgent (longer) deadline and arrival rate is low.

When we compare HAMA and HAMA-withoutDVS, weobserve that by using DVS energy saving has increased byabout 11% when % of job with urgent deadline and job arrivalrate is high. This is because for the scenario when DVS facilityis available jobs can run at lower CPU frequency to saveenergy.

B. Affect on Workload Executed

Figure 3 shows the total amount of workload successfullyexecuted according to user’s QoS. The workload of a jobrefer to multiplication of its execution time and the number ofCPU required. The affect of job urgency and arrival rate onworkload executed can be clearly seen from Figure 3(a) and3(d). All meta-scheduling algorithm shows consistent decreasein workload execution particularly in scenario of job urgency.The reason is rejection of more jobs due to deadline misswhen all jobs are of high urgency. The amount of workloadexecuted by EDF-FQ is less than HAMA because of the reasonthat while scheduling using EDF-FQ, the local schedulerexecute the jobs using conservative backfilling without anyconsideration of job deadline. While in case of HAMA, meta-scheduler send job to a resource site only if a time slot isavailable to execute job before deadline.

VII. CONCLUSION

With the increasing demand of global grids, the energyconsumption of grid infrastructure has escalated to the degreethat grids are becoming a threat to the society rather than anasset. The carbon footprint of grids may continue to increaseunless the problem is addressed at every level, i.e., from local(within a single grid site) to global (across multiple grid sites).

Moreover, the immediate and significant reduction in CO2emissions is required for the future sustainability of globalgrids.

In this paper, we have addressed the energy efficiency ofgrids at the meta-scheduling level. We proposed HeterogenityAware Meta-scheduling Algorithm (HAMA) to address theproblem by scheduling more workload with urgent deadline onresource sites which are more power-efficient. Thus, HAMAconsiders crucial information of global grid resource sites,such as cooling system efficiency (COP) and CPU power effi-ciency. HAMA address the problem in two steps: 1) allocatingjobs to more energy-efficient resource sites and 2) schedulingusing DVS policy at the local resource site to further reduceenergy consumption.

Results show that our HAMA can reduce up to 23% energyconsumption in worst case and upto 50% in best case ascompare to other algorithms (EDF-FQ). Moreover, even ifDVS facility is not available, HAMA-withoutEDF can stillresult in considerable amount of power savings of upto 21%.Particularly, our HAMA algorithm can work very well whenthe deadline of jobs is less urgent and arrival rate of jobs isnot high. Thus, HAMA can also compliment the efficiency ofexisting power-aware scheduling policies for clusters.

In future, we will investigate how HAMA can address theenergy consumption problem in virtualized environments suchas clouds, which is the emerging platform for hosting businessapplications. We will also integrate HAMA with existinggrid meta-schedulers and conduct experiments on real gridand cloud resources. We will also extend our current meta-scheduling model to resources such as the storage disks andthe switching devices.

ACKNOWLEDGEMENTS

We would like to thank Chee Shin Yeo for his constructivecomments on this paper. This work is partially supported byresearch grants from the Australian Research Council (ARC)and Australian Department of Innovation, Industry, Scienceand Research (DIISR).

REFERENCES

[1] Gartner, “Gartner Estimates ICT Industry Accounts for 2 Percent ofGlobal CO2 Emissions,” http://www.gartner.com/it/page.jsp?id=503867.

[2] J. G. Koomey, “Estimating total power consump-tion by servers in US and world,” http:/ /enter-prise.amd.com/Downloads/svrpwrusecompletefinal.pdf.

57

550

600

650

700

Av

era

ge

En

erg

y C

on

sum

pti

on

400

450

500

0% 20% 40% 60% 80% 100%

Av

era

ge

En

erg

y C

on

sum

pti

on

% of High Urgency (HU) JobsHAMA HAMA-withoutDVS EDF-FQHAMA HAMA-withoutDVS EDF-FQ

(a) HL: Energy Consumption VS Job Urgency

400

500

600

700

800

Av

era

ge

En

erg

y C

on

sum

pti

on

0

100

200

300

0% 20% 40% 60% 80% 100%

Av

era

ge

En

erg

y C

on

sum

pti

on

% of High Urgency (HU) Jobs

HAMA HAMA-withoutDVS EDF-FQHAMA HAMA-withoutDVS EDF-FQ

(b) HH: Energy Consumption VS Job Urgency

650

600

Av

era

ge

En

erg

y C

on

sum

pti

on

550

Av

era

ge

En

erg

y C

on

sum

pti

on

550

Av

era

ge

En

erg

y C

on

sum

pti

on

500

Av

era

ge

En

erg

y C

on

sum

pti

on

450Av

era

ge

En

erg

y C

on

sum

pti

on

400

10 100 1000

Increase in Job Arrival Rate


(c) HL: Energy Consumption VS Job Arrival Rate

300

400

500

600

700

800

Av

era

ge

En

erg

y C

on

sum

pti

on

0

100

200

300

10 100 1000

Av

era

ge

En

erg

y C

on

sum

pti

on



(d) HH: Energy Consumption VS Job Arrival Rate

Fig. 2. Comparison of HAMA with other meta-scheduling algorithms

[3] S. Greenberg, E. Mills, B. Tschudi, P. Rumsey, and B. My-att, “Best practices for data centers: Results from benchmarking22 data centers,” in Proc. of the 2006 ACEEE Summer Studyon Energy Efficiency in Buildings, Pacific Grove, USA, 2006,http://eetd.lbl.gov/emills/PUBS/PDF/ACEEE-datacenters.pdf.

[4] V. Salapura et al., “Power and performance optimization at the systemlevel,” in Proc. of the 2nd conference on Computing frontiers, Ischia,Italy, 2005.

[5] A. Elyada, R. Ginosar, and U. Weiser, “Low-complexity policies forenergy-performance tradeoff in chip-multi-processors,” IEEE Transac-tions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 9,pp. 1243–1248, 2008.

[6] A. Verma, P. Ahuja, and A. Neogi, “pmapper: Power and migration costaware application placement in virtualized systems,” in Proc. of the 9thACM/IFIP/USENIX International Conference on Middleware, Leuven,Belgium, 2008.

[7] L. Wang and Y. Lu, “Efficient Power Management of HeterogeneousSoft Real-Time Clusters,” in Proc. of the 2008 Real-Time SystemsSymposium, Barcelona, Spain, 2008.

[8] K. Kim, R. Buyya, and J. Kim, “Power aware scheduling of bag-of-tasksapplications with deadline constraints on dvs-enabled clusters,” in Proc.of the Seventh IEEE International Symposium on Cluster Computingand the Grid, Rio de Janeiro, Brazil, 2007.

[9] B. Bode et al., “The Portable Batch Scheduler and the Maui Scheduleron Linux Clusters,” in Proc. of the 4th Annual Linux Showcase andConference, Atlanta, USA, 2000.

[10] J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke, “Condor-G: A Computation Management Agent for Multi-Institutional Grids,”Cluster Computing, vol. 5, no. 3, pp. 237–246, 2002.

[11] E. Huedo, R. Montero, and I. Llorente, “A framework for adaptiveexecution in grids,” Software Practice and Experience, vol. 34, no. 7,pp. 631–651, 2004.

[12] S. Venugopal, K. Nadiminti, H. Gibbins, and R. Buyya, “Designing a re-source broker for heterogeneous grids,” SoftwarePractice & Experience,vol. 38, no. 8, pp. 793–825, 2008.

[13] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “CloudComputing and Emerging IT Platforms: Vision, Hype, and Reality forDelivering Computing as the 5th Utility,” Future Generation ComputerSystems, vol. 25, no. 6, pp. 599–616, 2009.

[14] Y. Etsion and D. Tsafrir, “A Short Survey of Commercial Cluster BatchSchedulers,” Technical Report 2005-13, Hebrew University, May 2005,Tech. Rep.

[15] R. Raman, M. Livny, and M. Solomon, “Resource Management throughMultilateral Matchmaking,” in Proc. of the 9th IEEE Symposium on HighPerformance Distributed Computing, Pittsburgh, USA, 2000.

[16] A. Orgerie, L. Lefevre, and J. Gelas, “Save Watts in Your Grid: GreenStrategies for Energy-Aware Framework in Large Scale DistributedSystems,” in Proc. of the 2008 14th IEEE International Conference onParallel and Distributed Systems, Melbourne, Australia, 2008.

[17] D. Bradley, R. Harper, and S. Hunter, “Workload-based power man-agement for parallel computer systems,” IBM Journal of Research andDevelopment, vol. 47, no. 5, pp. 703–718, 2003.

58

1500

2000

2500

Wo

rklo

ad

Exe

cute

d

Mil

lio

ns

0

500

1000

0% 20% 40% 60% 80% 100%

Wo

rklo

ad

Exe

cute

d



(a) HL: Workload Execution VS Job Urgency

1000

1500

2000

2500

Wo

rklo

ad

Exe

cute

d

Mil

lio

ns

0

500

1000

0% 20% 40% 60% 80% 100%

Wo

rklo

ad

Exe

cute

d


HAMA HAMA-withoutDVS EDF-FQ

(b) HH: Workload Execution VS Job Urgency

1000

1500

2000

2500

Wo

rk

loa

d E

xe

cu

ted

Mil

lio

ns

0

500

1000

10 100 1000

Wo

rk

loa

d E

xe

cu

ted



(c) HL: Workload Execution VS Job Arrival Rate

1000

1500

2000

2500

Wo

rk

loa

d E

xe

cu

ted

Mil

lio

ns

0

500

1000

10 100 1000

Wo

rk

loa

d E

xe

cu

ted



(d) HH: Workload Execution VS Job Arrival Rate

Fig. 3. Comparison of HAMA with other meta-scheduling algorithms

[18] B. Lawson and E. Smirni, “Power-aware resource allocation in high-endsystems via online simulation,” in Proc. of the 19th annual internationalconference on Supercomputing, Cambridge, USA, 2005, pp. 229–238.

[19] D. Meisner, B. Gold, and T. Wenisch, “PowerNap: eliminating serveridle power,” in Proceeding of the 14th international conference on Ar-chitectural support for programming languages and operating systems,Washington, USA, 2009.

[20] G. Tesauro et al., “Managing power consumption and performance ofcomputing systems using reinforcement learning,” in Proceedings ofthe 21st Annual Conference on Neural Information Processing Systems,Vancouver, Canada, 2007.

[21] Y. Chen, A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, and N. Gau-tam, “Managing server energy and operational costs in hosting centers,”ACM SIGMETRICS Performance Evaluation Review, vol. 33, no. 1, pp.303–314, 2005.

[22] A. Verma, P. Ahuja, and A. Neogi, “Power-aware dynamic placement ofHPC applications,” in Proc. of the 22nd annual international conferenceon Supercomputing, Athens, Greece, 2008, pp. 175–184.

[23] N. Kappiah, V. Freeh, and D. Lowenthal, “Just in time dynamic voltagescaling: Exploiting inter-node slack to save energy in MPI programs,”in Proc. of the 2005 ACM/IEEE conference on Supercomputing, Seattle,USA, 2005.

[24] C. Hsu and W. Feng, “A power-aware run-time system for high-performance computing,” in Proc. of the 2005 ACM/IEEE conferenceon Supercomputing, Seattle, USA, 2005.

[25] R. Porter, “Mechanism design for online real-time scheduling,” in Proc.

of the 5th ACM conference on Electronic commerce, New York, USA,2004, pp. 61–70.

[26] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, andP. Wong, “Theory and practice in parallel job scheduling,” in JobScheduling Strategies for Parallel Processing, London, UK, 1997, pp.1–34.

[27] H. A. Sanjay and S. Vadhiyar, “Performance modeling of parallelapplications for grid scheduling,” J. Parallel Distrib. Comput., vol. 68,no. 8, pp. 1135–1145, 2008.

[28] “Distributed European Infrastructure for Supercomputing Applications(DEISA),” http://www.deisa.eu.

[29] Enabling Grids for E-sciencE, “EGEE project,” http://www.eu-egee.org/,2005.

[30] J. Moore, J. Chase, P. Ranganathan, and R. Sharma, “Making scheduling”cool”: temperature-aware workload placement in data centers,” in Pro-ceedings of the 2005 Annual Conference on USENIX Annual TechnicalConference, Anaheim, CA, 2005.

[31] D. Feitelson, “Parallel workloads archive,”http://www.cs.huji.ac.il/labs/parallel/workload.

[32] D. Irwin, L. Grit, and J. Chase, “Balancing risk and reward in a market-based task service,” in Proc. of the 13th IEEE International Symposiumon High Performance Distributed Computing, Honolulu, USA, 2004.

[33] A. W. Mu’alem and D. G. Feitelson, “Utilization, Predictability, Work-loads, and User Runtime Estimates in Scheduling the IBM SP2 withBackfilling,” vol. 12, no. 6, pp. 529–543, Jun. 2001.

59

Intelligent Data Analytics Console Snehal Gaikwad

School of Computer Science Carnegie Mellon University

Pittsburgh, 15217 USA [email protected]

Mihir Kedia IBM Global Services

IBM India Pvt Ltd Delhi, 110020 INDIA

[email protected]

Aashish Jog School of Management

IIT Bomaby Pawai, 400076 INDIA

[email protected]

Bhavana Tiple Dept of Computer Science

MIT Pune Pune, 411038 INDIA [email protected]

Abstract—The problem of integrating data analysis, visualization and learning is arguably at the very core of the problem of data intelligence. In this paper, we review our research work in the area of data analytics and visualization focusing on four interlinked directions of research-(1) data collection, (2) data analytics and visual evidence, (3) data visualization, and (4) intelligent user interfaces - that contribute and complement each other. Our research resulted in Intelligent Data Analytics Console (IDAC), an integration of the above four disciplines. The main objectives of IDAC are to - (1) provide a rapid development platform for analyzing the data and generating components that can be used readily in software applications, (2) provide visual evidences after each data analytical operation that will help user to learn the behavior of data, and (3) provide the user-centric platform for skilled data analytics experts as well as naïve user. The paper presents development process of the user-centric intelligent software equipped with effective visualization. This approach should help business organizations to develop better data analytics software using open source technologies.

Keywords- Human Factors, Intelligent User Interfaces, Machine Learning

I. INTRODUCTION Huge amount of data is generated in various familiar

processes, systems and computer networks built around them. Voluminous data from electronic business, computer networks, financial trading systems, share markets, and weather forecasting comes in fast stream. In our day-to-day lives, manually analyzing and classifying the large amount of data is not feasible.

Data analytics is the art and science of getting to know more about the behavior of complex real time data by application of mathematical and statistical principles [30, 43]. Rapid data analytics operations performed on a given data results into huge changes in the original data values. If we consider a dataset having about a thousand data points, tracking changes after each data analytics operation is impossible. Therefore, to learn the pattern and trends of given data set, it is necessary to represent data values into understandable format. Many data analytics tools have been proposed to overcome these problems; Weka [5, 36, 43], Yet Another Learning Environment (Yale, now known as Rapid Miner) [15, 29], Sumatra TT [4] are major examples of them. However, current software systems fail to provide a rapid analytics platform for naïve users. Sumatra TT, Weka, and Yale cannot support different types of data formats. Sumatra TT fails to provide wide variety of analytics algorithms and operators. Less efficient drag and drop interfaces, lack of graphical

representations eventually results into the major usability challenges for naïve users. As a result, predictions and decision-making process becomes time consuming and frustrating for users.

Today, data analytics and visualization research has undergone fundamental changes in several approaches [6, 9, 25, 26]. A new research and development focus has emerged within the visualization to address some of the fundamental problems associated with the new classes of data and their related analysis tasks [28]. This research and development focus is known as information visualization. Information visualization combines aspects of scientific visualization, human-computer interfaces, data mining, imaging, and graphics [28]. Visualization is perceived as gateway to understanding voluminous datasets. It provides healthy environment to data analytics experts, business executives for understanding and forecasting a huge amount of data in a quick span of time.

In this paper, we present the development and implementation of Intelligent Data Analytics Console (IDAC) focusing on integration of Data Collection, Data Analytics and Visualization. We demonstrate an approach to develop a rapid platform for analyzing datasets and generating agents, which can be used readily in software application. This process involves a library of analytics blocks and allows user to drag and drop all the required functional blocks required to form an ‘analytics execution chain’. Our development is based on the open source libraries from Weka [5, 36, 43], Yale [15, 29], and JFree Charts [11]. IDAC architecture proves how intelligence user interfaces can increase usability of the complex systems. Our research provides the detail guidelines for research and business community to effectively set-up a platform for data analysis problems. The rapid learning platform of IDAC captures the knowledge generated by data analytics experts; by finding typical analysis patterns it provides standard recipes as well as online recommendation on what to do next. An advisory system is an application of data analytics, which analyzes the nature of data as well as the results obtained at each step to provide the recommendations. The paper demonstrates how this feature is immense to users who do not have deep understanding of analytics algorithms. Capturing the knowledge involves recording earlier successful execution of analytics chains used by users for doing particular types of analysis and ‘training’ the advisory system. Further, we illustrate how visual evidence technique can acts as an effective debugging technique to assist naïve users.

60

The rest of the paper is organized as follows. In Section 2, we introduce detail system architecture of IDAC; Section 3 presents the Intelligent User Interface; Usability evaluation and results are described in the Section 4. Section 5 discusses the future research; Section 6 provides concluding remarks.

II. SYSTEM ARCHITECTURE In this section, we provide a detail implementation of

IDAC. IDAC architecture is based on the iterative process model [34, 35]. Traditional software development cycle includes widely used Waterfall Model, Prototyping Model or Spiral Model [10]. Most of them have barriers of specification, communication and optimization [10, 34]. Considering research requirements, we decided to follow the Incremental Process Model development. The main objective of applying incremental development approach was to enhance system usability and constantly tune computational performance of the software [22, 33, 34, 35].

The data collection module, the data analytics and visual evidence module, and the data visualization module are three major components of the system. Fig. 1 shows the detail architecture of IDAC. The data collection module is responsible for data preprocessing and cleaning [38]. The data analytics and visual evidence module focuses on data mining [43] and Machine Learning [30] operations and generates textual results. The visual evidence operators are created after each data analysis operator. The data visualization module is responsible for converting textual results to suitable graphical formats. In following section of the paper, we demonstrate the detailed architecture of IDAC.

A. Data Collection Module

Visual Evidence operators are created after each data analysis operator. The data visualization module is responsible for converting textual results to suitable graphical formats. In following section of the paper, we demonstrate detailed architecture of the system. The data collection module is responsible for rapid data preprocessing.

The data preprocessing is also called as the data cleaning or the data cleansing. The Data cleaning process deals with detecting and removing errors and inconsistencies in order to improve the quality of the available data [2, 20, 38]. We have used three categories of filters for data cleaning; Supervised Filters, Unsupervised Filters, Streamable Filters. IDAC java packages for analytics operators are based on Yale [15] and Weka [5] open source coding hierarchy. IDAC filter package is responsible for filtering operations.

The real time data set contains huge number of missing values, which affects the accuracy of prediction. For example, if electronic business executive wants to launch a new product but available datasets do not have enough labels for attribute, then it is difficult for him to make accurate predictions. To overcome this problem and estimate unknown parameters, we have implemented the Expectation-Maximization (EM) algorithm. EM algorithm is used for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables [12]. E-step in the algorithm finds the conditional expectation of missing data given the observed data and current estimated parameters, and substitutes the expectations for ‘missing data’ M-step updates the parameter estimates by maximizing the expected complete-data log likelihood [12]. We used Frank Dellaert’s theory to maximize the posterior probability of the unknown parameters [12]. If U is the given measurement data, J is a hidden variable then unknown parameter is estimated by following formula [12]:

Θ∗ = argmax ∑ P J∈𝑗𝑗 𝑛𝑛 (Θ, 𝐽𝐽 |𝑈𝑈 ) (1)

Θ

EM computes a distribution over the space j rather than finding the best J ∈ 𝑗𝑗 [12]. We found that existing data analytics tools fails to support more than one data format [4], to overcome this drawback system supports two major data file formats ARFF and CSV. Data conversion, operator helps user to convert files into other formats.

Figure 1. IDAC System Architecture

61

Figure 2. WEKA Data Analytics chain without Visual Operator

1) ARFF: ARFF is widely known as the Attribute Relation File Format. Structure of an ARFF file is divided into two major parts: Header Section and Data Section. The header section contains name of relations, list of attributes and their types. The data section consists of data declaration lines and values. Table I shows the ARFF File for customer banking record.

TABLE I. ARFF FOR CUSTOMER DATA

@relation bank

@attribute Service_type Fund, Loan, CD, Bank_Account, Mortgage

@attribute Customer Student, Business, Other, Doctor, Professional

@attribute Size Small,Large,Medium

@attribute Interest_rate real

@DATA

Fund,Business,Small,1

Loan,Other,Small,1

Mortgage,Business,Small,4

………………………………….

2) Comma Separated Value (CSV): CSV is also known as the comma-separated list. CSV files consist of data values separated by commas. Table 2 represents the CSV file for Weather forecasting data used by IDAC.

TABLE II. ARFF FOR CUSTOMER DATA

Sunny, 89, false, no

Overcast, 78, true, yes

rainy, 75, true, yes

The data collection module sends cleaned data to the data analytics and visual evidence module to perform advance data analysis operations. In this module, IDAC allow data analytics expert to perform several analysis operations in certain order. Data probes send cleaned data to the analytics and visual evidence module. This module involves a library of analytics blocks which allows user to drag and drop all required blocks and connect them to form an ‘analytics execution chain’ in the visual application. The chain is based on an end goal of analysis as well as the nature of the dataset. Fig. 2 shows the chain of the Iris data set, cross validation, attribute selection operators. Existing data analytics software provides visualization at the end of the chain; as a result it becomes difficult for user to understand the changes after each data analytics operator [5, 4, 15]. From analytics, perspective tracking these changes is very important to learn behavior of the data set. We propose the new concept of the Visualization Debugger (see Fig. 3).

Figure 3. Visual Debugger

Figure 4. WEKA Data Analytics chain with Visual Operator

Before Processing

After Processing

Cross Validation Fold Maker Attribute Selection

ClassifierPerformanceEvaluator Text Viewer

ScatterPlotMatrix

Data Visualizer AttributeSummerizer

IRIS

Text Viewer

Cross Validation Fold Maker Attribute Selection

ScatterPlotMatrix

ClassifierPerformanceEvaluator

Data Visualizer

62

Visualization Debugger helps the user to understand changing behavior of the dataset after each analytics operation. Fig. 4 represents the basic visual debugger used to generate the visual evidence from data analytics results. The visualization operator displays the dataset values twice: a) before entering into the data analytics operator and b) after resulting from the data analytics operator. Comparing results from the both visual operators the user can easily keep track of changes that take place in the operational chain. This application presents interface amenable to the user who would like to rapidly provide an analytics solution into suitable data processing architecture. IDAC advisory system analyzes the results obtained at each step and provides recommendation to assist users to make decisions based on visual facts. Data analytics operations help user to find patterns in the huge datasets. We have used open source Weka platform to develop a library of different data analysis operators. Following section of paper focuses on major Data Analytics operators implemented in IDAC.

B. IDAC Classification

We have analyzed the performance of existing classifiers [13, 21, 41] as well as classification techniques used in Weka [43] and Sumatra TT [4]. We found that decision trees are the most popular and stable method among available data mining classifiers. The main reason is decision tree provides excellent scalability and highly understandable results. Decision trees also support the sparse data. We also used KNN model because it is applicable to most cases independent of training data requirement, sparse to dense. KNN is an instance-based classifier that provides excellent quality of prediction and explanatory results. However, we found that KNN is not optimal for problems with a non-zero Bayes error rate, that is, for problems where even the best possible classifier has a non- zero classification error. We overcome this problem be incorporating discriminative and generative classifiers.

C. IDAC Clustering

Clustering deals with discovering classes of an instance that belongs together [23, 43]. Clusters are also known as self-organized maps. IDAC clustering operators are developed by using high performance clustering algorithms [32]. IDAC operator converts an input file into several clusters; it also provides probability of distribution for all attributes. Most of the time data set may have different trend than regular, such data is called as Anomaly. Detecting theses anomalies is essential to study abnormal behavior of the dataset. Visual representation of clustering; enables user to find out abnormal behaviors in the datasets and helps in decision-making. Table III represents algorithms implemented in the IDAC.

TABLE III. IDAC JAVA CLASSES AND ALGORITHMS FOR CLUSTERING

Java Class Algorithm Description

Simple X Means

Cluster data using X-Means algorithm.

Simple K Means

Cluster data using K Means algorithm.

Java Class Algorithm Description

FarthestFast

Cluster data using FarthestFirst algorithm.

DBScan

Density-based algorithm for discovering clusters in large spatial databases with noise.

An Example: We illustrate the IDAC clustering with a simple example of the real world company. The main objective of the company is to develop a new financial product for its customers. Table IV represents data values of vacation, eCredit, salary and property of each customer. The goal of operation is to find out pattern from the customer behavior and develop new financial product. Main step of the process is finding out the data belonging to the specific cluster. Using data analytics library we apply Simple K means data analytics operator on available data. Table V shows which customer fits to what cluster. For example, customer number nine with salary 13.85 belongs to first cluster. Table VI shows the mean value for each cluster. Visualization in Fig. 5 helps us to understand the behavior of the data. User can easily observe that group 3 and 5 are differentiates themselves from the rest. Customers who belong to cluster 3 and 5 have maximum vacations, high salary, and low property.

TABLE IV. CUSTOMER DATA

Customer Vacation eCredit Salary Property 1 6 40 13.62 3.2804 2 11 21 15.32 2.0232 3 7 64 16.55 3.1202 4 3 47 15.71 3.4022 5 15 10 16.96 2.2825 6 6 80 14.66 3.7338 7 10 49 13.86 5.8639 8 10 84 15.64 3.187 9 9 74 14.4 2.3823

…………………………………………………………………………………… 186 51 13 20.71 1.4585 187 49 17 19.18 2.4251

TABLE V. RESULTS AFTER CLUSTERING

Customer Vacation eCredit Salary Property Fit Cluster 1 6 40 13.62 3.2804 2 2 11 21 15.32 2.0232 2 3 7 64 16.55 3.1202 1 4 3 47 15.71 3.4022 2 5 15 10 16.96 2.2825 2 6 6 80 14.66 3.7338 1 7 10 49 13.86 5.8639 3 8 10 84 15.64 3.187 1 9 9 74 14.4 2.3823 1

……………………………………………………………… 186 51 13 20.71 1.4585 5 187 49 17 19.18 2.4251 5

TABLE VI. MEAN VALUE FOR EACH CLUSTER

Group Cluster Vacation eCredit Salary Property Total

Number 1 14.51515 80.45455 21.90152 5.465142 33 2 8.75 14.80556 16.20771 1.423411 36 3 39.02128 55.08511 19.92681 3.72284 47 4 10.21277 224.0435 24.9087 11.61411 23 5 48.21277 15.42553 21.87149 2.057874 47

63

Figure 5. Clustering Visualization

Based on the above results company can make a decision to develop traveling specific related financial service with travelling card offers, insurance coverage, travel accident insurance, baggage insurance, theft insurance, full primary collision insurance on car rentals. Rapid clustering computations and detail visual results make real time decision-making process easier.

D. IDAC Association & Regression

We have used association operators to learn relationships between data attributes. Unlike Yale, IDAC supports Aprori algorithm for handling nominal attributes from given datasets. The algorithm helps to determine number of rules, minimal support and minimum confidence value [1]. Regression model is applicable for numeric classification and prediction; provided the relationship between the input attributes [17, 26]. If data analyst knows a value of particular quantity, regression helps him to estimate the value of other one. Other analytics operators in IDAC are responsible for performing prediction, anomaly detection, filtering, and sampling operations.

E. Data Visualization

Visualization is perceived as gateway to understand voluminous datasets. It provides healthy environment to data analytics experts, and business executives to understand and forecast huge amount of data in quick span of time. We have implemented IDAC visual operators by using open source JFree Chart library [11, 39]. The purpose of visual operator is to discover hidden patterns and provide assistance in decision-making. Visualization library not only help users to debug analytics results but also provides data forecasting platform.

TABLE VII. USER CENTRIC SCIENTIFIC AND INFORMATION

Visualization User Objective

Scientific Visualization

Data Analyst Deep understanding of scientific phenomena

Information Visualization

Less Technical

Searching, discovering relationships

Table VII shows the user centric visualization method. While representing voluminous data, we have considered following principles [9, 14, 25, 27]:

• Clarity: Data Visualization is different from verbal information; visual information is analyzed in parallel by the human brain. Therefore, data values and changes are represented in interpretable formats.

• Simplicity: The Graphical representation should be as simple as possible it should not confuse the user.

• Brevity: While representing datasets, economy of expressions is very important; the representation should be self-explanatory because we are much better in remembering visual information.

According to the nature of a dataset, IDAC finds out visual operator for specific data analysis task. Another key theme for data visualization involves ease of use. The demand for good, effective visualization of data is very high, especially for those who do not have any data analytics background. Naïve user community is highly diverse, with different levels of education, backgrounds, capabilities, and needs. Visualization module enables this diverse group to solve the analytics problem at hand. IDAC provides a visual library ranging from Scatter Plot Matrix to Andrews Curves.

III. INTELLIGENT USER INTERFACE In this section, we explore features of the IDAC

Intelligence Wizard. We have implemented intelligence and learning algorithm for automatic generation of the data analytics chains [37, 42]. Intelligence Wizard system provides separate user interface for the naïve user and guides him to perform further analysis. During initial development use case scenarios helped us to organize the information and determine the interaction sequence. We have conducted contextual inquiries to understand users thinking and develop user centric architecture. Our user centric design is based on three-tier model [45]. Fig. 6 shows channel Layer Interaction Layer and Semantic Layer of the three-tier architecture.

Figure 6. Three Tier Design Model

Figure 7. Interface to select Data Analysis Operation

2

1

3

5

4

Clusters with Many Vacations, High Salary and Low

property

Association

Missing values

Leaner Create Model

Leaner Apply Model

Cross Validation

SELECT THE KIND OF OPERATION TO BE DONE ON THE DATA

64

Figure 8. Automatic Chain Generation:- Data Analytics and Visualization Operator

We have used the Semantic Layer to define contents in the operator library. The Interaction Layer is responsible for defining the Interaction Sequences and to determine User Experience. The Chanel Layer is responsible for the actual presentation, which provides the user with a consistent experience, independent of the access method. For example, all interactions offered through a screen follow the same logical steps and offer the sequence of options. The structure of data analytics library supports the consistent experience. When user selects operations, system compares it with standard rules; the algorithm returns Boolean results for chain validation. If the chain is valid then IDAC continues data analysis operation. For invalid chains IDAC finds out the wrong operator and presents multiple suggestions to the user. Initially IDAC Intelligence Wizard Screen allows user to selects the file format. Depending on the selected dataset, system presents next interface to user. Fig. 7 shows the screen for several data analysis operation. Intelligence Wizard asks user about what kind of data analysis he wants to perform. After considering user’s choices, system automatically generates the chain of data analytics operators (See Fig 8). From user’s inputs, IDAC generates and validates the operating chain.

This approach understands human responses and significantly reduces usability barriers faced by naïve users.

IV. USABILITY EVALUATION AND RESULTS In this section, we describe software testing approaches

used to measure the performance and usability of the IDAC. During the usability evaluation phase, we focused on both automated as well as manual software testing approaches. IDAC automated testing is performed by using WinRunner 7.0i, software. WinRunner automates the testing process by storing a script of the actions that are taking place. We used Test Director to perform manual testing. The incremental model helped us to achieve desire results and performance. Initially we have conducted usability research to understand the proportion of usability problems in Weka, Sumatra TT and Yale. Fig. 9 shows the results of Nielsen’s heuristic analysis. We have analyzed usability heuristics including match between system and the real world standards, user control and freedom, help and documentation, consistency and standards, flexibility and efficiency of use, aesthetic and minimalistic design. Results shows that most of the systems fail to achieve consistency and standards, user control and freedom. Sumatra TT fails to provide help and documentation.

Figure 9. Usability evaluation results of Weka, Sumatra TT and Yale using heuristic analysis

Example Source: Sonar Data File

Automatically Generated Operators

Random Number Generator that changes values of given data set

Match between System & Real world Standard

User Control & Freedom

Visibility of System Status

Help & Documentation

Consistency & Standards

Flexibility & Efficiency of Use

Aesthetic & Minimalistic Design

Nielsen’s Usability Heuristics

100

90

80

70

60

ERRORS 50

40

30

20

10

00

WEKA SUMATRA TT YALE

65

Figure 10. Usability evaluation results of IDAC, Weka, Sumatra TT and Yale using heuristic analysis

We overcame drawbacks in the traditional software by focusing on following criteria:

• Effectiveness: Help to achieve desired accuracy and completeness with which users can achieve their goals.

• Efficiency: Reduces the computation time and resources spent in achieving desired goals.

• Satisfaction: Reduces user discomfort and increase pleasant interaction.

We have adopted the standard approaches in testing phase of iterative design. Unit testing, regression testing, integration testing, alpha testing, and beta testing methods were effectively used to test IDAC components. As new operator added to library, we performed regression testing to see whether it has an adverse effect on operators created previously. New visual operators are integrated without any side effect [18]. The alpha testing was conducted in the presence of end users. Initial contextual enquiry helped to analyze the software usability. After getting feedbacks from the beta testing further enhancements about help and documentation standards were made [18]. The performance of the IDAC was compared to other existing software. Results show that, the intelligence wizard creates accurate analytics chains in about 93% of the cases. Fig. 10 shows the improved results and efficiency of the IDAC as compared to the current software systems.

V. FUTURE RESEARCH In this paper, we have demonstrated the solution to

integrate data analytics, visualization and intelligence. We emphasized on an open, cooperative and multidisciplinary approach to increase usability of the system. Developing personalized system for wide variety of users and incorporating effective use of information visualization is one of the major challenges for future research. Solutions to these challenges are also rooted in understanding of visual Perception. Understanding the user’s needs and selecting suitable visual

display is the biggest challenge. The immediate enhancement to data visualization would be separation and representation of string attributes by taking frequency counts of their occurrences or by using other effective measures. Our current research focus is to solve multidisciplinary business problems using similar approach. Major work is development of a Privacy Protection Architecture. Today, organizations are trying to incorporate customer-driven innovations; aim is to provide better products and personalized services. This involves a learning process in which organizations continuously capture the data and learn from user behavior. Unfortunately, neither organizations nor users are equipped to act in the efficient optimal way to decide what information they should protect and what they should reveal. This situation is creating challenging problems regarding customer data security, privacy and trust. We are researching on how independent IDAC agents can effectively integrated with machine learning methods to classify and protect the sensitive data. Use of the visualization module to define uncertainty, ambiguity, and behavioral biases in the privacy protection mechanism will be the major milestone of our research.

VI. CONCLUSION In this paper, we presented Intelligent Data Analytics Console, the user centric platform that effectively integrates data analytics, visualization and intelligence. The proposed architecture demonstrates an open, cooperative and multidisciplinary approach to develop data analytics software that acts as an assistant to users rather than a tool. Using Visual Evidence technique we illuminate the demanding approach to provide recommendations based on the data and results obtained at particular instance. As shown through series of best practices and usability evaluations, IDAC substantially reduces the usability barriers. With our approach, researchers and enterprises can generate data analytics components that can be used readily in software applications.

WEKA SUMATRA TT YALE

IDAC

100

90

80

70

60

ERRORS 50

40 30

20

10

00

Match between System & Real world Standard

User Control & Freedom

Visibility of System Status

Help & Documentation Consistency & Standards

Flexibility & Efficiency of Use

Aesthetic & Minimalistic Design

Nielsen’s Usability Heuristics

66

ACKNOWLEDGMENT We would like to thank TRDDC scientists Prof. Harrick

Vin, Mr. Subrojyoti Roy Chaudhury, Dr. Savita Angadi and Mr. Niranjan Pedanekar for their constant support and guidance. We appreciate the contribution of WEKA developers to the open source community; your pioneered research motivated and guided us. This research is a part of Systems Research Lab, TATA Research Development and Design Center (TRDDC), TATA Consultancy Services.

REFERENCES [1] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of

Lipschitz-Hankel type involving products of Bessel functions,” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April 1955. (references)

[2] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases, Proceedings of 1993 ACM SIGMOD International Conference on Management of Data, pages 207- 216, Washington, D.C., 1993.

[3] Aha and David W. Tolerating noisy, irrelevant and novel attributes in instance based learning algorithms. Int. J. Man-Mach. Stud., 36(2):267-287, 1992.

[4] C. G. Atkeson, A. W. Moore, and S. Schaal locally weighted learning. (1-5):11-73, 1997.

[5] Petr Aubrecht, Filip zelezny, Petr Miksovsky, and Olga Stepankova. Sumatra TT: Towards a universal data preprocessor.

[6] Remco R. Bouckaert, Eibe Frank, Mark Hall, Richard Kirkby, Peter Reutemann, Alex Seewald, and David Scuse. Weka manual for version 3-4, 2007.

[7] Christopher J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121-167, 1998.

[8] Tom Chau and Andrew K.C. Wong. Pattern discovery by residual analysis and recursive partitioning. IEEE Transactions on Knowledge and Data Engineering, 11(6):833-852,1999.

[9] Kumar Chellapilla and Patrice Y. Simard. Using machine learning to break visual human interaction proofs (hips). In NIPS, 2004.

[10] W. S. Cleveland. Visualizing Data. Hobart Press, Summit, New Jersey, U.S.A., 1993

[11] Bill Curtis, Herb Krasner, and Neil Iscoe. A field study of the software design process for large systems. Commun. ACM, 31(11):1268-1287, 1988.

[12] Gilbert,Jfreechart http://www.jfree.org/jfreechart/, 2006. [13] Frank Dellaert. The expectation maximization algorithm. Technical

Report GIT-GVU-02-20, February 2002. [14] U. M. Fayyad and K. B. Irani. Multiinterval discretization of

continuousvalued attributes for classification learning. In Proc. of the 13th IJCAI, pages 1022-1027, Chambery, France, 1993.

[15] Stephen Few. Show Me the Numbers : Designing Tables and Graphs to Enlighten. Analytics Press, September 2004.

[16] Simon Fischer, Ingo Mierswa, Ralf Klinkenberg, and Oliver Rittho. Developer tutorial" Yale - yet another learning environment, 2006.

[17] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In In Proceedings of the thirteenth international conference on machine learning, pages 148-156. Morgan Kaufmann, 1996.

[18] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 1998.

[19] Snehal Gaikwad, Mihir Kedia, and Aashish Jog. Data analytics and visualization. Technical report, TATA Research Development and Design Center - TCS, 2007.

[20] Mark A. Hall and Lloyd A. Smith. Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper, 1999.

[21] Mauricio Hernandez, Mauricio A. Hern'andez, and Salvatore Stolfo. Realworld data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2:9-37, 1998.

[22] Robert C. Holte. Very simple classification rules perform well on most commonly used datasets. In Machine Learning, pages 63-91, 1993.

[23] Clare-Marie Karat. Cost-bene_t analysis of iterative usability testing. In INTERACT '90: Proceedings of the IFIP TC13 Third Interational Conference on Human-Computer Interaction, pages 351-356, Amsterdam, The Netherlands, The Netherlands, 1990. North-Holland Publishing Co.

[24] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: an introduction to cluster analysis. Wiley, 1990.

[25] Kenji Kira and Larry A. Rendell. A practical approach to feature selection. In ML92: Proceedings of the ninth international workshop on Machine learning, pages 249-256, San Francisco, CA, USA,1992. Morgan Kaufmann Publishers Inc.

[26] Michel Liquiere and Jean Sallantin. Structural machine learning with galois lattice and graphs. In Proc. of the 1998 Int. Conf.on Machine Learning (ICML'98, pages 305-313. Morgan Kaufmann, 1998.

[27] Classification Algorithms Lus, Lus Torgo, and Joo Gama. Regression using classification algorithms. Intelligent Data Analysis, 1:275-292, 1997.

[28] J. I. Maletic, A. Marcus, and M. L. Collard. A task oriented view of software visualization.pages 32-40, 2002.

[29] A.I. McLeod and S.B. Provost. Multivariate data visualization, January 2001.

[30] Ingo Mierswa, Ralf Klinkenberg, Simon Fischer, and Oliver Rittho. A exible platform for knowledge discovery experiments: Yale - yet another learning environment. In Univ. of Dortmund, 2003.

[31] Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. [32] D. J. Murdoch and E. D. Chow. A graphical display of large correlation

matrices. j-AMER-STAT, 50(2):173-178, 1996. [33] Raymond T. Ng and Jiawei Han. Efficient and effective clustering

methods for spatial data mining, 1994 [34] D. L. Parnas and P. C. Clements. A rational design process: How and

why to fake it. IEEE Trans. Softw. Eng., 12(2):251-257, February 1986. [35] Roger S. Pressman. Software Engineering: A Practitioner's Approach.

McGraw-Hill Higher Education, 2000. [36] Matthias Rauterberg. An iterative-cyclic software process model, 1992. [37] Roberts R. AI32 - Guide to Weka, March 2005

http://www.comp.leeds.ac.uk/andyr [38] Ross J. Quinlan. C4.5: Programs for Machine Learning (Morgan

Kaufmann Series in Machine Learning). Morgan Kaufmann, January 1993.

[39] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23:2000, 2000.

[40] Kathy Walrath, Mary Campione, Alison Huml, and Sharon Zakhour. The JFC Swing Tutorial: A Guide to Constructing GUIs, Second Edition. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 2004.

[41] Edward J. Wegman. Hyperdimensional data analysis using parallel coordinates. Journal of the American Statistical Association, 85(411):664-675, 1990.

[42] S.M. Weiss and C.A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991.

[43] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, October 1999

[44] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, June 2005.

[45] Normal Sadeh Class Notes Mobile Pervasice Computing.

67

http://www.comp.leeds.ac.uk/andyr

ADCOM 2009COMPUTATIONAL

BIOLOGY

Session Papers:

1. D. Narayan Dutt, “ Digital Processing of Biomedical Signals with Applications to Medicine ” –INVITED PAPER

2. C. Das, P.Maji, S. Chattopadhyay, “ Supervised Gene Clustering for Extraction of Discriminative Features from Microarray Data”.

3. S.Das, S.M. Idicula, “ Modified Greedy Search Algorithm for Biclustering Gene Expression Data”.

68

Digital Procesing of Biomedical Signals with

Applications to Medicine D. Narayana Dutt

Department of Electrical Communication Engineering

Indian Institute of Science, Bangalore-560012, India

[email protected]

Abstract— We consider here digital processing of

Electroencephalogram (EEG) and Heart Rate Variability (HRV)

signals with applications to psychiatry and cardiac care. First we

consider application to EEG. The conventional analysis of EEG

signals cannot distinguish between schizophrenia and normal

healthy individual. In this work, graph theoretic approach is used

to get various parameters from EEG. It is found that the

schizophrenia patients can not only be differentiated from healthy

subjects but can also be grouped into various subgroups in

schizophrenia. Use of SVM based automation gives us a clear way to

classify an individual into his particular group or subgroup for

further classification with an accuracy of more than 90%. Next we

present a neural network based approach to classification of supine

vs. standing postures and normal vs. abnormal cases using heart

rate variability (HRV) data. We have chosen ten features for the

network inputs. Four classification algorithms have been compared.

Index terms— EEG, Graph theory, Schizophrenia, Connectivity,

Support vector machine (SVM), HRV, Neural network

I INTRODUCTION

The brain consists of billions of neurons which are the basic

data processing units that are closely interconnected via axons and dendrites forming a large network. The complex system like the brain can be described as a complex network of decision making nodes. In the brain, the network is composed of these neuronal units which are linked by synaptic connections. The activity of these neuronal units gives rise to dynamic states that are characterized by specific patterns of neuronal activation and co-activation. The characteristic patterns of temporal correlations emerge as a result of functional interactions within a structured neuronal network.

The brain is inherently a dynamic system in which the correlation between the regions during behavior or even at rest creates and reshapes continuously. Many experimental studies suggest that perceptual and cognitive states are associated with specific patterns of functional connectivity. These connections are generated within and between large populations of neurons in the cerebral cortex. An important goal in computational neuroscience is to understand these spatiotemporal patterns of complex functional networks of correlated dynamics of the brain activity measured in terms of EEG recordings.

Similarities between the EEG time-series are commonly quantified using linear techniques, in particular estimates of temporal correlation in time domain or coherence in frequency domain. The temporal correlations are the most often used to represent and quantify patterns in neuronal networks and is represented as correlation matrix. The functional connectivity refers to the patterns of temporal correlations that exist between distinct neuronal units and such temporal correlations are often the result of neuronal interactions along anatomical or structural connections. Thus, the correlation matrix may be viewed as a representation of functional connectivity of the brain system.

Schizophrenia is a worldwide prevalent disorder with a multi-factorial but highly genetic aetiology. It is a multiple gene disorder. Schizophrenia being a complex and widespread disorder giving rise to a great burden of suffering and impairment to both patients and their families. By looking at EEG it is very difficult to ascertain and diagnose if an individual is suffering from the disease. It is all the more difficult to find to which type of groups a subject belongs without proper clinical analysis by a physician [1]. In this work the focus is on usage of graph theoretical methods to analyze and statistical methods and techniques to identify individuals suffering from schizophrenia and also classify them into the various groups of schizophrenia.

The Electrocardiogram describes the electrical activity of the

heart. The rhythmic behavior of the heart is studied by analyzing

heart rate time series. Heart Rate Variability (HRV) has been

accepted as a quantitative marker to study regulatory mechanisms

of autonomic nervous system. Understanding physiology of heart

rate dynamics is very important in treating cardiac diseases that

are chronic as well as life threatening. Understanding these

physiologies allow clinicians to devise improved treatment

methodologies. The importance of HRV was first discovered

when unborn infants were attached to cardiac sensors in utero.

Usage of HRV in the study of cardiac disorders is very popular.

The HRV analysis has gained importance since many studies [2]

have shown low HRV as a strong indicator to predict cardiac

mortality and sudden cardiac death. Several studies suggest that

these patients with anxiety and depression are at a higher risk for

significant cardiovascular mortality and sudden death [3]. In view

of these work, we have considered the application of neural

networks in the classification of supine vs. standing postures and

normal vs. abnormal cases using heart rate variability data. Four

classification algorithms have been compared viz., k-neighbors,

Radial Basis Function (RBF) networks, Support vector machines

(SVM) and back propagation networks for different set of

features.

II .MATERIALS AND METHODS

A. Data acquisition The EEG signals were recorded from 53 subjects (25 control

and 28 schizophrenic subjects of them 21 were females and 32 were of males) using 32 scalp electrodes. The Neuroscan machine was used for the recording purpose. The signal data were analog band pass filtered with cutoff frequencies 0.5 to 70 Hz and then sampled at the rate of 256 Hz with a resolution of 12 bits. All electrodes were referenced to A2. The subjects were instructed to close their eyes and be restful for some time before the collection of data was carried out. Subjects chosen for this analysis were old established cases schizophrenia. It was ensured that the subjects were not under the influence of any medication during the process of collection of data. The 28 channels were according to 10-20 international standard of electrode placement. The data were band

69

pass filtered between 0.5 to 38 Hz which included lower gamma frequency range and eliminated 50 Hz line noise.

B. Formation of network graphs The network connectivity can be studied using graph theory

methods. A graph is represented by nodes (vertices) and connections (edges) between the nodes. In a graph vertices are denoted by black dots and a line is drawn between the dots if the two vertices are connected. The complexity of a graph can be characterized by many measures like cluster coefficient and characteristic path length. The cluster coefficient is a measure of local interconnectedness of the graph and the characteristic path length is an indicator of its overall connectedness. More literature about the graph theory may be found elsewhere [1, 2 deal with graph theory with the subject in relation to brain connectivity].

The temporal correlation between any two EEG time series x1(t) and x2(t) at two scalp electrodes is ranges between 0 and 1. To allow mathematical analysis, we represent the neuronal network activity patterns as graphs as follows. In this study the vertices of the graph represents the position of the electrode placement on the scalp. Let the number of electrodes used for EEG recording be N and hence there are N vertices.

The correlation between all pair-wise combinations of EEG channels are computed to form a square matrix R of size N where N is the total number of EEG channels. Each entry rij in the matrix R, 1<=i,j<=N, is the correlation value for the channels i and j and 0<=mod(rij)<=1, mod(rij)=1 for i=j, where mod represents absolute value. The correlation matrix R is converted into a binary matrix by applying a threshold. The resulting binary matrix is called adjacency matrix. A network graph is constructed using this matrix with N vertices and an edge between the nodes if the correlation between them exceeds the threshold. Thus, if the correlation between a pair of channels i and j, rij exceeds the threshold value, an undirected edge is said to exist between the vertices i and j. All the edges are given the same cost equal to unity.

C. Measures of a graph

Graph theory is one of the latest tools being used for analysis of brain connectivity. The following are the measures of graph that are used for this analysis for which a brief is essential.

1)Average connection density The average connection density kden of an adjacency matrix A

is the number of all its nonzero entries divided by the maximal possible number of connections. Thus, 0<=kden<=1. The sparser the graph, the lower is its connection density [4,5].

2)Complexity Complexity captures the extent to which a system is both

functionally segregated and functionally integrated (large subsets tend to behave coherently). The statistical measure of neural complexity CN(X) takes into account the full spectrum of subsets. CN(X) can be derived from either ensemble average of the mutual information between subset sizes or equivalently from the ensemble average of mutual information between subsets of a given size (ranging from 1 to n/2) and their complement [6,7,8].

3)Characteristic path length

Within a digraph, a path is defined as any ordered sequence of distinct vertices and edges that links a source vertex j to a target vertex i. The distance matrix Dij describes the distance from vertex j to vertex i, that is, the length of the shortest valid path

linking them. The average of all entries of Dij has been called the “characteristic path length” [8,9,10], denoted l path.

4)Cluster Index The clustering coefficient is a measure of the local

interconnectedness of the graph, whereas the path length is an indicator of its overall connectedness. The clustering coefficient Ci for a vertex vi is the proportion of links between the vertices within its neighborhood divided by the number of links that could possibly exist between them. For an undirected graph the edge eij

between two nodes i, j is considered identical to eji. Therefore, if a vertex vi has ki neighbors , ki (ki-1)/2 edges could exist among the vertices within the neighborhood. Using this information, the clustering coefficient for undirected graphs can be calculated.

The clustering coefficient for the whole graph is given by Watts and Strogatz as the average of the clustering coefficient for each vertex.

D Neural network methods

The neural network methods implemented in our work are

explained below.

1)Multilayer Perceptrons: A Back propagation network or

Multilayer perceptron (MLP) consists of at least three layers of

units: an input layer, at least one intermediate hidden layer, and

an output layer. The units are connected in a feed-forward

fashion. With Back propagation networks, learning occurs during

a training phase. After a Back propagation network has learned

the correct classification for a set of inputs, it can be tested on a

second set of inputs to see how well it classifies untrained

patterns [11]

2)Radial-Basis Function Networks: Radial basis function

networks are also feed forward, but have only one hidden layer.

RBF hidden layer units have a receptive field, which has a centre

that is, a particular input value at which they have a maximal

output. Their output tails off as the input moves away from this

point. Generally, the hidden unit function is a Gaussian [11].

3)Support Vector Machines: Support vector machine is a

popular technique for classification. Given a training set of

instance-label pairs (xi, yi), i = 1, . . . l where xi ∈ Rn and y∈1, -

1t, the SVM require the solution of an optimization problem.

Here training vectors xi are mapped into a higher dimensional

space by the function φ. Then SVM finds a linear separating

hyper plane with the maximal margin in this higher dimensional

space. K(xi, xj) = φ(xi)tφ (xj) is called the kernel function[11].

4)K- Nearest Neighbour Classifier: Among statistical

approaches, a k-nearest neighbour classifier (KNNC) was

selected because it does not assume any underlying distribution

of data. In the k-nearest neighbour rule, a test sample is assigned

the class most frequently represented among the k nearest

training samples [12].

E. Features Considered

This section describes the features considered for classifying

HRV data.

70

1)Fractal Dimension (FD): Katz's approach is used to calculate

the FD of a waveform [13].

2)Complexity measure: Lempel-Ziv complexity measure C(n)

[14] is used in our work, since it is extremely suited for

characterizing the development of spatio-temporal activity

patterns in high-dimensionality nonlinear systems.

3)Time-domain features of HRV: We assume that only a finite

number of intervals are available. For 4-10 min segments, wide-

sense stationarily may be assumed.The standard deviation of the

RR intervals (SDRR) is defined as the square root of the variance

of the RR intervals

SDRR = √ E[RR2n]- RR

2mean …………………………(1)

The standard deviation of the successive differences of the RR

intervals (SDSD) is defined as the square root of the variance of

the sequence ∆RRn = RRn – RRn+1 (the ∆RR intervals)

SDSD=√E[∆RR2n]-∆RR

2mean ……………………………(2)

4)Non-linear features: The two non-linear features, which are

obtained from Poincare plot [15] are given below:

SD1: The SD1 measure of Poincare width is equivalent to the

standard deviation of the successive intervals, except that it is

scaled by 1/√2.

SD2: The SD2 measure of the length of the Poincare cloud is

related to the auto covariance function.

SD12+SD2

2= 2SDRR

2 ………………………(3)

SD22=2SDRR

2–(1/2)SDSD

2 …………………………(4)

It can be argued that SD2 reflects the long-term HRV. The width

of the Poincare plot correlates extremely highly with other

measures of short-term HRV.

5)Frequency-Domain features: The frequency-domain features

considered are powers in ultra low frequency range, very low

frequency range, low frequency range and high frequency range.

III. RESULTS AND DISCUSSION

A. Results for EEG Data

The collection of data was done for a period of three minutes on

an average. For every subject, the connectivity matrices were

found for every 2 seconds epoch for the entire duration of three

minutes. Parameters were calculated for each such epoch for all

the graphs generated. Once calculation of the parameters for all

the subjects for 3 minutes duration was completed mean of each

parameter was taken for each subject.In this study the plots of

In this study statistical means of pameters were used to come to

the conclusions. For finding the parameters in the different bands

of EEG signals the EEG was filtered as per the frequency band

that is being investigated. Initial glance on the parameters will

not reveal much information for classification, but on detailed

analysis of the same and study of various plots we can see that

the subjects occupy different Euclidian spaces. This property thus

can be used to study for classification of subjects.

The work involved detailed study of the entire band of EEG

signals (i.e. 0.5 to 38Hz) and also study of different bands of EEG comprising of Delta, Alpha-1, Alpha-2, Theta and Beta. In all the six cases the parameters were extracted, tabulated, statistical mean generated and listed for each subject. The results were plotted in 2D and 3D Euclidian space. These plots were examined for study in classification.

The study was successful in not only identification of subjects suffering from Schizophrenia but also was able to classify them into various groups within Schizophrenia. To further ascertain the work into a working model machine learning approach was used to see if the parameters could be used to classify automatically. In the present work Support Vector Machine was used successfully to identify Schizophrenia patients and also classify them with as accuracy of more than 90%.

Before concluding our work with support vector machines the algorithm that is intended to be proposed for development for successful identification of Schizophrenia is isolation of the normal healthy subjects, so the subjects are tested for positivity of Schizophrenia. Once the person is tested positive , it is required to find to which subgroup the person belongs. The algorithm proposed analyses these steps and proposes a way to detect the subgroups as well.

1)Detection of Non-Schizophrenic subjects

During analysis of the full band and various sub bands of EEG for the study of parameters it is found that the 3D plots of complexity, cluster index and characteristic path length could be easily used for classification of the normal healthy patients from that of the schizophrenic patients. The 3D plot of the above mentioned parameters can be used to distinguish between normal healthy patients from that of the schizophrenic patients easily as shown in figure 1(a). The alpha 1 and alpha 2 band plots are in figure 1(b) and 1(c) respectively. It can be seen that normal healthy subjects can be easily identified from that of schizophrenic subjects.

2)Identification of Sub group in Schizophrenic subjects After identification of a schizophrenic patient, classifying them to a particular subgroup is a very important and crucial task.

Identification of mixed schizophrenic subjects. It was observed

that by just plotting the connection density with complexity and

charecteristic path length and from cluster index and

charecteristic path length other three parameters we could

identify the mixed schizophrenic subjects without much

problems. This can be seen from plots as shown in figure 2.

statistical means of parameters are used to come to conclusions.

The plots show the three kinds of plots like complexity Vs

connecty etc.. Similar observations can also be seen in Delta

band as well.

The plots show the three kinds of plots complexity Vs

connection density, characteristic path length Vs connection

density, characteristic path length Vs cluster index for full EEG

band and for alpha1, alpha2 and theta bands respectively.

Similar observations can also be seen in Delta band as well. The

71

The plots clearly show that the mixed schizophrenic subjects

could be clearly isolated from the other two subgroups. Though

the figures are shown for four bands even in delta band the

mixed schizophrenia subjects show distinct charecteristics

which can be easily identified from rest of the two subgroups.

Identification of the Positive and Negative Schizophrenic

subjects. After looking segregation of the mixed schizophrneic

subjects we have two important subgroups to be detected. For

this we plot the 2D and 3D plots as shown in the figure 3. In

these plots we can see that the positive and negative

schizophrenic subjects form distinct clusters which can easily be

identified. Thus finally the classification of schziphrenic

patients is complete.

3)Proposed SVM based classification using multiband results

From the above plots it becomes amply clear that

Schizophrenia subjects can be classified using some type of

classifiers in higher dimensions. Thus, a SVM based classifier

can be so chosen classification that can give an accuracy of more

than 90%. By using multiband classifiers with multiple kernels in

tandem we can further reduce the errors in detecting and

classification. By looking at various kernels for testing it can be

seen that the when used with multiple kernels the accuracy

approaches almost 100%. Thus an algorithm as shown in Figure

4 is proposed for identification and classification of

schizophrenia subjects. For the algorithm to function effectively

a database of established schizophrenic patients is required. The

EEG of subjects from the three subgroups of schizophrenia could

be taken and analyzed for as explained in the previous sections.

The SVM modules then run based on the parameters. We have

calculated both the errors committed while predicting and also

the location of errors in the data is given for various bands. We

have observed that the errors can further be reduced using

multiple kernels thus increasing the efficiency.

Once a patient is classified it can be updated into the initial database which is used in SVM blocks. Prediction of more than 95% could be achieved using different kernels

The work presented here tries to apply the graph theory approach to identify the schizophrenia patients from the normal healthy individuals and also to classify them into their subgroups. It is well known that the schizophrenia patients are difficult to identify from the EEG itself. At present there is no way to identify various subgroups of schizophrenia except for physicians counseling, which consumes physicians valuable time. Schizophrenia is not the problem of any single portion in human brain but a problem of entire brain. Hence a technique that uses the entire multichannel EEG data at a time for analysis. like the graph theory approach should be used. Thus, the approach followed in this work is much better suited for analyzing the EEG and possibly identifying the patients suffering from schizophrenia.

During the study it is clear that there is a distinct difference in the connectivity of patients suffering from schizophrenia from that of normal healthy control individuals. The formation of clusters by various groups and subgroups in 2D and 3D Euclidian space has been exploited by using SVM based classifiers. The study encourages us to see if graph theory parameters could be used further identifying other brain disorders which are otherwise difficult to diagnose using existing forms of diagnosis. This study gives us a relatively cheaper noninvasive tool to prematurely classify individuals who can later be clinically analyzed by a physician in detail. By far this is one of the first ways for successful identification and also classification of patients suffering from schizophrenia.

.

Figure 1(a) Complete Band

Figure 1(b) Alpha 1 band

Fi

gure 1(c) Alpha 2 band

Fig.1 Three Dimensional Plots

for Complexity, Cluster Index and

Characteristic path length for

(a) Complete band (b) Alpha1

band and (c) Alpha2 band.

Figure 2(a)(i)

Figure 2(b)(i)

Figure 2(a)(ii)

Figure 2(b)(ii)

Figure 2(a)(iii)

Figure 2(b)(iii)

Figure 2(c)(i)

Figure 2(d)(i)

72

Figure 2(c)(ii) Fi

Figure 2(d)(ii)

Figure 2(c)(iii) Fi

Figure 2(d)(iii)

Figure 2: Plots re of the bands (a) Complete band, (b) Alpha1 band, (c) Alpha2

band and (d) Theta band. The plots are of (i) Complexity Vs Connection density,

(ii) Connection density Vs Characteristic path length and (iii) Characteristic path

length Vs Cluster Index.

F

Figure 3(a)(i)

Fig

Figure 3(b)(i)

F

Figure 3(a)(ii)

Fig

Figure 3(b)(ii)

Fig

Figure 3(a)(iii)

Fi

Figure 3(b)(iii)

Figure 3(a)(iv) Fi

Figure 3(b)(iv)

Figure 3: The plots are for (a) Complete Band and (b) Alpha1 Band. The

parameters plotted for (i) Cluster Index Vs Connection density, (ii) Characteristic

Path length Vs Cluster index (iii) Charecteristic path length, cluster Index and

Complexity and (iv) Charecteristic path length, cluster Index and Connection

density

Figure 4: Proposed algorithm

B. Results for ECG Data

The ECG is recorded in lead II configuration from HP 78173A

ECG monitor. The signal was recorded onto a PC using 12-bit

ADC at a sampling frequency of 500 Hz. HRV signal is extracted

from these ECG recordings using Berger’s algorithm (which

comprise of peak detection algorithm to detect R peaks followed

by interpolation of interbeat intervals). The supine data were

obtained after the subjects rested for 10 min and the standing data

were obtained 2 min after the subjects stood up. Control

breathing corresponds to breathing at specified rate (typically 12

breaths per minute) whereas in spontaneous breathing subject is

asked to breath normally. The short HRV records are for 256 sec

and they are at sampling freq of 4 Hz. Four classification

algorithms viz., KNNC, MLP, RBF networks and SVMs are

tested for the classification. Tables 1 and 2 show the

classification accuracy on the test sets of the algorithms for

classification of supine vs standing posture and normal vs

abnormal .The classification accuracies are presented for

different sets of features to the algorithms.

1)Classification of supine and standing postures: Classification

of data into supine and standing postures is considered. The

obtained accuracies for the back propagation network with 9

hidden units, RBF networks with 110 hidden units and support

vector machine with RBF kernel are shown in Table 1.The

highest classification accuracy (90.57%) is obtained for the SVM

with RBF kernel.

2)Classification of normal and abnormal cases; Classification of

data into normal and abnormal cases is considered. The obtained

accuracies for the back propagation network with 11 hidden

units, RBF networks with 110 hidden units and support vector

machine with RBF kernel are shown in table 2.The highest

classification accuracy (91.2%) is obtained for the SVM with

RBF kernel.

Features Considered KNNC MLP RBF SVM

Mean, Variance 78.63 90.43 89.74 90.57

Fractal Dimension (FD),

Complexity Measure (CM)

75.00 78.95 79.31 89.57

Mean, Variance, FD, CM 78.45 90.43 89.44 90.57

Frequency Domain Features 78.95 89.57 89.57 91.20

Mean, Variance, FD, CM &

Frequency domain features

74.06 90.43 89.44 90.57

Table 1. Testing Accuracy Table for HRV Data (Supine and

Standing postures)

73

Features Considered KNNC MLP RBF SVM

Mean, Variance 79.82 87.83 89.44 90.57

Fractal Dimension (FD),

Complexity Measure (CM)

70.69 83.66 84.27 88.19

Mean, Variance, FD, CM 79.82 87.83 89.44 90.57

Frequency Domain Features 77.59 89.74 89.44 91.20

Mean, Variance, FD, CM &

Frequency domain features

67.83 89.74 89.74 91.20

Table 2. Testing Accuracy Table for HRV Data (Normal and

Abnormal cases)

From the above Tables 1 and 2, we can observe the following:

(i) The obtained accuracies for the back propagation network,

RBF networks units and support vector machine with RBF kernel

are higher than for k-neighbours. This is because in the case of

data used in this work, different units bring different amount of

information whereas k-neighbours algorithm compute the

Euclidean distances between classified vector and some other

vectors. Therefore, this algorithm cannot take into consideration

the fact that different inputs bring different amount of

information, which is the natural feature of neural networks.

(ii)The features SDRR, SDSD, SD1 and SD2 play an important

role in the classification of normal vs abnormal cases.

3.The features fractal dimension and complexity measure seems

to be not playing a significant role in the classification problems.

4.Heart rate variability is reduced in individuals with anxiety and

depressive disorders and hence we are able to get accuracy

around 90% by using either features SDSD, SDRR, SD1 and

SD2 or frequency features alone.

5.In the case of classification of supine and standing postures,

frequency features are significant. Spectral analysis of HR data

reports a relative increase in low frequency power and decrease

in high frequency power from supine to standing posture. These

changes are attributed to a predominance of sympathetic activity

and vagal withdrawal in standing posture.

The second part of this paper on HRV has presented a neural-

based approach to classifying HRV data into normal and

abnormal cases and supine and standing postures. We are able to

correctly classify 106 out of 116 cases corresponding to normal

and abnormal subjects and in case of supine and standing

subjects we are able to classify 105 out of 116 correctly. We

compared conventional KNNC method with three kinds of neural

networks (MLP, RBF and SVM). The obtained accuracies for the

back propagation network, RBF network units and support vector

machine with RBF kernel are higher than for k-neighbours.

Among neural network methods, SVM gives better performance

than MLP and RBF networks. RBF network, in general, gives

better performance than MLP network. In the case of

classification of supine and standing postures, frequency features

are significant and heart rate variability is reduced in individuals

with anxiety and depressive disorders. Improved classification

can be achieved by taking a larger training set.

ACKNOWLEDGEMENTS

Author would like to thank Dr. John P John , National Institute

of Mental Health and Neuro Sciences (NIMHANS), Bangalore,

for providing the necessary EEG data for this study. Author would

also thank Maj.Kiran Kumar and Mr Mutyalaraju for their help in

the development of programs.

REFERENCES

[1] K.Sim ,T.H. Chua , Y.H. Chan , R Mahendran , S.A. Chong, “ Psychiatric

comorbidity in first episode schizophrenia: a 2 year, longitudinal outcome

study”, J Psychiatr Res, vol. 40 (7), pp. 656-63,2006.

[2] M. Galinier , S. Boveda , A. Pathak , J. Fourcade , B. Dongay , “ Intra-

individual analysis of instantaneous heart rate variability ”, Crit.

Care Med,vol.28(12), pp. 3939-3940,2000.

[3] D.L.Musselman,D.L.Evans,C.B.Nemeroff ,”Therelationship of expression

to cardiovascular diseases.Arch Gen Psychiatryu,vol.55, pp.580- 592,1998. [4] A.R. McIntosh , M.N. Rajah , N.J. Lobaugh ,. “ Interactions of prefrontal

cortex in relation to awareness in sensory learning ”, Science, vol. 284,

pp. 1531–1533,1999.

[5] G.Tononi, O. Sporns, G.M. Edelman , “ A measure for brain complexity:

Relating functional segregation and integration in the nervous system”,

Proc. Nat Acad Sci USA,vol. 91,pp. 5033–5037,1994.

[6] O.Sporns, G.Tononi, G.M. Edelman, “Theoretical neuroanatomy: Relating anatomical and functional connectivity in graphs and cortical connection matrices” Cerebral Cortex vol. 10, pp.127–141,2000.

[7] O.Sporns, G.Tononi, G.Edelman, “Connectivity and complexity: The relationship between neuroanatomy and brain dynamics. Neural Networks”, vol.13, pp. 909–922,2000.

[8] O.Sporns,G. Tononi, “Classes of Network Connectivity and Dynamics”, Complexity, vol. 7, pp.28-38,2002.

[9] D.J.Watts, S.H.Strogatz, “Collective dynamics of ‘small-world’ networks” Nature ,vol. 393, pp.440–442,1998.

[10] D.J.Watts, “Small Worlds” Princeton University Press: Princeton, NJ, 1999.

[11] Simon Haykin, Artificial neural networks, second edition,New York:

Pearson Education, 2002.

[12] R.O. Duda, P.E. Hart and D.G.Stork, Pattern Classification, second edition,

New York: John Wiley and Sons, 2001.

[13] M. Katz, “ Fractals and the analysis of waveforms”,Comput.Biol

Med., vol. 18, pp. 145-156, 1988.

[14] Zhang Xu-Scheng, J. Roy Rob and E.W. Jensen , “ EEG complexity as

a measure of Depth of Anaesthesia ” , IEEE Trans. Biomed. Engg.,

vol.48,pp 1424-1433, 2001.

[15] M.Brennan, M.Palaniswami and P.Kamen, “ Do existing measures of

Poincare plot geometry reflect nonlinear features of heart rate

variability?”, IEEE Trans. On BME, vol. 48, no. 11 pp. 1342-1347, 2001.

74

Supervised Gene Clustering for Extraction ofDiscriminative Features from Microarray Data

Chandra Das#1, Pradipta Maji∗2, Samiran Chattopadhyay$3

# Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, 700 152, India1 [email protected]

∗ Machine Intelligence Unit, Indian Statistical Institute,Kolkata, 700 108, India2 [email protected]

$ Department of Information Technology, Jadavpur University, Kolkata, 700 092, India3 [email protected]

Abstract— Among the large number of genes presented inmicroarray data, only a small fraction of them are effective forperforming a certain diagnostic test. However, it is very difficultto identify these genes for disease diagnosis. Clustering methodis able to group a set of genes based on their interdependenceand allows a small number of genes within or across the groupsto be selected for analysis that may contain useful informationfor sample classification. In this regard, a new supervised geneclustering algorithm is proposed to cluster genes from microarraydata. The proposed method directly incorporates the informationof response variables in the grouping process for finding suchgroups of genes, yielding a supervised clustering algorithm forgenes. Significant genes are then selected from each clusterdepending on the response variables. The average expression ofthe selected genes from each cluster acts as a representative ofthat particular cluster. Some significant representativesare thentaken to form the reduced feature set that can be used to buildthe classifiers with very high classification accuracy. To computethe interdependence among the genes as well as the gene-classrelevance, mutual information is used, which is shown to besuccessful. The performance of the proposed method is describedbased on the predictive accuracy of naive bayes classifier, K-nearest neighbor rule, and support vector machine. The proposedmethod attains 100% predictive accuracy for all data sets. Theeffectiveness of the proposed method, along with a comparisonwith existing methods, is demonstrated on three microarraydata.

I. I NTRODUCTION

Recent advancement of microarray technologies has madethe experimental study of gene expression data faster and moreefficient. Microarray techniques, such as DNA chip and highdensity oligonucleotide chip, are powerful biotechnologiesbecause they are able to record the expression levels ofthousands of genes simultaneously. The vast amount of geneexpression data leads to statistical and analytical challengesincluding the classification of the dataset into correct classes.So, an important application of gene expression data in func-tional genomics is to classify samples according to their geneexpression profiles such as to classify cancer versus normalsamples or to classify different types of cancer [1], [2].

Both supervised and unsupervised classifiers have beenused to build classification models from microarray data. This

study addresses the supervised classification task where datasamples belong to a known class. So, the outcomes are sampleclasses and the input features are measurements of genes forgene array-based sample classification. However, the majorproblem of microarray gene expression data-based sampleclassification, is the huge number of genes compared to thelimited number of samples. Most classification algorithmssuffer from such a high dimensional input space. Furthermore,most of the genes in arrays are irrelevant to sample distinction.These genes may also introduce noises and decrease predictionaccuracy. In addition, a biomedical concern for researchers isto identify the key ”marker genes” which discriminate samplesfor class diagnoses. Therefore, gene selection is crucial forsample classification in medical diagnostics, as well as forunderstanding how the genome as a whole works.

One way to identify these marker genes is to use theclustering methods [3], which partitions the given genes intodistinct subgroups, so that the genes (features) within a clusterare similar while those in different clusters are dissimilar. Aset of small number of top-ranked genes from each cluster arethen selected or extracted based on some evaluation criterionto constitute the resulting reduced subset. As a first approach,unsupervised clustering techniques have been widely appliedto find groups of co-regulated genes on microarray data. Themost prevalent approaches of unsupervised clustering include:(i) hierarchical clustering [4] (ii) K-means clustering [5] (iii)clustering through Self-Organizing Maps [6] etc. However,these algorithms usually fail to reveal functional groups ofgenes that are of special interest in sample classification.Thisis because genes are clustered by similarity only, without usingany information about the experiment’s response variable.

This problem is solved by supervised clustering. Super-vised clustering is defined as grouping of variables (genes),controlled by information about the class variables, as forexample, the tumor types of the tissues. Previous work inthis field encompasses tree harvesting [7], a two step methodwhich consists first of generating numerous candidate groupsby unsupervised hierarchical clustering. Then, the averageexpression profile of each cluster is considered as a potential

75

input variable for a response model and the few gene groupsthat contain the most powerful information for sample discrim-ination are identified. In this work, only the second step makesthe clustering supervised, as the selection process reliesonexternal information about the class variable types. But italsofails to reveal functional groups of genes of special interest inclass prediction because genes are clustered by unsupervisedinformation only.

A supervised clustering approach that directly incorporatesthe response variables or class variables in the grouping pro-cess is the partial least squares (PLS) procedure [8], [9]. ThePLS has the drawback that the fitted components involve all(usually thousands of) genes, which makes them very difficultto interpret. Another new promising supervised method [10]like PLS, is a one step approach that directly incorporates theresponse variables into the grouping process. This supervisedclustering algorithm is a combination of gene selection forcluster membership and formation of a new predictor bypossible sign flipping and averaging the gene expressionswithin a cluster. The cluster membership is determined witha forward and backward searching technique that optimizesthe Wilcoxon test based predictive score and margin criteriadefined in [10], which both involve the supervised responsevariables from the data. However, as both predictive score andmargin criteria depend on the actual gene expression values,they are very much sensitive to noise or outlier of the data set.

In this paper, a new supervised gene clustering algorithmis proposed. It finds co-regulated clusters of genes whosecollective expression is strongly associated with the samplecategories or class labels. The mutual information is usedhere to measure the similarity between genes and the gene-class relevance. The proposed method uses this measure toreduce the redundancy among genes. It involves partitioningof the original gene set into some distinct subsets or clustersso that the genes within a cluster are highly co-regulatedwith strong association to the response variables or samplecategories while those in different clusters are as dissimilaras possible. A single gene from each cluster having thehighest gene class relevance value is first selected as theinitial representative of that cluster. The representative of eachcluster is then modified by averaging the initial representativewith other genes of that cluster whose collective expressionis strongly associated with the sample categories. In effect,the proposed algorithm yields clusters typically made up ofa few genes, whose coherent average expression levels allowperfect discrimination of sample categories. After generatingall clusters and their representatives, a few representatives areselected according to their class discrimination power andarepassed through classification algorithms to classify samples.

To evaluate the performance of the proposed method, dif-ferent cancer gene expression datasets are used. The perfor-mance of the proposed method is studied using the predictiveaccuracy of support vector machine, K-nearest neighbor rule,and the naive-bayes method. The classification accuracy ofthe proposed method is compared with those yielded by othergene-selection methods. The experimental results demonstrate

that the proposed method is more effective.The rest of this paper is organized as follows: In Section

II, a novel feature extraction algorithm is presented. SectionIII briefly describes the concept of mutual information, whichis used to measure the similarity between two genes andthe relevance of a gene with respect to class variables. InSection IV, extensive experimental results are discussed,alongwith a comparison with other related methods. The paper isconcluded with a summary in Section V.

II. PROPOSEDFEATURE EXTRACTION METHOD

This section presents an algorithm for supervised learningof similarities and interactions among predictor variables forclassification in very high dimensional spaces, and hence ispredestinated for searching functional groups of genes onmicroarray expression data.

A. Proposed Supervised Clustering Algorithm

The proposed basic stochastic model for microarray dataequipped with categorical response is given by a random pair

(ξ, Y ) with valuesRm × Y

where ξ ∈ Rm denotes a log-transformed gene-expressionprofile of a tissue sample, standarized to mean zero andunit variance. Y is the associated response variable, takingnumeric values inY = 0, 1, · · · , K−1. Here K representsthe number of classes. Suppose, X represents the gene setwhereX = X1, X2, · · · , Xm.

To account for the fact that not all m genes on the chip, butrather a few functional gene subsets, determine nearly all of theoutcome variation and thus the type of a tissue, the whole geneset is partitioned into z number of functional groups or clusters(C1, · · · , Cz with z ≪ m). They form a disjoint and usuallyincomplete partition of the gene set:∪z

i=1Ci ⊂ 1, · · · , mand Ci ∩ Cj = ∅, i 6= j. Finally, a representative of everycluster is generated and among them a few forms the reducedfeature set. Let,ξCi

∈ R denotes a representative expressionvalue of gene clusterCi. There are many possibilities todetermine such group valuesξCi

, but as we would like to shapeclusters that contain similar genes, a simple linear combinationis an accurate choice:

ξCi=

1

Ci

∑

g∈Ci

δg ξg, δg ∈ −1, 1 (1)

Here ξg represents the expression value of geneXg.Because of the use of log-transformed, mean-centered and

standardized expression data,we as a novel extension, allowthe contribution of a particular gene g to the group valueXCi

also to be given by its ’sign-flipped’ expression values−Xg. This means that we treat under and over expressionsymmetrically, and it prevents the differential expression ofgenes with different polarity from cancelling out when theyare tagged. Now, we describe the partioning process of thegene setX into subsets or clustersC = C1, · · · , Cz.

The proposed clustering method has two phases: (1) clustergeneration phase (2) cluster refinement phase. In the clustergeneration phase, first the class relevance value of every gene

76

is calculated. Now, the gene with the highest class relevancevalue, supposeXv is selected as the member of first clusterC1. Now, among the remaining genes, the genes which havethe similarities withXv are greater than or equal to a userdefined thresholdα are selected as the members of clusterC1. In this way clusterC1 is formed.

The next phase is cluster refinement phase. In this phasefrom cluster C1, at first the geneXv with highest classrelevance value is chosen. Now, any gene which is presentin cluster C1, supposeXt, is taken and average expressionprofile of Xv and Xt is calculated. If the class relevancevalue of the average expression profile is greater than classrelevance value ofXv then Xt is selected. Again the wholeprocess is repeated that means any gene that is present inclusterC1 except genesXv andXt, supposeXr is taken andaverage expression profile ofXv, Xt, andXr is taken. If classrelevance value of average expression profile (of genesXv, Xt,Xr) is greater than class relevance value of previous averageexpression profile (of geneXv and Xt) then Xr is selected.Now the whole precess is repeated in clusterC1 until there isno unchecked gene. In this way in clusterC1 some genes areselected which will increase class relevance value and averageexpression profile of these selected genes are taken and thiswill act as a representative for clusterC1. After that, the genewith highest class relevance value i.e.,Xv in this cluster andall other genes which have similarity withXv is greater thanor equal toβ present in this cluster are discarded. After thecompletion of clusterC1 creation, the gene which has nexthighest class relevance value among the genes which are notselected in any previously created clusters is taken and thewhole clustering process is repeatedz number of times. Here,z is a user defined parameter.

After generating allz clusters and their representatives, thebesth number of cluster representatives are selected accordingto their class relevance value and are passed through classifieralgorithms to measure classification accuracy.

• Proposed Feature (Gene) Extraction Algorithm:Input: Given a m × n gene expression matrixT =wij |i = 1, · · · , m, j = 1, · · · , n, where wij is themeasured expression level of geneXi in the jth sample,m and n represents the total number of genes andsamples respectively. LetX represents the set of genes.Then |X | = m and X = X1, · · · , Xi, · · · , Xm. LetCR(Xi) represents the class relevance value of geneXi

using mutual information.α and β are user definedthreshold.p is a variable which holds the number ofcurrent clusters.z is a user defined input variable whichholds the maximum number of clusters, generated by theproposed algorithm. LetC represents the set of clustersandC = C1, · · · , Cz. HereS is a set which holds thegenes which are used to generate average gene expressionprofile of current cluster.Output: A set containingh number of cluster represen-tatives.

1) Initialize: S ← ∅ andp← 1.

2) At first, using mutual information the class relevancevalue of every geneXi, CR(Xi) for i = 1, · · · , m iscalculated.

3) Now the gene, letXv, with highest class relevance valueis selected. This is the first member of the clusterCp.

4) Repeat step5 to step15 until p ≤ z or all m genes areselected:

5) Now among the remaining genes inX whose similaritieswith Xv ≥ α are selected for clusterCp and the clusterCp is formed.

6) Now, in clusterCp the geneXv is taken and the initialcluster meanξCp

is set to the expression vector( ˜ξXv)

of the chosen gene and in effectXv ∈ S.7) Repeat the following two steps until all genes of cluster

Cp are checked:8) Average the current average cluster expression profile

ξCpwith each individual geneXi present in clusterCp,

ξCp+Xi=

1

|S|+ 1( ˜ξXi

+∑

Xt∈S

˜ξXt)

9) if ξCp+Xi≥ ξCp

thenXi ∈ S andξCp= ξCp+Xi

.10) The final average expression profile(ξCp

) of clusterCp

acts as representative ofCp.11) After generating cluster representative,Xv and all genes

present in clusterCp that have similarities withXv

greater than or equal toβ are discarded from the genesetX .

12) Now,p = p + 1 andS ← ∅ .13) Now the gene, letXu, with next highest class relevance

value among the genes which are not selected in anypreviously created clusters is taken. This is the firstmember of the next clusterCp and go to step4.

14) If no gene is found thenp = p− 1 and goto step15.15) After generating all cluster representatives, the besth

number of cluster representatives are selected accordingto their class relevance values.

16) end.

B. Time Complexity

In this algorithm the original number of features (genes) ism. From them the proposed method has selectedh number ofcluster representatives. These cluster representatives actuallyform the reduced feature set. In the proposed method, at firstthe class relevance value of every gene is calculated. The timeneeded to calculate class relevance value of every gene is nas there aren number of samples. So, the time complexityof this phase isO(mn). Now, in the next phase, the gene,with highest class relevance value is chosen as the memberof the first cluster. Based on a user defined threshold amongall other genes some genes are selected for that cluster andthis cluster is formed. Then, the next gene, with highest classrelevance value among all remaining genes which are notalready selected in any previously created clusters is chosenand the whole process is repeated untilm number of genesare checked or we getz number of clusters. In every cluster,the similarities of all genes are calculated with respect tothe

77

gene which has maximum class relevance value in this cluster.The similarity calculation time between two gene isn as therearen number of samples. Then time complexity of selectionof genes in a cluster isO(mn). As, z number of clusters arecreated, so, the time complexity of this phase isO(zmn). Atlast,h number of cluster representatives are selected accordingto their class relevance value. The time complexity of thisphase isO(h). As, h≪ m, the overall time complexity of theproposed method isO(zmn).

C. Choice ofα and β

In this paper, we measure the similarity between any twogenes using mutual information. If two genes are highlysimilar then mutual information between them is very high,otherwise mutual information between them is low. When twogenes are statistically independent, then mutual informationbetween them is zero. Mutual information is maximum whenwe measure similarity of a gene with itself. The parameterα

is a threshold that measures the degree of similarity of twogenes. For every dataset, we measure first the similarity ofa gene with itself. Now, among these similarity values, themaximum value is the maximum mutual information value forthat particular data set. So, we take the values ofα between0 and maximum mutual information value for every dataset.On the other hand, the thresholdβ is used to decide whethera gene of current cluster will be considered for next clustergeneration step or not. In case of every data set and for everyclusterβ is set to 90% of maximum similarity of initial clusterrepresentative.

III. E VALUATION CRITERIA FOR GENE SELECTION

TheF -test value [11], [12], information gain, mutual infor-mation [11], [13], normalized mutual information [14], etc.,are typically used to measure the gene-class relevance andthe same or a different metric such as mutual information,the L1 distance, Euclidean distance, Pearson’s correlationcoefficient, etc., [11], [13], [15] is employed to calculatethegene-gene similarity or redundancy. However, as theF -testvalue, Euclidean distance, Pearson’s correlation, etc., dependon the actual gene expression values of the microarray data,they are very much sensitive to noise or outlier of the data set.On the other hand, as mutual information depends only on theprobability distribution of a random variable rather than on itsactual values, it is more effective to evaluate the gene-classrelevance as well as the gene-gene similarity [11], [13]. So, inthis paper to measure the gene class relevance and gene-genesimilarity mutual information is used.

A. Mutual Information

In principle, mutual information is used to quantify theinformation shared by two objects. If two independent objectsdo not share much information, the mutual information valuebetween them is small. While two highly correlated objectswill demonstrate a high mutual information value [16]. Theobjects can be the class label and the genes. The necessity fora gene to be an independent and informative can, therefore,

be determined by the shared information between the geneand the rest as well as the shared information between thegene and class label [11], [13]. If a gene has expressionvalues randomly or uniformly distributed in different classes,its mutual information with these classes is zero. If a gene isstrongly differentially expressed for different classes,it shouldhave large mutual information. Thus, the mutual informationcan be used as a measure of relevance of genes. Similarly,mutual information may be used to measure the level ofsimilarity between genes.

The entropy is a measure of uncertainty of random variables.If a discrete random variableX has X alphabets and theprobability density function isp(x) = PrX = x, x ∈ X ,the entropy ofX is defined as

H(X) = −∑

x∈X

p(x) log p(x). (2)

Similarly, the joint entropy of two random variablesX withX alphabets andY with Y alphabets is given by

H(X, Y ) = −∑

x∈X

∑

y∈Y

p(x, y) log p(x, y) (3)

where p(x, y) is the joint probability density function. Themutual information betweenX andY is, therefore, given by

I(X, Y ) = H(X) + H(Y )−H(X, Y ). (4)

B. Discretization

In microarray gene expression data sets, the class labelsof samples are represented by discrete symbols, while theexpression values of genes are continuous. Hence, to measureboth gene-class relevance of a gene with respect to classlabels and gene-gene redundancy between two genes usingmutual information [11], [13], [17], the continuous expressionvalues of a gene are usually divided into several discretepartitions. The a prior (marginal) probabilities and theirjointprobabilities are then calculated to compute both gene-classrelevance and gene-gene redundancy using the definitions fordiscrete cases. In this paper, the discretization method reportedin [11], [13], [17] is employed to discretize the continuousgene expression values. The expression values of a gene arediscretized using meanµ and standard deviationσ computedover n expression values of that gene: any value larger than(µ + σ/2) is transformed to state 1; any value between(µ− σ/2) and(µ + σ/2) is transformed to state 0; any valuesmaller than(µ− σ/2) is transformed to state -1. These threestates correspond to the over-expression, baseline, and under-expression of genes.

IV. EXPERIMENTAL RESULTS AND DISCUSSION

Organization of the experimental results is as follows:First, the characteristics of the four microarray data setsarediscussed briefly. Then, the descriptions of different classifiers(naive bayes classifier, K-nearest neighbor rule and supportvector machine) used here are discussed to measure the perfor-mance of the proposed algorithm and finally, the performance

78

of the proposed method is extensively compared with otherexisting methods that are given in [3].

In this paper, mutual information is applied to calculate bothgene-class relevance and gene-gene redundancy. All methodsare implemented in C language and run in LINUX environ-ment having machine configuration Pentium IV, 3.2 GHz, 1MB cache, and 1 GB RAM. To analyze the performance ofthe proposed method, the experimentation is done on differentmicroarray gene expression data sets.

A. Gene Expression Data Sets

In this paper, different public data sets of cancer microarraysare used. Since binary classification is a typical and funda-mental issue in diagnostic and prognostic prediction of cancer,the proposed method is compared with other existing methodsusing following binary-class data sets.

1) Breast Cancer Data Set [18]:The breast cancer data setcontains expression levels of 7129 genes in 49 breast tumorsamples from [18]. The samples are classified according totheir estrogen receptor (ER) status. 25 samples are ER positivewhile the other 24 samples are ER negative.

2) Leukemia Data Set [1]:It is an affymetrix high-densityoligonucleotide array that contains 7070 genes and 72 sam-ples from two classes of leukemia: 47 acute lymphoblasticleukemia and 25 acute myeloid leukemia.

3) Colon Cancer Data Set [19]:The colon cancer data setcontains expression levels of 40 tumor and 22 normal colontissues. Only the 2000 genes with the highest minimal intensitywere selected by [19].

B. Class Prediction Methods

Following three classifiers are used to evaluate the perfor-mance of the proposed clustering algorithm.

1) Naive Bayes Classifier [20]:The naive bayes (NB)classifier is one of the oldest classifiers. It is obtained byusing the Bayes rule and assuming features (variables) areindependent of each other given its class. For thejth samplesj with m gene expression levelsw1j , · · · , wij , · · · , wmjfor the m genes, the posterior probability thatsj belongs toclassc is

p(c|sj) ∝m∏

i=1

p(wij |c) (5)

wherep(wij |c) are conditional tables (or conditional density)estimated from training examples. Despite the independenceassumption, the NB has been shown to have good classificationperformance for many real data sets [20].

2) Support Vector Machine [21]:The support vector ma-chine (SVM) is a relatively new and promising classificationmethod. It is a margin classifier that draws an optimal hy-perplane in the feature vector space; this defines a boundarythat maximizes the margin between data samples in differentclasses, therefore leading to good generalization properties. Akey factor in the SVM is to use kernels to construct nonlineardecision boundary. In the present work, linear kernels areused. The source code of the SVM is downloaded fromhttp://www.csie.ntu.edu.tw/∼cjlin/libsvm.

3) K-Nearest Neighbor Rule [22]:The K-nearest neighbor(K-NN) rule is used for evaluating the effectiveness of thereduced gene set for classification. It is a simplest machinelearning algorithm for classifying samples based on closesttraining samples in the feature space. A sample is classifiedby a majority vote of its K-neighbors, with the sample beingassigned to the class most common amongst its K-nearestneighbors. The value of K, chosen for the K-NN, is the squareroot of the number of samples in training set.

C. Performance Analysis

In this section, the results obtained by applying the super-vised clustering algorithm to the above mentioned datasetshave been briefly described. The experimental results ondifferent microarray data sets are presented in Tables I-IX.Subsequent discussions analyze the results with respect totheprediction accuracy of the NB, SVM, and K-NN classifiers.Tables I, IV, VII and Tables II, V, VIII provide the performanceof the proposed method using the NB and SVM respectively,while Tables III, VI, IX show the results using the K-NN. Fordifferent values ofα, extensive experiments have been investi-gated. The different values ofα for which the experiment hasbeen carried out, are 0.0001, 0.001, 0.005, 0.01, 0.02, 0.03,0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14,0.15, 0.19, 0.20,0.30, 0.40, and 0.50.

To compute the prediction accuracy of the NB, SVM, andK-NN, the leave-one-out-cross-validation is performed oneachgene expression data set. In all experiments, the value ofz istaken 50. So, we can get maximum 50 number of clusters andevery cluster has its own representative. Among them withrespect to best 5 representatives we have shown all the resultsfor all the data sets using three classifiers. Each data set ispre-processed by standardizing each sample to zero mean andunit variance.

TABLE I

PERFORMANCE ONBREAST CANCER DATA SET USING NB CLASSIFIER

Value Number of Selected Genesof α 1 2 3 4 5

0.0001 100 * * * *0.001 100 100 * * *0.005 100 100 * * *0.01 100 100 100 * *0.02 100 100 100 97.96 *0.04 100 100 100 100 1000.06 100 100 100 100 1000.08 100 100 100 100 1000.10 100 100 100 100 1000.11 100 100 100 100 1000.12 100 100 100 100 1000.13 100 100 100 100 1000.14 100 100 100 100 1000.15 100 100 100 100 1000.20 100 100 100 97.96 97.960.30 97.96 97.96 97.96 97.96 97.960.40 89.80 97.96 97.96 100 97.960.50 91.84 93.88 93.88 93.88 91.84

Tables I, IV, and VII show the classification accuracy ofdifferent cancer microarray datasets using NB classifier. Using

79

TABLE II

PERFORMANCE ONBREAST CANCER DATA SET USING SVM


0.0001 93.87 * * * *0.001 85.72 93.88 * * *0.005 89.79 91.84 * * *0.01 87.76 89.80 91.84 * *0.02 89.80 93.88 93.88 100 *0.04 87.76 100 100 100 1000.06 91.84 100 100 100 1000.08 91.84 95.92 100 100 1000.10 100 97.96 100 100 1000.11 93.88 97.96 100 100 1000.12 95.92 100 100 100 1000.13 95.92 100 100 100 1000.14 95.92 97.96 100 100 1000.15 100 100 100 100 1000.20 97.96 100 100 100 1000.30 89.80 91.84 89.80 93.88 95.920.40 89.80 95.92 95.92 93.88 95.920.50 87.76 91.84 91.84 91.84 91.84

TABLE III

PERFORMANCE ONBREAST CANCER DATA SET USING K-NN RULE


0.0001 93.88 * * * *0.001 95.92 95.92 * * *0.005 100 95.92 * * *0.01 97.96 100 100 * *0.02 93.88 93.88 93.88 100 *0.04 97.96 97.96 100 100 1000.06 100 100 100 100 1000.08 95.92 95.92 100 100 1000.10 100 100 97.96 97.96 1000.11 95.92 97.96 100 100 1000.12 97.96 100 100 100 1000.13 95.92 100 100 100 1000.14 97.96 97.96 97.96 100 1000.15 100 100 100 100 1000.20 100 100 100 100 97.960.30 95.92 91.84 89.80 97.96 95.920.40 89.80 95.92 95.92 95.92 95.920.50 91.84 93.88 93.88 91.84 91.84

NB, the proposed method gives 100% accuracy for all theabove mentioned datasets considering 1 or more gene clusterrepresentatives. With the NB, a classification accuracy of100% is obtained in case of breast cancer data forα valueranging from 0.0001 to 0.20 and for leukemia cancer data100% accuracy is obtained forα value ranging from 0.0001to 0.50. In case of colon cancer data 100% accuracy is obtainedfor a set ofα values. These are 0.01, 0.02, 0.04 and 0.13.

The results reported in Tables II, V, and VIII are based onpredictive accuracy of the SVM. The results show that in caseof breast cancer data set, 100% accuracy is obtained forα

value ranging from 0.02 to 0.20 using 1 or more gene clusterrepresentatives. Using SVM, 100% accuracy is obtained forleukemia data forα value ranging from 0.001 to 0.40 using 1or more cluster representatives. In case of colon cancer datafor α value 0.02 only 100% accuracy is obtained considering1 gene cluster representative.

For breast cancer data set using the K-NN, 100% accuracy

TABLE IV

PERFORMANCE ONLEUKEMIA DATA SET USING NB CLASSIFIER


0.0001 100 * * * *0.001 100 100 * * *0.005 100 100 100 * *0.01 100 100 100 100 *0.02 100 100 100 100 1000.04 100 100 100 100 1000.06 100 100 100 100 1000.08 100 100 100 100 1000.10 100 100 100 100 1000.11 100 100 100 100 1000.12 100 100 100 100 1000.13 100 100 100 100 1000.14 100 100 100 100 1000.15 100 100 100 100 1000.20 100 100 100 100 1000.30 100 100 100 100 1000.40 100 100 100 100 98.610.50 93.06 97.22 98.61 100 100

TABLE V

PERFORMANCE ONLEUKEMIA DATA SET USING SVM


0.0001 90.27 * * * *0.001 91.66 100 * * *0.005 84.72 100 100 * *0.01 83.33 100 100 100 *0.02 90.27 94.44 100 100 1000.04 90.27 98.61 98.61 100 1000.06 90.27 100 100 100 1000.08 94.44 100 100 100 1000.10 98.61 100 98.61 100 1000.11 95.83 100 100 100 1000.12 98.61 98.61 98.61 98.61 1000.13 100 100 100 100 1000.14 98.61 100 100 100 1000.15 98.61 98.61 100 100 1000.20 94.44 100 98.61 100 1000.30 100 100 100 100 98.610.40 97.22 100 98.61 95.83 98.610.50 93.05 93.05 95.83 94.44 95.83

is obtained forα values ranging from 0.0005 to 0.20 using 1or more gene cluster representatives. K-NN also gives 100%accuracy forα value from 0.001 to 0.30 using 1 or morecluster representatives in case of leukemia data set. For coloncancer data set it also gives 100% accuracy forα values 0.02and 0.13 using 1 or more gene-cluster representatives.

Analyzing all these results, we can say that for NB classifier,in case of breast cancer data set 100% accuracy is obtainedfor α values from 0.0001 to 0.20 and for SVM and K-NN bestresult is obtained forα values 0.10 and 0.15. So, the proposedmethod gives best result for breast cancer data forα values0.10 and 0.15. In case of colon cancer data using NB classifier,100% accuracy is obtained for a set ofα values. These valuesare 0.01, 0.02, 0.04 and 0.13. For SVM, it gives best resultfor α value 0.02 and using K-NN it gives best result forα

value 0.02, and 0.13 for colon cancer data. So whenα valueis set to 0.02, the proposed method gives 100% accuracy forthree specified classifiers in case of colon cancer data using

80

TABLE VI

PERFORMANCE ONLEUKEMIA DATA SET USING K-NN RULE


0.0001 98.61 * * * *0.001 97.22 100 * * *0.005 95.83 97.22 97.22 * *0.01 98.61 100 100 100 *0.02 98.61 98.61 100 100 1000.04 97.22 98.61 98.61 100 1000.06 100 100 100 100 1000.08 98.61 100 100 100 1000.10 100 100 98.61 100 1000.11 95.83 100 100 100 1000.12 98.61 98.61 98.61 98.61 1000.13 100 100 100 100 1000.14 98.61 100 100 100 1000.15 100 100 100 100 1000.20 98.61 98.61 98.61 100 1000.30 100 100 100 100 1000.40 97.22 100 95.83 94.44 94.440.50 94.44 91.67 91.67 93.06 94.44

TABLE VII

PERFORMANCE ONCOLON CANCER DATA SET USING NB CLASSIFIER


0.0001 98.39 * * * *0.001 98.39 96.77 * * *0.005 98.39 96.77 * * *0.01 100 100 * * *0.02 100 98.39 * * *0.04 100 100 100 * *0.06 98.39 98.39 96.77 95.16 *0.08 98.39 98.39 96.77 98.39 98.390.10 96.77 98.39 98.39 98.39 98.390.11 98.39 98.39 98.39 98.39 98.390.12 96.77 96.77 96.77 95.16 96.770.13 100 98.39 98.39 96.77 96.770.14 96.77 98.39 95.16 95.16 93.550.15 98.39 95.16 95.16 96.77 95.160.20 98.39 98.39 96.77 95.16 95.160.30 95.16 95.16 93.55 93.55 93.550.40 91.94 96.77 96.77 95.16 95.160.50 83.87 82.26 85.48 93.55 90.32

1 or more cluster representatives. So, the proposed methodgives best result for colon cancer data set forα values 0.02.For leukemia cancer data we get 100% accuracy forα valuefrom 0.0001 to onwards using NB classifier. Using SVM, theproposed method gives best result forα values ranging from0.001 to 0.40 and for K-NN it gives best result forα valuesfrom 0.001 to 0.30. So, we can say that for leukemia cancerdata it gives best result whenα value is set to 0.13 or 0.30.

D. Comparative Performance Analysis

For comparison purpose we compare the proposed methodwith the results of the attribute clustering algorithm (ACA),t-value,k-means algorithm, minimum redundancy-maximumrelevance (mRMR) algorithm, self organizing map (SOM), bi-clustering algorithm, and radial basis function (RBF) networkon colon cancer and leukemia cancer data sets as given in [3]using the NB and K-NN methods.

The experimental results in Table X show that the proposed

TABLE VIII

PERFORMANCE ONCOLON CANCER DATA SET USING SVM


0.0001 95.16 * * * *0.001 95.16 95.16 * * *0.005 95.16 98.39 * * *0.01 98.39 96.77 * * *0.02 100 93.55 * * *0.04 90.32 90.32 95.16 * *0.06 95.16 96.77 95.16 93.55 *0.08 88.71 98.39 98.39 98.39 98.390.10 93.55 95.16 93.55 93.55 93.550.11 95.16 96.77 95.16 91.94 91.940.12 95.16 91.94 93.55 91.94 93.550.13 96.77 98.39 96.77 96.77 95.160.14 95.16 93.55 93.55 95.16 95.160.15 98.39 95.16 95.16 95.16 95.160.20 93.55 91.94 91.94 93.55 93.550.30 93.55 91.94 91.94 87.09 91.940.40 91.94 93.55 95.16 95.16 95.160.50 80.65 83.88 80.65 85.49 82.26

TABLE IX

PERFORMANCE ONCOLON CANCER DATA SET USING K-NN RULE


0.0001 95.16 * * * *0.001 95.16 95.16 * * *0.005 98.39 96.77 * * *0.01 98.39 96.77 * * *0.02 100 93.55 * * *0.04 98.39 93.55 93.55 * *0.06 95.16 98.39 95.16 95.16 *0.08 95.16 96.77 96.77 98.39 98.390.10 93.55 95.16 93.55 93.55 93.550.11 96.77 96.77 95.16 95.16 95.160.12 95.16 91.94 93.55 91.94 93.550.13 100 98.39 96.77 96.77 95.160.14 93.55 93.55 96.77 95.16 93.550.15 98.39 93.55 96.77 95.16 95.160.20 98.39 93.55 93.55 95.16 95.160.30 93.55 91.94 91.94 91.94 91.940.40 91.94 95.16 95.16 95.16 95.160.50 83.87 80.65 80.65 82.26 82.26

method is superior to the other gene selection methods by se-lecting a smaller set of discriminative genes in the colon cancerand leukemia cancer data sets than the others as reflected bythe classification results. The proposed method outperforms inall cases. Although, ACA andt-value algorithms can find gooddiscriminative genes for the K-NN method,t-value is unable todo so for the naive bayes method. Using naive bayes classifierACA gives good result for leukemia cancer but is unable to doso for colon cancer. Thek-means algorithm, SOM, biclusteringalgorithm, mRMR and RBF fail to find good discriminativegenes for these two data sets, as shown in the results.

V. CONCLUSION

This paper presents a new algorithm for supervised clus-tering of genes from microarray experiments. The proposedalgorithm is potentially useful in the context of medicaldiagnostics, as it identifies groups of interacting genes thathave high explanatory power for given tissue types, and which

81

TABLE X

COMPARATIVE PERFORMANCEANALYSIS OF DIFFERENTMETHODS

Classifier Data Sets Methods/ Accuracy NumberAlgorithms (%) of GenesProposed 100 1

ACA 83.9 7t-value 80.6 7

K-NN Colon k-means 69.4 14Cancer SOM 59.7 14

Biclustering 69.4 7mRMR 64.5 7

RBF 67.7 3Proposed 100 1

ACA 67.7 14t-value 56.5 7

NB Colon k-means 62.9 7Cancer SOM 64.5 7



ACA 91.2 7t-value 88.2 14

K-NN Leukemia k-means 50.0 7Cancer SOM 50.0 7



ACA 82.4 7t-value 55.9 7

NB Leukemia k-means 58.8 7Cancer SOM 58.8 7


RBF 58.8 3

in turn can be accurately predict the class labels of newsamples. At the same time, such gene clusters may revealinsights into biological processes and may be valuable forfunctional genomics.

In summary, the proposed algorithm tries to cluster genessuch that the discrimination of different tissue types is assimple as possible. The performance of the proposed method isevaluated by the predictive accuracy of naive bayes classifier,K-nearest neighbor rule, and support vector machine. Forall data sets, 100% classification accuracy is found by theproposed method. The results obtained on real data setsdemonstrate that the proposed method can bring a remarkableimprovement on gene selection problem. So, the proposedmethod is capable of identifying discriminative genes thatmaycontribute to revealing underlying class structures, providinga useful tool for the exploratory analysis of biological data.

REFERENCES

[1] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri,C. D. Bloomfield, and E. S. Lander, “Molecular Classificationof Cancer:Class Discovery and Class Prediction by Gene Expression Monitoring,”Science, vol. 286, pp. 531–537, 1999.

[2] E. Domany, “Cluster Analysis of Gene Expression Data,”Journal ofStatistical Physics, vol. 110, pp. 1117–1139, 2003.

[3] W. Au, K. C. C. Chan, A. K. C. Wong, and Y. Wang, “AttributeClustering for Grouping, Selection, and Classification of Gene Expres-sion Data,” IEEE/ACM Transactions On Computational Biology andBioinformatics, vol. 2, no. 2, pp. 83–101, 2005.

[4] M. B. Eisen, P. T. Spellman, and D. Botstein, “Cluster Analysis andDisplay of Genome-Wide Expression Patterns,”Proc. Natl Acad. Sci.USA, vol. 95, pp. 14 863–14 868, 1998.

[5] R. Herwig, A. J. Poustka, C. Muller, C. Bull, H. Lehrach, and J. O’Brien,“Large-scale Clustering of cDNA-fingerprinting data.”Genome Res.,vol. 9, pp. 1093–1105, 1999.

[6] P. Tamayo, D. Slonim, J. Mesirov, Q. Kitareewan, S. Dmitrovsky,E. Lander, E. S., and T. R. Golub, “Interpreting Patterns of GeneExpression with Self-Organizing Maps: Methods and Application toHematopoietic Differentiation,”Proc. Natl Acad. Sci. USA, vol. 96, pp.2907–2912, 1999.

[7] T. Hastie, R.Tibshirani, D. Botstein, and P. Brown, “Supervised Har-vesting of Expression Trees,”Genome Biology, 2001.

[8] D. Nguyen and D. Rockie, “Tumor Classification by PartialLeastSquares using Microarray Gene Expression Data,”Bioinformatics, pp.39–50, 2002 2001.

[9] P. Geladi and B. Kowalski, “Partial Least Square Regression: A Tuto-rial,” Analyt Chem Acta, 1986.

[10] M. Detting and P. Buhlamann, “Supervised Clustering ofGenes,”Genome Biology, pp. 0069.1–0069.15, 2002.

[11] C. Ding and H. Peng, “Minimum Redundancy Feature Selection fromMicroarray Gene Expression Data,” inProceedings of the ComputationalSystems Bioinformatics, 2003, pp. 523–528.

[12] J. Li, H. Su, H. Chen, and B. W. Futscher, “Optimal Search-BasedGene Subset Selection for Gene Array Cancer Classification,” IEEETransactions on Information Technology in Biomedicine, vol. 11, no. 4,pp. 398–405, 2007.

[13] H. Peng, F. Long, and C. Ding, “Feature Selection Based on MutualInformation: Criteria of Max-Dependency, Max-Relevance,and Min-Redundancy,”IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 27, no. 8, pp. 1226–1238, 2005.

[14] X. Liu, A. Krishnan, and A. Mondry, “An Entropy Based Gene Selec-tion Method for Cancer Classification Using Microarray Data,” BMCBioinformatics, vol. 6, no. 76, pp. 1–14, 2005.

[15] D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene ExpressionData: A Survey,”IEEE Transactions on Knowledge and Data Engineer-ing, vol. 16, no. 11, pp. 1370–1386, 2004.

[16] C. Shannon and W. Weaver,The Mathematical Theory of Communica-tion. Champaign, IL: Univ. Illinois Press, 1964.

[17] P. Maji, “f -Information Measures for Efficient Selection of Discrimi-native Genes from Microarray Data,”IEEE Transactions on BiomedicalEngineering, vol. 56, no. 4, pp. 1063–1069, 2009.

[18] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang,H. Zuzan, J. A. Olson, J. R. Marks, and J. R. Nevins, “Predicting theClinical Status of Human Breast Cancer by Using Gene ExpressionProfiles,”Proceedings of the National Academy of Science, USA, vol. 98,no. 20, pp. 11 462–11 467, 2001.

[19] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack,and A. J. Levine, “Broad Patterns of Gene Expression Revealed byClustering Analysis of Tumor and Normal Colon Tissues Probed byOligonucleotide Arrays,”Proceedings of the National Academy of Sci-ence, USA, vol. 96, no. 12, pp. 6745–6750, 1999.

[20] T. Mitchell, Machine Learning. McGraw-Hill, 1997.[21] V. Vapnik, The Nature of Statistical Learning Theory. New York:

Springer-Verlag, 1995.[22] R. O. Duda, P. E. Hart, and D. G. Stork,Pattern Classification and

Scene Analysis. John Wiley & Sons, New York, 1999.

82

Modified Greedy Search Algorithm for Biclustering Gene Expression Data

Shyama Das Department of Computer Science

Cochin University of Science and Technology Cochin, Kerala, India

[email protected]

Sumam Mary Idicula Department of Computer Science,

Cochin University of Science and Technology Cochin, Kerala, India [email protected]

Abstract— Biclustering refers to simultaneous clustering of both rows and columns of a data matrix. Biclustering is a highly useful data mining technique in the analysis of gene expression data. The problem of identifying the most significant biclusters in gene expression data has shown to be NP complete. In this paper a greedy search algorithm is developed for biclustering gene expression data. This algorithm has two steps. In the first step high quality bicluster seeds are generated using K-Means clustering algorithm. Then these seeds are enlarged using the greedy search method. Here the node that results in the minimum Hscore value when combined with the bicluster is selected. This is added to the bicluster. This selection and addition is continued till the Hscore value of the bicluster reaches the given threshold. Even though it is a greedy method the results obtained is far better than many of the metaheuristic methods which are generally considered superior to greedy approach.

Keywords - Biclustering; Gene expression data; greedy search; datamining; K-Means clustering

I. INTRODUCTION DNA Microarray technology is capable of measuring the expression levels of thousands of genes in a single experiment. Measuring the gene expression levels across different stages in different tissues or cells or under different conditions is useful for understanding and interpreting biological processes. The relative abundance of the mRNA of a gene under a specific experimental condition or sample is called the expression level of a gene. Gene expression patterns can offer massive information about cell functions. It has revolutionized gene expression analysis. Microarray data are widely used in genomic research because of its enormous potential in gene expression profiling, facilitating the prognosis and the discovery of subtypes of diseases. Microarrays are widely used in medical domain to construct molecular profiles of diseased and normal tissues of patients. Such profiles are extremely useful for understanding various diseases and facilitate a more accurate diagnosis, prognosis, treatment planning and drug discovery. Microarray Gene expression data is organized in the form of a matrix where rows represent genes and columns represent experimental conditions or samples. The experimental conditions can be patients, tissue types etc. The samples can

correspond to different time points or different environmental conditions. The samples can be from different organs, from cancerous or healthy tissues, or even from different individuals. The gene expression data contains thousands of genes and hundreds of conditions. An element in the matrix refers to the expression level of a particular gene under a specific condition. The genes are co-regulated if the genes in a set display similar fluctuation under all conditions. By discovering the co-regulation, it is possible to refer to the gene regulative network, which will lead to better understanding as to how organisms develop and evolve. One of the objectives of gene expression data analysis is to group genes according to their expression under multiple conditions. Clustering is the most widely used data mining technique for analyzing gene expression data to group similar genes or conditions. Clustering of co-expressed gene into biologically meaningful groups assists in inferring the biological role of an unknown gene that is co-expressed with a known gene. However clustering has got its own limitations. Clustering is based on the assumption that all the related genes behave similarly across all the measured conditions. It may reveal the genes which are very closely co-regulated along the entire column. However genes are not relevant for all the experimental conditions but groups of genes are co-expressed and co-regulated only under specific conditions. They behave almost independently under other conditions. Moreover clustering partitions the genes into disjoint sets i.e. each gene is associated with a single biological function, which is in contradiction to the biological system [1]. This observation resulted in the development of clustering methods that try to simultaneously group genes and conditions. This approach is called biclustering or co-clustering. Biclustering is clustering applied in two dimensions, i.e. along the row and column, simultaneously. This approach identifies the genes which show similar expression levels under a specific subset of experimental conditions. The objective is to discover maximal subgroups of genes and subgroups of conditions. Such genes express highly correlated activities over a range of conditions. Biclustering was first defined by Hartigan who called it direct clustering [2]. Cheng and Church were the first to apply biclustering to

83

gene expression data [3]. Biclustering is a powerful analytical tool when some genes have multiple functions and experimental conditions are diverse. In this work a novel algorithm is developed for biclustering gene expression data using greedy strategy. In the first step high quality bicluster seeds are generated using K-Means clustering algorithm. Then the seeds are enlarged by adding the node that results in minimum incremental increase in Hscore. The node addition continues till the Hscore value of the bicluster reaches the given threshold.

II. METHODS AND MATERIALS

A. Model of bicluster Gene expression dataset is a matrix in which rows represent genes and columns represent experimental conditions. An element aij of the expression matrix A represents the logarithm of the relative abundance of the mRNA of the ith gene under the jth condition. Let X=G1,G2,....GN be the set of genes and Y=C1,...CM be the set of conditions in the gene expression dataset. The dataset can be viewed as an NxM matrix A of real numbers. A bicluster is a submatrix B of A and if the size of B is IxJ, then I is a subset of rows X of A, and J is a subset of the columns Y of A. The rows and columns of the bicluster B need not be contiguous as in the expression matrix A. Biclusters are generally classified into four major types. They are: biclusters with constant values, biclusters with constant values on rows or columns, biclusters with coherent values, and biclusters with coherent evolutions. In the gene expression data matrix constant biclusters disclose subsets of genes with similar expression values within a subset of conditions. On the other hand a bicluster with constant values in the rows will identify a subset of genes with similar expression values across a subset of conditions permitting the expression levels to vary from gene to gene. Similarly a bicluster with constant columns identifies a subset of conditions within which a subset of genes manifest similar expression values assuming that the expression values might vary from condition to condition. In the case of a bicluster with coherent values, it identifies a subset of genes and a subset of conditions with coherent values on both the rows and columns. In this case the similarity among the genes is measured as the mean squared residue score. If the similarity measure (mean squared residue score) of a matrix is within a certain threshold, it is a bicluster. In the case of a bicluster with coherent evolutions, a subset of genes is up-regulated or down-regulated across a subset of conditions without considering their actual expression values [1]). The biclusters with coherent values are biologically more relevant than biclusters with constant values. Hence in this work biclusters with coherent values are identified. Thus the problem of biclustering can be formulated in the following manner: given a data matrix A, find a set of submatrices B1, B2, ... Bn which satisfy some homogeneous characteristics or coherence. For measuring the degree of coherence a measure

called mean squared residue score or Hscore was introduced by Cheng and Church. It is the sum of the squared residue score. The residue score of an element bij in a submatrix B is defined as

RS(bij)=bij-bIj-biJ+bIJ where

Here I denotes the row set and J denotes the column set of matrix B, bij denotes the element in a submatrix, biJ denotes the ith row mean, bIj denotes the jth column mean, and bIJ denotes the mean of the whole bicluster. The residue score of an element bij provides the difference between the actual value and its expected value predicted from its row mean, column mean and bicluster mean. The residue of an element is a measure of how well the entry fits into that bicluster. Hence from the value of residue, the quality of the bicluster can be evaluated by computing the mean squared residue. That is Hscore or mean squared residue score of bicluster B is MSR(B) = 2 Cheng and Church defined a bicluster to be a matrix with low mean squared residue score. The maximum value of MSR that a matrix can have to be called as a bicluster is called MSR threshold and is denoted as δ. A submatrix B is called a δ bicluster if MSR(B)< δ for some δ >0. A high MSR value signifies that the data is uncorrelated. A low MSR value means that there is correlation in the matrix. The value of δ depends on the dataset. For Yeast dataset the value of δ is 300 and for Lymphoma dataset the value of δ is 1200. The volume of a bicluster or bicluster size is the product of number of rows and the number of columns in the bicluster. This bicluster model is much more flexible than the row clusters. The identified submatrices need to be neither disjoint nor cover the entire matrix. But the computation of biclusters is costly because one will have to consider all the combinations of columns and rows in order to find out all the biclusters. The search space for the biclustering problem is 2m+n where m and n are size of genes and conditions respectively. Usually m+n is more than 2000. The biclustering problem is Np-hard.

B. Encoding of bicluster Each bicluster is encoded as a binary string of fixed length [4]. The length of the string is the sum of the number of rows and the number of columns of the gene expression data matrix. First N bits represent genes and next M bits represent conditions. A bit is set to one when the corresponding gene or condition is included in the bicluster otherwise it is set to zero. This representation is advantageous for node addition and deletion.

84

C. Algorithm Description Different types of algorithm design techniques are used to address the biclustering problem including iterative row and column clustering combination, Divide and Conquer, Greedy iterative search, Evolutionary or metaheuristic algorithms. Greedy iterative search methods are based on the idea of creating biclusters by adding or removing rows/columns from them, using a criterion that maximizes a local gain. In this work greedy search method is used for finding δ biclusters and it is very fast compared to metaheuristic methods. The algorithm has two major phases. In the first phase, an initial set of seed biclusters are generated using K-Means one way clustering algortithm. The second phase is used to enlarge the seeds by adding more rows and columns using a greedy search algorithm.

D. Seed Finding A good seed of a bicluster is a small bicluster with very low Hscore value. Hence in the seed there exists a possibility of accommodating more genes and conditions within the given Hscore threshold. In this algorithm a simple seed finding technique is used [5]. For finding seeds K-Means clustering algorithm is used. K-Means is a partitional clustering algorithm. The generated clusters are disjoint, flat or non-hierarchical. The number of clusters generated should be specified as input. In K-Means clustering algorithm distance measure is a parameter that specifies how the distance between data points is measured. Here cosine angle distance is selected as the distance measure. First of all gene and condition clusters are obtained from the K-Means one way clustering algorithm. That is genes in the dataset are partitioned into n gene clusters. Those clusters having more than 10 genes are further divided into groups based on cosine angle distance from the cluster centre so that each group contains maximum 10 genes. Similarly conditions in the dataset are partitioned into m clusters and each cluster containing more than 5 conditions is further divided based on cosine angle distance from the cluster center so that each group contains maximum 5 conditions. There are p gene clusters and q condition clusters. All combinations of these p gene clusters and q condition clusters are found. Hscore value for all these combinations is calculated and those with Hscore value below a certain threshold are selected as seeds. Thus the gene expression data matrix is partitioned into fixed size tightly co-regulated submatrices. The Yeast dataset is partitioned into 140 gene clusters and 3 condition clusters [4]. E. Seed growing phase In the seed growing phase a separate list is maintained for conditions and genes not included in the bicluster. Each seed is enlarged separately by adding more genes and conditions. Initially conditions are added followed by genes. In modified greedy search algorithm the best element is selected from the gene list or condition list and added to the bicluster. The quality of the element is determined by the Hscore or MSR value of the bicluster after including the element in the bicluster. The element which results in minimum Hscore value

when added to the bicluster is considered as the best element. It cannot be specified as an element with smallest incremental cost of Hscore because adding some elements reduces the Hscore value. Seed growing starts from condition list followed by gene list until the Hscore value reaches the given threshold. This is a greedy method since our aim is to select the next element which produces bicluster with minimum Hscore value. This algorithm is deterministic. A pseudo-code description of modified greedy search algorithm is given below. F. Modified Greedy Search Algorithm

Algorithm modifiedgreedy(seed, δ)

bicluster := seed

Calculate Column_List the list of conditions not included in the bicluster

While (MSR(bicluster) <= δ) No_elem_Col=size(Column_List)

for i:=1: No_elem_Col

bicluster=bicluster+ Column_List [i]

Column_List_msr[i]= MSR(bicluster) Remove Column_List[i] from bicluster

end(for)

find minimum value in Column_List_msr and corresponding index K bicluster=bicluster+ Column_List [K]

delete Column_List [K] from Column_List

end(while)

Calculate Row_List the list of genes not included in the bicluster

While (MSR(bicluster) <= δ)

No_elem_Row=size(Row_List)

for i:=1: No_elem_Row

bicluster=bicluster+ Row_List [i]

Row_List_msr[i]= MSR(bicluster)

Remove Row_List[i] from bicluster end(for)

find minimum value in Row_List_msr and corresponding index J bicluster=bicluster+ Row_List [J]

85

delete Row_List [J] from Row_List

end(while)

end(modifiedgreedy) G. Difference between Novel Greedy Search and Modified Greedy Search algorithms In novel Greedy Search algorithm [6] node (condition or gene) addition follows node deletion if necessary. The added node is deleted if the Hscore value of the bicluster exceeds certain threshold. The nodes are searched sequentially. The node thus added may not be optimal in terms of Hscore value. But in the case of Modified Greedy Search algorithm the node which results in minimum Hscore value when joined with bicluster is selected and added to the bicluster. Hence superior results can be obtained through Modified Greedy Search algorithm. In Modified Greedy before adding a node the Hscore value of the bicluster combined with a single gene or condition which is not included in the bicluster is to be calculated for all the genes or conditions not included in the bicluster. Even though Modified Greedy Search algorithm is computationally more expensive in terms of time, it is capable of obtaining larger biclusters with low Hscore value from gene expression data compared to Novel Greedy Search algorithm.

III. EXPERIMENTAL RESULTS

A. Dataset used The proposed algorithm is implemented in Matlab and Experiments are conducted on the Yeast Saccharomyces cerevisiae cell cycle expression dataset to assess the quality of the proposed method. The dataset is based on Tavazoie et al [7]. Dataset consists of 2884 genes and 17 conditions. The values in the expression dataset are integers in the range 0 to 600. There are 34 missing values represented by -1. The dataset is obtained from http://arep.med.harvard.edu/biclustering.

B. Bicluster Plots In Figure 1 eight biclusters identified by the modified greedy search algorithm on the Yeast dataset are shown. From the bicluster plots it can be noticed that genes present a similar behavior under a set of conditions. Many of the biclusters found on the Yeast dataset contain all 17 conditions. Out of the eight biclusters shown in Figure 1, seven contain all 17 conditions and they differ in appearence. In short modified greedy search algorithm is ideal for identifying various biclusters with coherent values. Information about these biclusters is given in Table 1. All the biclusters are having mean squared residue less than 300. Details about 6 more biclusters obtained using modified greedy algorithm, whose bicluster plots are not included in Figure 1 are also given in the last six rows of Table 1. These biclusters are also taken into account while calculating the average of mean squared

residue, gene number, condition, volume etc. for the performance comparison of modified greedy with other biclustering algorithms.

0 2 4 6 8 10 12 14 16 18350

400

450

500

550

600

condition

Expr

essi

onV

alue

0 2 4 6 8 10 12 14 16 18150

200

250

300

350

400

450

Condit ion

Exp

ress

ion

Valu

e

0 2 4 6 8 10 12 14 16 180

50

100

150

200

250

300

350

400

450

Condition

Exp

ress

ion

Val

ue0 2 4 6 8 10 12 14 16 18

150

200

250

300

350

400

450

condition

Exp

ress

ion

Val

ue

0 2 4 6 8 10 12 14 16 18100

150

200

250

300

350

400

450

500

550

600

0 2 4 6 8 10 12 14 16 18300

350

400

450

500

550

600

condition

Exp

ress

ion

Val

ue

0 2 4 6 8 10 12 14 16 18350

400

450

500

550

600

Condit ion

Exp

ress

ion

Val

ue

1 2 3 4 5 6 7 8 9-100

0

100

200

300

400

500

600

Condition

Exp

ress

ion

Val

ue

Figure 1. Eight biclusters obtained from the Yeast expression data. Bicluster labels are (a), (b), (c), (d), (e), (f), (g) and (h) respectively. In the bicluster plots X axis contains conditions and Y axis contains expression values. The details about the biclusters can be obtained from Table 1 using bicluster label. Here only biclusters with different shapes are selected.

86

TABLE 1

INFORMATION ABOUT BICLUSTERS OF YEAST DATASET Label Rows Columns Bicl.Vol . MSR

(a) 10 17 170 66.4403 (b) 17 17 289 99.3497 (c) 108 17 1836 194.5204 (d) 14 17 238 97.8389 (e) 107 17 1819 199.1857 (f) 33 17 561 99.9639 (g) 31 17 527 97.9121 (h) 1405 9 12645 299.8968 (p) 147 17 2499 200.2474 (q) 710 8 5680 199.9880 (r) 913 9 8217 256.1985 (s) 1163 8 9304 246.0037 (t) 1200 8 9600 249.9022 (u) 1355 9 12195 294.9206

In the above table the first column contains the label of each bicluster. The second and third columns report the number of rows (genes) and of columns (conditions) of the bicluster respectively. The fourth column reports the volume of the bicluster and the last column contains the mean squared residues of the biclusters. Table contains details of some more biclusters not included in Figure 1 with labels (p),(q),(r),(s),(t) and (u).

IV. COMPARISON

A comparative summarization of results of Yeast data involving the performance of related algorithms are given in Table 2. The performance of modified greedy algorithm in comparison with that of Novel Greedy [6], SEBI [8], Cheng and Church’s algorithm (CC) [3], and the algorithm FLOC by Yang et al. [9] and DBF [10] for the Yeast dataset are given. SEBI (Sequential Evolutionary Biclustering) is based on evolutionary algorithms. In the Cheng and Church algorithm, rows/columns were deleted from the gene expression data matrix to find a bicluster. Their algorithm is based on greedy strategy which removes rows and columns starting from the entire gene expression matrix. The model of bicluster proposed by Cheng and Church was generalized by Yang et al (2003) for incorporating null values and for removing random interference. They developed a probabilistic algorithm FLOC that can discover a set of possibly overlapping biclusters simultaneously. Zhang et al. presented DBF (Deterministic Biclustering with frequent pattern mining). In DBF a set of good quality biclusters seeds are generated in the first phase based on frequent pattern mining. In the second phase these biclusters are enlarged by adding more genes or conditions. For the modified greedy search algorithm presented here the average number of conditions is better than that of CC, FLOC and DBF. Average gene number, average volume and largest bicluster size is greater than that of all other algorithms.

Average mean squared residue score is better than that of all other algorithms listed in the table except DBF. As it is clear from the Table 2, performance of modified greedy is better than novel greedy in terms of average mean squared residue, average gene number, average volume and largest bicluster size. In multi-objective evolutionary computation [11] the maximum number of conditions obtained is only 11 for the Yeast dataset. But, in this method there are biclusters with all 17 conditions. For the Yeast dataset the maximum number of genes obtained for this algorithm in all the 17 conditions is 147 with Hscore value 200.2474. The maximum available in all the literature published so far is in the case of multi-objective PSO [12]. They obtained 141 genes for 17 conditions with Hscore value 203.25.

TABLE 2

PERFORMANCE COMPARISON BETWEEN MODIFIED GREEDY AND OTHER ALGORITHMS FOR YEAST DATASET

Algori thm

Avg. Residue

Avg.Gene Num

Avg.Cond. Num

Avg. Vol.

Largest Bicluster

Modified Greedy

185.88 515.21 13.36 4684.29 12645

Novel Greedy

199.78 94.75 14.75 1422.87 2112

CC 204.29 166.71 12.09 1576.98 4485 SEBI 205.18 13.61 15.25 209.92 1394 FLOC 187.54 195.00 12.80 1825.78 2000 DBF 114.70 188.00 11.00 1627.20 4000

As is clear from the above table the average mean squared residue, the average number of genes and conditions, average volume and largest bicluster size are compared for various algorithms. For the average mean squared residue field lower values are better where as higher values are better for all other fields.

V. CONCLUSION As a powerful analytical tool biclustering finds application in the gene expressions of cancerous data for the identification of coregulated genes, gene functional annotation and sample classification. In this paper a new algorithm is introduced based on the greedy search method for finding biclusters in gene expression data. In the first step K-Means algorithm is used to group rows and columns of the data matrix separately. Then they are combined to produce submatrices. From these submatrices those with Hscore value below a certain threshold are selected as seeds which are small tightly coregulated submatrices. Then more genes and conditions are added to these seeds using a greedy search method in which gene or condition with minimum increase in Hscore value is added in each iteration until the Hscore value of the bicluster reaches the given threshold. Based on the algorithm implementation on Yeast dataset a comparative assessment of the results is provided in order to demonstrate the effectiveness of the proposed method. In terms of the average mean residue score, average gene number, average volume and largest bicluster size the biclusters obtained in this method is far better than

87

many of the biclustering algorithms and especially the Novel Greedy Search algorithm. Moreover this method finds high quality biclusters that show strikingly similar up-regulations and down-regulations under a set of experimental conditions that can be inspected visually by using plots.

REFERENCES [1] Madeira S. C. and Oliveira A. L., “Biclustering algorithms for

Biological Data analysis: a survey” IEEE Transactions on computational biology and bioinformatics, 2004, pp. 24-45.

[2] J. A. Hartigan, “Direct clustering of Data Matrix”, Journal of the American Statistical Association Vol.67, no.337, 1972, pp. 123-129.

[3] Yizong Cheng and George M. Church, “Biclustering of expression data”, Proc. 8th Int. Conf. Intelligent Systems for Molecular Biology, 2000, pp. 93-103.

[4] Anupam Chakraborty and Hitashyam Maka “Biclustering of Gene Expression Data Using GeneticAlgorithm” Proceedings of Computation Intelligence in Bioinformatics and Computational Biology CIBCB, 2005, pp. 1-8

[5] Chakraborty A. and Maka H., “Biclustering of gene expression data by simulated annealing”,HPCASIA ’05, 2005, pp. 627-632.

[6] Shyama Das and Sumam Mary Idicula “A Novel Approach in Greedy Search Algorithm for Biclustering Gene Expression Data” accepted for presentation in the International Conference on Bioinformatics, Computational and Systems Biology (ICBCSB) which will be held in Singapore during Aug 27-29, 2009.

[7] Tavazoie S., Hughes J. D., Campbell M. J., Cho R. J. and Church G. M., “Systematic determination of genetic network architecture”, Nat. Genet., vol.22, no.3, 1999 pp, 281-285.

[8] Federico Divina and Jesus S. Aguilar-Ruize, “Biclustering of Expression Data with Evolutionary computation”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18,2006, pp. 590-602.

[9] J. Yang, H. Wang, W. Wang and P. Yu, “Enhanced Biclustering on Expression Data”, Proc. Third IEEE Symp. BioInformatics and BioEng. (BIBE’03, 2003), pp. 321-327.

[10] Z. Zhang, A. Teo, B. C. Ooi, K. L. Tan, “Mining deterministic biclusters in gene expression data”, Proceedings of the fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’04), 2004, pp.283-292.

[11] Banka H. and Mitra S., “Multi-objective evolutionary biclustering of gene expression data”, Journal of Pattern Recognition, Vol.39, 2006, pp. 2464-2477.

[12] Junwan Liu, Zhoujun Lia and Feifei Liu “Multi-objective Particle Swarm Optimization Biclustering of Microarray Data”, IEEE International Conference on Bioinformatics and Biomedicine, 2008, pp. 363-366.

88

ADCOM 2009AD-HOC NETWORKS

Session Papers:

1. Rajiv Saxena and Alok Singh, “Solving Bounded Diameter Minimum Spanning Tree Problem Using Improved Heuristics”

2. Santosh Kulkarni and Prathima Agrawal, “Ad-hoc Cooperative Computation in Wireless Networks using Ant like Agents”

3. Natarajan Meghanathan and Ayomide Odunsi, “A Scenario-based Performance Comparison Study of the Fish-eye State Routing and Dynamic Source Routing Protocols for Mobile Ad hoc Networks”

89

Solving Bounded-Diameter Minimum SpanningTree Problem Using Improved Heuristics

Rajiv Saxena and Alok SinghDepartment of Computer and Information Sciences

University of HyderabadHyderabad 500046, Andhra Pradesh, India

[email protected], [email protected]

Abstract—The bounded-diameter minimum spanning tree(BDMST) problem is to find a minimum spanning tree of agiven connencted, undirected, edge-weighted graph G in whichno path between any two vertices contains more than k edges.The problem is known to be NP-Hard for 4 ≤ k < n − 1,where n is the number of vertices in G. Therefore, we lookfor heuristics to find good approximate solutions. This work isan improvement over two existing greedy heuristics - ImprovedRandomized Greedy Heuristics (RGH-I) and Improved CentreBased Tree Construction (CBTC-I). Both themselves are im-proved versions of heuristics RGH and CBTC. The improvementis such that given a bounded-diameter minimum spanning tree Tas constructed by RGH or CBTC, the heuristic tries to improvethe cost of T further by disconnecting a subtree of height h rootedat vertex v in T and attaching it to a vertex where the cost ofattaching it is minimum witout violating the diamteter constraint.On 25 euclidean instances and 20 non-euclidean instances upto1000 vertices our approach shows substantial improvement oversolutions found by RGH-I and CBTC-I.

I. INTRODUCTION

The bounded-diameter minimum spanning tree problem isuseful in many practical applications where a minimum span-ning tree (MST) with a small diameter (length of the longestpath in the tree) is required - such as in distributed mutualexclusion algorithms [8], in data compression for informationretreival [3] and in linear lighwave networks (LLNs) [2].

Let G = (V,E) be a connected undirected graph where Vdenotes the set of vertices and E denotes the set of edges. Eachedge e ∈ E has a non-negative weights w(e) associated withit. The BDMST problem seeks a Minimum Spanning Tree (T)on G whose diameter does not exceed a given positive integerk ≥ 2. That is

Minimize W (T ) =∑e∈T

w(e)

such thatdiameter(T ) ≤ k

It is to be noted that for diameter k = n − 1 the problemis nothing but to find MST of G for which we alreadyhave polynomial time exact algorithms (Prim’s or Kruskal’salgorithms). When k = 2, the BDMST takes the form of astar which can be computed in O(n2) and then selecting thesmallest-weight star as solution. When k = 3, the BDMSTtakes the form of a dipolar star where every node must be

of degree 1 except at most two nodes. To compute BDMSTin this case we consider each edge e of the graph one-by-one and make its endpoints the vertices whose degree canexceed 1. For each of the remaining n − 2 nodes, whosedegree is 1, it requires a comparison to determine to whichof the two nodes of degree ≥ 2 it is to be connected. Thishas to be repeated for every edge in G and then selecting aspanning tree with the smallest cost. In a complete graph withm edges, the total no. of comparisons thus required is (n−2)mwhich is O(n3). Finally when all the edge weights are same,a minimum diameter spanning tree can be constructed usingbreadth first search in O(mn) time. In the remaining generalcases the BDMST problem is NP-Hard [4].

The diameter of a tree is the maximum eccentricity of itsvertices. The eccentricity of a vertex v is the length of thelongest path from v to any other vertex. The vertex withminimum eccentricity defines the centre of the tree. Everytree has either one or two centres. If the diameter is even wehave only one vertex as the centre. If the diameter is odd thentwo connected vertices forms the centre of the tree.

Abdalla et. al [1] presented a greedy heuristic called One-Time Tree Construction (OTTC) for solving BDMST. OTTC isa modification of Prim’s algorithm that starts with a vertex andgrows the spanning tree by connecting the nearest unconnectedvertex to the partially build spanning tree without violatingthe diamater constraint. It keeps track of eccentricity of eachvertex such that the eccentricity of any vertex does not exceeddiameter k.

Raidl and Julstrom [7] proposed a randomized greedyheuristic (RGH). Their algorithm starts by fixing the tree’scentre. If k is even a vertex v0 at random is chosen from theset of vertices V as the centre vertex and if the diameter is oddthen another vertex v1 is chosen at random and v0, v1 formsthe centre of the tree. RGH maintains the diameter constraintby maintaining the depth of each vertex in tree, i.e., the numberof edges on the path from the tree’s centre to the vertex. Novertex in T can have depth > bk/2c. This is based on animportant observation by Handler [5] that in a tree of diameterk, no vertex is more than bk/2c edges from the tree’s centre.Thus by fixing the tree’s centre and using the observation byHandler, RGH grows the spanning tree such that no vertexhave depth greater than bk/2c. On the test instances consideredRGH outperforms OTTC substantially.

90

Julstrom [6] later proposed a full greedy heuristic whichis a modified version of RGH called Centre Based TreeConstruction (CBTC) for constructing BDMST. In CBTCinstead of selecting each next vertex at random, it selects anunconnected vertex v /∈ T and connects it to a vertex that isalready ∈ T via an edge with the smallest weight. On 20 non-euclidean instances whose edge weights have been chosen atrandom, CBTC outperforms RGH but on euclidean instancesRGH outperforms CBTC.

Singh and Gupta [9] presented improved versions of RGHand CBTC that further reduce the cost of BDMST obtainedby these heuristics. The improved version of RGH is RGH-I, and, correspondingly, for CBTC it is CBTC-I. There weretwo improvements proposed for each of these heuristics - oneconcerned with the efficiency of the algorithm and second withthe solution quality of RGH/CBTC. On Euclidean as well asnon-euclidean instances RGH-I (CBTC-I) outperforms RGH(CBTC) substantially.

The rest of this paper is organised as follows. The nextsection (section 2) describes our heuristic for improvementmade in RGH-I and CBTC-I for solving BDMST problem.Section 3 presents details of experiments and the comparativecomputational results. The last section (section 4) outlinessome conclusions.

II. IMPROVED GREEDY HEURISTIC (RGH+HT)Our improved greedy heuristic is based on RGH [7] and its

improved version RGH-I [9]. We call our heuristic RGH+HT.To better understand the improvement made by RGH+HT letus first consider how RGH-I improves the cost of BDMSTconstructed by RGH. RGH-I includes two improvements forRGH as follows:

1) It checks for each vertex v other than centre ver-tex/vertices and vertices connected immediately to thecentre vertex whether it can be connected to a bettervertex whose depth is less than that of the depth ofvertex v. In this case the subtree rooted at vertex v isdisconnected from its current parent and connected tothe new vertex selected.

2) It uses sorted cost matrix for two purposes. First to selecta better vertex for v as in previous case, and, second,where the greediness approach is applied in RGH, i.e.,instead of searching the candiadate set of vertices Cfor the lowest-weight edge in case of |C| > n/10,first n/10 elements of the row of the sorted cost matrixcorresponding to vertex v is searched.

RGH+HT retains the second improvement of RGH-I as itimproves the speed with which the search process and im-provements are carried out. We have modified the first im-provement of RGH-I. With RGH-I we can connect a subtreerooted at vertex v only to a vertex whose depth is lessthan the depth of vertex v and if it offers reduced costthan to its parent vertex. However, such an operation maynot always lead to best possible improvement. The betterimprovement may be possible if we can connect the subtreeto any vertex as long as cost is reduced and the feasibility of

Candidate Vertices in RGH−I, RGH+HT

Candidate Vertices Only in RGH+HT

Violates Diameter Constraint in RGH+HT

Descendant vertex; can not be candidate (RGH+HT)

Centre Vertex

parent(v)

v

Fig. 1. Possible candidate set of vertices to which subtree rooted at vertexv can be attached.

the resulting solution is maitained. This requires computingthe height of the subtree rooted at vertex v before makingthis decision. This improvement is shown in Figure 1. Asshown in the figure, RGH+HT allows a subtree rooted atvertex v to be disconnected from its current parent and to beattached to any vertex marked in rectangle or circle (assumingall these vertices offer reduced cost). The vertices shown inrectangle are also the candidate vertices in RGH-I, whereasRGH+HT allows additional candidate vertices as shown incircles. Thus RGH+HT presents a larger set of candidatevertices for improvement in comparison to RGH-I. It is to benoted that because RGH+HT allows a vertex to be connectedto any vertex whose depth is greater than its current depthwe must not connect it to any of its descendant vertex (asshown in Figure 1 by upside triangle), otherwise tree willbe disconnected. Pseudocode for RGH+HT based on theseobservations is given in Pseudocode 1.

RGH+HT makes multiple passes over the set of verticesuntil there is no further improvement possible as shown inline 2 of the pseudocode. The height of a subtree at step 11can be computed by performing breadth-first-search startingat root vertex of that subtree. While performing this breadth-first-search we also keep track of all the descendant verticesbelonging to root vertex so that we do not attach v0 to anyof its descendant vertex as specified in step 15. The vertexthat offers the maximum reduction in cost for v0 is selectedfrom the row of the sorted cost matrix correspoding to v0 instep 13. Step 16 is the condition for maintaining the diamterof the tree (T) while connecting the subtree to its newly foundbetter vertex. Once a better vertex has been found, we connectthe subtree to this better vertex (Step 18). After connecting thissubtree to the new vertex, the depth of each vertex belongingto this subtree including the root vertex is updated (Step 20).

The basic idea behind RHG+HT was mentioned in [9] also.However, it was not implemented and tested there as the mainpurpose of [9] was to use improved RGH repeatedly within agenetic algorithm, where it could have slowed down the overallalgorithm significantly.

91

Pseudocode 1 RGH+HTRequire: BDMST T as computed by RGHEnsure: diam(T ) ≤ k

1: moreimp← true // While further improvements possible2: while (moreimp) do3: moreimp← false4: U ← V − c0 // Vertices in U without centres c0 or c1

5: if odd(k) then6: U ← V − c17: end if8: while (U 6= ∅) do9: v0 ← random(U)

10: U ← U − v011: ht← height of subtree(v0)12: pv0 ← parent(v0)13: minvtx← next min cost(v0)14: while (minvtx 6= pv0) do15: if (minvtx /∈ desc(v0)) then16: if ((ht + 1 + depth[minvtx]) ≤ bk/2c) then17: moreimp← true18: T ← T − (pv0, v0)+ (minvtx, v0)19: parent[v0]← minvtx20: for each vertex w in subtree rooted at vertex v0

do21: depth[w]← depth[parent(w)] + 122: end for23: break24: end if25: end if26: minvtx← next min cost(v0)27: end while28: end while29: end while

The improvement for CBTC [6] is same as specified in thepseudocode for RGH+HT. Once the tree constructed by CBTCis known, we perform the same steps as in RGH+HT. TheCBTC+HT is CBTC with these improvements.

III. COMPUTATIONAL RESULTS

A. Experimental Setup

All our heuristics are coded in C and executed on anIntel Core 2 Duo 3.00 GHz CPU using 2GB of RAM inLinux environment (Open SuSE 10.3). CBTC, CBTC-I andCBTC+HT were executed n times on each instance of size nstarting from each vertex in turn. RGH, RGH-I and RGH+HTwere executed n times starting from a random chosen vertexeach time.

B. Test Instances Description

We have compared the performance of various heuris-tics on euclidean as well as non-euclidean instances. Theproblem instances used in our experiments are same stan-dard BDMST benchmark problem instances as used in [6]and [9]. There are in all total 45 instances. Twenty-fiveof these instances are euclidean with five instances foreach value of n ∈ 50, 100, 250, 500, 1000. Theseinsances can be downloaded from Beasley’s OR-library(www.people.brunel.ac.uk/∼mastjjb/jeb/info.html). These in-stances are listed there as instances of the euclidean steiner treeproblem. Euclidean instances consists of n points randomlychosen in the unit square. These points are treated as thevertices of a complete graph whose edge weights are theeuclidean distances between the points. The library contains15 instances for each n and the first 5 of them are used forthe BDMST problem. The diameter bound k is taken to be 5,10, 15, 20, and 25 respectively for n = 50, 100, 250, 500 and1000 respectively.

Twenty more instances, five each for n = 100, 250, 500and 1000 vertices were created by Julstrom [6]. These arenon-euclidean or random instances which are complete graphswith edge weights chosen at random from the interval [0.01,0.99]. The diameter bound is taken to be 10, 15, 20 and 25for n = 100, 250, 500 and 1000 vertices respectively.

C. Results of Experiments

Tables I and II report the results of various heuristics oneuclidean instances, whereas Tables III and IV do the sameon non-eulidean instances. For each instance these tables listn, the diameter bound k, the best and average solutions,and, standard deviation (SD) of solutions after running theheuristics n times on each instance. Best results over the threeheuristics are printed in bold. It is clear from these tables that:

1) On euclidean instances, RGH+HT oobtained the best so-lutions (Table I). RGH+HT outperforms RGH-I, whichso far gives the most promising results on euclideaninstances, both in terms of best cost as well as averagecost for all the instances (except on instances of size 50where the best cost of RGH-I and RGH+HT are same,however RGH+HT gives better average values.).

2) On non-euclidean instances, it is CBTC+HT (Table IV)that outperforms all other heuristics. On all the instancesnot only it gives better results in terms of best cost butalso in terms of average costs.

3) On euclidean instances, CBTC+HT (Table II) showssubstantial improvements in comparison to CBTC-I interms of best values for all the instances with n =1000.

92

TABLE IRESULTS OF RGH, RGH-I AND RGH+HT ON 25 EUCLIDEAN INSTANCES HAVING 50, 100, 250, 500 AND 1000 VERTICES

Instance RGH RGH-I RGH+HTn k Number Best Avg. SD Best Avg. SD Best Avg. SD50 5 1 9.34 12.82 2.48 8.53 12.57 2.13 8.53 12.56 2.14

2 8.98 11.56 1.56 8.74 11.39 1.48 8.74 11.39 1.483 8.76 11.54 1.90 8.28 10.66 1.21 8.28 10.66 1.214 7.47 10.57 1.66 7.54 9.83 1.51 7.54 9.80 1.525 8.79 10.91 1.61 8.59 10.52 1.48 8.59 10.49 1.48

100 10 1 9.35 10.77 0.81 9.16 10.21 0.75 8.88 9.96 0.722 9.41 10.80 0.81 9.09 10.45 1.00 8.68 10.16 1.003 9.75 11.25 0.90 9.39 10.73 0.70 9.25 10.46 0.714 9.55 11.03 0.89 9.14 10.57 0.88 8.95 10.35 0.855 9.78 11.36 1.06 9.61 10.95 0.87 9.09 10.65 0.93

250 15 1 15.14 16.51 0.69 14.61 15.89 0.45 14.04 15.08 0.492 15.20 16.33 0.67 14.82 15.73 0.42 14.11 14.99 0.483 15.08 16.19 0.56 14.75 15.68 0.44 13.80 14.86 0.474 15.49 16.77 0.62 15.14 16.15 0.43 14.24 15.38 0.485 15.42 16.53 0.58 14.99 15.91 0.45 14.11 15.10 0.48

500 20 1 21.72 22.86 0.51 21.10 22.07 0.39 19.39 20.40 0.432 21.46 22.52 0.46 20.81 21.78 0.38 19.09 20.17 0.423 21.51 22.78 0.50 20.89 22.03 0.37 19.42 20.41 0.414 21.82 22.85 0.47 21.15 22.10 0.39 19.41 20.46 0.465 21.37 22.52 0.51 20.84 21.75 0.39 18.86 20.05 0.44

1000 25 1 30.97 32.19 0.41 29.93 31.17 0.40 27.22 28.26 0.432 30.90 32.05 0.42 29.85 31.04 0.39 27.08 28.12 0.413 30.69 31.77 0.42 29.36 30.77 0.38 26.80 27.83 0.404 30.93 32.18 0.43 29.99 31.13 0.38 27.05 28.21 0.405 30.85 31.93 0.42 29.81 30.89 0.39 26.50 27.91 0.42

TABLE IIRESULTS OF CBTC, CBTC-I AND CBTC+HT ON 25 EUCLIDEAN INSTANCES HAVING 50, 100, 250, 500 AND 1000 VERTICES

Instance CBTC CBTC-I CBTC+HTn k Number Best Avg. SD Best Avg. SD Best Avg. SD50 5 1 13.84 21.86 5.27 13.28 21.80 5.33 13.28 21.80 5.33

2 13.32 19.29 3.68 13.19 19.23 3.73 13.19 19.23 3.733 11.62 19.10 3.79 11.59 19.06 3.82 11.59 19.06 3.824 11.04 16.86 3.64 10.78 16.79 3.65 10.78 16.79 3.655 12.31 18.36 3.25 12.31 18.30 3.28 12.31 18.30 3.28

100 10 1 17.50 28.80 7.02 17.35 28.66 7.06 17.34 28.60 7.092 15.02 26.95 6.16 14.17 26.77 6.24 14.17 26.56 6.333 18.37 29.66 7.62 17.70 29.48 7.71 15.75 29.28 7.864 15.11 28.77 7.81 14.92 28.65 7.87 14.90 28.48 7.935 15.73 29.46 7.72 14.78 29.30 7.83 12.82 29.18 7.88

250 15 1 41.61 72.35 19.86 39.70 72.07 19.92 37.64 71.63 20.162 32.43 75.52 19.44 31.59 75.35 19.49 28.90 74.73 19.463 32.65 70.60 18.09 32.01 70.32 18.22 27.31 69.67 18.664 32.29 76.23 20.07 31.78 76.09 20.15 29.42 75.44 19.865 35.90 71.56 17.90 35.79 71.40 17.97 35.66 70.66 17.85

500 20 1 80.76 150.68 39.02 72.07 150.46 41.40 48.18 148.07 40.652 70.44 148.75 39.89 70.17 148.54 39.96 60.15 146.37 40.383 69.37 153.17 39.02 68.83 152.96 39.11 45.49 149.61 40.864 63.88 150.98 39.18 63.17 150.79 39.24 63.00 148.34 40.225 72.36 150.68 41.33 72.07 150.46 41.40 41.77 146.80 42.73

1000 25 1 173.23 327.50 82.96 172.62 327.30 83.02 90.01 321.07 84.902 173.85 323.72 81.34 173.06 323.50 81.41 95.83 318.59 83.483 175.80 321.25 83.04 175.47 321.04 83.10 94.02 312.70 85.724 163.89 323.45 80.13 163.43 323.23 80.22 81.39 317.02 83.175 149.36 325.96 78.34 148.37 325.76 78.41 70.55 318.52 81.37

93

TABLE IIIRESULTS OF RGH, RGH-I AND RGH+HT ON 20 NON-EUCLIDEAN INSTANCES HAVING 100, 250, 500 AND 1000 VERTICES

Instance RGH RGH-I RGH+HTn k Number Best Avg. SD Best Avg. SD Best Avg. SD100 10 1 3.96 5.47 0.60 3.30 4.39 0.49 3.02 3.89 0.47

2 4.01 5.41 0.59 3.40 4.30 0.51 2.72 3.85 0.543 4.50 5.68 0.57 3.43 4.59 0.57 2.78 4.01 0.514 4.16 5.20 0.53 2.95 4.24 0.47 2.57 3.80 0.485 4.21 5.50 0.58 3.56 4.53 0.44 3.01 4.01 0.48

250 15 1 6.17 7.73 0.59 5.12 6.43 0.52 4.26 5.44 0.482 6.27 7.64 0.56 4.73 6.31 0.49 4.31 5.45 0.483 6.35 7.62 0.55 5.07 6.37 0.49 4.30 5.45 0.474 6.21 7.63 0.60 5.15 6.43 0.55 4.51 5.49 0.455 6.51 7.81 0.52 5.28 6.60 0.53 4.66 5.69 0.45

500 20 1 9.36 10.72 0.55 7.50 8.86 0.55 6.54 7.44 0.422 9.27 10.79 0.56 7.71 8.94 0.53 6.70 7.53 0.423 9.16 10.70 0.59 7.36 8.89 0.54 6.55 7.46 0.424 9.13 10.69 0.60 7.66 8.91 0.52 6.68 7.50 0.425 9.18 10.66 0.54 7.46 8.89 0.57 6.67 7.49 0.42

1000 25 1 14.83 16.36 0.58 12.80 14.33 0.57 11.69 12.57 0.402 14.93 16.36 0.57 12.81 14.34 0.57 11.64 12.61 0.413 14.90 16.40 0.58 12.89 14.37 0.58 11.70 12.61 0.394 14.52 16.29 0.57 12.83 14.26 0.55 11.58 12.53 0.415 14.80 16.43 0.59 12.93 14.43 0.53 11.78 12.68 0.41

TABLE IVRESULTS OF CBTC, CBTC-I AND CBTC+HT ON 20 NON-EUCLIDEAN INSTANCES HAVING 100, 250, 500 AND 1000 VERTICES

Instance CBTC CBTC-I CBTC+HTn k Number Best Avg. SD Best Avg. SD Best Avg. SD100 10 1 2.58 3.23 0.35 2.53 3.06 0.29 2.53 2.91 0.25

2 2.55 3.09 0.31 2.43 2.93 0.28 2.36 2.78 0.263 2.66 3.48 0.44 2.61 3.32 0.39 2.49 3.16 0.364 2.45 3.03 0.29 2.38 2.87 0.25 2.37 2.74 0.225 2.71 3.34 0.39 2.63 3.17 0.33 2.58 3.00 0.27

250 15 1 3.96 4.40 0.22 3.93 4.30 0.19 3.88 4.15 0.142 4.09 4.45 0.24 4.02 4.31 0.20 3.97 4.16 0.133 3.87 4.33 0.21 3.83 4.21 0.17 3.82 4.08 0.134 3.92 4.40 0.24 3.88 4.29 0.20 3.85 4.15 0.155 4.16 4.63 0.26 4.11 4.48 0.21 4.05 4.31 0.15

500 20 1 6.34 6.70 0.16 6.31 6.61 0.14 6.29 6.48 0.962 6.47 6.82 0.17 6.43 6.72 0.14 6.38 6.58 0.973 6.34 6.66 0.16 6.30 6.57 0.13 6.24 6.44 0.104 6.39 6.77 0.15 6.36 6.68 0.13 6.31 6.54 0.095 6.41 6.75 0.17 6.37 6.65 0.14 6.30 6.52 0.10

1000 25 1 11.37 11.66 0.15 11.33 11.57 0.12 11.29 11.45 0.082 11.40 11.68 0.14 11.38 11.60 0.12 11.32 11.48 0.083 11.42 11.69 0.15 11.38 11.61 0.12 11.35 11.49 0.084 11.30 11.58 0.14 11.26 11.50 0.11 11.22 11.39 0.085 11.47 11.76 0.13 11.43 11.68 0.11 11.39 11.56 0.08

IV. CONCLUSIONS

We have improved the results of RGH-I and CBTC-I heuris-tics both on euclidean as well as on non-euclidean instances ofthe BDMST problem. The improved heuristics RGH+HT andCBTC+HT takes into consideration the height of the subtreebefore connecting it to some other vertex of the tree thusallowing more candidate vertices for improvement than RGH-

I and CBTC-I. After attaching the subtree to a new bettervertex it updates the depth of all the vertices in the subtree.This alongwith multiple passes over the list of vertices resultin better solution values for the BDMST problem.

As a future work, we plan to develop hybrid approachesfor the BDMST problem combining RGH+HT with somemetaheuristics like [9].

94

REFERENCES

[1] A. Abdalla, N. Deo, and P. Gupta, “Random-tree diameter and thediameter constrained MST,” Congressus Numerantium, 144, 2000, pp.161-182.

[2] K. Bala, K. Petropoulos and T.E. Stern, “Multicasting in a linear lightwavenetwork,” Proceedings of IEEE INFOCOM’93, pp. 1350-1358

[3] A. Bookstein and S.T. Klein, “Compression of correlated bit-vectors,”Information Systems, vol. 16, 1990, pp. 387-400

[4] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide tothe Theory of NP-Completeness. W.H. Freeman, New York, 1979.

[5] G.Y. Handler, “Minimax location of a facility in an undirected Graph,”Transportation Science, vol. 7, 1978, pp. 287-293

[6] B.A. Julstrom, “Greedy heuristics for the bounded-diameter minimumspanning tree problem,” ACM Journal of Experimental Algorithmics, vol.14, 2009, pp. 1-14

[7] G.R. Raidl and B.A. Julstrom, “Greedy heuristics and an evolutionaryalgorithm for the bounded-diameter minimum spanning tree problem,”Proceedings of the ACM Symposium on Applied Computing, 2003, pp.747-752

[8] K. Raymond, “A tree-based algorithm for distributed mutual exclusion,”ACM Transactions on Computer Systems, vol. 7, 1989, pp. 61-77

[9] A. Singh and A.K. Gupta, “Improved heuristics for the bounded-diameterminimum spanning tree problem,” Soft Computing, vol. 11, 2007, pp.911-921

95

Ad-hoc Cooperative Computation in Wireless Networks using Ant like Agents

Santosh KulkarniAuburn University

Computer Science & Software EngineeringAuburn, AL 36849-5347, USA

[email protected]

Prathima AgrawalAuburn University

Electrical & Computer EngineeringAuburn, AL 36849-5347, USA

[email protected]

Abstract

Mobile applications continue to soar in popularity asthey provide its users the convenience of accessing servicesfrom anywhere, at anytime. The underlying computing de-vices for such applications however, are often limited intheir battery and processing powers, primarily due to sizeand weight restrictions. Running complex applications onsuch resource-limited devices has always been a challenge.In the work presented here, we address this challenge byproposing a cooperative paradigm for ad-hoc computationin wireless networks. In this model, a set of heterogeneouscomputing devices cooperate to dynamically form a dis-tributed computation system. Whenever a resource-limitedcomputing device in such a system has resource-consumingapplication to be run, it uses the resources of other devicesto overcome its own limitations. The proposed paradigm isbased on the concept of execution migration and relies onmigratory execution units called Ant Agents to seek spareresources available in the network. Simulation results forthe proposed model demonstrate that cooperative ad-hoccomputation is indeed beneficial for resource constrainedwireless devices.

1. Introduction

The shrinking size and increasing density of wirelessdevices have profound implications for the future of wire-less communications. Today’s laptops and wireless phonesmay soon be outnumbered by ubiquitous computing devicessuch as smart dust [24], micro sensors and micro robots[17]. In fact, there are already some organizations that pro-pose embedding communication systems into cars allow-ing cars to interact with other cars or infrastructures overaWireless Local Area Network [1]. Therefore, future genera-tions of wireless networks are expected to have a huge num-ber of heterogeneous mobile computing devices that are dy-namically inter connected over wireless links.

Because of their size limitations however, a large num-ber of these devices are likely to have severe restrictionson their processing power, storage space, available mem-ory as well as battery capacity. Unfortunately such rigorousresource limitations in mobile computing devices precludethe full utilization of mobile applications in real life sce-narios [2]. To overcome this problem, we propose a newdistributed computing model for wireless ad-hoc networkscalled the Ad-hoc Cooperative Computation (ACC) model.ACC is a computing model in which a set of heterogeneouscomputing systems dynamically form a cooperative system.Whenever a resource limited computing device in such asystem has a resource consuming application to be run, ituses resources of other devices to surmount its own limi-tations. The following scenario presented in [2] makes astrong case for our proposed computing model.

Triage is a process that is executed in hospital emer-gency rooms to sort injured people into groups based ontheir need for immediate treatment. The same process is ac-tually needed in disaster areas which usually have a largenumber of casualties. In such scenarios, quickly identifyingthe severely injured has proved to be an effective techniquein saving lives and controlling acute injuries. But unfortu-nately, both technical and human resources that are readilyavailable in emergency rooms are scarce in disaster areas.Mobile computing is now being proposed as a solution tocomplement, automate and expedite the triage process indisaster fields. First, low-power vital sign sensors are at-tached to each patient in the field. These sensors send med-ical data about the patient to nearby first responders whoare provided with mobile computing devices and medicalapplications. These medical applications then process andanalyze the received data in order to make a decision on whoneeds the most immediate treatment [14]. Running such ap-plications on resource-limited mobile computing devices isa real challenge. Such devices may not have enough energyto run the application and/or may not have enough process-ing power needed to make a timely decision. Alleviatingthe effect of these limitations, which is the main objective

96

of our proposed computing model, would undoubtedly savemore lives in future emergencies.

1.1. Challenges for Cooperation

Wireless ad-hoc networks pose a unique set of chal-lenges which make traditional distributed computing mod-els difficult, if not impossible to employ, in alleviating theeffects of resource limitations. Some of the identified chal-lenges are:

• Network size- The number of devices working to-gether to achieve a common goal will be orders ofmagnitude greater than those seen in traditional dis-tributed systems.

• Heterogeneous architecture- The devices are all likelyto have different hardware architectures as they aretypically tailored to perform a specific task within thenetwork.

• Unreliable links- The links in the network are inher-ently fragile, with device and connection failures beinga norm rather than an exception.

• Limited processing power- As mobile devices arelikely to have size and weight restrictions, they are lim-ited in their processing power and battery capacity.

• Dynamic topology- The availability of the devicesmay vary greatly with time, with devices becoming un-reachable due to mobility or due to depletion of energy.

• Limited reach- Because of the nature of wireless com-munication, devices can communicate directly withonly those devices that are within its transmissionrange.

1.2. Cooperative Setup

Applications designed to suit the ACC model will typi-cally target specificpropertieswithin the network and notindividual devices. Such targetedpropertiescould includespecific data and/or resources that the application is inter-ested in. From the application’s point of view devices withthe samepropertiesare interchangeable. Thus fixed namingschemes such as IP addressing are inappropriate in most sit-uations. As discussed in [11] a naming scheme based on thecontent or property of a device is more appropriate for wire-less ad-hoc networks.

Due to network volatility and dynamic binding of namesto devices, execution migration based distributed comput-ing is more suitable for wireless ad-hoc networks than dis-tributed computing based on data migration (message pass-ing) [4]. Hence, the system architecture for ACC is based

on the concept of execution migration. Applications that arein compliance with the ACC model shall consist of migra-tory execution units called Ant Agents which work togetherto accomplish a common goal. Ant Agents (AA), similar toMobile Agents [13], are collections of code and data blocks.They migrate through the network, executing on each de-vice in its path, foraging for devices of interest or deviceswith desiredproperties.

The agents are also self-routing, namely, they are respon-sible for determining their own paths through the network.In the proposed ACC model, AAs forage the network fordevices of interest using ant like routing algorithms [3], [5],[8], [19]. Such routing algorithms based on the behavior ofsocial insects in nature are known to result in optimal routebetween the source and the destination [20]. For their part,devices in the network support AAs by providing,

• A name based memory system and

• An architecturally independent environment for re-ceipt and execution of Ant Agents

To validate the proposed computing model, we havedeveloped a simulator that executes AAs allowing us toevaluate both execution and communication time of a dis-tributed application. In this simulator we execute applica-tions that are modeled as per the Bag-of-Tasks paradigmof distributed computing. Simulation results show that ourproposed computation model is able to significantly im-prove the execution times of mobile applications on re-source constrained devices.

The rest of this paper is organized as follows. The nextsection describes our proposed Ad-hoc Cooperative Com-putation model. Section 3 presents the system architecturethat supports the proposed model. In Section 4, we dis-cuss the details of Ant Agents while in Section 5 we discussthe application paradigms implemented using our proposedmodel. Section 6 discusses related work and Section 7 con-cludes the paper.

2. Ad-hoc Cooperative Computation Model

To exploit the raw computing power of large scale, het-erogeneous, wireless ad-hoc networks we propose a dis-tributed computing model called the Ad-hoc CooperativeComputation (ACC) model. This model is based on thesocial behavior of ants which work together in groups toexecute tasks that are beyond the abilities of a single mem-ber. The ACC model consists of distributed applicationsthat are defined as a dynamic collection of Ant Agents (AA)which cooperate amongst themselves to collectively achievea common objective. The execution of an AA can be de-scribed in two phases: forage and migrate phase followedby a computation phase. The AA execution performed at

97

each step may differ based on thepropertiesof its hostingdevice. On devices that meet the application targetedprop-erties(device of interest), an AA may advance its executionstate while on other devices it only executes its routing al-gorithm. Like any mobile agent, an AA too carries alongwith it its mobile data, mobile code as well as a lightweightexecution state.

Devices in the network support the reception and exe-cution of AAs by providing an architecturally independentprogramming environment (e.g., Virtual Machine [9]) aswell as a name based memory system (e.g., Tag Space [22]).The AAs along with the system support provided by the de-vices in the network form the ACC infrastructure which al-lows execution of distributed applications over ad-hoc wire-less networks.

Our proposed computational model allows the user to ex-ecute distributed tasks in ad-hoc networks by simply inject-ing the corresponding AAs into the network. To do this,the user need not have any prior knowledge about either thescale or the topography of the network nor the specific func-tionality of the devices involved. Additionally, making theAAs intelligent eliminates the issue of implementing newprotocols on all the devices in the network; a task which isdifficult or even impossible to do using current approaches[11].

Because of their intelligence, AAs are reasonably re-silient to any network volatility. When certain devices be-come unavailable due to their mobility or energy depletion,AAs are able to adapt well by either finding a new path totheir destination or by foraging for other devices in the net-work that meet thepropertiestargeted by the application.

Figure 1. Generating prime numbers

Let us consider two example applications that demon-strate the computation and communication aspects of theproposed ACC model. Figure 1 depicts an applicationwhere the joint task is to generate all prime numbers less

than some limit MAX, say 40,000. Since prime numbergeneration is a processor intensive task, the resource lim-ited source device, depicted as a black circle, injects AAsinto the network seeking cooperation from other devicesin the network. Because the originating device is proces-sor limited, the injected AAs are initialized to forage forcomputing cycles within the network. Once initialized eachAA forages the network for available computing resources.When an AA finds a device with enough spare CPU cycles,depicted as gray circles, it proceeds to calculate the set ofprimes within its given range. Upon completion, each AAreports its results back to the originating device by tracingback its migratory route.

Figure 2. 3-D modeling using Computer-aideddesign

Next, let us consider a Computer-aided design applica-tion which is required to generate a 3-D model, given itstop, front and profile views. Since the originating device,depicted as a black circle in Figure 2, is missing the requiredviews, it injects an AA into the network seeking coopera-tion from devices which have the required data. Becausethe originating device is data limited, the injected AA is ini-tialized to forage for specific data within the network. Onceinitialized the AA forages the network for the three views ofthe object in question. When an AA finds a device with therelevant data, depicted as gray circles, it proceeds to pro-cess the available view. Finally, having processed all threeviews, the agent reports the results back to the originatingdevice by tracing back its migratory path.

For applications that deal with large amounts of data,moving the execution to the source of the data wheneverpossible will improve the overall performance of the dis-tributed system. For example, when using an AA for ob-ject recognition, performing the image analysis on the de-vice that acquired the image whenever possible, rather thantransferring the image (or sequence of images) over the net-work would result in improved response time and band-width usage while reducing the overall energy consumed.Similarly caching frequently used code blocks on devices

98

that regularly host AAs can also limit the impact of codetransfer occurring with every injected AA.

Security is an important issue in any cooperative com-puting model. Addressing it in our proposed model wouldmean protecting AAs against malicious devices as well asprotecting devices against malicious AAs. Although real-izing this requires a comprehensive security framework tobe in place, we limit the current architecture to a simple ad-mission control using authentication mechanisms based ondigital signatures.

3. System Architecture

Considering the heterogeneous nature of the network,the system architecture aims to place as much intelligenceas possible in Ant Agents and keep the support requiredfrom devices in the network to a minimum.

Figure 3. ACC System Architecture

Figure 3 shows the system architecture support neededfor the proposed ACC model. As depicted, the SecurityManager first verifies the credentials of all incoming AntAgents. Next, the Resource Manager inspects the AA’s Re-source Table header field to check if the listed resource es-timates can be satisfied. AAs whose resource estimates canbe satisfied are then queued up for execution at the VirtualMachine. The Tag Space represents the name-based mem-ory region that stores data objects persistent across AA exe-cutions. The Virtual Machine acts as a hardware abstractionlayer for loading, scheduling and executing tasks generatedby incoming AAs. Post execution, the AAs are injectedback into the network to allow them to migrate to their nextdestination.

3.1. Security Manager

To prevent excessive use of its resources, a device needsto perform some form of admission control. In the pro-posed architecture, the Security Manager component per-forms this role. It is primarily responsible for receiving in-coming AAs and passing them onto the Resource Managersubject to their approval by various admission restrictions.

3.2. Resource Manager

Each AA lists its estimate resource requirements in a Re-source Table located in the AA header. The Resource Man-ager is responsible for receiving the authenticated AA fromthe Security Manager and storing them into AA Queue, sub-ject to the requested resource constraints being satisfied.AResource Manager also checks to see if code section of theincoming AA is already cached locally or not.

3.3. Virtual Machine

The Virtual Machine is a hardware abstraction layer forthe execution of AAs across all heterogeneous hardwareplatforms present in the ad-hoc wireless network. Exam-ples include Java Virtual Machine, K Virtual Machine, etc.

3.4. Tag Space

A Tag Space consists of a limited number of tags thatare persistent across AA executions. Figure 4 from [4] il-lustrates the structure of a tag. It consists of an identifier, adigital signature, lifetime information and data. The iden-tifier represents the name of the tag. The access of AAs totags is restricted based on the digital signature. The tag life-time specifies the time at which the tag will be reclaimed bythe device from the Tag Space.

Figure 4. Tag Structure

Tags can be used for:

• Naming: AAs name the devices of interest using tagidentifiers.

• Data Storage: An AA can store data in the network bycreating its own tags.

• Routing: AAs use tags to create pheromone trail ofvisited devices in the network, by caching the relevantIDs in the data portion of such tags.

• Synchronization: An AA can block on specific tagpending a write of the tag by another AA. Once thistag is written, all AAs blocked on it will be woken upand made ready for execution. This way AAs can syn-chronize among themselves.

• Interaction with host device: An AA can interact withthe host OS and I/O system using I/O tags.

99

4. Ant Agents

AAs like mobile agents are migratory execution unitsconsisting of code, data and an execution state. In ACC,the behavior of AAs is modeled on the behavior of ants innature. Just as ants in nature cooperate with each other toforage for food, AAs in ad-hoc networks cooperate witheach other to forage for devices that satisfy the applica-tion targetedproperties(device of interest). In the contextof the proposed computing model, user applications can beviewed as a collection of AAs cooperating with each otherto achieve a common goal. Such AAs are intelligent and arecapable of routing themselves without needing any externalsupport. When admitted for execution on the hosting de-vice, the computation code within the AA is embodied intoa task. During its execution this task may modify the datasections of AA, modify the local tags to which it has access,may migrate to another device or may block on other tagsof interest.

4.1. Format

In addition to its identity and authentication information,an AA is comprised of code and data sections, a lightweightexecution state and a resource estimate table. A digital sig-nature together with the AA and task IDs identifies an AA.The digital signature is used by the host devices to protectthe access to an AA’s tags. The code and data sections con-tain mobile code and data that an AA carries from one de-vice to another. The state field contains the execution con-text necessary for task resumption after a successful migra-tion. The resource table consists of resource estimates like,execution time, memory requirements, etc. The resourceestimates set a bound on the expected needs of the AA atthe host device. Figure 5 depicts the skeletal structure ofthe Ant Agent.

Figure 5. Format of an Ant Agent

4.2. Life Cycle

Once initialized at the originating device, each AA fol-lows the life cycle defined below:

1. It is subject to admission control at the next hop desti-nation.

2. Upon admission, a task is generated out of AA’s codeand data sections and scheduled for execution.

3. After completion of execution, the AA may migrateto other devices of interest or may return back to theoriginating device with the results of execution.

Ant Agent AdmissionTo avoid unnecessary resource consumption, the Secu-

rity Manager executes a three-way handshake protocol fortransferring AAs between neighboring devices. First, onlythe identification information, digital signature and resourcetable information is sent to the destination for admissioncontrol. If the AA admission fails either due to security orresource constraints, the transferring task will be notified sothat it can decide upon subsequent action.

If the AA is accepted, the Resource Manager at the des-tination checks to see if the code section is already cachedlocally. It then informs the source device to transfer onlythe missing sections. Thus if code caching is enabled, thesubsequent transfer cost of the code is amortized over time.

Ant Agent ExecutionUpon admission, an AA becomes a task which is sched-

uled for execution by the Virtual Machine (VM). The ex-ecution of an AA is non-preemptive, but new AAs can beadmitted during execution. An executing AA can yield theVM by blocking on a tag. The VM makes sure that a taskconfirms to its declared resource estimates. Otherwise, thetask can be forcefully removed from the system.

Ant Agent MigrationIf the current computation of the AA does not complete

on the hosting device, the task may continue its executionon another device. The current execution state is capturedand migrated along with the code and data sections. In casethe current computation of the AA does complete success-fully, the execution state as well as the results are capturedand migrated back to the originating device.

4.3. Routing

AAs are self-routing, i.e. they are responsible for de-termining their own paths through the network. Except forproviding Tag Space, there is no other system support re-quired by the AAs for routing. An AA identifies its destina-tion based on its application’s targetedproperties. Howeverthe AA executes its routing algorithm on each device in itspath. Because the AA is inspired from the behavior of antsfound in nature, it deposits an artificial pheromone in theTag Space of every device that is on the way to its destina-tion. Like its natural counterpart, the artificial pheromonetoo has a lifetime and is used by other AAs to find theirway through the network. Suchstigmergiccommunicationbetween the ants in nature is known to yield sub optimalpaths. The following subsection explains how.

100

Ant Cooperation in NatureIn nature, ants have the power of finding the shortest path

from ant colonies to foods [6]. As an ant moves, it deposits asubstance called pheromone on the ground. This depositedpheromone is unique to each colony and is used by its mem-bers to establish a route to the food source. Initially, whenants start out with no prior information, they start search-ing for food by walking in random directions. When anant finds food, it follows its pheromone trail back to thecolony. In doing so the ant lays down more pheromonealong its successful path. When other ants run into a trail ofpheromone, they give up their own search and start follow-ing the existing trail. Since ants follow a trail with strongestpheromone concentration, the pheromone on the branchesof the shortest path to the food will grow faster when com-pared to its concentration on other branches. As pheromoneevaporates over time, the colony forgets older, sub-optimalpaths.

Since AAs are modeled on the behavior described above,over time they too are expected to avoid sub-optimal pathsbetween the originating device and the device of interest.In case the pheromone tags are missing, an AA can foragefor the device of interest by spawning another AA for routediscovery and blocking on its pheromone tag. A write onthis tag unblocks the waiting AA, which will now resumeits migration. Since the tags are persistent for their lifetime,pheromone information once acquired can be used by sub-sequent AAs that belong to the same application.

5. Simulations

To prove that many distributed applications can be writ-ten using Ant Agents, we have implemented a simple ap-plication belonging to the Bag-of-Tasks paradigm of dis-tributed applications. There are three reasons for choosingan application confirming to this paradigm. First, many ap-plications that fit this paradigm are highly computationallyintensive and thus can benefit from cooperation from otherdevices in wireless ad-hoc networks. Second, an applica-tion following these paradigms can easily be divided intolarge number of coarse-grain tasks. Third, each of thesetasks are highly asynchronous and self-contained and thereis limited communication amongst the tasks. These threeproperties make the chosen paradigm suitable for executionin a networked environment.

5.1. Bag-of-tasks Paradigm

The bag-of-tasks paradigm applies to the situation whenthe same function is to be executed a large number of timesfor a range of different parameters. If applying the functionto a set of parameters constitutes a task, then the collection

of all tasks that need to be solved is called the bag of tasks.Such a collection of tasks need not be solved in any par-ticular order. Workers are entities capable of executing andsolving tasks from the bag. At each iteration a worker grabsone task from the bag and computes the result.

The bag-of-tasks applications share a general structure.The first step is to initialize the problem data. Then the bagof tasks is created, where the termination condition eitherrepresents a fixed number of iterations or is given implicitlyby reading input values from a file until the end of the file.The actual computation is represented by a loop, which isrepeated until the bag is empty. Multiple workers may exe-cute the loop independently. All workers have shared accessto the task bag and the output data. Each worker repeatedlyremoves a task, solves it by applying main compute func-tion to it and writes the results into a file.

5.2. ACC Implementation

Because the number of tasks in the bag-of-tasksparadigm may be large, it is useful to allow the tasks tobe generated on the fly. An AA is created specifically forthis purpose. This task-generating AA, called the Genera-tor AA, typically stays at the originating device. When thenumber of tasks in the task pool falls below a certain thresh-old, the task-generating AA generates additional new tasks.Generator AA terminates when the bag-of-tasks becomesempty.

After generating initial number of tasks, the GeneratorAA injects them as AAs into the network. These AAs in-dependently forage the network for adequate computing re-sources. When an AA discovers a device of interest, it mi-grates there and starts executing its task. Post completionthe results are returned to the Generator AA on the orig-inating device. The Generator AA injects a new AA intothe network for every execution result received. This newAA can swiftly migrate to a device of interest by followingthe pheromone trial of the previous successful AAs. Thenumber of AAs in the network is not fixed but can be dy-namically changed to adapt to the changes in the network.

Figure 6 depicts a snapshot of the ACC implementationwith the Generator AA located at the originating device(black circle) while four application AAs execute at fourdifferent devices of interest (gray circle). The arrows indi-cate the back-and-forth migration of the AAs.

5.3. Performance

The bag-of-tasks paradigm is widely used in many sci-entific computations. Our experiments with this paradigmwere based on a Monte Carlo simulation of a model of lighttransport in organic tissue [21]. The simulation runs as fol-lows. Once launched, a photon is moved a distance where

101

Figure 6. Illustration of B-o-T implemenation

it may be scattered, absorbed, propagated undisturbed, in-ternally reflected or transmitted out of the tissue. The pho-ton is repeatedly moved until it either escapes from or it isabsorbed by the tissue. This process is repeated until thedesired number of photos has been propagated.

Because the model assumes that the movement of eachphoton in the tissue is independent of all other photons, thissimulation fits well in the bag of tasks paradigm. The ex-periment results are shown in Figure 7. The graph presentsa near-linear speedup for each additional AA injected intothe network.

Figure 7. Speedup for B-o-T experiments

6. Related Work

Ant Agents bear some similarity to Active Messages [7],Active Networks [18], [23], [25], Mobile Agents [10], [16]and Smart Messages [12]. Although Ant Agents borrowimplementation solutions from all of them, the concept ismarkedly different.

Similar to the Active Messages [7], the receipt of an Ant

Agent at any device in the network leads to the execution ofsome code block on the receiving device. However, whileActive Messages point to the handler at the receiving de-vice, Ant Agents carry their own code with them. More-over Ant Agents and Active Messages address completelydifferent problems. While Active Messages target fast com-munication in system-area networks, Ant Agents are meantto address large, heterogeneous, ad-hoc wireless networks.

The Smart Packets [23] architecture provides a flexiblemeans of network management through the use of mobilecode. Smart Packets are implemented over IP, using the IPoption header. They are routed just like other data trafficin the network and only execute on arrival at a specific lo-cation [4]. Unlike Smart Packets, Ant Agents are executedat each hop in the network to not only deposit its artificialpheromone but also to determine its next hop towards thedestination. Additionally Ant Agents carry along with themtheir execution context.

The ANTS [25] capsule model of programmability al-lows forwarding code to be carried and safely executed in-side the network by a Java VM [4]. When compared to AntAgents we find that ANTS does not migrate the executionstate from device to device. Also ANTS targets IP networkswhile Ant Agents target large, heterogeneous, wireless ad-hoc networks.

A Mobile Agent [16] may be viewed as a task that explic-itly migrates from node to node assuming that the underly-ing network assures its transport between them [4]. Unlikemobile agents however, Ant Agents are responsible for theirown routing in the network. The ACC architecture furtherdefines the infrastructure that devices in the network mustimplement in order to support Ant Agents.

Ant Agents are similar to Smart Messages [12] whichalso use migration of code in wireless networks. Apartfrom being responsible for their own routing, Smart Mes-sages also carry with them, their execution state as well astheir code and data blocks during every migration. How-ever unlike Smart Messages, Ant Agents are modeled onants in nature. Hence, Ant Agents achieve stigmergic com-munication by recording their pheromone in the Tag Spaceof every device visited which can be sensed by other AntAgents belonging to the same distributed application. Suchant like behavior when employed in large numbers is knownto yield to emergence of optimal paths [15].

7. Conclusions

This paper has described a computing paradigm for largescale, heterogeneous, wireless ad-hoc networks. In the pro-posed model, distributed applications are implemented as acollection of Ant Agents. The model overcomes the scale,heterogeneity and connectivity issues by placing the intel-ligence in migratory execution units. The devices in the

102

network cooperate by providing a common minimal sys-tem support for the receipt and execution of Ant Agents.Simulations for Bag-of-Tasks family of distributed applica-tions demonstrated that Ad-hoc Cooperative Computationrepresents a flexible and a simple solution for surmountingresource constraints on mobile computing devices.

References

[1] Car 2 car communication consortium. http://www.car-to-car.org/.

[2] W. Alsalih, S. Akl, and H. Hassanein. Cooperative adhoc computing: towards enabling cooperative processing inwireless environments.International Journal of Parallel,Emergent and Distributed Systems, 23(1):58–79, February2008.

[3] J. S. Baras and H. Mehta. A probabilistic emergent routingalgorithm for mobile ad hoc networks, 2003.

[4] C. Borcea, D. Iyer, P. Kang, A. Saxena, and L. Iftode. Co-operative computing for distributed embedded systems. InProceedings of the 22nd International Conference on Dis-tributed Computing Systems (ICDCS 2002, pages 227–236,2002.

[5] G. D. Caro, G. D. Caro, F. Ducatelle, and L. M. Gam-bardella. Anthocnet: An adaptive nature-inspired algorithmfor routing in mobile ad hoc networks.European Transac-tions on Telecommunications, 16:443–455, 2005.

[6] G. D. Caro and M. Dorigo. Antnet: A mobile agents ap-proach to adaptive routing. Technical report, 1997.

[7] T. V. Eicken, D. E. Culler, S. C. Goldstein, and K. E.Schauser. Active messages: a mechanism for integratedcommunication and computation, 1992.

[8] M. Gnes and O. Spaniol. Routing algorithms for mobilemulti-hop ad-hoc. InProceedings of International Work-shop on Next Generation Network Technologies, EuropeanComission Central Laboratory for Parallel Processings -Bulgarian Academy of Sciences, 2002.

[9] R. P. Goldberg. Survey of virtual machine research.IEEEComputer, pages 34–45, June 1974.

[10] R. Gray, D. Kotz, G. Cybenko, and D. Rus. Mobile agents:Motivations and state-of-the-art systems. Technical report,Handbook of Agent, 2000.

[11] J. Heidemann, F. Silva, C. Intanagonwiwat, R. Govindan,D. Estrin, and D. Ganesan. Building efficient wireless sen-sor networks with low-level naming. InSOSP ’01: Pro-ceedings of the eighteenth ACM symposium on Operatingsystems principles, pages 146–159, 2001.

[12] P. Kang, C. Borcea, G. Xu, A. Saxena, U. Kremer, andL. Iftode. Smart messages: A distributed computing plat-form for networks of embedded systems.The ComputerJournal, Special Focus-Mobile and Pervasive Computing,47:475–494, 2004.

[13] D. B. Lange and M. Oshima. Seven good reasons for mobileagents.Commun. ACM, 42(3):88–89, 1999.

[14] K. Lorincz, D. J. Malan, T. R. F. Fulford-Jones, A. Na-woj, A. Clavel, V. Shnayder, G. Mainland, M. Welsh, andS. Moulton. Sensor networks for emergency response:

Challenges and opportunities.IEEE Pervasive Computing,3(4):16–23, 2004.

[15] V. Maniezzo and A. Carbonaro. Ant colony optimization:An overview. In Essays and Surveys in Metaheuristics,pages 21–44. Kluwer Academic Publishers, 1999.

[16] D. S. Milojicic, W. LaForge, and D. Chauhan. Mobile ob-jects and agents (moa), 1998.

[17] R. Min and A. Chandrakasan. A framework for energy-scalable communication in high-density wireless networks.In International Symposium on Low Power Electronics andDesign, pages 36–41, 2002.

[18] J. T. Moore, M. Hicks, and S. Nettles. Practical pro-grammable packets. Inin Proceedings of the 20th AnnualJoint Conference of the IEEE Computer and Communica-tions Societies (INFOCOM 2001, pages 41–50, 2001.

[19] Y. Ohtaki, N. Wakamiya, M. Murata, and M. Imase. Scal-able ant-based routing algorithm for ad-hoc networks. In3rd IASTED International Conference on Communications,Internet, and Information Technology, 2004.

[20] Y. Ohtaki, N. Wakamiya, M. Murata, and M. Imase. Scal-able and efficient ant-based routing algorithm for ad-hocnetworks. IEICE Transactions on Communications, E-89B(4):1231–1238, January 2006.

[21] S. A. Prahl, M. Keijzer, S. L. Jacques, and A. J. Welch. Amonte carlo model of light propagation in tissue.SPIE Pro-ceedings of Dosimetry of Laser Radiation in Medicine andBiology, IS(5):102–111, 1989.

[22] O. Riva, T. Nadeem, C. Borcea, and L. Iftode. Context-aware migratory services in ad hoc networks.IEEE Trans-actions on Mobile Computing, 6(12):1313–1328, 2007.

[23] B. Schwartz, A. W. Jackson, W. T. Strayer, W. Z., R. D.Rockwell, and C. Partridge. Smart packets for active net-works, 1998.

[24] B. Warneke, M. Last, B. Liebowitz, Kristofer, and S. J. Pis-ter. Smart dust: Communicating with a cubic-millimetercomputer.Computer Magazine, 34(1):44–51, January 2001.

[25] D. Wetherall. Active network vision and reality: Lessonsfrom a capsule-based system, 1999.

103

A Scenario-based Performance Comparison Study of the Fish-eye State

Routing and Dynamic Source Routing Protocols for Mobile Ad hoc Networks

Natarajan Meghanathan Ayomide Odunsi

Jackson State University, USA Goldman Sachs, USA

E-mail: [email protected] E-mail: [email protected]

Abstract

The overall goal of this paper is to investigate the

scalability of the Fish-eye State Routing (FSR) protocol

and the Dynamic Source Routing (DSR) protocol under

different network scenarios in mobile ad hoc networks

(MANETs). This performance based study simulates

FSR and DSR under practical network scenarios typical

of MANETs, and measures selected metrics that give an

introspective look into the performance of FSR and

DSR. The implementations of both protocols are

simulated for varying conditions of network density,

node mobility and traffic load. The following

performance metrics are evaluated: packet delivery

ratio, average hop count per path, control message

overhead and energy consumed per node. Simulation

results indicate FSR scales relatively better compared

to DSR and consumes less energy when operated with

moderate to longer link-state broadcast update time

intervals in high density networks with moderate to high

node mobility and offered traffic load. FSR successfully

delivers packets for a majority of the time with

relatively lower energy cost in comparison to DSR.

Keywords: Routing protocols, Mobile ad hoc networks,

Energy consumption, Simulations, Performance Studies

1. Introduction

A mobile ad hoc network (MANET) is a dynamic

distributed system of wireless nodes where in the nodes

move independent of each other. MANETs have several

operating constraints such as: limited battery charge per

node, limited transmission range per node and limited

bandwidth. Routes in MANETs are often multi-hop in

nature. Packet transmission or reception consumes the

battery charge at a node. Nodes forward packets for

their peers in addition to their own. In other words,

nodes are forced to expend their battery charge for

receiving and transmitting packets that are not intended

for them. Given the limited energy budget for MANETs,

inadvertent over usage of the energy resources of a

small set of nodes at the cost of others can have an

adverse impact on the node lifetime.

There exist two classes of MANET routing protocols

[1]: proactive and reactive. The proactive routing

protocols can be of two sub-categories: Distance-vector

and Link-state based routing. In the distance-vector

based routing approach, each node periodically

exchanges its routing table for the whole network with

all of its neighbors. For each destination, the neighbor

node that informs of the best path to a destination is

chosen as the next hop. In the link-state based routing

approach, each node periodically floods link-state

updates, containing the list of its neighbors, to the whole

network. Using these link-state updates, the global

topology is locally constructed at each node and the

Dijkstra algorithm [2] is run on this topology to find the

best path to any other node. Destination-Sequenced

Distance Vector (DSDV) routing [3] and Optimized

Link State Routing (OLSR) [4] protocols are classical

examples of the distance-vector and link-state based

strategies respectively. Proactive routing protocols are

characterized by low route discovery latency as routes

between any two nodes are known at any time instant.

But there is a high control overhead involved in

periodically propagating the routing tables or the link-

state updates to determine and maintain routes.

The reactive or on-demand routing protocols

discover routes only when required. When a source

node has data to send to a destination node and does not

have a route to use, the source node broadcasts a Route-

Request (RREQ) message in its neighborhood and

through further broadcasts by the intermediate nodes,

the RREQ message is propagated towards the

destination. The destination node receives the RREQ

message along several paths and chooses the path that

best satisfies the route selection principles of the routing

protocol. The destination sends a Route-Reply (RREP)

message to the source on the best path selected. The

Dynamic Source Routing (DSR) [5] protocol and the Ad

hoc On-demand Distance Vector (AODV) [6] routing

protocol are classical examples of the reactive routing

protocols. The reactive routing protocols are often

characterized by low route discovery overhead as routes

are discovered only when needed; but, the tradeoff is

higher route discovery latency.

The Fish-eye State Routing (FSR) protocol [7] is a

type of link-state based proactive routing protocol

104

proposed to lower the traditionally observed higher

control overhead with this class of protocols. In FSR, a

node exchanges its link-state updates more frequently

with nearby nodes, and less frequently with nodes that

are farther away. The number of nodes with which the

link-state information is exchanged more frequently is

controlled by the “Scope” parameter (basically the

number of hops), while the frequency of updating the

neighbors outside the scope is controlled by the “Time

Period of Update” (TPU) parameter. The operation of

FSR is basically controlled by these two parameters. As

a result, a node maintains accurate distance and path

information to its nearby nodes, with progressively less

accurate detail about the path to nodes that are farther

away. A scope value of 1 and a larger TPU value

typically results in a lower control overhead at the cost

of a higher hop count path (a sub-optimal path) between

any two nodes. On the other hand, a scope value equal

to the diameter of the network and a smaller TPU value

basically transform FSR to OLSR, resulting in higher

control overhead with the advantage of being able to use

the minimum hop path between any two nodes.

Given that the scope parameter is normally set to 1-

hop, the critical performance metrics for FSR such as

the control overhead (number of link-state messages

exchanged), path hop count and energy consumption are

heavily dependent on the TPU parameter. To date, only

a handful of performance studies ([8][9][10]) are

available for FSR in the literature. To the best of our

knowledge, we could not find a simulation study on the

performance of FSR as a function of this TPU

parameter. In addition, we conjecture that as the node

mobility and network density increases, the proactive

routing strategy based FSR may be preferable over the

reactive DSR. DSR and FSR have not been

categorically studied for different levels of node

mobility, network density and offered traffic load. The

above observations are the motivation for this paper.

In this paper, we present a simulation based

performance analysis of FSR with respect to the TPU

parameter under scenarios generated by different

combinations of node mobility, network density and

offered traffic load. For each of these scenarios, the

performance of FSR is also compared with that obtained

for DSR. We categorically state which of these two

protocols can be preferred for each of the different

scenarios. The rest of the paper is organized as follows:

Section 2 describes the simulation environment and the

scenarios considered. Section 3 defines the performance

metrics evaluated. Section 4 illustrates the simulation

results obtained for different scenarios; interprets the

performance of FSR with respect to the TPU parameter

and compares the performance of FSR vis-à-vis DSR.

Section 5 concludes the paper.

2. Simulation Environment

The simulations of FSR and DSR were conducted in

ns-2 [10]. The network dimensions are 1000m x 1000m.

The transmission range of each node is 250m. We vary

the network density by conducting simulations with 50

nodes (low density network with an average of 10

neighbors per node) and 75 nodes (high density network

with an average of 15 neighbors per node). The

simulation time is 1000 seconds. The scope value is 1-

hop. If all the nodes flood their link-state updates at the

same time instant, there would be collisions in the

network. Hence, the TPU value for each node in the

network is uniformly and randomly chosen from the

interval [0…TPUmax]. The different values of TPUmax

studied in the simulations are: 5, 20, 50, 100, 200 and

300 seconds. For simplicity, we refer TPUmax as TPU

for the rest of this paper. They mean the same.

The node mobility model used in all of our

simulations is the commonly used Random Waypoint

model [11]. Each node starts moving from an arbitrary

location to a randomly selected destination location at a

speed uniformly distributed in the range [0,…,vmax].

Once the destination is reached, the node may stop there

for a certain time called the pause time (0 seconds in our

simulation) and then continue to move by choosing a

different target location and a different velocity. The

vmax values used are 5 m/s, 50 m/s and 100 m/s; the

corresponding average node velocity values are: 2.5

m/s, 25 m/s and 50 m/s representing mobility levels of

low (school environment), moderate (downtown) and

high (interstate highway) respectively.

Traffic sources are continuous bit rate (CBR).

Number of source-destination (s-d) sessions used is 15

(low traffic load) and 40 (high traffic load). The starting

times of the s-d sessions is uniformly distributed

between 1 to 20 seconds. Data packets are 512 bytes in

size; the packet sending rate is 4 data packets per

second. While distributing the source-destination roles

for each node, we saw to it that a node does not end up

as source of more than two sessions and also not as

destination for more than two sessions.

Each node is initially provided energy of 1000

Joules to make sure that no node failures happen due to

inadequate energy supply. The transmission power loss

per hop is fixed and it is 1.4 W and the reception power

loss is 1 W [12]. The Medium Access Control (MAC)

layer model used is the standard IEEE 802.11 model

[13] wherein access to the channel per hop is

accomplished using a Request-to-send (RTS) and Clear-

to-send (CTS) control message exchange between the

sender and the receiver constituting the hop in a path.

The different combinations of simulation scenarios used

in this paper are summarized in Table 1.

105

Table 1: Scenarios Studied in the Simulation

Scenario # Network Density Offered Traffic Load Node Mobility

1 Low (50 nodes) Low (15 s-d Pairs) Low (vmax = 5 m/s)

2 Low (50 nodes) Low (15 s-d Pairs) Moderate (vmax = 50 m/s)

3 Low (50 nodes) Low (15 s-d Pairs) High (vmax = 100 m/s)

4 Low (50 nodes) High (40 s-d Pairs) Low (vmax = 5 m/s)

5 Low (50 nodes) High (40 s-d Pairs) Moderate (vmax = 50 m/s)

6 Low (50 nodes) High (40 s-d Pairs) High (vmax = 100 m/s)

7 High (75 nodes) Low (15 s-d Pairs) Low (vmax = 5 m/s)

8 High (75 nodes) Low (15 s-d Pairs) Moderate (vmax = 50 m/s)

9 High (75 nodes) Low (15 s-d Pairs) High (vmax = 100 m/s)

10 High (75 nodes) High (40 s-d Pairs) Low (vmax = 5 m/s)

11 High (75 nodes) High (40 s-d Pairs) Moderate (vmax = 50 m/s)

12 High (75 nodes) High (40 s-d Pairs) High (vmax = 100 m/s)

3. Performance Metrics

The following performance metrics are evaluated for

each of the 12 scenarios (listed in Table 1) and each of

the six TPU values considered.

(i) Packet Delivery Ratio – the ratio of number of

actual data packets successfully disseminated from

the source to the destination to that of the number of

data packets originating at the source.

(ii) Average Hop Count per Path – the average number

of hops in the route of an s-d session, time averaged

considering the duration of the s-d paths for all the

sessions over the entire simulation time.

(iii) Control Message Overhead – the ratio of the total

number of control messages (route discovery

broadcast messages for DSR or the link-state update

broadcast messages for FSR) received at the nodes

to that of the actual number of data packets

delivered to the destinations across all s-d sessions.

(iv) Energy Consumption per Node – the average

energy consumed across all the nodes in the

network. The energy consumed due to transmission

and reception of data packets, periodic broadcasts

and receptions (in the case of FSR), and route

discoveries (in the case of DSR) all contribute to

the energy consumed at a node.

Note that we take into consideration the number of

control messages received rather than transmitted

because a typical broadcast involves a node transmitting

the control message and all of its neighbors receiving

the control message. The energy expended to receive the

control message, summed over all the nodes, is far less

than the energy expended to transmit the message.

4. Simulation Results

Each data point in Figures 1 through 4 and Tables 2

and 3 is an average of data collected using 5 mobility

trace files for each value of vmax and network density,

and 5 sets of randomly selected 15 and 40 s-d sessions.

To present the results of FSR (with larger TPU values)

and DSR in a comparable scale in the figures, we

present the control message overhead and energy

consumption per node incurred by FSR for maximum

TPU value of 5 seconds in Tables 2 and 3 respectively.

4.1 Low Network Density and Low Traffic

Load (Scenarios 1 through 3)

The packet delivery ratio (refer Figure 1.1) of FSR

decreases with increase in the TPU value. This can be

attributed to the inaccuracy in the routing information

stored at the intermediate nodes for certain destination

nodes. However, it should be noted that FSR still

consistently maintains a packet delivery ratio of above

90% even for TPU values exceeding 200 seconds. For

both FSR and DSR, as the node mobility is increased

from 5m/s to 50m/s, there is an increase in the packet

delivery ratio. In low density networks, spatial

distribution of nodes plays a critical role in the

effectiveness of a routing protocol. Nodes are sparsely

distributed in a low density network and if nodes are

also characterized with low mobility, they tend to

experience higher rates of network disconnection.

Consequently, since the nodes do not change their

positions frequently, the disconnected state persists, and

packet delivery is adversely impacted. In contrast, as

node mobility increases, nodes are redistributed and

move to new locations, thus increasing the probability

that they move within the transmission range of each

other. As a result, the probability of network

connectivity increases, thus increasing the likelihood of

a node successfully routing a packet to its destination.

In the low node mobility scenario, FSR was

observed to yield a more optimal minimum hop path

than DSR for a time period of update (TPU) value of 5

106

Table 2: Control Message Overhead (Control messages received per data packet delivered) for

Maximum TPU Value of 5 Seconds

Maximum Node

Velocity (vmax)

Low Density, Low

Traffic Load

Low Density, High

Traffic Load

High Density, Low

Traffic Load

High Density, High

Traffic Load

5 m/s 178 64 585 220

50 m/s 182 69 640 235

150 m/s 180 67 660 250

Table 3: Energy Consumption per Node at Maximum TPU Value of 5 Seconds

Maximum Node

Velocity (vmax)

Low Density, Low

Traffic Load

Low Density, High

Traffic Load

High Density, Low

Traffic Load

High Density, High

Traffic Load

5 m/s 104 Joules 126 Joules 212 Joules 230 Joules



Figure 1.1: Packet Delivery Ratio Figure 1.2: Average Hop Count per Path

Figure 1.3: Control Message Overhead Figure 1.4: Average Energy Consumption per Node

Figure 1: Performance of FSR and DSR in Low Density Network and Low Traffic Load Scenarios

seconds (see Figure 1.2). In low node density networks,

nodes are sparsely distributed, and availability of routes

between s-d pairs is not always guaranteed. At low

mobility, nodes are less likely to change their location,

which hinders them from discovering more optimal

routes to destinations. In addition, DSR tends to

maintain its current minimum hop path route until a link

failure is detected, predisposing it to retain sub-optimal

routing information in low node density scenarios.

Consequently, since FSR proactively maintains more

accurate topology information at lower TPU values, it

outperforms DSR by determining more optimal

minimum hop paths. In contrast, at higher TPU values,

FSR propagates routing information infrequently, thus

DSR outperforms FSR at these TPU values. The

degradation in the performance of FSR can be attributed

to routing inaccuracy as a result of longer link-state

update time intervals utilized to exchange broadcast

messages about the network topology.

FSR incurs a significantly higher control overhead

over DSR at a lower TPU value of 5 seconds (see Table

2 and Figure 1.3). FSR periodically generates network

wide broadcasts at a TPU value with the purpose of

establishing routes for every node in the network.

107



Figure 2: Performance of FSR and DSR in Low Density Network and High Traffic Load Scenarios

This process of periodic broadcasts generates high

control overhead especially if it is done rather

frequently as in the case of a TPU value of 5 seconds. In

contrast, DSR incurs less overhead than FSR because it

generates less control packets in a low network density

scenario. DSR performs network wide flooding only

when a route is needed for a data transmission session,

and thus its control overhead depends on the offered

traffic load (number of s-d pair sessions).

In comparison to DSR, FSR generates less control

message overhead (refer Figure 1.3) for TPU values

ranging from 50 to 300 seconds. With respect to node

mobility, the amount of overhead generated appears to

grow with increasing mobility in DSR. However, FSR

remains unaffected by variations in node mobility.

At lower TPU values for FSR, it can be observed

that DSR consumes less energy per node (refer Table 3

and Figure 1.4), relative to FSR. This is expected

because DSR, a reactive protocol, should incur less

energy consumption as a result of less control overhead

generation, when compared to a proactive routing

protocol like FSR. However, we do notice that operating

FSR under higher TPU values helps to minimize energy

consumption per node. FSR loses less energy per node

relative to DSR in high mobility scenario cases of

100m/s, corresponding to TPU values of 100 seconds

and 200 seconds respectively. Figure 1.1 shows that the

packet delivery ratios of FSR corresponding to TPU

values of 100 seconds and 200 seconds in a

characteristic high mobility scenario of 100m/s to be at

least 94%. Thus, FSR can be utilized for applications

requiring optimized energy consumption at high node

velocity scenarios and can tolerate a packet delivery

ratio of approximately 94%.

4.2 Low Network Density and High Traffic


Both DSR and FSR exhibits an appreciable increase

in their respective packet delivery ratios (refer Figure

2.1) at low node mobility of 5m/s. However, as node

mobility is increased to 50m/s and 100m/s respectively,

both protocols experience a slight decrease in their

packet delivery ratios. This observation is justified for

the following reason: In networks of low density and

high traffic load, the number of neighbors per node is

significantly smaller compared to the number of active

s-d pairs. As a result, there is more demand placed on a

few nodes to successfully route packets to their

destinations. This obviously results in more packets

getting dropped at each node and hinders the ability of

both protocols to successfully route packets to their

destinations at a higher rate. As node velocity is

increased from low to high, FSR incurs a higher hop

count compared to DSR, except for the TPU value of 5

seconds (see Figure 2.2). The hop count of DSR is not

much affected by the node velocities.

As illustrated in Table 2, FSR is observed to incur a

higher control overhead than DSR at a lower TPU value

of 5 seconds due to frequent network-wide broadcasts.

108



Figure 3: Performance of FSR and DSR in High Density Network and Low Traffic Load Scenarios

However, DSR incurs significantly more overhead than

FSR as traffic load is increased to 40 s-d pairs (refer

Figure 2.3) for TPU values of 20 seconds and beyond.

This observation can be attributed to the reactive nature

of DSR and the low node density of the network. DSR

determines routes as needed. With an increasing need to

determine routes for a growing number of s-d pairs,

DSR invokes its route discovery mechanism frequently,

leading to frequent flooding of the network with

broadcast messages. The amount of route discoveries

increases with increasing mobility, to determine routes

for all the s-d pairs, and thus DSR incurs a higher

control overhead compared to FSR. FSR remains

largely unaffected by increasing rates of node mobility.

The amount of energy consumed by both protocols

(refer Figure 2.4) is observed to be appreciably larger

than that observed in low-density networks with low

traffic load (refer Figure 1.4). The spike noticed in

energy consumption can be attributed to factors such as

the number of data and control packets flowing through

the network. An increase in the offered traffic load at

low network density is analogous to an increase in the

number of active s-d pairs wishing to establish sessions.

Consequently, this corresponds to an increase in the

number of data packets flowing through each node in

the network, which contributes to the observed increase

in the energy consumption at each node. In addition, in

a low density network, the probability of route failures

is rather high. This is attributed to the fact that nodes

could be sparsely distributed, and as a result, will be

unable to find paths to route data packets successfully to

their designated destination. Thus, there will be an

observed increase in the amount of control overhead

generated to maintain and establish routes for the

voluminous amount of data traffic. The energy

consumption of FSR is significantly less when

compared to that of DSR in moderate to high mobility

scenarios for a TPU value of 100 seconds and above.

4.3 High Network Density and Low Traffic


In high-density networks, the packet delivery ratios

incurred by both FSR and DSR are relatively larger than

those incurred in low-density networks (compare

Figures 1.1 and 3.1). For low mobility scenarios of

5m/s, both FSR and DSR deliver packets at

approximately 100%. FSR maintains this perfect packet

delivery rate for low node mobility as TPU values are

increased from 5 seconds up to 200 seconds. The better

performance of both protocols can be attributed to the

fact that each node has more neighbors within its

transmission range to route messages along a given s-d

route. This distribution almost always guarantees that a

packet will be successfully routed to its destination.

In high-density networks, the average hop count per

path values for both FSR and DSR shown in Figure 3.2,

reduced appreciably compared to the low network

density scenarios in Figures 1.2 and 2.2. Nodes in a high

density network tend to have more neighbors, and as a

109



Figure 4: Performance of FSR and DSR in High Density Network and High Traffic Load Scenarios

result have better path alternatives (shorter paths) to

choose from among the optimal routes to any given

destination. On the other hand, FSR and DSR are

observed to incur significantly higher control overhead

in high-density networks. This is because more

broadcast messages are received at each node due to an

increase in the number of neighbors. As illustrated in

Table 2 and Figure 3.3, for TPU value of 100 seconds

or above, FSR incurs less control overhead than DSR.

The energy consumed per node by both protocols is

lower in magnitude for high density networks, compared

to that consumed in lower density networks (see Figures

1.4, 2.4, 3.4 and 4.4). As each node has more neighbors,

data gets efficiently routed along optimal paths in high

density networks. In low node mobility scenarios, the

energy consumption of FSR is significantly higher than

that of DSR. This is because FSR incurs a fixed energy

cost due to periodic network broadcasts. However, at

higher node mobility scenarios, energy consumption of

FSR converges to that of DSR, and actually outperforms

DSR at higher TPU values of 100 seconds and above.

Thus, FSR can be employed as a suitable routing

alternative in networks characterized with high node

density and moderate to high node mobility.

4.4 High Network Density and High Traffic


FSR and DSR maintained a near perfect packet

delivery ratio of 100% as illustrated in Figure 4.1.

For moderate to high node mobility, DSR yielded a

higher packet delivery ratio. The discrepancy between

FSR and DSR in terms of packet delivery ratio

increased, as the TPU parameter values were increased

from 50 seconds to 300 seconds. It should be noted that

FSR is still able to maintain a packet delivery ratio

above 97% even at a high TPU value of 300 seconds.

With respect to hop count, FSR outperforms DSR in

low node mobility scenario at a TPU 5 seconds. Beyond

5 seconds, DSR discovers more optimal minimum hop

paths compared to FSR, due to the discovery of

inaccurate routes in FSR. One major difference

observed is a slight increase in the magnitude of the hop

count discovered by both protocols as compared to the

high network density and low traffic load scenario.

As illustrated in Table 2 and Figure 4.3, with respect

to the control message overhead, FSR scales

considerably better than DSR at TPU values greater

than 50 seconds. FSR proactively maintains routing

information and is not affected by increasing network

density. On the other hand, DSR incurs more overhead

with increasing demand of route discoveries for the s-d

sessions. Thus, variations in mobility have a significant

effect on the amount of control messages generated by

DSR in high node density and high traffic scenarios.

FSR is not much affected by changes in node mobility.

It is observed from Table 3 and Figure 4.4 that the

energy consumption of both protocols exceeded that of

the high network density and low traffic load scenario

(refer Figure 3.4). This is justified by the increase

110

observed in the number of communicating s-d pairs.

More packets are routed in the network due to data and

control overhead, and as a result nodes expend more

energy associated with routing a larger amount of

packets. Energy consumption per node also increases

with increase in the mobility levels of nodes. When

compared to DSR, FSR consumes less energy in

moderate to high mobility scenarios at TPU values

ranging from 20 seconds to 300 seconds. Thus, it can be

suggested that for high mobility and high-density

scenarios, FSR can be configured to a lower TPU value

of 20 seconds to minimize energy consumption. For

moderate node mobility, high density and high traffic

load networks, FSR can be selected over DSR by

configuring the former with a TPU value of 50 seconds.

5. Conclusions

This paper explores the performance and the

associated tradeoffs for the FSR protocol relative to the

DSR protocol for MANETs under varying scenarios of

network density, node mobility, and traffic load using a

comprehensive simulation based analysis. Conclusions

and suggestions are made with respect to the

configuration of the FSR protocol in order to yield

better performance than DSR under specific scenarios

based on the results observed in the simulations.

A significant tradeoff has been observed in the

performance of FSR regarding the hop count per path.

For lower TPU values, FSR has been discovered to

obtain shorter paths due to the increased frequency of

route update messages. As the TPU value is increased,

FSR has been observed to incur higher hop count values

due to lower update frequency. Consequently, this leads

to the persistence of stale routes, which generates longer

hop paths. We have identified the TPU values that will

generate paths with hop count comparable to DSR. It

has been discovered that at low mobility levels, FSR

yields more optimal paths.

In high density networks characterized with high

traffic load, even at higher TPU values, FSR has a

significantly lower control message overhead compared

to DSR and yet achieves a packet delivery ratio of at

least 90%. The same trend has been noticed with respect

to energy consumption at high node density, moderate

and high mobility values with FSR losing less energy for

routing and topology maintenance as compared to DSR.

6. References

[1] C. Siva Ram Murthy and B. S. Manoj, “Routing

Protocols for Ad Hoc Wireless Networks,” Ad Hoc

Wireless Networks: Architectures and Protocols,

Chapter 7, pp. 299 – 364, Prentice Hall, June 2004.

[2] C. E. Perkins and P. Bhagwat, “Highly Dynamic

Destination Sequenced Distance Vector Routing for

Mobile Computers,” Proceedings of ACM (Special

Interest Group on Data Communications)

SIGCOMM, pp. 234 – 244, October 1994.

[3] P. Jacquet, P. Muhlethaler, T. Clausen, A. Laouiti,

A. Qayyum and L. Viennot, “Optimized Link State

Routing Protocol for Ad Hoc Networks,”

Proceedings of the IEEE International Multi Topic

Conference, pp. 62 – 68, Pakistan, December 2001.

[4] D. B. Johnson, D. A. Maltz, and J. Broch, “DSR:

The Dynamic Source Routing Protocol for Multi-

hop Wireless Ad hoc Networks,” Ad hoc

Networking, edited by Charles E. Perkins, Chapter

5, pp. 139-172, Addison-Wesley, 2001.

[5] C. E. Perkins and E. M. Royer, “Ad hoc On-Demand

Distance Vector Routing,” Proceedings of the 2nd

IEEE Workshop on Mobile Computing Systems and

Applications, pp. 90-100, February 1999.

[6] G. P. Mario, M. Gerla and T-W Chen, “Fisheye

State Routing: A Routing Scheme for Ad Hoc

Wireless Networks,” Proceedings of the

International Conference on Communications, pp.

70 -74, New Orleans, USA, June 2000.

[7] S. Jaap, M. Bechler and L. Wolf, “Evaluation of

Routing Protocols for Vehicular Ad Hoc Networks

in City Traffic Scenarios,” Proceedings of the 5th

International Conference on Intelligent

Transportation Systems and Telecommunications,

Brest, France, June 2005.

[8] E. Johansson, K. Persson, M. Skold and U. Sterner,

“An Analysis of the Fisheye Routing Technique in

Highly Mobile Ad Hoc Networks,” Proceedings of

the IEEE 59th

Vehicular Technology Conference,

Vol. 4, pp. 2166 – 2170, May 2004.

[9] T-H. Chu and S-I. Hwang, “Efficient Fisheye State

Routing Protocol using Virtual Grid in High-

density Ad Hoc Networks,” Proceedings of the 8th

International Conference on Advanced

Communication Technology, Vol. 3, pp. 1475 –

1478, February 2006.

[10] Ns-2 Simulator: http://www.isi.edu/nsnam/ns/

[11] C. Bettstetter, H. Hartenstein and X. Perez-Costa,

“Stochastic Properties of the Random-Way Point

Mobility Model,” Wireless Networks, pp. 555-567,

Vol. 10, No. 5, September 2004.

[12] L. M. Feeney, “An Energy Consumption Model for

Performance Analysis of Routing Protocols for

Mobile Ad hoc Networks,” Journal of Mobile

Networks and Applications, Vol. 3, No. 6, pp. 239-

249, June 2001.

[13] G. Bianchi, “Performance Analysis of the IEEE

802.11 Distributed Coordination Function,” IEEE

Journal of Selected Areas in Communications, Vol.

18, No. 3, pp. 535-547, March 2000.

111

ADCOM 2009NETWORK

OPTIMIZATION

Session Papers:

1. Angeline Ezhilarasi G and Shanti Swarup K , “Optimal Network Partitioning for Distributed Computing Using Discrete Optimization”

2. Suman Kundu and Uttam Kumar Roy, “An Efficient Algorithm to Reconstruct a Minimum Spanning Tree in an Asynchronous Distributed Systems”

3. Amit Kumar Mishra, “A SAL Based Algorithm for Convex Optimization Problems”

112

Optimal Network Partitioning for Distributed Computing Using Discrete Optimization

G. Angeline Ezhilarasi Department of Electrical Engineering Indian Institute of Technology Madras

Chennai, INDIA [email protected]

Dr. K. S. Swarup Department of Electrical Engineering Indian Institute of Technology Madras

Chennai, INDIA [email protected]

Abstract— This paper presents an evolutionary based discrete optimization (DO) technique for optimal network partitioning (NP) of a power system network. The algorithm divides the network model into a number of sub networks optimally in order to balance distributed computing and parallel processing of power system computation and to reduce the communication overhead. The partitioning method is illustrated on IEEE Standard 14 Bus, 30 Bus and 118 Bus Test Systems and compared with the other existing methods. The performance of the algorithm is studied using the test systems with different configurations.

Keywords-Network Partitioning, Discrete Particle Swarm Optimization.

I. INTRODUCTION The power system is a large interconnected complex

network involving computation intensive applications and highly intensive, nonlinear dynamic entities that spread across vast area. Under normal as well as congested condition, centralized control requires powerful computing facilities and multiple high speed communication links at the control centers. During certain circumstances, a failure in remote part of the system might spread instantaneously if the control action is delayed. This lack of response may cripple the entire power system including the centralized control center itself. An effective way to monitor and control complex power system is to intervene locally at places where there is disturbance and control the problem from propagating through the network. Hence distributed computing can greatly enhance the reliability and improve the efficiency of power system monitoring and control.

To simulate and implement distributed computing in power system the large interconnected network must be torn into sub networks in an optimal way. The partitioning should balance between the size of the sub networks and the interconnecting tie lines in order to reduce the overall parallel execution time.

Over the past decades a number of algorithms have been proposed in literature for optimal network tearing. The techniques include dynamic programming, and the heuristic clustering approaches. Some of the optimization techniques such as simulated annealing, genetic algorithm [1] and tabu search [2] have also been used for network tearing. For these optimization problems the cost function is formed such that it

reflects the features of parallel and distributed processing. However these methods are computation intensive and involve procedures based on natural selection crossover and mutation. It also requires a large population size and occupies more memory.

This paper presents the application of evolutionary based discrete particle swarm optimization for the problem of network partition. The main advantages of the PSO algorithm are summarized as: simple concept, easy implementation, robustness to control parameters, and computational efficiency when compared with mathematical algorithm and other heuristic optimization techniques. Recently, PSO have been successfully applied to various fields of power system optimization such as power system stabilizer design, reactive power and voltage control, dynamic security border identification, economic dispatch and optimal power flow.

The following sections are organized as follows. Section 2 deals with the formulation of the objective function for the network partition problem. The aim of this optimization is to minimize the cost function, which is a measure of the execution time of the applications in the torn network. An overview of the DPSO and its implementation to NP problem is done is section 3. The algorithm is tested on IEEE standard test systems and simulation results are discussed in section 4. The case studies demonstrate validity of the application of this algorithm by attaining a near optimal solution with a small population size and less computational effort.

II. PROBLEM FORMULATION The objective of the problem is to optimally assign each node of a large interconnected network to a sub network subject to constraints. The resulting sub networks are used for efficient distributed computing and parallel processing of power system analysis. The allocation of the nodes to the sub networks should be such that the number of nodes in the sub network and the number of tie lines connecting the sub networks are well balanced. Hence the conventional cost function [3] which models the computational performance of the partitioned network is taken as the fitness function for solving this optimization problem.

The objective is to minimize

2 3F M L= +α β (1)

113

where, F Partition Index M Maximum number of nodes in a sub network L Total Number of Branches between all the sub

networks ,α β Weighting factors

The first term in the fitness function includes the maximum

nodes in a sub network, thereby influencing the load balance in the distributed processing of any applications. The second term relates to the communication of data between the processes and hence focuses on the number of branches linking the sub networks. The total fitness function value reflects the overall computation time of the power system analysis problems under distributed or parallel processing.

III. IMPLEMENTATION OF NETWORK PARTITIONING

A. Overview of Discrete Optimization Particle Swarm Optimization (PSO) was developed by

Kennedy and Eberhart through simulation of bird flocking in a two dimensional space. In this search space, every feasible solution is called a particle and several such particles in the search space form a group. The particles tend to optimize an objective function with the knowledge of its own best position attained so far and the best position of the entire group. Hence the particles in a group share information among them leading to increased efficiency of the group. The original PSO treats non linear optimization problem with continuous variables. However the practical engineering problems are often combinatorial optimization problems for which Discrete Particle Swarm Optimization (DPSO) can be used.

In a physical n-dimensional search space, the position of a particle ‘p’ is represented as a vector ( ), , .........1 2 3X X X X X p=

and the velocity of a particle as ( ), , .........1 2 3V V V V Vp= . Let

( ), ,........1 2best best bestPbest X X Xp p= and ( ),.........1

best bestGbest Best X Xp p= be

the best position of the particle p and its neighbors so far. In DPSO [4] [5], the particles are initially set to binary values randomly. The probability of the particle making a decision is a function of the current particle, velocity, Pbest and Gbest. The velocity of the particle given by equation (2) determines a probability threshold. The sigmoid function shown in equation (3) imposes limits on the velocity updates. The threshold is constrained within the range of [0, 1] such that higher velocity likely chooses 1 and lower velocities chooses 0. Hence the position update is done using the velocity as shown in equation (4).

( )( )

11 1

2 2

k k k kV V C rand Pbest Xp p p p

kC rand Gbest X p

+ = + −

+ −

ω (2)

( ) 1

1

newS Vp newVpe

=−

+

(3)

( ) 1

0

new newIf rand S V then Xp p

newelse X p

⎛ ⎞< =⎜ ⎟⎝ ⎠

=

(4)

where, ω Weight parameter

,1 2C C Weight factors

1, 2rand rand Random number between 0 and 1

1,k kX Xp p+ Position of the particle at the (k+1)th and kth

iteration 1,k kV Vp p+ Velocity of the particles at the (k+1)th and

kth iteration kPbest p Best position of the particle p until the kth

iteration Gbest Best position of the group until the kth

iteration

B. Evolutionary Based Discrete Optimization The PSO gains self adapting properties by incorporating

any one of the evolutionary techniques such as replication, mutation and reproduction. In order to improve the convergence in PSO mutation is generally done on the weight factors. Also if there is no significant change in the Gbest for a considerable amount of time then mutation can be applied. In this work mutation is used to update the particles as they are constituted by binary values only. The entire position update process of the particles is done based on the mutation probability normally above 0.85[6]. This ensures that the particles are not trapped in their local optimum and do no deviate far off from the current position as well. The process of the algorithm and its implementation aspects are described in detail in the following sections.

1) Generation of the Particles The objective of the network partition problem is to allocate

every node of power system network to a sub network, such that the nodes are equally distributed and number of lines linking them is minimum. Hence the structure of the particle is framed as matrix of dimension (nc x nn), where ‘nc’ is the number of clusters or sub networks, it is a user defined quantity and ‘nn’ is the total number of nodes in the power system network. It is ensured that the each node is assigned to one cluster only .i.e, the sum of the columns of the particle array is 1 always. The velocity of the particles corresponds to a threshold probability. Initially all velocities are set to zero vectors and all solutions including the Pbest and Gbest are undefined.

The position of particles ‘p’ in the search space is created as follows: Step 1) Set j = nc, the number of sub networks and set

k = 1, where k varies from 1 to nn. Step 2) Generate a random number R1 in the range

114

[1, nn/nc]. Step 3) Set the number of nodes to 1 from k to R1, and

set k = k+R1. Step 4) Set j = j -1, if j = 0 go to step 2, otherwise go

to step 5 Step 5) Repeat steps 1 to 4 for all the particles. Step 6) Stop the initialization process.

N1 N2 Nnn-1 Nnn

C1 1 1 0 0 0 0 C2 0 0 1 0 0 0

Cnc-1 0 0 0 1 0 0 Cnc 0 0 0 0 1 1

Figure 1. Structure of Particles

The particle structure of the network partition problem is shown in Figure 1 for the system with ‘nc’ clusters or sub networks and ‘nn’ Nodes.

2) Generation of the Particles The particles in the solution space are evaluated by means of

the fitness function given by equation (1). The fitness function is such that the results of the optimization problem would balance the computational load on the processors and reduce the communication overhead as well. The choice of the exponents of M and L determines the order of solution times required for sub network solution and full solution of the interconnected network in a typical parallel processor solution. Once the particles are evaluated the Pbest and Gbest are selected from the swarm in that iteration as follows:

Step 1) Set j = 1, p = the number of particles. Step 2) If (F (Xp) > Pbestp) then Pbest = F (Xp) Step 3) If max (F (Xp)) > Gbest) then Gbest = F (Xp) Step 4) Set j = j +1, if j < p go to step 2, otherwise go

to step 5 Step 5) Stop the evaluation process.

3) Modification of the Particles To modify the particles in the solution space for the next iteration, the velocity of the particles are obtained from equation (2). In this process of updating the velocity the weight factors must be known a priori. It has been shown that the irrespective of the problem the following parameters are appropriate. 1 2 max min2.0, 0.9, 0.4C C= = = =ω ω . In this paper the weighting function is kept constant for all iterations and is taken as the average of its range[7-9]. Once the velocities are updated for the next iteration the particles are updated based on the sigmoid function given by equation (3). Since this is a discrete optimization problem and there exists some constraints on the redundancy of the nodes in the sub networks. The particles as depicted in Figure. 2 are modified using the following procedure based on a high mutation probability

Step 1) Set N = number of Nodes and M = number of clusters.

Step 2) Select a column in random from 1 to N Step 3) Find the row index whose element is 1 Step 4) Select a row in random from 1 to M

whose element is 0. Step 5) Flip the elements using the condition given by

equation (4). Step 6) Repeat the above steps for all the particles.

N1 N2 Nnn-1 Nnn C1 1 0 0 0 0 0 C2 0 1 1 0 0 0

Cnc-1 0 0 0 1 0 0 Cnc 0 0 0 0 1 1

N1 N2 Nnn-1 Nnn

C1 1 0 0 0 0 0 C2 0 0 1 0 0 0

Cnc-1 0 1 0 1 0 0 Cnc 0 0 0 0 1 1

Figure 2. Modification of Particle Structure

4) Stopping Criteria

Generally for evolutionary algorithms the solution is reached if the fitness function remains constant for a considerable amount of iterations or a maximum number of iterations can be fixed. In this paper the later is followed.

IV. CASE STUDIES To assess the efficiency of the proposed method of network

partioning, it was tested using the data of the standard IEEE 14 Bus, 30 Bus and 118 Bus test systems. The results obtained are compared with other methods like Simulated Annealing (SA) and Genetic Algorithm (GA). Simulation was done using Matlab in a high performance computing Linux cluster. It is a 2048 node Linux cluster which aids parallel and distributed processing. The parameters [10] used for simulation of the algorithm are as follows: population size = 100, maximum iteration = 50, mutation rate = 0.1.

The number of clusters is varied depending upon the size of the network, but for comparison purpose results of 2 and 3 clusters are discussed here. Figure 3(a) shows the IEEE 14 Bus system partitioned into two clusters optimally and Figure 3(b) shows a worst case partition of the same system present

115

in the population. Similarly Figure 4(a) shows the IEEE 30 Bus System partitioned into three clusters. This is the optimal partition obtained from the DPSO applied to the network partitioning problem and Figure 4(b) shows a worst case partition of the same system present in the population with the same configuration.

Figure 3(a). Optimal Partition of 14 Bus System

Figure 3(b). Viable Partitioning of 14 Bus System

The partitions so obtained can be used for distributed

computing and parallel processing of large scale interconnected power system. This will enhance the real time simulation and can be applied in grid computing also. Table 1 shows the maximum number of nodes in a sub network, the number of branches linking the sub networks. It also gives account of the cost of partition and the execution time. It can be concluded from the results that the optimal partition can be obtained by increasing the number of clusters, as the size of the system increases. It is clear that 2 clusters are optimal for 14 Bus system and 3 clusters are optimal for 30 Bus system and higher. This observation is highlighted in Table 1.

Figure 4(a). Optimal Partition of 30 Bus System

Figure 4(b). Viable Partitioning of 30 Bus System

TABLE 1. COMPARISON OF COST AND TIME FOR PARTITIONING IEEE

STANDARD SYSTEMS INTO 2 AND 3 CLUSTERS WITH 1 1and= =α β

Test Case

No. of Clusters M L Cost Time

(Sec) 2 7 3 76 4.14 14 Bus3 8 4 128 6.68 2 16 7 599 7.96 30 Bus3 12 7 487 15.36 2 59 18 9313 32.87 118 Bus3 42 18 7596 61.54

116

In order the test the efficiency of the algorithm under different conditions, simulation is performed with variant configuration of parameters like weights, population size and mutation rates. Table II shows the partitions of 14 Bus and 30 Bus systems with different weight factors.

TABLE II. COMPARISON OF COST FOR IEEE 14 BUS AND 30 BUS SYSTEMS WITH DIFFERENT CONFIGURATION

Weight No. of Clusters

Test System M L Cost

14 Bus 8 4 128 11

==

αβ

3 30 Bus 12 7 487 14 Bus 5 5 200 3

1==

αβ

3 30 Bus 16 6 984 14 Bus 5 5 400 1

3==

αβ

3 30 Bus 15 6 873

The performance of the optimization algorithm is shown by

means of convergence characteristics in Figure 5(a) and 5(b). Simulation was done with standard parameter mentioned earlier and convergence was attained when the system was partitioned into two and three clusters. Unlike SA[1] and GA[3], DPSO reaches a near optimal solution in a less number of iterations and minimum particles in the population.

Figure 5(a). Convergence of 14 Bus System

Figure 5(b). Convergence of 30 Bus System

V. CONCLUSION This paper presents a Discrete Particle Swarm Optimization

method for optimal tearing of power system network. The algorithm is simple and can be used for clustering related problems in any field. In this method the DPSO is used to minimize the cost function, which is actually an estimate of the execution time of a sub network in the distributed computing environment. The algorithm was implemented in a high performance computing environment which supports distributed computing and parallel processing. The simulation results shows that the DPSO can find the near optimum solution under different operating conditions. The torn sub network can aid parallel processing thereby improving speed of intensive power system computations.

REFERENCES

[1] M.R. Irving, BEng, PhD, CEng, MlEE, Prof. M.J.H. Sterling,, “Optimal network tearing using simulated annealing”, 1EE Proceedings of Generation, Transmission and Distribution, Vol. 137, No. I, Jan 1990, 69 – 72. [2] C.S. Chang, L.R. Lu, F.S. Wen, “Power system network partitioning using Tabu Search”, Electric Power Systems Research 49 (1999) 55–613:1, 2009 [3] H. Ding, A.A. E1-Keib, R. Smith, Optimal clustering of power networks using genetic algorithms, Electric Power Systems Research 30 (1994) 209-214 [4] Jong-Bae Park, Member, IEEE, Ki-Song Lee, Joong-Rin Shin, and Kwang Y. Lee, Fellow, IEEE, “A Particle Swarm Optimization for Economic Dispatch With Nonsmooth Cost Functions”, IEEE Transactions on Power Systems, Vol. 20, No. 1, February 2005 [5] Qian-Li Zhang, Xing Li, Quang-Ahn Tran, “A Modified Particle Swarm Optimization Algorithm”, Proceedings Of The Fourth International Conference On Machine Learning And Cybernetics, Guangzhou, 18-21 August 2005

[6] Li-Yeh Chuanga, Hsueh-Wei Changb, Chung-Jui Tu C, Cheng-Hong Yang C, “Improved binary PSO for feature selection using gene expression data”, Computational Biology and Chemistry, Elsevier, Vol 32, 2008, pp. 29-38. [7] X.H. Shi’, X.L. Xing, Q.X. Wang, L.H. Zhang, X.W. Yang, C.G. Zhou, Y.C. Liang, “A Discrete PSO Method For Generalized Tsp Problem”, Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, 26-29 August 2004 [8] H. Shayeghi, M. Mahdavi, A. Kazemi, “Discrete Particle Swarm Optimization Algorithm Used for TNEP Considering Network Adequacy Restriction”, International Journal of Electrical, Computer, and Systems Engineering,3:1, 2009 [9] Zhongxu Li, Yutian Liu, Senior Member IEEE, Rushui Liu and Xinsheng Niu, “Network Partition for Distributed Reactive Power Optimization in Power Systems”, IEEE International Conference on Networking, Sensing and Control, 6-8 April 2008, 385 - 388 [10] P. Kanakasabapathy and K. Shanti Swarup, “Optimal Bidding Strategy for Multi-unit Pumped Storage Plant in Pool-Based Electricity Market Using Evolutionary Tristate PSO”, IEEE International Conference on Sustainable Energy Technologies, ICSET 2008, 24-27 Nov. 2008, 95 – 100.

117

An Efficient Algorithm to Reconstruct a MinimumSpanning Tree in an Asynchronous Distributed

SystemsSuman Kundu

Department of Information TechnologyJadavpur University

Salt Lake, Kolkata - [email protected]

Dr. Uttam Kr. RoyDepartment of Information Technology

Jadavpur UniversitySalt Lake, Kolkata - 700098

u [email protected]

Abstract—In a highly dynamic asynchronous distributed net-work, node failure (or recovery) and link failure (or recovery)triggers topological changes. In many cases, reconstructing theminimum spanning tree, after each such topological change, isvery much required.

In this paper, we have described a distributed algorithm basedon message passing to reconstruct the minimum spanning treeafter a link failure. The algorithm assumes that no furthertopological changes occur during the execution of the algorithm.The proposed algorithm requires significantly fewer numbers ofmessages to reconstruct the spanning tree in comparison to otherexisting algorithms.

I. INTRODUCTION

A distributed network consists of several nodes and con-nection among them. Each node is a computational unit, andthe connections between them can send and receive messagesin a duplex manner. Multiple paths may exist between a pairof nodes. A Minimum Spanning Tree (hereafter referred toas MST) of such a network is the minimally connected treethat contains all the nodes of the network. Applications ofMST includes, effective communication in distributed systems,effective file searching and sharing for peer-to-peer network,gateway routing in local area network or bandwidth allocationin multi hop radio network and other computational scenario.Usually, a cost is associated with each link. The cost may indi-cate distance between two nodes, or the time required to sendor receive data packets, or bandwidth of the communicationchannel, or any other parameters. A MST always contains theminimum cumulative cost within the network.

A distributed system is dynamic in nature i.e. in anydistributed network topological changes occur with respect totime. A change can occur due to deletion or recovery of nodesand links. In many situations, it is important to reconstructthe MST after each topological change. The main hurdle toreconstruct MST arises due to the asynchronous nature ofthe system. Moreover, A node only knows local information.If topological changes occur, it must be propagated to eachnode via message communication. It is also possible that somepart of the network gets the latest knowledge, whereas someportion does not. Algorithm should address this issue as well.

Several algorithms for constructing MST in distributed sys-tems were proposed in last three decades. Most of these MSTconstruction algorithms are applicable in a static topology. Inthis paper, we have proposed an algorithm based on messagepassing to reconstruct an MST that works seamlessly evenin a dynamic topology. Our algorithm considers a singlelink failure and assumes that no further topological changesoccur during the execution of the algorithm. It is also shownthat the total number of messages required to reconstructthe spanning tree is significantly less in comparison to otherexisting algorithms.

The rest of the paper is organized as follows: Section IIdescribes the related work and overview of our result. SectionIII describes description of distributed algorithm, analysis andproof of correctness. In section IV, we provide a result ofsimulation; section V concludes the overall algorithm andfinally in section VI, we point out the further research areaswe are working on.

II. RELATED WORK

In their pioneer paper [2], Gallager, Humblet and Spiraproposed one of the first distributed protocols to constructMST in year 1983. The protocol of [2] is further improvedin the protocol [3], [4], [5], [6], [7] and [8]. In [9], someflaws of [5] are rectified. All these protocols of constructingMST are developed for static topology. Some of them addressmessage efficiency and some of them address time efficiency,as the performance measures for the algorithm.

However, the distributed systems are dynamic as describedin the previous section. Researchers are working on protocols,which are resilient in nature to adopt the topological changes.Few of such algorithms are given in [1], [12] and [14]. Intheir paper [10] B. Das and M. C. Loui provide a serial anda parallel algorithm to address the similar problem and laterimproved by Nardelli, Proietti and Widmayer in their paper[11]. These algorithms do not address the distributed versionof the problem. In paper [13], P. Flocchini, T. M. Enriquez,L. Pagli, G. Prencipe, and N. Santoro provided a distributedversion of the same problem. In [13] authors provided with

118

the precomputed node replacement scheme. In our paper weprovide the improved version of the distributed algorithm of[1] for a single link failure. In the following section, we willdescribe the response of a link failure by [1].

A. Basic Algorithm of [1]

In the algorithm of [1], C. Chang, I. A. Cimett and S.P.R.Kumar proposed a resilient algorithm which reconstructs theMST after link failure and recovery. Complexity of the algo-rithm for a single link failure is O(e), where e is the numberof links in the network.

A link failure is a process of fragment expansion i.e. Alink(which is a part of the MST) failure breaks the MST intotwo different fragments. The failed link initiates the processof recovery in the adjacent nodes. The initiator node generatesnew fragment identity. In algorithm [1], authors suggest twoapproaches to generate new fragment identity such that noconflict occurs between subsequent topological changes. Firstapproach, is to include identity of all nodes of the fragmentinto the fragment identity. Second approach is to maintain acounter for each link. This counter counts the link failure andincludes the value along with weight of the node for generatingnew fragment identity. In case of first approach, the size ofthe fragment identity of a large fragment becomes very large.So, the second approach is efficient in terms of message size.

If a link w(u, v), is broken in certain time and u be theparent of v. In response to the link failure, u will generatethe new identity for the fragment like w(u, v), u, c where cis the counter of topological changes of the link w(u, v).Then u will forward the fragment identity to the root ofthe fragment. However, in case of v, after generating thefragment identity like w(u, v), v, c, it marks itself as rootof that fragment. After getting the new fragment identity,root of these two fragments changes their fragment identityand starts broadcasting REIDEN<id> using the tree linksand wait for the acknowledgment. Any intermediate nodes,upon getting the REIDEN<id> message changes its fragmentidentity to the id and sends the same message to its childnode. A leaf node, after getting the REIDEN<id> changes itsfragment identity and sends REIDEN ACK<id> to its parent.Intermediate node, sends REIDEN ACK<id> to its parentonly after getting REIDEN ACK<id> from all of its children.Receiving of the REIDEN ACK<id> message indicates thatall nodes of the fragment aware about the new identity value.Root now changes its state to find and initiate the findminimum outgoing edge (hereafter referred to as MOE) phaseby sending FINDMOE message over the children. When anode receives FINDMOE message, a node changes its state tofind. In the find state each node starts to send TEST<id>message via each non tree link. A TEST<id> message isresponded by either ACCEPT<id> or REJECT<id>. AnACCEPT<id> message, indicates that the edge is outgoing,leading to another fragment. An important thing to rememberhere is that the ACCEPT or REJECT message should returnthe identity number of the test message. This will help todetermine whether the message is for the current failure or

previous one. After identifying the MOE a node propagatesthe FINDMOE ACK<w(MOE)> to upward. Where w(MOE)is the locally known best outgoing weight, either its ownMOE or MOE received from its children (Whichever isminimum). After receiving FINDMOE ACK<w(MOE)> rootsends CHANGE ROOT<id> through the same path it receivesthe MOE. The CHANGE ROOT<id> reaches to the nodewhere the MOE of the fragment incident. The node marksitself the new root of the fragment and sends CONNECTmessage over the MOE. Connect subroutine works same asthe algorithm [2] and merge two fragments sharing the sameMOE and starts the next iteration.

B. Overview of Our Results

When considering the single link failure, our approachprovides a significant improvement on the total number ofmessages required during reconstruction over the algorithmof [1]. Also, the message size for some control message isimproved slightly.

After link failure, the fragment which contains the root nodeof the MST is referred as root fragment in this paper. If theroot fragment contains E′ number of edges, then our algorithmrequires 2× E′ fewer messages to reconstruct the MST. Ourapproach here is to use the previously known fragment identity(say it as a historical data) for the root fragment. However, howthe algorithm evolves if another link failure occurs during theexecution is still under observation.

III. ALGORITHM TO RECONSTRUCTING MST AFTER LINKFAILURE

We closely followed the response of the algorithm [1] for asingle link failure and found some improvement areas. In thefollowing subsections, we will describe the network model, ourobservation regarding the algorithm [1], our contribution to im-proving the algorithm, description of the modified distributedalgorithm, analysis of the outcome and proof of correctness.

A. Network Model

The communication model for the algorithm is modeledas an asynchronous network represented by an undirectedweighted graph of N nodes. The graph is represented byG(V,E), where V is the set of nodes and E ⊂ V × V isthe set of links. Each node is a computing unit consistingof a processor, a local memory and also an input and outputqueue. The input (output) queue is an unlimited sized bufferto send and receive messages. A unique identification numberis associated with each node (Node i represent the node withthe identification number i).

Each link (u, v), assigned with a fixed weight w(u, v) isa bidirectional communicational line between the node u andthe node v. Each node has only the local information i.e. eachnode aware about its identification number and the weight ofthe incident links. After construction of the MST, each nodewill be aware of two additional information; first, the adjacentedge leading to the parent node in the MST and secondly theadjacent edges, those leading to the child nodes in the MST.

119

Nodes can communicate only via messages. The messagesmay be lost due to link failure during transmission. However,if the link is functioning, the messages can be sent from eitherend and can be received by the other end within a finite, un-determinable time, without error and in sequence.

Also, if a link failure occurs, then the failure event triggersthe recovery process for each end of the failed link and therecovery process initiated by the same node.

B. Observation

In the algorithm of [1] the reconstruction process works intwo phases. In Phase-I, root node informs each node of thefragment with the new fragment identity, and in Phase-II, eachnode finds its own MOE and forward the MOE to the root.Root then identifies the MOE of the fragment. Finally, thefragment sends CONNECT message via the MOE.

After failure each fragment changes its fragment identity tonew one. Also, each message passes to the neighbor containsthe fragment identity along with the control information. Thisis used because if overlapping of link failure occurs thenthe message response could be avoided depending upon thefragment identity such a way, that only the current failure willbe processed during the execution.

C. Our Contribution

Our contribution to the algorithm is that we can use thehistorical data like previously known fragment identity forone fragment. When we use the previously known fragmentidentity than the fragment with the older identity enter itsPhase-II without executing Phase-I. For our algorithm, we usethe previously known fragment identity for the root fragment.Also, the FAILURE message propagating from the failure linkto the root of the root fragment no longer requires to carrynewly generated fragment identity. That means the FAILUREmessage size is also reduced. The difficulty with this approachis, when a TEST<id> message is received it may be possiblethat fragment identity of the node is not updated yet. It may bepart of the same fragment or other fragments still in Phase-I (Propagating new fragment identity is not completed yet).However, if a node receives a TEST message in Phase-II, thenits fragment identity is correct. So, to avoid the conflictingresponse to a TEST<id> message, the response is delayeduntil the node enters into Phase-II.

Also, it is assumed that no further failures occur during theexecution. So, it is possible to reduce message size for controlmessages. For example, the ACCEPT and REJECT messagedo not require to send the fragment identity back to the sender.

D. Description of the protocol

In the beginning, each node maintains a collection ofadjacent edge. This adjacent edge collection is sorted by thecost of the link. During the life time, an adjacent link can haveone of the following status

1) Basic - the link is yet to processed2) Parent - the link lead to the parent3) Child - the link lead to child

Fig. 1. MST of a random network

4) Rejected - the link is leading to the node included in thesame fragment

5) Down - the link in not workingIt is assumed that, the MST is already constructed using

some distributed protocol. A link failure, triggers the recoveryprocess in both end of the link. Let us take, failure occurs forthe link e = (u, v, w, c) where u and v is the node connectingthe link. w is the weight of the link and c is the status changecount for the link. If the link is not included in the MST, i.e.it is either in Basic or in Rejected state then u and v simplychanges the link status to Down and do nothing.

Fig. 2. A non MST link failure

Otherwise, the nodes marks the link Down and respond inthe following manner -• When previous status of the link is Parent - the link

marks itself as root of the newly generated fragment. Itthen marks all Rejected nodes to Basic. This is necessarybecause the link may lead to the other fragments due tothe topological change. The node generates new fragmentidentity for the fragment, reset its own fragment identityand enters into the Phase-I by sending the INIT<fid>message to its children. Here fid is the new fragmentidentity as described by the algorithm [1], i.e. it includesthe weight and count of failure along with the identityof the node. If u is the parent of v in the example edgee then after failure v marks itself the root and changesits fragment identity to fid = w(u, v), v, c then initiatePhase-I by sending this fid along with the INIT message.After receiving the INIT<fid> message node changes its

120

fragment identity to fid and marks all Rejected link toBasic. Then it forwards the message to its children. Ifthe node is leaf node then it returns a FINISH messageto its parent. Each intermediate node waits for receivingFINISH message from all its children and then sendsthe FINISH message to the parent. A FINISH messagereceived by v (i.e. the root) indicates that all node ofthe fragment knows the current fragment identity. Thenv starts Phase-II by sending the FINDMOE message.

• When previous status of the link is Child - the linkforward the FAILURE message to upward. Note that thenode did not generate new fragment identity to forwardalong with FAILURE message. When the root nodedetects the failure, it initiates the Phase-II directly bysending FINDMOE message to its children.

Fig. 3. MST link failure and response of u and v

• A node receiving FINDMOE message, immediately en-ters to finding state. In finding state each node finds itslocal MOE. To find the local MOE a node picks theminimum weighted adjacent edge which is in Basic stateand send TEST<fid>.

Fig. 4. Phase-II initiated by root of the fragment with FINDMOE message

• A node received a TEST<fid> message. The followingtwo cases to consider -

– Node executing in Phase-I: Then the response isdelayed until the node itself enters into Phase-II byreceiving FINDMOE from its parent.

Fig. 5. TEST message and response

– Node executing in Phase-II: In this scenario aTEST<fid> message is replied by either ACCEPTif its fragment identity is different than of fid or byREJECT if its fragment identity is same as fid.

• Upon receiving the REJECT message the node picks thenext best edge in Basic state and sends TEST<fid> to testthe edge. However, if it gets ACCEPT message, whichindicates it found its local MOE, then the node waitsfor its children’s response. After finding the local MOE,leaf nodes propagate the best weight to its parent viaREPORT<wt> message.

• When a REPORT<wt> message is received, the inter-mediate nodes compare their local MOE with the wtreceived from children and change the MOE accordingly.After getting REPORT<wt> message from all of itschildren it sends the REPORT<wt> to its parent withthe best weight known by the node. Thus the best weightis propagated to the root node of the fragment. At thispoint the root sends the CHANGE ROOT message to thesame path that leads to the MOE.

• A node with the MOE of the fragment receivesCHANGE ROOT and marks itself as a root of the frag-ment. Then it sends the CONNECT message over theMOE and merge with the fragment sharing same MOE.

Fig. 6. Root of the fragment is changed and CONNECT message send formerge the fragment

121

E. Analysis

Compare with the algorithm [1], in our approach the rootfragment directly enters into Phase-II. So, the INIT<fid> andFINISH messages (REIDEN<id> and REIDEN ACK<id> inalgorithm [1]) of phase one is not required for root fragment.Let us take the root fragment has E′ number of edges afterfailure. Then to executing Phase-I it required to send E′

number of INIT<fid> message over E′ links. Also, it requiredto send E′ number of FINISH message over E′ links. Thatmeans the reconstruction process of our algorithm requireE′+E′ = 2E′ fewer messages then the protocol described in[1].

When compare our approach with the protocol of [1] interms of message size, some control message contains veryfew bits with respect to the protocol of [1]. For example,the FAILURE message in the root fragment only containsthe control information indicating the failure occurrence. Nofragment identity is sends along with it. As we consider nofurther failure occur during execution of the protocol ACCEPTand REJECT message also contains control message only; nofragment identity returns with it.

1) Complexity: Let us consider the network contains Nnodes and E edges. The initial MST contains N nodes ande edges before the failure. Also, consider the root fragmentcontains N ′ number of nodes and E′ number of edges. If theheight of the root fragment be h′, then to propagate the failuremessage to the root of the root fragment require O(h′) numberof message. Propagate the new fragment identity for the otherfragment, requires O(N − N ′) messages because this infor-mation will be propagated through the tree links. Similarly, tosend Find MOE request and merging two fragments requireO(N) messages since these messages also be sent through thetree links. However, finding MOE of the fragments require tosend messages through O(E) links of the network. So, themessage complexity for the algorithm on link failure is O(E).

F. Proof of Correctness

At first, we will proof several Lemmas, which are usedin the distributed algorithm, and then we will show that thealgorithm generates the minimum weighted spanning tree aftercompletion of the algorithm.

Lemma III-F.1. Before topological changes, each node of thenetwork has information about the tree links incident to it.

Proof: It is assumed that the MST is initially constructedusing some distributed protocol. In our simulation, we usethe algorithm [7] to construct MST. Each node maintains alist of incident edges and their status as described in section3.4 i.e. nodes are aware about the links leading to its parentand links leading to its children. These parent and child linksdenote the tree links which are included in MST. Hence, eachnode is aware about the tree links, incident to it before anytopological changes.

Lemma III-F.2. Root of the fragments receives failure noti-fication within a finite time.

Proof: A failed link broke down the MST into twofragments and notifies the failure to either end of the link.The node attached with the root fragment propagate theFAILURE message toward the root element. The other nodemark itself root of the new fragment and generates the newfragment identity. Each message is assumed to be reach at thedestination in finite time and in a sequence manner wheneverthe link is working. Also, it is assumed that no further failureoccurs during the execution. Hence, the FAILURE message iscorrectly received by the root of the both fragments in finitetime.

Lemma III-F.3. On each node, fragment identity is updatedbefore the start of Phase-II according to the latest link failure.

Proof: Root fragment uses the previously known fragmentidentity of existing MST. So the root fragment does not requireto update the fragment identity. It then directly enters intothe Phase-II, all nodes belongs to the root fragment alreadyaware about the fragment identity. However, for the otherfragment, newly generated identity is propagated to the childnodes by INIT<fid> message. Leaf node returns the FINISHmessage to its parent after updating their fragment identity.Only after receiving FINISH message from its entire child aintermediate node sends FINISH message to its parent. As themessages are assumed to be reach at destination sequentialmanner in a finite time then FINISH message received by theroot of the fragment implies each node already updated itsfragment identity. Then the root node initiates Phase-II forthat fragment. Thus, when a node is in Phase-II, its fragmentidentity is always updated with the latest link failure.

Lemma III-F.4. Each fragment starts its Phase-II executionin finite time.

Proof: As their is no link failure occur during the execu-tion of the protocol, all messages of Phase-I will be properlyresponded by the nodes in a finite time and finally terminatedwhen FINISH message received by root of a fragments.Root node then initiate the Phase-II by sending FINDMOEmessage. Hence, each fragment starts its Phase-II in a finitetime.

Lemma III-F.5. Fragments find its MOE within a finite time.

Proof: After getting FINDMOE message each node sendsTEST message to the minimum weighted non tree link (Whichis not in Down status) to test whether the link is leadingto another fragment or not. Then the node waits for theresponse from the other end. The TEST message is correctlyreached to the other end because the link is not marked asdown and message lost is not a valid property according tothe assumption. TEST message response is delayed until thenode start executing Phase-II. From the previous lemma, wefound that TEST message is responded within a finite timebecause the responder node will enter Phase-II within a finitetime. Also, the response is known to be correct because inPhase-II the fragment identity is already updated with the newfragment identity. Now, if El be the local minimum outgoing

122

edge and Ec be the minimum outgoing edge forwarded byall of its children. Then, an intermediate node forward theMOE = min(El, Ec) to its parents. Thus, at the root nodeMOE of the fragment is calculated by MOE = min(El, Ec)within a finite time.

Theorem III-F.1. The algorithm reconstructs spanning treein finite time and the spanning tree is the minimum weightspanning tree of the network.

Proof: When the minimum weighted outgoing edge MOEis determined by root, the fragment changes its root to the newnode where the MOE is incident. Then it sends the connectmessage through MOE. If the MOE found by one fragmentis also MOE of the other fragment then two fragments mergeand create a merged fragment. If there is no other fragmentremains then it is the desire spanning tree. As merging onlypossible if both fragments agreed that the MOE is common,it is obvious that algorithm does not produce any cyclic path.

Considering the case, where only one link failure occurswe can easily derive the proposition that the MOE of onefragment is also the MOE of other fragment(Since networklinks are uniquely weighted). Hence, the fragments merge atMOE and produce a spanning tree.

Now, there is no possibility of any other outgoing edge withlesser weight then MOE, because in the process only minimumweight edge is filtered and forwarded from the leaf nodeto the root node(MOE = min(El, Ec) from last lemma).Finally, root node determines MOE of the fragment. So, themerging occurs at the minimum possible weighted edge oftwo fragments. Hence, merged spanning tree has the minimumcollective weight in the system.

Thus, upon terminating the algorithm reconstruct the span-ning tree which has the minimum weight i.e. MST of thenetwork.

IV. EXPERIMENT AND RESULT

To compare the output, we simulate few network graphsusing ’Network Simulator v2 (NS2 2.29)’ [15]. We constructedthe initial MST using the algorithm [7]. The pictures belowshows the results of one experiment.

Fig. 7. Initial Graph and Constructed MST

The example graph contains the vertex set V , edge set Eand corresponding weight set W as below -

Vertex:

V = 0, 1, 2, 3, 4, 5

Fig. 8. Recovered MST after failure(in red)

Edges:

E = e1 = (0, 5), e2 = (0, 2), e3 = (0, 1),e4 = (1, 2), e5 = (1, 4), e6 = (1, 3),

e7 = (2, 4), e8 = (2, 3)

Weight:

W = w(e1) = 4, w(e2) = 8, w(e3) = 15,

w(e4) = 10, w(e5) = 3, w(e6) = 5,

w(e7) = 6, w(e8) = 7

Message required to terminate the execution of the algo-rithm is tabularized in Table 1. Scenario 1, e7 link failsand initiates the recovery process. If we use the constructionalgorithm again (i.e. algorithm [7]) after failure it requires 114messages, where as the performance is improved if we usethe reconstruction algorithm (i.e. using the algorithm [1]). Inour modified algorithm, it takes fewer messages then existingreconstruction algorithm of [1].

TABLE ILINK FAILURE VS REQUIRE MESSAGE CHART

Message Require For

BrokenLink

MST Con-struction

Algorithmof [1]

Modified Al-gorithm

Scenario 1: Single link failuree7 114 59 55

Scenario 2: Single link failuree6 114 54 46

Scenario 3: Link failure one after anothere7 114 59 55e6 50 50

V. CONCLUSION

In this paper, we have presented a distributed algorithm toreconstructing a Minimum Spanning Tree after deletion ofa link. The problem can also be solved using the protocol[1]. Here we showed that for single link deletion scenario ourprotocol can reconstruct the MST with 2E′ fewer messagesthan protocol [1]; where E′ represents the number of links inthe root fragment. If we consider MST with large depth and thelink failure occur to very close to leaf node (i.e. the E′ is muchgreater than E−E′) then our algorithm performs much better

123

way. However, when E′ is zero i.e. the link failure occurs atroot node, the algorithm completed without any improvementof the total number of messages. We can refer to Scenario 3,where for next link failure there is no improvement in totalnumber of messages.

VI. FURTHER WORKS

When a TEST<fid> message is delayed until the nodeenters its finding state then we can use the same TEST<fid> todetermine whether the adjacent edge is Rejected or Acceptedand used it instead of sending another TEST<fid> messageover the same edge. However, this approach may lead to someother difficulties due to the asynchronous nature, and we arecurrently working on the same.

Whenever the failure occurs at the root or very close to theroot, improvement is close to zero. We are working on thealgorithm so that it can use historical data such a way, whichproduces improvement for other scenarios too.

Also, currently we are working how we can modify the al-gorithm so that it accepts topological changes during executionof the algorithm.

REFERENCES

[1] C. Cheng, I.A. Cimett, and S. P.R. Kumar, ”A protocol to maintain a min-imum spanning tree in a dynamic topology.” Computer CommunicationsReview 18, no. 4 (Aug. 1988): 330-338

[2] R. Gallager, P. Humblet and P. Spira, ”A distributed algorithm forminimum-weight spanning trees.” ACM Transaction on ProgrammingLanguages and Systems, 5(1):66-77, January 1983

[3] Chin F., Ting H. ”An almost linear time and O(n log n+e) messagesdistributed algorithm for minimum-weight spanning trees.” Proceedingsof 26th IEEE Symp. Foundations of Computer Science, p.257-266, 1985

[4] Gafni E., ”Improvement in the time complexities of two message optimalprotocols.” Proceedings of the ACM Symp. on Principles of DistributedComputing, 1985

[5] B. Awerbuch, ”Optimal Distributed Algorithm for Minimum WeightSpanning Tree, Counting, Leader Election, and Related Problems,” Symp.Theory of Comp., pp. 230-240, May 1987.

[6] J. Garay, S. Kutten and D. Peleg, ”A Sub-Linear Time DistributedAlgorithm for Minimum-Weight Spanning Trees.” 34th IEEE Symp. onFoundations of Computer Science, pp. 659-668, November 1993.

[7] Gurdip Singh , Arthur J. Bernstein, ”A highly asynchronous minimumspanning tree protocol.” Distributed Computing, v.8 n.3, p.151-161,March 1995

[8] Elkin M., ”A faster distributed protocol for constructing minimumspanning tree.” Proceedings of the ACM-SIAM Symp. on DiscreteAlgorithms, p.352-361, 2004

[9] Michalis Faloutsos, Mart Molle, ”Optimal Distributed Algorithm forMinimum Spanning Trees Revisited” Proceedings of the 14th AnnualACM Symposium on Principles of Distributed Computing, pp. 231-237,1995.

[10] B. Das and M.C. Loui, ”Reconstructing a minimum spanning tree afterdeletion of any node.” Algorithmica, 31. pp. 530-547, 2001.

[11] E. Nardelli, G. Proietti, and P. Widmayer ”Nearly linear time minimumspanning tree maintenance for transient node failures” Algoritmica, 40.pp. 119-132, 2004

[12] Hichem Megharbi and Hamamache Kheddouci ”Distributed algorithmsfor Constructing and Maintaining a Spanning Tree in a Mobile Ad hocNetwork” First International Workshop on Managing Context Informationin Mobile and Pervasive Environments, 2005.

[13] P. Flocchini, L. Pagli, G. Prencipe, and N. Santoro ”Distributed compu-tation of all node replacements of a minimum spanning tree” Euro-Par,volume 4641 of LNCS, pages 598-607. Springer, 2007

[14] Awerbuch, B., Cidon I., and Kuten, ”Optimal maintenance of a spanningtree.” J. J. ACM 55, 4, Article 18 (September 2008), 45 pages.

[15] Network Simulator version 2(NS2), URL:http://www.isi.edu/nsnam/ns/

124

A SAL based algorithm for convex optimizationproblemsAmit Kumar Mishra

Department of Electronics and Communication EngineeringIndian Institute of Technology Guwahati, India

Email: [email protected]

Abstract—A new successive approximation logic (SAL) basediterative optimization algorithm for convex optimization problemis presented in this paper. The algorithm can be generalizedfor multi-variable quadratic objective function. There are twomajor advantages of the proposed algorithm. First of all, theproposed algorithm takes a fixed number of iteration whichdepends not on the objective function but on the search spanand on the resolution we desire. Secondly, for ann variableobjective function, if the number of data points we consider inthe span isN , then the algorithm takes just n log

2N number of

iteration.Index Terms—Quadratic objective function, iterative optimiza-

tion

I. I NTRODUCTION

Solution to convex optimization problem is a well studiedarea with a number of existing iterative algorithms. However,the performance of the algorithms depend on how far we startfrom the solution and on the nature of the objective function[1], [2]. In the current paper we present an iterative algorithmbased on the successive approximation logic (SAL). SAL hasbeen successfully used in a range of uses ranging from analog-to-digital converter [3] to coordinate rotation digital computer(CORDIC) architecture [4]. In the proposed algorithm we firstdiscritize the search domain and represent the points usingbinary number system. The starting point is an all-zero binarynumber. The bits are updated starting from the most significantbit (MSB) till the least significant bit (LSB). The updatealgorithm is based on the slope of the objective function atthe given candidate solution.

Some of the major advantages of this algorithm are asfollows. First of all this is simple and easily implementableon digital hardware. Secondly, irrespective of the point fromwhere we start, the search takes exactlyB iterations, whereB is the number of bits used to represent each point in thesearch space. Thirdly, each iteration is computationally light,involving the calculation of the objective function twice.Thealgorithm needs just the sign of the gradient at a point notthe exact magnitude. The error of optimization is less than theLSB i.e. 2−B . Finally, for ann variable objective function, ifthe number of data points we consider in the span isN , thenthe algorithm takes justn log2 N number of iteration.

In the present paper we have not handled the problem ofboundary constraints. It is further assumed that the searchspace has a single extrema point. Lastly, we only deal withthe maxima-search problem. Solution to the minima-search

Fig. 1: Generic single variable maximization problem

problem can easily be achieved by incorporating some trivialmodification to the algorithm.

We also show a simple extrapolation of the algorithm formulti variable optimization (MVO) with an illustration of thealgorithm for bivariate optimization.

The next section expounds the algorithm for single vari-able optimization (SVO) problem. Section 3 describes thealgorithm for MVO. Section 4 discusses the comparison ofthe performance of the proposed algorithm to that of someof the classic algorithms from the literature. The last sectionconcludes the paper.

II. T HE ALGORITHM FOR SVO

The problem definition for SVO is as follows. Given afunction f : A → R, we seek the pointxO such thatf(xO) ≥ f(x), for all x in the search space. Through out thispaper we will deal with maximization problem. The extensionof the algorithm to a minimization problem is trivial.

Figure 1 shows the generic single variable maximizationproblem. P1 and P2 are two generic points in the searchspace. In an iterative optimization, ifP1 is the resulting pointof the current iteration, the next iteration should move thepoint towards right, and if the resulting point isP2, then thenext iteration should move the point towards left. This strategyis shown in algorithm 1.

It may be marked here that the updating depends only onthe sign of the slope, not on the exact magnitude. This givesthe scope to use computationally less complicated algorithmsto estimate the slope.

125

The above mentioned updating is done using digital suc-cessive approximation algorithm. In this first of all the searchspace is sampled by digital representation using binary numbersystem. This representation and the number of bits used foreach point depends on the desired accuracy of the algorithm.The starting point of updating is always0. Let xi be theithestimation forx0, and letB be the number of bits used forrepresenting a point in the search space. In theith iteration(i ∈ [1, B]), the (B − i − 1)th bit is updated confirming withalgorithm 1. The updating algorithm for theith iteration isgiven in algorithm 2.

Algorithm 1 Updating algorithm forxi

1: if slope(xi) ≥ 0 then2: xi+1 > xi

3: else if slope(xi) ≤ 0 then4: xi+1 < xi

5: end if

Algorithm 2 Updating algorithm forith iteration1: if slope(xi) ≥ 0 then2: (B − i − 1)th bit of X = 13: else if slope(xi) ≤ 0 then4: (B − i − 1)th bit of X = 05: end if

The complete pseudo-code of the algorithm for SVO isshown in algorithm 3. Table I gives short description ofthe functions used in the pseudo-code.

Algorithm 3 Find xO, given the search space boundariesxh

andxl, and the desired resolution in the search spaceδx

1: B ⇐ ceil(

log2

(

xh−xh

δx

))

2: δxup ⇐ xh−xh

2B

3: arg ⇐ zeros(1, B)4: sl ⇐ 05: for i = 1 to B do6: arg(i) ⇐ 17: xi ⇐ xl + bin2int(arg) ∗ δxup

8: sl ⇐ Hs(findslope(xi))9: arg(i) ⇐ sl

10: end for11: xO ⇐ xB

12: return xO

A. Explanation of the pseudo-code

The algorithm needs three inputs, viz. the boundary pointsfor the search spacexh and xl, and the desired resolutionin the search spaceδx. In step 1, the number of bits,Brequired for the problem is estimated. From this the updatedresolutionδxup is calculated (δxup ≤ δx). Variablesl containsthe slope of the function atxi for the ith iteration.arg is thebinary number whose bits are updated in each iteration. InB

TABLE I: Short description of the functions used in thepseudo-codes

Function name Descriptionceil ceiling functionHs Heaviside step function

zeros zeros(K) gives aK bit binarynumber with all bits set to0

bin2int function to convert binarynumber to equivalent integer

findslope finds the slope of the costfunction at the given arguments

iterations, all the bits ofarg are updated.xB is assigned tox0 and is returned as the answer.

III. T HE ALGORITHM FOR MVO

For multi-variable optimization (MVO) an updating algo-rithm similar to the already discussed SVO is used. In thisall the variables of the search space are digitized as perthe desired resolution, and represented using binary numbersystem. Instead of applying the algorithm on each dimension,all the dimensions are fused together by interleaving the binaryrepresentation of each dimension. Hence ifB bits are used torepresent each dimension, the updating algorithm is appliedon aDB bit number, whereD is the dimension of the MVOproblem. In the interleaving, the bits of same significance areplaced together. Hence, the complete running of the algorithmwill need DB iterations.

In general ifKSV O is the number of operations required foran SVO algorithm, similar algorithm forD dimension MVOwill needKD

SV O operations. However, using the current algo-rithm the number of operations for aD-variable optimizationproblem isDKSV O. This results in a substantial speed up ofthe algorithm for MVO problems.

A. Pseudo-code for bi-variate optimization

As an example, algorithm 4 gives the pseudo-code for a bi-variate optimization problem. This algorithm needs six inputs,viz. the boundary points for the search spacexh, xl andyh, yl,and the desired resolution in the search space in both the di-mensionsδx, δy. In step 1 and 2, the numbers of bits,B1, B2required for the problem in both the dimensions are estimated.To make the algorithm less complicated, uniform number ofbits are assigned for both the dimensions, by taking the highernumber out ofB1 andB2 as the number of bits to representthe search space in both the dimensions. Accordingly, theresolutions in both the dimensions are updated toδxup andδyup in steps 4 and 5. Variablesl contains the slope of thefunction at (xi, yi for the ith iteration. (argx, argy) are thedigital numbers for the two dimensions, whose bits are updatedin each iteration.arg is the2B bit long digital number whoseodd sequenced bits are derived fromargx and even sequencedbits are derived fromargy. In 2B iterations, all the bits ofarg

are updated. Final estimates (x2B, y2B) are assigned to (x0, y0)and are returned as the answer.

126

Algorithm 4 Find (xO, yO), given the search space boundaries(xh, yh) and (xl, yl), and the desired resolution in the searchspace(δx, δy)

1: B1 ⇐ ceil(

log2

(

xh−xh

δx

))

2: B2 ⇐ ceil(

log2

(

yh−yh

δy

))

3: B ⇐ max(B1, B2)4: δxup ⇐ xh−xh

2B

5: δyup ⇐ yh−yh

2B

6: arg ⇐ zeros(1, 2B)7: sl ⇐ zeros(1, 2B)8: argx ⇐ zeros(1, B)9: argy ⇐ zeros(1, B)

10: for i = 1 to 2B do11: arg(i) ⇐ 112: for j = 1 to B do13: argx(j) ⇐ arg(2j − 1)14: argy(j) ⇐ arg(2j)15: end for16: xi ⇐ xl + bin2int(argx) ∗ δxup

17: yi ⇐ yl + bin2int(argy) ∗ δyup

18: sl(i) ⇐ Hs(findslope(x(i), y(i)))19: arg(i) ⇐ sl(i)20: end for21: xO ⇐ x2B

22: yO ⇐ y2B

23: return (xO, yO)

IV. COMPARISON WITH SOME STANDARD OPTIMIZATION

ALGORITHMS BASED ON NUMERICAL EXPERIMENTS

In this section we compare the proposed algorithm SALbased optimization (SALO) algorithm with some of the ex-isting powerful optimization algorithms reported in the openliterature [5], [6]. The comparisons are made on the basisof the performance in the optimization of scaled sine-valleyfunction and scaled Rosenbrock function. Comparisons aremade with respect to classic BFGS algorithm, Yuan’s modifiedBFGS algorithm [6] and usual trust region method withcurvilinear path (UTRCP) [5].

Following are the functions on which the algorithm has beenvalidated:

1) Problem 1: The first problem is a sine-valley functiongiven by:

f(x1, x2) = 100[x2 − sin(x1)]2 + 0.25x2

1. (1)

Starting point for this problem was( 3

2π,−1). The solu-

tion is (0, 0).2) Problem 2: The second problem is the Rosenbrok’s

function given by:

f(x1, x2) = 100(x2 − x21)

2 + (1 − x1)2. (2)

Starting point for this problem was(−1.2, 1.0). Thesolution is(1, 1).

The problems were solved for a resolution of10−8, i.e.

TABLE II: Comparison of the proposed algorithm with somestandard optimization algorithms

Prob. BFGS MBFGS UTRCP SALOno. [6] [6] [5]

NI/NF NI/NF NI/NF NI/NF1 40/57 39/54 22/35 17/352 33/45 34/45 22/38 17/35

||∇f(xk)|| ≤ 10−8. The algorithms are compared based ontwo factors, viz. the number of iterations needed (NI) andnumber of function evaluations (NF) involved.

Table II gives the consolidated results.It can be observed that the performance of the proposed

SALO algorithm is better or comparable to some of the bestalgorithms in the literature. However the classic algorithmsare designed to work on any unconstrained function. Thegeneralization of the SALO algorithm in the same lines isunder progress.

V. CONCLUSIONS AND DISCUSSIONS

We have presented an algorithm based on digital successiveapproximation principle for convex optimization problem.Thealgorithm can directly be applied for SVO and with minormodification, for MVO. Some of the major advantages of thisalgorithm are as follows:

• The number of iterations is fixed and is equal to thenumber of bits used to represent numbers in the samplespace. This number of iterations is irrespective of the costfunction.

• Resolution of the algorithm depends on the user and isfixed for a given number of bits, irrespective of the costfunction.

• A D dimension SVO takesDB number of iterations.This greatly reduces the computational complexity of anSVO.

• The complete algorithm runs in digital domain and henceis highly amenable to digital computer implementation.

• The proposed algorithm will also work for finding non-smooth maxima or minima, provided there are no localmaxima or minima.

We have tested the algorithm on a range of quadratic func-tions for different number of variables. For all the cases thealgorithm was found to find the maxima with error< 2−B .

However we have not discussed about the boundary prob-lem, in this paper. Still, because of the above mentionedadvantages, the algorithm is deemed to prove as a usefulpractical solution for convex optimization problems in anydomain.

REFERENCES

[1] R. Fletcher,Practical methods of optimization. Wiley Interscience, 1987.[2] M. Powel, “How bad are the BFGS and DFP methods when the objective

function is quadratic?”Mathematical Programming, vol. 34, pp. 34–47,1986.

127

[3] J. F. Wakerly,Digital Design: Principles and Practices. Prentice Hall,1999.

[4] J. E. Volder, “The CORDIC trignometric computing technique,” IRETrans. Electronics and Computers, vol. EC-8, pp. 330–334, 1959.

[5] Y. Xiao and F.Zhou, “Nonmonotone trust region methods withcurvilinearpath in unconstrained optimization,”Computing, vol. 48, pp. 303–317,1992.

[6] Y.-X. Yuan, “A modified BFGS algorithm for unconstrained optimiza-tion,” IMA Journal of Numerical Analysis, vol. 11, pp. 325–332, 1991.

128

ADCOM 2009WIRELESS SENSOR

NETWORKS

Session Papers:

1. V. V. S. Suresh Kalepu and Raja Datta , “Energy Efficient Cluster Formation using Minimum Separation Distance and Relay CH’s in Wireless Sensor Networks”

2. Pankaj Gupta, Tarun Bansal and Manoj Misra, “An Energy Efficient Base Station to Node Communication Protocol for Wireless Sensor Networks”

3. R. C. Hansdah, Neeraj Kumar and Amulya Ratna Swain, “A Broadcast Authentication Protocol for Multi-Hop Wireless Sensor Networks”,

129

1

Energy Efficient Cluster Formation using Minimum Separation Distance

and Relay CH’s in Wireless Sensor Networks

and , Member, IEEE

Department of Electronics and Electrical Communication Engineering

Indian Institute of Technology, Kharagpur, India, Kharagpur-721302

Email: ,

Abstract In this work we proposed a scheme to select the relay

nodes for forwarding network data when a minimum

separation distance (MSD) is maintained between cluster

heads in a cluster based sensor network. This prolongs

network lifetime by spreading the cluster heads, thus

lowering the average communication energy consumption

by optimizing the next node for delivery of data.. The

work also includes the study of the above protocol by

varying the network area. We also proposed another

cluster-based routing protocol for large network areas,

this improves the MSD routing protocol by introducing

minimum spanning trees (MST) instead of direct

communications to connect nodes in clusters. We have

done extensive simulations to show that the proposed

method outperforms the existing techniques.

Keywords: Wireless Sensor Network, MSD, TDMA,

LEACH , PEGASIS, CH, Minimum Spanning Tree.

I. Introduction

Wireless sensor networks consist of hundreds to

thousands of low-power multi-functioning sensor nodes,

operating in an unattended environment, with limited

computational and sensing capabilities. Recent

developments in low-power wireless integrated micro

sensor technologies have made these sensor nodes

available in large numbers, at a low cost, to be employed

in a wide range of applications in military and national

security, environmental monitoring, and many other

fields [1]. In contrast to traditional sensors, sensor

networks offer a flexible proposition in terms of the ease

of deployment and multiple functionalities. In classical

sensors, the placement of the nodes and the network

topology need to be predetermined and carefully

engineered. However, in the case of modern wireless

sensor nodes, their compact physical dimensions permit a

large number of sensor nodes to be randomly deployed in

inaccessible terrains. In addition, the nodes in a wireless

sensor network are also capable of performing other

functions such as data processing and routing, whereas in

traditional sensor networks special nodes with

computational capabilities have to be installed separately

to achieve such functionalities. In order to take advantage of these features of

wireless sensor nodes, we need to account for certain

constraints associated with them. In particular,

minimizing energy consumption is a key requirement in

the design of sensor network protocols and algorithms.

Since the sensor nodes are equipped with small, often

irreplaceable, batteries with limited power capacity, it is

essential that the network be energy efficient in order to

maximize the life span of the network [1, 2].

In this paper, we propose a method to select the relay

nodes to forward the aggregated data by considering Link

Cost Factors (LCF). This work includes another efficient

cluster-based routing protocol for large network areas,

which improves MSD routing protocol by introducing

Minimum Spanning Trees (MST) instead of direct

communications to connect nodes in clusters. The rest of

the paper is organized as follows: the important existing

protocols and the improvements subsequently proposed

upon them is described in Section II and the power radio

model used for simulations is present in Section III.

Section IV describes on the drawbacks of existing

protocols and the proposed algorithm. The results duly

supported by the relevant plots for performance

characteristics and related analysis are presented in

Section V and Section VI concludes the paper. II. Related work

2.1. Cluster-Based Routing Protocol

The popular existing Hierarchical Routing protocol in

sensor networks is Low Energy Adaptive Clustering

Hierarchy (LEACH).

LEACH (Low-Energy Adaptive Clustering

Hierarchy) [3] is a TDMA cluster based approach where

a node elects itself to become cluster head by some

probability and broadcasts an advertisement message to

all the other nodes in the network. A non cluster head

node selects a cluster head to join based on the received

signal strength. Being cluster head is more energy

consuming than being a non cluster head node, since the

cluster head needs to receive data from all cluster

members in its cluster and then send the data to the base

station. All nodes in the network have the potential to be

cluster head during some periods of time. The TDMA

scheme starts every round with a set-up phase to organize

the clusters. After the set-up phase, the system is in a

steady-state phase for a certain amount of time. The

steady-state phase consists of several cycles where all

nodes have their transmission slots periodically. The

nodes send their data to the cluster head that aggregates

the data and sends it to its base station at the end of each

cycle. After a certain amount of time, the TDMA round

ends and the network re-enters the set-up phase. LEACH

130

2

has a drawback that the cluster is not evenly distributed

due to its random rotation of local cluster-head.

The Power Efficient Gathering in Sensor Information

Systems (PEGASIS), another clustering-based routing

protocol, further enhances network lifetime by increasing

local collaboration among sensor nodes [5]. In PEGASIS,

nodes are organized into a chain using a greedy algorithm

so that each node transmits to and receives from only one

of its neighbors. In each round, a randomly chosen node

from the chain will transmit the aggregated data to the

base station, thus reducing the per round energy

expenditure compared to LEACH.

III. Network and Radio Models

The Network Model and Architecture

Our proposed protocol lies in the realization that the

base station is a high-energy node with a large amount of

energy supply. Thus, it utilizes the base station to control

the coordinated sensing task performed by the sensor

nodes. In this article we assume a sensor network model,

similar to those used in [3, 6], with the following

properties:

• A fixed base station is located far away from the sensor

nodes.

• The sensor nodes are energy constrained with a uniform

initial energy allocation.

• The nodes are equipped with power control capabilities

to vary their transmitted power.

• Each node senses the environment at a fixed rate and

always has data to send to the base station.

• All sensor nodes are immobile.

The two key elements considered in the design of

protocol are the sensor nodes and base station. The sensor

nodes are geographically grouped into clusters and

capable of operating in two basic modes:

• The cluster head mode

• The sensing mode

In the sensing mode, the nodes perform sensing tasks

and transmit the sensed data to the cluster head. In cluster

head mode, a node gathers data from the other nodes

within its cluster, performs data fusion, and routes the

data to the base station through other cluster head nodes.

The base station in turn performs the key tasks of cluster

formation, randomized cluster head selection, and CH-to-

CH routing path construction.

The Radio Model

As shown in Fig. 1, a typical sensor node consists of

four major components: a data processor unit; a micro-

sensor; a radio communication subsystem that consists of

transmitter/receiver electronics, antennae, and an

amplifier; and a power supply unit [1]. Although energy

is dissipated in all of the first three components of a

sensor node, we mainly consider the energy dissipations

associated with the radio component since the core

objective of this article is to develop an energy-efficient

network layer protocol to improve network lifetime. In

addition, energy dissipated during data aggregation in the

cluster head nodes is also taken into account.

Figure 1. major components and energy cost parameters

of a sensor node.

In our analysis, we use the same radio model

discussed in [9]. The transmit and receive energy costs

for the transfer of a l-bit data message between two nodes

separated by a distance of d meters is given by Eqs. 1 and

2, respectively.

(1)

(2)

Where in Eq. 1 denotes the total energy

dissipated in the transmitter of the sensor node, and

in Eq. 2 represents the energy cost incurred in the

receiver of the destination node. The parameters

and in Eq. 1 and Eq. 2 are the per bit energy

dissipation for transmission and reception, respectively.

is the energy required by the transmit

amplifier to maintain an acceptable signal-to-noise ratio

in order to transfer data messages reliably. As is the case

in [6], we use both the free-space propagation model and

the two-ray ground propagation model to approximate the

path loss sustained due to wireless channel transmission.

Given a threshold transmission distance of , the free-

space model is employed when , and the two-ray

model is applied for cases when . Using these two

models, the energy required by the transmit amplifier

is given by

(3)

Where and denote transmit amplifier parameters

corresponding to free-space and the two-ray models,

respectively, and is the threshold distance given by

(4)

We assume the same set of parameters used in [3] for all

experiments throughout the article:

, , and

. Moreover, the energy cost

for data aggregation is set as .

IV. Energy efficient Routing protocol using MSD

and Relay CH”s

The proposed routing technique is an extension to

LEACH. It uses a centralized cluster formation algorithm

to form clusters that means the cluster formation was

carried out by the BS. The protocol uses the same steady-

state protocol as LEACH. During the set-up phase, the

base station receives information from each node about

131

3

their current location and energy level. After that, the

base station runs the centralized cluster formation

algorithm to determine cluster heads and clusters for that

round. Once the clusters are created, the base station

broadcasts the information to all the nodes in the network.

Each of the nodes, except the cluster head, determines its

local TDMA slot, used for data transmission, before it

goes to sleep until it is time to transmit data to its cluster

head, i.e., until the arrival of the next slot.

In our method during the set-up phase, for cluster

formation we are using the minimum separation distance

method proposed by Ewa Hansen, Jonas Neander [7]

which overcomes the drawback of LEACH protocol by

spatially distributing the cluster heads. A simple

algorithm to find and select cluster heads is described

below.

4.1 Cluster head selection algorithm

We randomly choose a node among the eligible nodes

to become cluster head but we also make sure that the

nodes are separated with at least the minimum separation

distance (if possible) from the other cluster head nodes.

Algorithm : CH selection algorithm

MSD = Minimum Separation Distance

dc = Number of desired cluster heads,

energy(n) = Remaining energy for node n

)

In the cluster head selection part, cluster heads are

randomly chosen from a list of eligible nodes. To

determine which nodes are eligible, the average energy of

the remaining nodes in the network is calculated. In order

to spread the load evenly, only nodes with energy levels

above average are eligible.

If a node that has been randomly chosen is too close

i.e. within the range of the minimum separation distance

from all other chosen cluster heads, a new node has to be

chosen to guarantee the minimum separation distance.

This process iterates until the desired number of cluster

heads is attained. If we cannot find a node outside the

range of the minimum separation distance (to guarantee

the minimum separation distance) we choose any node

among the eligible nodes to become cluster head. When

all cluster heads have been chosen and separated,

generally with at least the minimum separation distance,

clusters are created the same way as in LEACH.

4.2 Our Approach

In WSNs asymmetric communication is possible.

That is, the base station reaches all the sensor nodes

directly, while some sensor nodes cannot reach the base

station directly but need other nodes to forward its data,

hence routing schemes are necessary.

As the network size increases the transmission

distance within the cluster increases. There by energy

consumption increases.

In our approach we present the Designed energy

efficient cluster based routing protocol to overcome the

above drawbacks. This section includes selection of the

relay cluster heads to forward the data when minimum

separation distance between clusters is maintained.. We

also present the efficient routing technique for large

sensor network areas. So, to forward the aggregated data

from CHs to BS relay nodes are required. The selection

of Relay nodes is described below.

4.2.1 Selection of Forwarding CHs

Once the CHs are identified and the nodes are

clustered relative to the distance from the CHs, the

routing towards the base station (BS) is initiated. First,

the CH checks if the BS is within communication range.

If so, data is sent directly to the BS. Otherwise, the data

from the CHs in the sub-network are sent over a multi-

hop route to the BS. Here, the selection of a relay node is

set to maximize the link cost factor (LCF) which includes

energy, end-to-end delay and distance from the BS to the

RN.

Initially, a CH broadcasts HELLO packets to all CH

nodes in range and receives ACK packets from all the

relay candidates that are in communication

range. The ACK packets contain information such as the

node ID, available energy, and processing delay at a

node, and distance from the BS. The RNs that are further

away from the BS than the current node do not respond to

the HELLO packets. If one of the ACK packets was sent

from the BS, then it is selected as a next hop node, thus

ending the route discovery procedure. Otherwise, the

current node builds a list of potential RNs from the

ACKs. Then it selects the optimal RN using the LCF

parameter. The same procedure is carried out for all hops

to the BS. The advantage of this routing method is

reduction of the number of relay nodes that have to

forward data in the network, and hence the scheme

reduces overhead and minimizes the number of hops and

communication due to flooding.

The LCF from a node to its next hop node is given

by (5) where represent the delay to reach the next hop

node, is the distance between the next hop node and

the BS, and is the energy remaining at the next hop

node:

(5)

In equation (5), consideration of the remaining energy

at the next hop node increases network lifetime, the

distance to the BS from the next hop node reduces the

number of hops and end-to-end delay; and the delay

132

4

incurred to reach the next hop node minimizes any

channel fading problems and processing delay. When

multiple RNs are available for routing of packets, the

most optimal RN is selected based on the maximum LCF.

4.2.2 Using MST in Intra cluster for large

Sensor Networks:

In this protocol the main idea is MSTs to replace

direct communication in one layer of the network: intra-

cluster. The average transmission distance of each node

can be reduced by using MSTs instead of direct

transmissions and thus the energy dissipation of

transmitting data is reduced.

Figure 2 (a) (b)

a) Direct communication in LEACH

b) MST Communication in intra cluster for large

network area

In each cluster, all nodes including the CH are

connected by a MST and then the CH as the leader to

collect data from the whole tree. And the CHS uses Relay

nodes to forward the data to the BS. Data fusion process

is handled along the tree route. When the network area is

larger, the reduced transmission distance is greater. Thus,

this protocol is more energy efficient.

In the direct transmission, information of routing

path is simple, and each node only needs to know of and

send data to its CH. But in trees, each node must know

the next node that it would send data to. So we form the

MSTs by the BS.

V. Performance evaluation

To assess the performance of proposed routing

protocol, we simulated MSD and MST routing protocols

using C-language.

Nodes 100

Network size 100m x100m

Base station location (50,175)

Radio propagation speed 3 x m/s

Processing delay 50µs

Radio speed 1Mbps

Data size 500 bytes

Table 1. Characteristics of the test network.

For an evaluation to be meaningful, the performance

of the proposed protocol should be compared with the

performances of certain well known existing energy

aware protocols namely, LEACH and PEGASIS.

Performance is measured by quantitative metrics of

average energy dissipation, system lifetime, total data

messages successfully delivered, and number of nodes

that are alive.

For these simulations, we consider random network

configuration with 100 nodes where each node is

assigned an initial energy of 1J.Further more, the data

message size for all simulations is fixed at 500 bytes, and

packet header for each type of packet was 25 bytes long.

In the performed simulations we have varied the

minimum separation distances between cluster heads, in

order to see the effects on energy consumption in the

network. We have also investigated whether the number

of clusters used, together with the minimum separation

distance, has any effect on the energy consumption. The

minimum separation distance varied between 30 and 45

meters, and the number of clusters varied between 2 and

8 clusters.

Figure 3: No of messages Received by varying msd and

clusters

Figure 4. Distribution of sensor nodes and cluster

formation with msd=30m

In Figure 3, we see how the minimum separation

distance affects the energy consumption, i.e., the number

of messages received at the base station during the

lifetime of the network. We also see how the number of

clusters used affects the energy consumption in the

0

10000

20000

30000

40000

50000

2 3 4 5 6 7 8No

of

me

ssag

es

Re

ceiv

ed

at

BS

No of clustres

msd=30

msd=35

msd=40

133

5

network. Further, we see that when using 5 clusters and a

minimum separation distance of 30 meters between

cluster heads, the base station receives the most

messages. So, this gives the best energy efficient

configuration. Figure 4 gives the distribution of sensor

nodes and the formation of clusters with a minimum

separation distance 30m.

The improvement gained through MSD with relay

CH’s protocol is exemplified by the system lifetime graph

in Fig.5. This plot shows the number of nodes that remain

alive over the number of rounds of activity for the 100 m

× 100 m network scenario. With MSD protocol, all the

nodes remain alive for 920 rounds, while the

corresponding numbers for LEACH and PEGASIS are

510, and 825, respectively. Furthermore, if system

lifetime is defined as the number of rounds for which 75

percent of the nodes remain alive, proposed protocol

exceeds the system lifetime of LEACH by 30 percent. A

5 percent improvement in system lifetime is observed

over PEGASIS.

Figure 5. No of alive nodes as rounds increases

Figure 6. A comparison of MSD protocol average

energy dissipation with other clustering-based protocols.

Figure 6 shows the average energy dissipation of the

protocols under study over the number of rounds of

operation. This plot clearly shows that MSD with relay

CHs protocol has a much more desirable energy

expenditure curve than those of LEACH, and PEGASIS.

On average, MSD protocol exhibits a reduction in energy

consumption of 40 percent over LEACH, respectively.

This is because all the cluster heads in LEACH transmit

data directly to the distant base station, which in turn

causes significant energy losses in the cluster head nodes.

Next we analyze the number of data messages

received by the base station for the three routing

protocols under consideration. For this experiment, we

again simulated 100 m × 100 m network topology where

each node begins with an initial energy of 1J. Figure 7

shows the total number of data messages received by the

base station over the average energy dissipation. The plot

clearly illustrates the effectiveness of proposed protocol

in delivering significantly more data messages than its

counterparts. Moreover, results in Fig. 7 confirm that

MSD protocol delivers the most data messages per unit of

energy of the two schemes. In the final experiment, we

evaluate the performance of the routing protocols as the

area of the sensor field is increased. For this simulation,

100 nodes are randomly placed in a square field of

varying network areas with the base station located at

least 75 m away from the closest sensor node, and results

were obtained over 25 different network topologies for

each network area instance. The figure 8 shows the no of

alive nodes after 900 rounds by varying the network area.

there is a comparison in performances between MSD with

relay CHs protocol, LEACH,PEGASIS and MST

protocol.

Figure 7. Total number of data messages received at the

base station as a function of average energy dissipation.

Figure 8. Number of nodes alive as a function of

network area.

Clearly, MSD protocol outperforms both LEACH and

PEGASIS as network area increases up to 300m. As the

network area increases further MST protocol performs

better than other three protocols. This is mainly because

the LEACH do not ensure that the cluster heads are

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 91011121314151617181920

No

of

Aliv

e n

od

es

No Of rounds Hundreds

leach

pegasis

msd

0

0.2

0.4

0.6

0.8

1

1.2

0 500 1000 1500

Ave

rage

en

ery

d

issi

pat

ion

(J)

No of rounds

msd

pegasis

leach

0

10000

20000

30000

40000

50000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

no

of

msg

s R

xed

Avg Energy Dissipation (J)

leachpegasismsd

0

20

40

60

80

100

120

0 200 400 600

No

Of

No

de

s A

live

Network Area (m2)…

leachpegasismsdMST in cluster

134

6

uniformly placed across the whole sensor field. As a

result, the cluster head nodes in LEACH and can become

concentrated in a certain region of the network, in which

case nodes from the “cluster head deprived” regions will

dissipate a considerable amount of energy while

transmitting their data to a faraway cluster head. the

utilization of the greedy algorithm in PEGASIS results in

a gradual increase in neighbor distances. This in turn

increases the communication energy cost for those

PEGASIS nodes that have far neighbors. As shown

shortly, increasing neighbor distances will have a

significant effect on PEGASIS’ performance when the

area of the sensor field is increased.

Figure 9. Network lifetimes in network area of

400m X400m

Figure 9 gives the performance comparison of MSD

routing protocol and MST routing protocols for network

area 400m. The plots also show that tree topology makes

protocol performs better than direct transmission in larger

network area. MST protocol uses tree topology. MSD

protocol has uses direct communication intra-cluster. The

simulation results prove MST is an elegant solution in

large network area.

VI. Conclusion

We presented a simple energy-efficient cluster

formation algorithm for the AROS architecture. The

simulations showed that using a minimum separation

distance between cluster heads and using Forwarding

CHs improves energy efficiency compared to LEACH,

PEGASIS, measured by the number of messages received

at the base station. Using 5 clusters and a minimum

separation distance of 30 meters between cluster heads is

the best efficient configuration for our simulated network.

Using MST within the cluster is an elegant

solution for large sensor networks. From the simulation

result, this approach reduces the distance between a

cluster head and its cluster member nodes, thereby

reducing the transmission energy when a cluster member

node communicates with its cluster head. Results show

that it is more energy efficient than LEACH and

PEGASIS for large sensor networks. References

1) F. Akyildiz, Su Weilian, Y. Sankarasubramaniam, and E.

E. Cayirci. “A Survey on Sensor Networks”. IEEE

Communications Magazine, 40(8):102.114, 2002.

2) J. N. Al-Karaki and A. E. Kamal, “Routing

Techniques in Wireless Sensor Networks: A

Survey,” IEEE Wireless Communications, vol.11, pp.

6-28, Dec. 2004.

3) W. Heinzelman, A. Chandrakasan, and H.

Balakrishnan. “Energy-Efficient Communication

Protocols for Wireless Microsensor Networks

(LEACH)”. Porc. of the 33rd Hawaii International

Conference on Systems Science-Volume 8, pp. 3005-

3014, January 04-07, 2000.

4) An Energy Efficient Routing Mechanism for Wireless

Sensor Networks, Ruay-Shiung Chang and Chia-Jou

Kuo, IEEE,2006.

5) S. Lindesy and C. Raghavendra, “PEGASIS:Power-

Efficient Gathering in Sensor Information System,”

Proc. of 2002 IEEE Aerospace Conference, pp. 1-6,

March 2002.

6) S. D. Muruganathan, D. C. F Ma., R. I. Bhasin, and A.

O. Fapojuwo, “A Centralized Energy-Efficient

Routing Protocol for Wireless Sensor Networks,”

IEEE Communications Magazine, vol.43, pp. 8-13,

2005.

7) Ewa Hansen, Jonas Neander, Mikael Nolin and Mats

Björkman,”Energy-Efficient Cluster Formation for

Large Sensor Networks using a Minimum Separation

Distance” , In proceedings of the Fifth Annual

Mediterranean Ad Hoc NetworkingWorkshop 2006,

MedHocNet, Lipari, Italy, June 2006.

8) G.Huang, Xiaowei Li, Jing He,“Dynamic Minimum

Spanning Tree Routing Protocol for large wireless

Sensor networks” .IEEE, 2006.

9) Wendi B. Heinzelman, Member, IEEE, Anantha P.

Chandrakasan “An Application-Specific Protocol

Architecture for Wireless Microsensor Networks”

IEEE Transactions on Wireless Communications, vol.

1, pp. 660-670, October 2002.

10) Nauman Israr and Irfan Awan, July 2006”Multihop

routing Algorithm for Inter-ClusterHead

communication”, 22nd

UK Performance Engineering

Workshop Bournemouth UK”, Pp.24-31.

11) M. Younis, M. Youssef, and K. Arisha. “Energy-

aware management for cluster- based sensor

networks”. Computer Networks, 43:649.668, Dec

2003.

12) A. Manjeshwar and D. P. Agrawal.” TEEN: A

Routing Protocol for Enhanced Efficiency in Wireless

Sensor Networks.” Parallel and Distributed

Processing Symposium., Proceedings 15th

International, pages 2009-2015, April 2001.

13) A. Manjeshwar and D. P. Agrawal. “APTEEN: A

Hybrid Protocol for Efficient Routing and

Comprehensive Information Retrieval in Wireless

Sensor Networks.” Parallel and Distributed

Processing Symposium., Proceedings International,

IPDPS2002, pages 195-202, April 2002.

14) J. Chang and L. Tassiulas. “Maximum lifetime

routing in wireless sensor networks”. IEEE/ACM

Trans. Networks, 12(4):609.619, 2004.

15) A. Mainwaring, J. Polastre, R. Szewczyk, D. Culler,

and J. Andersson. “Wireless Sensor Networks for

Habitat Monitoring”. WSNA'02, September 2002

0

20

40

60

80

100

120

500 750 1000 1250 1500

No

of

no

de

s al

ive

No Of Rounds

msd

MST in cluster

135

An Energy Efficient Base Station to Node Communication Protocol for Wireless Sensor Networks

Pankaj Gupta Department of E.C.E.

Indian Institute of Technology Roorkee Roorkee-247667, India [email protected]

Tarun Bansal Department of Computer Science

University of Texas at Dallas Richardson TX, USA

[email protected]

Manoj Misra Department of E.C.E.

Indian Institute of Technology Roorkee Roorkee-247667, India [email protected]

Abstract—Inexpensive sensors capable of significant computation and wireless communication with limited energy resources are available. Once deployed, the small sensor nodes are usually inaccessible to the user, and thus replacement of the energy source is not feasible. Hence, energy efficiency is a key design issue that needs to be enhanced in order to improve the life span of network. Several network layer protocols like LEACH, BCDCP, and PEDAP have proved very useful and efficient for Node to Base Station (BS) communication. However, these centralized protocols have no explicit support for BS to Node communication. In some scenarios, BS to Node communication may be very frequent. In such cases trivial solution of flooding may prove to be a very costly solution. We introduce here M-way Search Tree Based Base station to Node communication protocol (MSTBBN) for Wireless Sensor Networks, which can be used to provide efficient BS to Node communication. MSTBBN can be used with any of the centralized data-centric protocols (like LEACH-C) without any significant message overhead. Our solution provides efficient communication with time complexity of O(h) hops where h is the height of the BS-rooted tree constructed by the underlying routing protocol.

Keywords- Wireless Sensor Networks(WSN), Base station to node communication, Wireless communication, Routing protocol, Energy Efficiency.

I. INTRODUCTION A wireless sensor is a battery-operated device, capable of

sensing physical quantities. In addition to sensing, it is capable of wireless communication, data storage, a limited amount of computation and signal processing. A wireless sensor network (WSN) consists of hundreds to thousands of such low-power multifunctioning sensor nodes, operating in an unattended environment, with limited computational and sensing capabilities to achieve a common objective [1]. A WSN has one or more base station (or sink) which collects data from all sensor devices. These base station(s) (BS) are the interface through which WSN interacts with the outside world.

Recent developments in low-power wireless integrated microsensor technologies have made these sensor nodes available in large numbers, at a low cost, to be employed in a wide range of applications in military & national security, environmental monitoring, and many other fields [2]. The technology promises to revolutionize the way we live, work, and interact with the physical environment.

In contrast to traditional sensors, sensor networks offer a flexible proposition in terms of the ease of deployment and multiple functionalities. In classical sensors, the placement of the nodes and the network topology need to be predetermined and carefully engineered. However, in the case of modern wireless sensor nodes, their compact physical dimensions permit a large number of sensor nodes to be randomly deployed in inaccessible terrains. In addition, the nodes in a wireless sensor network are also capable of performing other functions such as data processing and routing, whereas in traditional sensor networks, special nodes with computational capabilities have to be installed separately to achieve such functionalities.

In order to take advantage of these features of wireless

sensor nodes, we need to account for certain constraints associated with them. In particular, minimizing energy consumption is a key requirement in the design of sensor network protocols and algorithms. Since the sensor nodes are equipped with small, often irreplaceable, batteries with limited power capacity, it is essential that the network be energy efficient in order to maximize the life span of the network [1, 3].

Recent advances in wireless sensor networks have led to

many new routing protocols specifically designed for sensor networks where energy awareness is an essential consideration. These routing mechanisms have considered the characteristics of sensor nodes along with the application and architecture requirements. Most of the attention, however, has been given to designing protocols for routing

136

data from sensor nodes to base station. These protocols use flooding for data transmission from BS to individual nodes [3]. Obviously, flooding proves to be a very costly solution. This paper presents a novel method for base station to communicate with nodes in network, which could be used in scenarios where BS to node communications is frequent. For example, in cases where BS has to update some parameters at a particular node.

The rest of paper is organised as follows. In Section II, we

discussed previous work in this area. Section III presents outline of sensor network model we used and our assumptions. Then our protocol is described in Section IV. In section V we analyze performance of algorithm using simulations. Lastly, paper is concluded in section VI with pointers to future work.

II. RELATED WORK An adaptive clustering scheme called Low-Energy

Adaptive Clustering Hierarchy (LEACH) is proposed in [4] which try to reduce the number of nodes communicating directly with the BS. The protocol achieves this by forming few clusters (elected randomly), where each cluster-head (CH) collects the data from nodes in its cluster, fuses and sends the result to the BS. LEACH-C [4] uses a centralized cluster formation algorithm to guarantee k nodes in the cluster and minimize the total energy spent by the non-cluster-head nodes, by evenly distributing the CHs throughout the network. UDACH proposed in [5] works on similar lines; however here the cluster heads are selected based upon the residual energy of each node.

Another clustering based approach called BCDCP [6]

makes clusters of equal size to ensure similar power dissipation. PEDAP [7] follows a minimum spanning tree organisation with BS as the root, improving the total lifetime and scalability. In PEGASIS [8] a chain is constructed among the sensor nodes so that each node will receive from and transmit to close neighbour.

Protocols discussed above and other data-oriented

centralized network protocols allow efficient node to BS communication. However when base station has to communicate with the nodes, these protocols rely on flooding where each receiver is obligated to retransmit the packet it has not seen before in network. Network wide flooding reduces the network capacity by sending information to mobile hosts which are not supposed to receive it, thus increasing the traffic load, and packet collision rate. This also leads to an increase in individual node power consumption. Obviously this proves to be a very costly solution especially when it comes to industrial deployments where base station has to frequently communicate the values of various parameters to the nodes. For example, consider a real life example where nodes are deployed to monitor temperature as directed by BS and sampling frequency is different at each node.

Various alternatives to blind flooding were proposed. Typically these techniques aim at minimising number of retransmission of broadcast messages. One of the alternatives proposed is randomised forwarding where each node forwards a packet to its neighbours with a probability p. This scheme was termed as gossiping [9]. Typical value of p lies in the range 0.65 to 0.75 for acceptable reliability of data delivery. However this probabilistic forwarding increases delay in data delivery. Directed flooding proposed in [10] sends data in a specific directional virtual aperture instead of broadcasting. Only nodes within this virtual aperture forward packets and thus power consumption is reduced while maintaining a low overhead. However, it is very difficult to decide the size of the aperture. If the aperture is small, the adjusting times will be large. Increasing the aperture will give less adjustment, but the benefit of directed flooding reduces because of increased overhead. Alternatives proposed in [11, 12] require 1 or 2 hop neighbour information of nodes and are thus do not scale with increasing node density.

Many solutions are available in literature to solve these

problems and avoid flooding. First solution, LEAR proposed in [13] is inspired from Dynamic Source Routing [14], where the base station puts the complete path to the destination node in the packet. Intermediate nodes read the path in the received packets and use that to forward it to the destination. However this solution incurs an overhead of carrying the whole path inside packet. When the network size is large with multiple hops, the overhead of carrying whole path in the packet will prove to be very costly.

Second solution given by Hyun-sook Kim et. al. In [15]

requires nodes to maintain information of all their children. So whenever a node receives a packet from BS, it forwards it to one of its children on the basis of the final destination address. This solution is also not scalable as the size of the routing tables will grow with increase in network density.

III. SYSTEM MODEL AND ASSUMPTIONS

The system consists of following components: Node: This refers to sensor nodes. Sensor nodes are heart of the network; they are in-charge of collecting, processing data and routing it to the sink. In other words, sensor nodes can sense data from environment, perform simple computations and transmit this data wirelessly to a command center either directly or in a multi hop fashion through neighbors. Base Station: Base station is a sensor node responsible for getting request for data collection from applications. BS is also responsible for calculation of routing tree according to

137

underlying routing protocol. Base station thus coordinates and controls the sensing tasks performed by sensor nodes. Destination Node: This is the node to which BS wants to send information. Intermediate Nodes: These are sensor nodes which come in path between BS and destination node. Intermediate nodes forward data packet based upon underlying routing protocol. Network topology: It is a connectivity graph where nodes are sensor nodes and edges are communication links. In a wireless network, the link represents a one-hop connection, and the neighbours of a node are those within the radio range.

Following were our assumptions which are consistent with the assumption made in literature [4, 6, 7, 8]:

• Base station is a high-energy node with a large amount of energy supply

• Sensor nodes are homogeneous and energy constrained with uniform initial energy.

• Sensor nodes are equipped with power control capabilities to vary their transmitted power.

• Sensor nodes exhibit no mobility. • All sensor nodes communicate through wireless

links over a single shared channel. • Links between two sensor nodes is bidirectional. • Each node knows its current energy and location

(using GPS [16] or other localization mechanisms). • A message sent by a node is received correctly

with in a finite time by all one hop neighbours [17]. • Network topology does not change during network

operation. • Each node can be identified uniquely with its

identifier. • Single hop broadcast refers to the operation of

sending a packet to all single-hop neighbours. Most of these restrictions here have been placed in

order to simplify the solution. By slightly modifying the proposed protocol, these restrictions can be easily removed. In the sensing mode, the nodes perform sensing tasks and transmit the sensed data. In cluster head mode (wherever assumed), a node gathers data from other nodes within its cluster, performs data fusion, and routes the data to the base station. The base station in turn performs the key tasks of cluster formation, cluster head selection, and routing tree construction.

IV. PROTOCOL ARCHITECTURE In all the centralized routing protocols like LEACH-C

[4], PEDAP [7] etc., routing tree is calculated by the base station at the beginning of each round. This calculation is done by BS on the basis of parameters like geographical

position of the nodes, residual energy and other heuristic parameters. Base station then broadcasts this routing tree to the nodes which on receiving the routing tree, rebroadcast it. Finally all the nodes in the field are aware of the routing tree. Nodes then use the information available in routing tree to route their data to BS. Although this setup allows nodes to transfer their data to the base station efficiently, it provides no support for the BS to individual node communication.

In MSTBBN, we propose that along with the routing tree,

BS will assign each node a key KMSTBBN (inspired from concept of m-way search trees [18]), which later on will be used by BS to communicate efficiently with any particular node without any packet or memory overhead. MSTBBN is not limited by use of any particular routing algorithm. Any of the centralized algorithms (like LEACH-C, PEDAP) can be used to provide underlying routing capabilities. MSTBBN provides BS to node communication in O(h) hops and O(h) time & message complexity. Here h is the height of routing tree constructed by the underlying routing protocol with BS as root.

Next we describe the working of MSTBBN protocol in

following four phases.

Phase 1: Calculation of routing tree by BS.

At the beginning of each round, routing tree will be calculated by the underlying routing protocol. For example, fig. 1 here shows the routing tree calculated by LEACH-C with BS, Cluster Heads (CHs) and Nodes. Phase 2: Allocation of KMSTBBN by BS.

In next phase, using idea of m-way search trees, BS assigns key - KMSTBBN to each node. Following rules are followed by BS while assigning keys to the nodes where KMSTBBN (X) refers to key assigned to node X according to MSTBBN:

• Rule 1: KMSTBBN(N1) > KMSTBBN(N2) where N1 is child of N2. E.g. In fig. 2, KMSTBBN(A) = 0 which is lowest in whole tree, since its root (BS).

Figure 1. Routing tree calculated by BS in LEACH-C

BS CH Node

138

• Rule 2: KMSTBBN (N1) < KMSTBBN (N2) where N2 is the right sibling of N1. E.g. In fig. 2, KMSTBBN(B) < KMSTBBN(C), Since C is the right sibling of B where KMSTBBN(B) = 1, KMSTBBN(C) = 6.

• Rule 3: KMSTBBN (N3) > KMSTBBN (N1) where N1 is in the subtree rooted by N2 & N3 is the right sibling of N1. E.g. In fig. 2, KMSTBBN(C) < KMSTBBN(D), Since D is the right sibling of C where KMSTBBN(C) = 12, KMSTBBN(D) = 6 and both rooted at A(base station).

• Rule 4: KMSTBBN (N1) < KMSTBBN (N3) where N3 is in the subtree rooted by N2 and N1 is the left sibling of N2. E.g. In fig. 2, KMSTBBN(C) < KMSTBBN(Z), Since C is the left sibling of D and Z is in subtree rooted at D where KMSTBBN(C) = 6, KMSTBBN(D) = 12, KMSTBBN(Z) = 14.

Observe that this method of assigning keys converts the

routing tree into a search tree. Moreover, it ensures that the nodes to BS routing paths in the resulting tree are same as those calculated by the underlying protocol. This is important, as MSTBBN does not require any alteration to the routing tree created by the underlying protocol. In phase 4, we will explain how these keys will be used to provide efficient BS to node communication

Phase 3: BS broadcasts keys to nodes.

BS next broadcasts KMSTBBN of every node. This broadcast is done in the similar fashion as done in the underlying protocol (like LEACH-C etc). Moreover this broadcast can be piggybacked by BS while broadcasting the routing tree, thus further cutting down the network overhead. When the node receives these packets, it makes record of its own KMSTBBN and KMSTBBN of its children with highest value of KMSTBBN (or base station can assign each node this value). Thus the memory requirements of MSTBBN is O(1). For example, Table 1 shows KMSTBBN’s stored at node C.

TABLE I. KMSTBBN’S STORED AT NODE C

KMSTBBN(C) 6 KMSTBBN(C’s child with maximum KMSTBBN) 11

Phase 4: Base station to node communication.

Now since the key assigning procedure has converted the routing tree into a search tree, we can use traditional searching algorithms similar to those of m-way search trees for data transmission. Whenever BS has to communicate with any node, BS will broadcast data to its children. Nodes which receive packets will follow flowchart as shown in fig. 3 for further forwarding the data packets.

E.g. in fig. 2, if base station A has to send data to node Z

where KMSTBBN(Z) = 14 = KMSTBBN(destination), it will forward its data to node D with KMSTBBN(D) = 12. However, owing to the broadcast nature of the wireless medium, nodes B and C will also receive the transmitted data (apart from node D). According to flowchart shown in fig. 3, a node i on receiving the packet will check whether

Figure 2. Routing tree with key values assigned to nodes

KMSTBBN(destination) (obtained from packet) lies in between KMSTBBN of node i and KMSTBBN of i’s child with highest value of KMSTBBN. If not, node i will drop the packet instead of processing it. Hence node i will only forward the packet if destination node is in its subtree. Here nodes B and C will drop packet since Z is neither in subtree of B or C. Node D will forward packet to its children, since Z is in subtree of D. Except Z, rest of D’s children will discard this packet (since neither is this packet intended for them, nor does destination lies in their subtree) and in this way packet finally reaches its destination. It can be easily seen that these decisions are followed directly from the traditional m-way tree search algorithms.

Figure 3. Flowchart describing how a packet received will be handled by node i.

V. PERFORMANCE EVALUATION In order to evaluate our scheme, we simulated MSTBBN

over LEACH-C [4] and PEDAP [7] as underlying routing protocols on ns-2 [19, 20] and compared its performance with flooding and directed flooding. We used following model in our simulation studies:

0

1

2

3 4

5

6

7

8 9 13

11

10

12

16

14

15

A

BC

Z

D

BS CH Node

139

• Wireless Sensor Network consisted of randomly (uniform) distributed nodes in a square field of size 100 x 100 m2.

• Each sensor node was assigned initial energy of 20 Joules.

• Size of data message was set to 500 bytes. • First order radio model as described in [4] was used

to calculate energy consumption for receiving and transmitting a data packet.

• Base station was located at the centre of field. • One round was defined as the duration of time from

when BS initiates sending of data packet to a node to the time when this node receives the packet.

• Radio range of nodes was set to 40 m. • Performance was measured by quantitative metrics

of average energy dissipated per node and number of packets forwarded.

A. MSTBBN over LEACH-C

In this section we present simulation results where we ran MSTBBN algorithm over LEACH-C [4] as underlying routing protocol. Energy dissipated with Rounds: In the first experiment, we evaluated the average energy dissipated by nodes in the network with increasing number of rounds. For this experiment, we simulated deployment of 100 nodes. In each round base station randomly chooses a destination node and sends a data packet to it. Number of rounds was varied from 600 to 900. Fig. 4 here shows comparison of MSTBBN with flooding while fig. 5 shows comparison of MSTBBN with directed flooding. For directed flooding, virtual aperture was varied from Θ=π/2 to Θ=3π/2.

The simulation results clearly demonstrate that as we increase number of rounds, MSTBBN achieves significant energy savings compared to both flooding and directed

Figure 4. Average energy dissipated per node with increasing number of rounds

Figure 5: Average energy dissipated by nodes in LEACH-C using MSTBBN and Directed flooding (for Θ = π/2 to Θ = 3π/2) versus number of rounds.

flooding (for all values of virtual aperture Θ). This is because MSTBBN limits receptions & retransmissions of packets by assigning KMSTBBN to all nodes and then performing efficient routing on the basis of this assigned key. This results into reduced redundant packet transmissions and thus increased energy savings. Average energy efficiency in MSTBBN compared to flooding was observed to be 15.52%. Energy dissipated with node density: In next experiment, we evaluated the average energy dissipated by nodes in the network by varying number of nodes from 25 to 400. This experiment was done for 100 rounds and in each round BS chooses a random destination node for sending data packet. Fig. 6 here shows comparison of MSTBBN with flooding while fig. 7 shows comparison of MSTBBN with directed flooding. Energy dissipation in MSTBBN is nearly constant while for both flooding and directed flooding energy dissipation keeps on increasing with increasing number of nodes. This is because in case of MSTBBN, only CH’s forward data packets while in case of both flooding and directed flooding, comparatively more nodes will receive flooded-packets which lead to more flooding thereby increasing overall energy dissipation.

Figure 6. Average energy dissipated per node with increasing number of

nodes

9

11

13

15

17

600 700 800 900

Average en

ergy dissipa

ted

(in Jo

ules)

No. of Rounds

FloodingMSTBBN

9.5

11.5

13.5

15.5

17.5

600 700 800 900

Average en

ergy dissipa

ted

(in Jo

ues)

No. of Rounds

MSTBBN

Θ=π/2

Θ=3π/4

Θ=π

Θ=5π/4

Θ=3π/2

1.55

1.65

1.75

1.85

1.95

2.05

0 200 400Average en

ergy dissipa

ted

(in Jo

ules)

No. of Nodes

MSTBBNFlooding

140

Figure 7. Average energy dissipated by nodes in LEACH-C using MSTBBN and Directed flooding (for Θ = π/2 to Θ = 3π/2) with varying node density.

Number of packets forwarded with node density: In this experiment, we evaluated number of packets forwarded by nodes by varying number of nodes from 25 to 400. Experiment was done for 100 rounds where BS chooses a random destination node at each round for sending data packet. Fig. 8 and fig. 9 shows comparison of MSTBBN with flooding and directed flooding respectively. These results testify our interpretation of nearly constant energy dissipation in LEACH-C.

B. MSTBBN over PEDAP In this section we present simulation results where we ran MSTBBN algorithm over PEDAP [7] as underlying routing protocol. Energy dissipated with rounds: In this experiment, we evaluated the average energy dissipated by nodes in the network with increasing number of rounds. For this experiment, we simulated deployment of 100 nodes. In each round BS randomly chooses a destination node and sends a data packet to it. Number of rounds was varied from 600 to

Figure 8. Number of packets forwarded by nodes in LEACH-C using MSTBBN and Flooding with varying node density.

Figure 9. Number of packets forwarded by nodes in LEACH-C using MSTBBN and Directed Flooding (for Θ=π/2 to Θ=3π/2) with varying

node density.

Figure 10. Average energy dissipated by nodes in PEDAP using MSTBBN and flooding versus number of rounds

to 900. Fig. 10 shows comparison of MSTBBN over PEDAP with flooding while fig. 11 shows its comparison with directed flooding. For directed flooding we varied size of virtual aperture from Θ = π/2 to Θ = 3π/2. Average energy efficiency in MSTBBN compared to flooding was observed to be 8.48%.

Figure 11. Average energy dissipated by nodes in PEDAP using MSTBBN and Directed flooding (for Θ = π/2 to Θ = 3π/2) versus number of rounds

1.55

1.65

1.75

1.85

1.95

2.05

0 100 200 300 400

Average En

ergy dissipa

ted

(in Jo

ules)

No. of Nodes

MSTBBNΘ=π/2Θ=3π/4Θ=πΘ=5π/4Θ=3π/2

0

10

20

30

0 100 200 300 400

No. of P

ackets fo

rwarde

d(in

Tho

usan

ds)

No. of Nodes

MSTBBNFlooding

0

5

10

15

20

25

0 100 200 300 400

No. of p

ackets fo

rwarde

d( in Th

ousand

s)

No. of Nodes


10.5

12.5

14.5

16.5

600 700 800 900

Average en

ergy dissipa

ted

(in Jo

ules)

Rounds

MSTBBNFlooding

10.5

12.5

14.5

16.5

600 700 800 900

Average en

ergy dissipa

ted

(in Jo

ules)

No. of Rounds

MSTBBN

Θ=π/2

Θ=3π/4

Θ=π

Θ=5π/4

Θ=3π/2

141

As we increase the number of rounds, the simulation results clearly indicate that MSTBBN is more energy efficient compared to flooding. Except for Θ = π/2, MSTBBN is always more efficient in case of directed flooding. However, when we compare these results with LEACH-C i.e. from fig. 4 & fig. 5, we find that MSTBBN is lesser efficient for PEDAP. The reason is PEDAP is based on spanning tree while LEACH-C is single level cluster based protocol. So height of BS rooted tree in case of PEDAP is higher while for LEACH-C it’s constant and is 2. Thus lesser number of nodes receive and forward packets in case of LEACH-C which can be seen by comparing fig. 8 and fig. 14.

Energy dissipated with node density: In next experiment we evaluated the average energy dissipated by nodes in the network by varying number of nodes from 25 to 400. This experiment was done for 100 rounds and in each round BS chooses a random destination node for sending data packet. Fig. 12 shows comparison of MSTBBN with flooding while fig. 13 shows its comparison directed flooding. MSTBBN is thus an efficient protocol compared to flooding while in case of directed flooding for some values of Θ, energy dissipated is nearly same as MSTBBN. Number of packets forwarded with node density: In this experiment, we evaluated number of packets forwarded by nodes by varying number of nodes from 25 to 400. Experiment was done for 100 rounds where BS chooses a random destination node at each round for sending data packet. Fig. 14 shows comparison of MSTBBN with flooding while fig. 15 shows comparison with directed flooding.

Figure 12. Average energy dissipated by nodes in PEDAP using MSTBBN and Flooding with varying node density.

Figure 13. Average energy dissipated by nodes in PEDAP using MSTBBN and Directed flooding (for Θ = π/2 to Θ = 3π/2) with varying node density.

Figure 14. Number of packets forwarded by nodes in PEDAP using MSTBBN and Flooding with varying node density.

Figure 15. Number of packets forwarded by nodes in PEDAP using MSTBBN and Directed Flooding (for Θ = π/2 to Θ = 3π/2) with varying

node density.

1.6

1.7

1.8

1.9

2

2.1

2.2

0 100 200 300 400

Average En

ergy dissipa

ted

(in Jo

ules)

Number of Nodes

MSTBBNFlooding

1.6

1.7

1.8

1.9

2

0 100 200 300 400

Average en

ergy dissipa

ted

(in Jo

ules)

No. of Nodes


0

10

20

30

0 100 200 300 400

No. of p

ackets fo

rwarde

d (in

Tho

usan

ds)

No. of Nodes

MSTBBNFlooding

0

5

10

15

20

25

0 100 200 300 400

No. of p

ackets fo

rwarde

d(in

Tho

usan

ds)

No. of Nodes

MSTBBNΘ=π/2Θ=3π/4Θ=π

Θ=5π/4Θ=3π/2

142

VI. CONCLUSIONS AND FUTURE WORK Minimizing energy consumption is a key requirement in

the design of sensor network protocols and algorithms. Most of the attention however has been given to designing protocols for routing data from sensor nodes to base station. In this paper we presented MSTBBN, a novel scheme for base station to node communication in O(h) hops where h is the height of the tree constructed by the underlying routing protocol. h can vary from 1 (in LEACH-C) to as much as N (in PEGASIS) where N refers to number of nodes in whole network. Currently no such protocol exists for base station to node communication as the proposed one. Existing protocols are based upon multicasting protocols which are not energy efficient.

Our protocol can work with any of the underlying

centralized routing protocol like LEACH-C, BCDCP without any noticeable modification or message overhead. Routing tree constructed by underlying protocol is converted to a search tree using keys assigned in MSTBBN. Furthermore, our protocol does not result in any message or memory overhead compared to the existing solutions, making our solution scalable with increasing number of nodes as well as density. This resulted into an energy efficient scheme for base station to node communication, which we verified using ns-2 simulations.

Recent developments increasingly call for scenarios

where the sensed data must be delivered to multiple base stations. This forms on of the areas for future research where MSTBBN could be extended to serve for scenarios with multiple base stations. Further, we also assumed that all the sensor nodes in network are stationary. For further research, MSTBBN could be explored as a solution for slightly mobile networks. This could be done by increasing update interval of keys based upon node mobility.

REFERENCES [1] Arampatzis, Th., Lygeros, J., Manesis, S., “A survey of applications

of wireless sensors and wireless sensor networks,” Intelligent Control, 2005. Proceedings of the 2005 IEEE International Symposium on, Mediterrean Conference on Control and Automation , vol., no., pp.719-724, 27-29 June 2005.

[2] Li, Yingshu, Thai, My T., Wu, Weili (Eds.), “Wireless Sensor Networks and Applications,” Springer Series on Signals and Communication Technology, 2008, ISBN 978-0-387-49591-0.

[3] Carlos de Morais Cordeiro, Dharma Prakash Agarwal, “Ah Hoc & Sensor Networks: Theory and Applications,” World Scientific Publishing Company, 2006, ISBN 981-256-681-3.

[4] Wendi B. Heinzelman, Anantha P. Chandrakasan, Hari Balakrishnan, “An application-specific protocol architecture for wireless microsensor networks”, IEEE Transactions on wireless communications, vol.1, no. 4, October 2002, pp. 660-670.

[5] Jin-Young Choi, Joon-Sic Cho, Seon-Ho Park, Tai-Myoung Chung, “A clustering method of enhanced tree establishment in wireless sensor networks,” in Proc. 10th Int. Conf. On Advacned Communication Technology(ICAST), Feb. 2008, pp. 1103-1107.

[6] S. D. Muruganathan, D. C. F. Ma, R. I. Bhasin, and A. O. Fapojuwo, “A centralized energy-efficient routing protocol for wireless sensor

networks,” IEEE Commnication Magazine, vol. 43, no. 3, pp. S8-S13, Mar. 2005.

[7] Huseyin O zgur Tan, Ibrahim Korpeoglu, “Power efficient data gathering and aggregation in wireless sensor networks”, SIGMOD Record, Vol. 32, No. 4, December 2003, pp. 66-71.

[8] S. Lindsey and C. S. Raghavendra, “Pegasis: Power-efficient gathering in sensor information systems,” In Proceedings of IEEE Aerospace Conference, 2002.

[9] S. Hedetniemi and A. Liestman, “A survey of gossiping and broadcasting in communication networks,” Networks, Vol. 18, No. 4, pp. 319-349, 1988.

[10] R. Farivar, M. Fazeli, and S. G. Miremadi, “Directed Flooding: A fault tolerant routing protocol for wireless sensor networks,” 2005 Systems Communications (ICW'05, ICHSN'05, ICMCS'05, SENET'05), pp. 395- 399, 2005

[11] Hai Liu , Xiaohua Jia , Peng-Jun Wan , Xinxin Liu , Frances F. Yao, “A distributed and efficient flooding scheme using 1-hop information in mobile ad hoc networks,” IEEE Transactions on Parallel and Distributed Systems, v.18 n.5, p.658-671, May 2007.

[12] Trong Duc Le , Hyunseung Choo, “PIB: an efficient broadcasting scheme using predecessor information in multi-hop mobile ad-hoc networks,” Proceedings of the 2nd international conference on Ubiquitous information management and communication, January 31-February 01, 2008, Suwon, Korea.

[13] Kyungtae Woo , Chansu Yu , Dongman Lee , Hee Yong Youn , Ben Lee, “Non-blocking, localized routing algorithm for balanced energy consumption in mobile ad hoc networks,” Proceedings of the Ninth International Symposium in Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS'01), p.117, August 15-18, 2001

[14] David B. Johnson, David A. Maltz, and Josh Broch, “DSR: The dynamic source routing protocol for multi-hop wireless ad hoc networks. in ad hoc networking,” edited by Charles E. Perkins, Chapter 5, pp. 139-172, Addison-Wesley, 2001.

[15] Hyun-sook Kim, Ki-jun Han, “A power efficient routing protocol based on balanced tree in wireless sensor networks,” Proceedings of the First International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05), p.138-143, February 06-09, 2005

[16] N. Bulusu, J. Heidemann, D. Estrin, ‘‘GPS-Less Low Cost Outdoor Localization for Very Small Devices,’’ IEEE Personal Communications, Vol. 7, 2000.

[17] Chalermek Intanagonwiwat, Ramesh Govindan and Deborah Estrin, “Directed diffusion: A scalable and robust communication paradigm for sensor networks,” In Proceedings of the Sixth Annual International Conference on Mobile Computing and Networking (MobiCOM '00), August 2000, Boston, Massachussetts.

[18] Sartaj Sahni, “Data structures, algorithms, and applications in C++,” edited by June Waldman, Chapter 11, pp. 525-527, McGraw-Hill, 2000.

[19] ns-2. http://www.isi.edu/nsnam/ [20] K. Fall, “ns Notes and Documents,” The VINT Project, UC Berkeley,

LBL, USC/ISI, and Xerox PAPC, Feb. 2000, available at http://www.isi.edu/nsnam/ns-documentation.html.

143

A Broadcast Authentication Protocol for Multi-HopWireless Sensor NetworksR. C. Hansdah, Neeraj Kumar and Amulya Ratna Swain

Dept. of Computer Science & AutomationIndian Institute of Science, Bangalore, India.

hansdah, neerajkumar, [email protected]

Abstract—A base station in wireless sensor network (WSN)needs to frequently broadcast messages to sensor nodes, sincebroadcast communication is used in many applications suchas networks query, time synchronization, multi-hop routing,etc. One of the main problems of broadcast communicationin WSNs is source authentication. Source authentication meansthat the receivers of broadcast data have to verify that thereceived data really originated from the claimed source and isnot modified on the way. This problem is complicated due tountrusted receivers and unreliable communication environmentwhere the sender does not retransmit the lost packets. In thispaper, we propose a novel scheme for authenticating messagesfrom each node of the WSN at the base station using Diffie-Hellman key. Most existing schemes for broadcast authenticationusing hash key chain are limited to single hop WSNs only. Usingthe above technique for source node authentication, we extendthe broadcast authentication scheme using hash key chain tomulti-hop wireless sensor networks. The number of transmissionsof packets is also reduced using some selective paths duringthe broadcast and as a result, the storage and communicationoverhead is also reduced. The analysis and experiments showthat our protocol is efficient and practical, and achieves betterperformance than the previous approaches.

I. I NTRODUCTION

A wireless sensor networks consists of a collection of lowcost, low power, and multifunctional sensor nodes. Some of thedesignated nodes, called base stations, facilitate computationwithin the WSN as well as communication with the outsideworld. A WSN usually has a single base station. The basestation controls the sensor nodes as well as collects datareported by the sensor nodes. A WSN essentially can monitorevents of practical importance either periodically, or wheneverthey occur over any geographical area such as forest, buildingsetc. As a result, WSNs have the potential to provide practicalsolutions to many problems of these types. Some of thepotential applications of WSN are environmental and habi-tat monitoring, monitoring of civil structures like buildings,bridges etc., target tracking for military as well as civilianapplications, monitoring health conditions of patients and soon.

Security of many of the applications of WSN is very criticalto the systems using them. There are many types of attacksthat can be made on the WSNs. A survey of the attacks canbe found in [1], [2], [3]. One of the important operations ofWSN is that the base station needs to broadcast messagesto the sensor node occasionally. But the messages need tobe authenticated at each sensor node ensuring that they have

come from the base station only. This problem is known as thebroadcast authentication problem. If a global shared secret keyis used to authenticate these messages, the malicious nodescaneither modify these messages if they have to rebroadcast themor masquerade as the base station even in single hop WSNsince they already have the key. A solution to this problem [4]is to use hash key chain. The first key of the chain is usuallydistributed to each node of the WSN using some mechanism.The first message is encrypted using the first key of the chain.The key next to the first key of the chain is sent along withthe first message, which can be used to authenticate the firstmessage, and so on. The problem in multi-hop environment,which is quite common in WSN, is that the sensor nodes whichreceive the broadcast directly from the base station can modifythe messages before rebroadcasting. In a single hop WSN, thisproblem does not arise. Also the solution given above ensuresthat malicious nodes cannot masquerade as the base stationsince they do not have the next key. A few solutions to aboveproblem have been proposed in the literature [5], [6], [7]. Ofthese solutions, some of them use digital signature [5], [6], andothers use one way hash key chain [7]. The solutions whichuse digital signature are quite heavy on the meager resourcesof sensor nodes. In this paper, we propose a novel schemeto authenticate each node of the WSN at the base stationusing Diffie-Hellman key. We also use this scheme to proposea broadcast authentication protocol using hash key chain formulti-hop WSNs. An important feature of our protocol is thatit is fault-tolerant to node failures.

The rest of the paper is organized as follows. In sectionII, we give a brief survey of related works. Assumptionsand definitions for the proposed protocol are described insection III. In section IV, we describe our proposed scheme.Security and performance analysis of the protocol is describedin section V. In section VI, we discuss our simulation results.Conclusions are given in section VII.

II. SURVEY OF RELATED WORKS

Broadcast authentication is an essential service in WSNs.Symmetric key based message authentication code [8] cannotbe directly used for resource-constrained wireless sensornet-work, since a compromised receiver can easily impersonatethe sender. On the other hand, asymmetric key based digitalsignature schemes [9] which are typically used for broadcastauthentication in traditional networks, are too expensiveto be

144

used in WSNs, due to high computation involved in signatureverification. As a result, several broadcast authentication pro-tocol have been proposed for resource constrained WSNs [4],[7], [10], [11], [12], [13].

Perrig et al. have proposed a broadcast authentication pro-tocol, namedµTESLA [4], and it is the first protocol proposedfor broadcast authentication in WSNs. This protocol is basedon one way hash key chain.µTESLA uses the key chain toemulate public key cryptography with delayed key disclosure.A key is initially chosen, and the remaining keys are generatedusing one way hash function. The first key of this chain(the lastkey produced by the hash function) is used to encrypt the firstmessage to be broadcasted by the base station, and this keyis distributed to each node of the WSNs apriori. The senderdivides the time period for broadcast into multiple intervalsand in each interval it uses one key starting with the firstkey. At the end of each interval, it discloses the next key,which makes it possible to authenticate the messages that weresent encrypted with the previous key. However, the receivingnode needs to verify that the next key was not yet disclosedwhen it received the messages. After receiving a packet,if the receiver can ensure that the packet was sent beforethe next key was disclosed, the receiver buffers this packetand authenticates it later after receiving the next key. Theprotocol has certain drawbacks. The protocol requires loosetime synchronization between sender and receiver. Individualauthentication as well as instantaneous authentication isnotavailable inµTESLA. More storage space is required at thereceiver side to buffer the packets until the next key is received.Many WSN applications are real time applications. Hence,to minimize the delay in authentication of real time data,the maximum number of additional packets that are receivedbefore a packet is authenticated should be small. Nonetheless,there would be some delay before a broadcast packet can beauthenticated, and therefore, it is not suitable for real timeapplications.

To increase the scalability ofµTESLA, Liu and Ninghave proposed a multilevelµTESLA [10]. The basic idea ofthis protocol is to predetermine and broadcast the parame-ters such as the key chain commitments instead of unicastbased message transmission used inµTESLA. Even thoughit improves the scalability ofµTESLA, it still suffers fromcertain drawbacks like requirement of time synchronization,more buffer storage, etc.

A broadcast authentication protocol, called BiBa (Binsand Balls) [11], have been proposed by Perrig, and it usesone time digital signature scheme to authenticate the source.In BiBa, signer precomputes somet random values, calledSEALs (SElf Authenticating vaLues). For each SEALsi,signer generates a public keyfsi

= Fsi(0), where Fsi

() isa one-way function, and these public keys are transferred tothe receiver to authenticate SEALs at the receiver end. Foreach message M, the signer computesGH(M) for all SEALss1 to st, whereGH(M) is a particular instance from a familyof one-way function whose range is 0 ton-1 (i.e., n possibleoutput values). The signer generates a signature〈si, sj〉, where

GH(M)(si) = GH(M)(sj) andsi 6= sj , and send the messageM with the signature〈si, sj〉 to the receiver. After receiving themessage, the receiver authenticates the received message byauthenticating the signature using previously obtained publickeys. The advantage of BiBa is fast verification and a shortsignature but BiBa takes longer signing time and uses largerpublic key size to authenticate the signer.

To make an improvement over public key size and signingtime, Reyzin et al. have proposed a new one-time signaturescheme called HORS (Hash to Obtain Random Subset) [12]which reduces the time needed to sign the message and verifythe signature. It also reduces the key and signature sizesin comparison to the ones used in BiBa and makes HORSthe faster one-time signature scheme. The security of BiBadepends upon the random-oracle model, while the securityof HORS relies on the assumption of the existence of one-way functions. HORS is computationally efficient, requiringa single hash evaluation to generate the signature and a fewhash evaluations for verification as compared to BiBa. Stillthisprotocol has large public key size, which is not suitable in aWSN environment without additional modifications. Signingeach packet would definitely provide secure broadcast authen-tication, but it still has considerable overhead for signing andverifying packets and also uses more bandwidth.

An efficient broadcast authentication scheme[13], proposedby Shang-Ming, is also based on one-time digital signaturescheme. Compared to HORS, this scheme requires less storageand communication overhead at the expense of higher compu-tation cost. In this scheme, key generation is the same as thatused in HORS scheme. This scheme makes an improvementover the HORS scheme by reducing the large key size, butstill the public key size is large and computational overheadper message is also large.

Bekara et al. have proposed a hop-by-hop broadcast sourceauthentication protocol for WSN [7] to overcome the DOSattacks that limits the effect of attack to the one hop neighboronly. In this protocol, the authors use different key chainsfor different hops of the network, where maximum hop ofthe network can be deduced from the maximum propagationdelay in the network. This protocol consists of three phasesi.e., initialization phase, data broadcast phase and data buffer-ing/verification phase. In the initialization phase, the basestation divides time into fixed time intervals, and generatesa separate hash key chain for each hop of the network, andstores the first key of each key chain and the duration of theintervals in each sensor node. In the data broadcast phase,the base station computes the MAC for the data it sends inthe current time interval by using the current key of each keychain, and broadcasts the data together with the MACs. Later,it discloses the keys of the current time interval one afteranother. In data buffering/verification phase, after receivingthe broadcasted data, each node in a particular hop buffers thedata until the associated key corresponding to the hop numberand time interval is disclosed. If the data packet is authentic,it forwards them to the next hop. With increase in size ofthe network and the number of nodes, the protocol requires

145

more number of hash key chains that leads to demand formore storage space. Hence, the protocol is not scalable andit has large storage overhead. Since each node stores one keyof each key chain, authentication of the broadcast messagesusing each of the keys introduces extra computation overheadat each node. Even if the protocol claims that nodes needto buffer data for a duration less than the duration of keydisclosure delay as inµTESLA, it still suffers from delay inauthentication.

Since the symmetric key based cryptography is not suitablefor the broadcast authentication, most of the proposed proto-cols have used asymmetric key based cryptography. Amongthese protocols, a few of them use public key concept toachieve asymmetric key based cryptography and others usehash key chain technique. The protocols which use publickey concept suffer from large public key size and largecomputational overhead and the protocols which use the hashkey chain technique achieve the broadcast authentication forsingle hop network only with the exception of the protocolproposed in [7]. In this paper, we propose a novel scheme toauthenticate each node of the WSN at the base station usingDiffie-Hellman key, and also use this scheme to propose abroadcast authentication protocol for multi-hop wirelesssensornetworks using multilevel hash key chain.

III. A SSUMPTIONS ANDOBJECTIVES

In this section, we first state the assumptions about WSNsthat we make for the proposed broadcast authentication pro-tocol, and then give a brief description of hash key chainwhich is used in our protocol. Finally, we briefly describe theobjectives that we aim to achieve with our proposed scheme.

A. Assumptions

We make the following assumptions for our proposed pro-tocol to ensure authentication of each node and also broadcastthe messages securely over the whole WSN.

• The WSN has a single base station, and the sensor nodesare static.

• Each sensor node has a unique ID.• The WSN is connected, i.e., there always exists a path

between any pair of sensor nodes.• The base station is trusted, but the broadcast medium

is not trusted, i.e., the opponents can eavesdrop on themessages being transmitted.

• The base station is secure, i.e., none can tamper with it,and can extract information from it and also, it is suffi-ciently powerful to perform cryptographic computations.

B. Hash Key Chain

A hash key chain of lengthn + 1 consists of a sequence ofkeyskn

h→ kn−1

h→ kn−2

h→ . . .

h→ k1

h→ k0, wherekn is the

initial key, and h is an arbitrary hash function. An importantproperty of hash key chain is thatki−1 can be derived fromki

(0 ≤ i < n) but not vice versa. The keyk0 is referred as thefirst key of the chain since it is used first in any application.Usually, the sender has the key chain, and the receivers have

Fig. 1. Authentication request packet from node to a base station

the first keyk0. The signature on all messages signed withkey ki by the sender can be verified at the receivers using keyki+1 after it becomes available at the receivers.

C. Objectives of the Proposed Protocol

We aim to achieve the following with the proposed protocol.(i) Messages sent by each node of WSN to the base station

is fully authenticated at the base station.(ii) Compromising of a single node should not affect other

nodes of the WSN, i.e., other nodes are not compro-mised.

(iii) Intermediate nodes must not be able to modify thebroadcast messages.

IV. T HE PROPOSEDPROTOCOL

In this section, we present our proposed scheme for broad-cast authentication. Our proposed protocol uses multilevelhash key chain and Diffie-Hellman key for authenticationof messages sent by sensor nodes to the base station, andalso messages broadcasted by the base station. The broadcastauthentication protocol proposed in this paper is based on anovel scheme using Diffie-Hellman key to authenticate eachsensor node at the base station. The scheme is described inthe following subsection.

A. Authentication of Sensor Node at the Base Station

The scheme to authenticate sensor nodes at the base stationuses Diffie-Hellman key between a pair of nodes. A Diffie-Hellman key is generated from the multiplicative groupZ∗

p =1, 2, . . . , p−1, wherep is a large prime number. Letg be agenerator forZ∗

p . In this scheme, the base station is assigneda unique private key1 < α < p − 1, and each sensor nodei is assigned a unique private key1 < βi < p − 1. Now thefollowing is preloaded in each sensor nodei.

1. A key ki = f((

gβimod p)α

mod p)

, where f is anysuitable function. We refer to keyki as Diffie-Hellmankey.

2. gβi mod p

It is assumed that theβi’s have been chosen in such a waythat βi 6= βj ⇒ ki 6= kj . This property can be ensured at thetime of generatingβi’s, which would mean that each node hasa unique key. The base station is preloaded just withα andp.A message M sent by a sensor node to the base station hasthe generic format as shown in Figure 1.

It is important thatgβi mod p is not encrypted. The messageM may or may not be encrypted according to the requirement.Upon receipt of the message M at the base station, the basestation computes the keyki, and verifies the authenticity and

146

the integrity of the message. The important features of thescheme are as follows.

1. The base station stores only two values, viz.α & p, toauthenticate any message from any sensor node.

2. At the minimum, each sensor node only needs to computethe MAC of the message that it sends to the base station.

3. The private keyβi of each node, and primep are notstored in the sensor nodei.

B. Informal Description of the Broadcast Authentication Pro-tocol

Since sensor nodes are resource and energy constrained,it is important that any broadcast operation should consumeas minimum resources and energy of each node as possible.Keeping this in mind, we initially construct a broadcast treeusing which messages are broadcasted. In order that theintermediate nodes which rebroadcast messages cannot modifythe messages, we divide the sensor node into groups, and eachgroup is initialized with different hash key chain. Using thismechanism, we ensure that a message cannot be modifiedwhen it is moved from a node in a group to a node inanother group. We use the Diffie-Hellman key of each node todistribute the first key of the hash key chain of a group to eachmember node of the group. If the same tree is used repeatedlyfor broadcast, the energy of the internal nodes of the tree woulddry up quite fast. Therefore, we periodically restructure thebroadcast tree by taking into account the remaining energyof each node. After reconstruction, the nodes with higherremaining energy would become the internal nodes. As aresult, our broadcast authentication protocol consists ofthefollowing four phases and they are elaborated upon in thefollowing subsections.

1) Broadcast tree establishment and group formation.2) Authentication and key distribution.3) Message broadcast phase.4) Periodic restructuring of the tree.

C. Broadcast Tree Establishment and Group Formation

In this section, we present algorithm for the construction ofbroadcast tree using approach similar to the one given in [14],and dividing the sensor nodes into groups. The broadcast treeessentially has the following structure. The base station is atlevel 0, which is the highest level. All the sensor nodes whichcan be directly reached from the base station are at level 1, thenext lower level. When the sensor nodes at level 1 broadcast amessage, the new sensor nodes which receive these messagesare at level 2, and so on. The algorithm for the construction ofbroadcast tree given below designates some of the nodes withhigher remaining energy as internal nodes of the tree at eachlevel1 ≤ i < n−1, wheren is the total number of levels in thetree. The value ofn depends on the extent of geographical areainto which the sensor nodes have been deployed and the powerused to broadcast messages, which essentially determines thecommunication range of broadcast. When the internal nodes ofthe broadcast tree at leveli broadcast a message, all the sensornodes at leveli + 1 receive the message. The procedure CBT

Procedure CBT;begin

Initialize its cost to∞;SET FLAG = false; ACK NODE = -1;Timer Flag=RESET;while (node i receive ADV message from node j)do

if (Timer Flag = RESET)thenWait for more advertisement message fortime durationt1;Timer Flag = SET;

end

if(

costi>costj +1

REi

)

then

costi = costj +1

REi

;

levi = levj + 1;SET FLAG = true; ACK NODE = src;

endif ((Timer Flag = SET)and (Time durationt1expired))then

Break;end

endif (SET FLAG = true) then

Broadcast new advertisement message havingcostcosti and levellevi;Send acknowledgment to node ACKNODE;

endWait for possible ACK message;if (ACK message received)then

Node i is an internal node;else

Node i is a leaf node;end

end# costj = cost of nodej from where the advertisement# message has been received.# REi= remaining energy of nodei.

# costi =∑

j∈path

1

REj

Algorithm 1 : Broadcast tree construction phase of nodei

given above describes the algorithm for the construction ofthebroadcast tree at each nodei.

Each node in the WSN stores the parent node ID, its levelnumber along with the associated least cost of the path to thebase station through the parent node. At the very beginning,each node except the base station sets its cost field, parentnode ID and level number to∞, -1 and -1 respectively. Thebase station sets both its cost field and level number to 0 andsets its parent node ID to its own ID. Initially, the base stationbroadcasts an advertisement message ADV with its node ID,level no, and cost as shown in Figure 2. After receiving thefirst ADV message, a sensor nodei waits for a fixed durationof time to receive additional ADV messages. It then choosesamong the nodes from which it received an ADV message a

147

Fig. 2. Advertisement(ADV) message format

nodej that would result in a path with the least cost from thebase station to the nodei itself. When the nodei broadcastits ADV message, it piggybacks an ACK message for nodej.When the nodej receives the ACK, it comes to know that it isan internal node of the tree. If a node does not receive an ACKmessage, it concludes that it is a leaf node. The differencebetween leaf nodes and internal nodes of a broadcast treeis that a leaf node never rebroadcast a broadcast message.The base station is initially given an estimate of how long itwould take for the construction of the broadcast tree. Afterthisduration elapses, the base station can start using the tree.Tomaintain integrity of the ADV message, we can use a sharedsecrete key for the whole network, which will be preloadedbefore the deployment of the whole network. This key willnot be required afterwards.

To prevent the intermediate nodes from modifying a broad-cast message, one can assign to each level of the tree adifferent hash key chain. But this would run into problem asthe number of levels is not known apriori, and also the level ofa node may change after the reconstruction of the tree. Hence,we divide the sensor nodes into groups based on their levelsin the broadcast tree constructed initially as follows.

Let levi = level of nodeiThen groupi = group number of nodei is assigned as

follows.

groupi =

0, if levi = 1((((levi − 1)%(MAX GROUP − 1))

+2)%(MAX GROUP − 1))+1, if levi > 1

(1)

where MAX GROUP is the maximum number of groupsused for the whole network. The actual number of groups(ANOG) may be less than or equal toMAX GROUPand theyare numbered from 0 to (ANOG-1). The equation (1) is illus-trated by an example as given in Table I withMAX GROUPequal to 4.

levi 1 2 3 4 5 6 7 8 . . .

groupi 0 1 2 3 1 2 3 1 . . .

TABLE I

The construction of broadcast tree is illustrated in Figure3.From Figure 3, it is clear that the broadcast tree constructionalgorithm always chooses a path from base station to anode, whose intermediate nodes have higher remaining energycompared to the nodes in other possible paths. A packet froma node to the base station is sent along the tree towards theparent.

Fig. 3. Broadcast tree

Fig. 4. Authentication request(ARQ) message format

D. Authentication and Key Distribution

After the level number and group number are assigned toa sensor nodei during the construction of broadcast tree, itsends an authentication request(ARQ) message to the basestation through its parent. On receipt of authentication requestmessage, the base station replies to the sensor node withauthentication request reply(ARR) message which, apart fromother information, contains the first key of the hash key chainof the group to which the nodei belongs. We first describe theformat of ARQ and ARR messages followed by the formatof data packet(DP) messages which are broadcasted by thebase station. The format of the ARQ message is as shownin Figure 4. Each node sends an ARQ message to the basestation through the path created during the broadcast treeconstruction and group formation phase. To reduce collisions,ARQ message from each node are sent after a random delay.After an ARQ message is received at the base station, it isauthenticated as shown in Figure 5.

Upon receipt of an ARQ message from a nodei at the basestation, it sends an authentication request reply(ARR) messageto the node with the first key of hash key chain of the groupto which the nodei belongs. The format of ARR message isas shown in Figure 6. The ARR message for nodei containsthe first key of the hash key chain of the first group and thefirst key of the hash key chain of the group to which nodei belongs encrypted with keyki. The use of these keys inthe message broadcast phase is described in the next section.The base station generates a separate hash key chain for eachgroup of node. We denoted the hash key chain of groupias Sin

h→ Si(n−1)

h→ Si(n−2)

h→ . . .

h→ Si1

h→ Si0. The

maximum number of groupsMAX GROUPis independent ofthe number of sensor nodes or the number of levels, and it isfixed apriori. The group of a node is as given by equation (1).

E. Message Broadcast Phase

After each node has received the ARR message, thebase station can broadcast the data packet. The format of

148

Fig. 7. Data Packet(DP) message format

Fig. 5. Authentication of individual node

Fig. 6. Authentication request reply(ARR) message format

the data packet is as shown in Figure 7. The data packetmessage contains message type, message, actual number ofgroups(ANOG), encryption of the next key for each group bythe corresponding previous key, i.e.,ESik

[

Si(k+1)

]

, 0 ≤ i ≤(ANOG−1) , and MAC of these information using each of theprevious key as shown in Figure 7. To ensure confidentialityof messages, the required part of each data packet may beencrypted withS0k, but if confidentiality of messages is notrequired, this encryption is not necessary. Upon receipt ofadata packet message, a node can verify the authenticity of themessage using its current key for the hash key chain, and alsoreceive the next key of the chain. An intermediate node cannotmodify the message since it does not have the current key foreach of the other groups. For the very same reason, it cannotextract the next key of other groups.

F. Periodic Restructuring of the Tree

Each broadcast tree contains a few internal nodes and restof them are leaf nodes. Each internal node always consumesmore energy because these nodes are used for transmitting thepackets of its own, and also those of other leaf nodes. Thisreduces the remaining energy of the internal nodes. Hence, itis necessary that a node should not act as an internal node fora long duration. To overcome this problem, we restructure the

broadcast tree periodically, so that the nodes which have actedas internal nodes in the current broadcast tree should not actas internal node in the next broadcast tree. This ensures thatthe whole network will survive for longer duration and all thenodes will die nearly at the same time. Since the current keyof the hash key chain of the first group is available at everynode, it is used to ensure the integrity of the ADV messageduring the restructuring of the broadcast tree.

But, as we change the broadcast tree periodically, the levelof each node may get changed. Since the group number ofeach node is generated from node’s level number in the initialtree, it remains unchanged due to restructuring of the tree,andauthentication process remains unaffected due to restructuring.

V. SECURITY AND PERFORMANCEANALYSIS

In this section, we discuss parameters that influence securityand performance of our protocol. In our protocol, we have useda parameter, calledα, which strengthens the security. As wehave earlier mentioned,α is preloaded and known only to thebase station. After completion of broadcast tree constructionand group formation, every nodei sends an authenticationrequest packet to the base station as shown in Figure 4. Thisauthentication request packet contains node ID, level number,gβi mod p, and message authentication code(MAC). If anyattacker gets this packet, it will not be useful to the attacker,because the attacker will not get any key or generate any keyfrom this packet even if the packet is not encrypted. If he triesto modify the packet, the base station can easily identify thismodified packet by computing MAC using a keytmp keyi

which has been calculated by base station as shown in Figure5. Since the attacker does not know the keyki, he is unable tocompute MAC. Hence, to break the security,α must be knownto the attacker, which is not possible becauseα is preloadedand known only to the base station.

Now we discuss security of the data packets, which arebroadcasted by the base station after the authentication andkey distribution phase.

Let us assume that the broadcast packet is received bysome attacker in the first hop. Suppose he tries to changemessagemsg0. If he changes the message, he would have tochange the MACs including theMACS1,0

(x). But we haveearlier mentioned that the first level group key chain is knownto only the receivers which have already been authenticated

149

Proposed Protocol H2BSAPComputaion overheadper packet

MAX GROUP×MACop +(MAX GROUP+1) × ENCop

l × MACop

Transmission overheadper packet

MAX GROUP×|MAC| +[MAX GROUP×|key|]

l × |MAC| +[l × |key|]

Storage overhead MAX GROUP×(n + 1) × |key|

l× (n + 1)×|key|

TABLE II. Overhead at the base station

by the base station as the group one node. But, to computeMACS1,0

(x), he has to know the keyS1,0, which is notpossible. If the attacker modifies the broadcast packet using itsown key and forwards the modified packet to the next level,then the next level nodes would identify the incorrect packetthrough their MAC if their group is different from that of thesender, and reject the packet.

It is possible that two different nodes in consecutive levelsbelong to the same group after restructuring of the broadcasttree. In this case, some nodes in the next lower level maynot be able to detect the modification. But other nodes willbe able to detect the same, and the event can be reported toall the nodes in the vicinity. Nonetheless, this situation wouldbe very rare as only the internal nodes broadcast, and it isvery unlikely that a node would become internal node and itslevel would also get increased or decreased. Our protocol isalso fault-tolerant to node failures as periodic restructuring ofthe broadcast tree would eliminate the failed nodes from theinternal node of the tree. Besides, even if an internal nodehas failed, broadcast might still go through other neighboringinternal nodes.

Table II and III show the comparisons between our proposedprotocol andH2BSAP protocol [7] with respect to compu-tation, transmission, and storage overhead at the base stationand other nodes respectively. Table IV describes the notationsused in Table II and III. As compared toH2BSAP protocol,our proposed protocol uses lesser number of key chainswhich depends uponMAX GROUP, whereas inH2BSAP ,the maximum number of key chains depends on the depthof the network. SinceMAX GROUP is fixed, our proposedprotocol can support network of any depth, butH2BSAP

can support maximum of up to 15-hops as given in [7]. Aswe have already discussed that each node in the proposedprotocol keeps only two keys, i.e., the key of the key chainof the first group and the key of the key chain of the groupto which it belongs. Hence, the proposed protocol always hasless overhead as compared toH2BSAP , and authenticationin the proposed protocol is always immediate.

VI. SIMULATION STUDIES

We have studied the performance of our Broadcast authen-tication scheme using Castalia simulator[15]. All the nodesin the network are randomly deployed. For the simulationpurpose, we vary the number of nodes from 50 to 225 in

Proposed Protocol H2BSAPComputaion overheadper packet

MACop + 2 ×DECop

MACop + ⌊l ×Hashop⌋

Transmission overheadper packet

MAX GROUP×|MAC| +[MAX GROUP×|key|]

(l − r)×|MAC|+[l × |key|]

Storage overhead 2 × |key| l × |key|

TABLE III. Overhead at other nodes

MACop: MAC operation Hashop: Hash operationENCop: Encryption operation DECop: Decryption operation|key|: Key length |MAC|: MAC lengthn: Size of key chain l: Maximum hopMAX GROUP: Maximum no. ofgroups

r: Hop distance from BS

TABLE IV. Notations

a fixed area of 100 * 100meter2. The network parameters,such as transmission range, transmission rate, sensitivity, trans-mission power etc., for this simulation study are similar totheparameters specified in CC2420[16] data sheet and TelosB[17]data sheet. We have taken the initial energy of each node tobe 29160 joules for 2 AA batteries as given in the Castaliasimulator. Energy consumption for different radio modes usedin this simulator are given in Table V. For this simulation,we assume that clocks of all the nodes are synchronized.The simulation was carried out for both realistic as well asideal channel. We have used TelosB node hardware platformspecification for our simulation and have also used “tunableprotocol” provided by Castalia as the MAC layer protocol.The broadcast packet is generated randomly with uniformdistribution in every 2 seconds interval at the base station.

Figure 8 shows the total number of transmissions to broad-cast a packet for different sizes of network. In this figure,we have compared our protocol with the previously proposedscheme with respect to the number of transmissions made.Our approach gives better performance as compared to thepreviously proposed scheme, because our approach generatesa broadcast tree, where only the internal nodes can forwardthe packet over the network. This reduces the number oftransmissions required to broadcast a packet over the network.

Figure 9 shows the number of authenticated nodes and theaverage number of nodes that received the broadcast packetfor different sizes of WSNs. From this figure, we can say thatwith increase in the density of the network, the percentageof authenticated nodes also decreases. This happens onlydue to the collisions of the authentication request packet.Itcan be reduced by increasing the random delay before theauthentication request packet is transferred.

Figure 10 shows the minimum, average, and maximumdelivery time of a broadcast packet in the network for differentnode density of the network.

VII. C ONCLUSIONS

In this paper, we have proposed a broadcast authenticationprotocol for multi-hop wireless sensor networks using Diffie-Hellman key and hash key chain. The protocol is based on

150

Radio mode Energy Consumption (mW)Transmit 57.42Receive 62Listen 62Sleep 1.4

TABLE V. Radio Characteristics

0

500

1000

1500

2000

2500

50 75 100 125 150 175 200 225

Nu

mb

er

of

pa

ck

et

tra

ns

mit

ted

Number of nodes

with flodingwithout floding

Fig. 8. Number of transmissions required to broadcast a packetwith floodingand without flooding

a novel scheme to authenticate each sensor node at the basestation using Diffie-Hellman key. For this purpose, the basestation needs to store only two values instead of separateshared secret key with each node. Compromising a single nodewill not affect other nodes, and even a compromised will notbe able to do any damage as far as the broadcast is concerned.The proposed protocol exhibits many nice properties includingindividual authentication, instant authentication, and low over-head in communication and storage. It also improves over theexisting broadcast authentication schemes in many aspects.

REFERENCES

[1] X. Ren, “Security methods for wireless sensor networks,”in Mecha-tronics and Automation, Proceedings of the 2006 IEEE InternationalConference on, June 2006, pp. 1925–1930.

[2] T. Zia and A. Zomaya, “Security issues in wireless sensor networks,” inSystems and Networks Communications, 2006. ICSNC ’06. InternationalConference on, Oct. 2006, pp. 40–40.

[3] Y. Wang, G. Attebury, and B. Ramamurthy, “A survey of security issuesin wireless sensor networks,”Communications Surveys & Tutorials,IEEE, vol. 8, no. 2, pp. 2–23, Quarter 2006.

[4] A. Perrig, R. Szewczyk, V. Wen, D. Culler, and J. D. Tygar,“Spins:security protocols for sensor networks,” inMobiCom ’01: Proceedingsof the 7th annual international conference on Mobile computing andnetworking. New York, NY, USA: ACM Press, 2001, pp. 189–199.[Online]. Available: http://dx.doi.org/10.1145/381677.381696

[5] S. Yamakawa, Y. Cui, K. Kobara, and H. Imai, “Lightweight broadcastauthentication protocols reconsidered,” inWireless Communications andNetworking Conference, 2009. WCNC 2009. IEEE, April 2009, pp. 1–6.

[6] P. Ning, A. Liu, and W. Du, “Mitigating dos attacks against broadcastauthentication in wireless sensor networks,”ACM Trans. Sen. Netw.,vol. 4, no. 1, pp. 1–35, 2008.

[7] C. Bekara, M. Laurent-Maknavicius, and K. Bekara, “H2bsap: A hop-by-hop broadcast source authentication protocol for wsn tomitigatedos attacks,” inCommunication Systems, 2008. ICCS 2008. 11th IEEESingapore International Conference on, Nov. 2008, pp. 1197–1203.

[8] H. Krawczyk, M. Bellare, and R. Canetti, “Hmac: Keyed-hashing formessage authentication,” Internet RFC 2104, February 1997.

[9] R. H. Brown and A. Prabhakar, “Digital signature standard (dss).”[Online]. Available: http://www.itl.nist.gov/fipspubs/fip186.htm

20

40

60

80

100

120

140

160

180

50 75 100 125 150 175 200 225

in n

um

ber

Number of Nodes

authenticationavg. packet delivery

Fig. 9. Number of authenticated nodes, and average number of nodes thatreceived the broadcast packet

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

50 75 100 125 150 175 200 225

De

liv

ery

Tim

e(i

n S

ec

.)

Number of nodes

Minimum Average

Maximum

Fig. 10. Minimum, average and maximum delivery time for a broadcastpacket

[10] D. Liu and P. Ning, “Multilevelµ tesla: Broadcast authentication fordistributed sensor networks,”ACM Trans. Embed. Comput. Syst., vol. 3,no. 4, pp. 800–836, 2004.

[11] A. Perrig, “The biba one-time signature and broadcast authenticationprotocol,” in CCS ’01: Proceedings of the 8th ACM conference onComputer and Communications Security. New York, NY, USA: ACM,2001, pp. 28–37.

[12] L. Reyzin and N. Reyzin, “Better than biba: Short one-time signatureswith fast signing and verifying,” inACISP ’02: Proceedings of the 7thAustralian Conference on Information Security and Privacy. London,UK: Springer-Verlag, 2002, pp. 144–153.

[13] S.-M. Chang, S. Shieh, W. W. Lin, and C.-M. Hsieh, “An efficient broad-cast authentication scheme in wireless sensor networks,” inASIACCS’06: Proceedings of the 2006 ACM Symposium on Information, computerand communications security. New York, NY, USA: ACM, 2006, pp.311–320.

[14] F. Ye, A. Chen, S. Lu, and L. Zhang, “A scalable solution to minimumcost forwarding in large sensor networks,” inComputer Communicationsand Networks, 2001. Proceedings. Tenth International Conference on,2001, pp. 304–309.

[15] “Castalia a simulator for wireless sensor networks,”http://castalia.npc.nicta.com.au/pdfs/Castalia User Manual.pdf.

[16] “Cc2420 data sheet,” http://www.stanford.edu/class/cs244e/papers/cc2420.pdf.

[17] “Telosb data sheet,” http://www.xbow.com/Products/ Prod-uct pdf files/Wirelesspdf/TelosBDatasheet.pdf.

151

ADCOM 2009GRID SCHEDULING

Session Papers:

1. Tapio Niemi, Jukka Kommeri and Ari-Pekka Hameri “Energy-efficient Scheduling of Grid Computing Clusters”

2. Ankit Kumar, Senthil Kumar R. K. and Bindhumadhava B. S ,“Energy Efficient High Available System: An Intelligent Agent Based Approach ”

3. Amit Agarwal and Padam Kumar “A Two-phase Bi-criteria Workflow Scheduling Algorithm in Grid Environments”

152

Energy-efficient Scheduling of Grid ComputingClusters

Tapio NiemiHelsinki Institute of Physics, Technology Programme

CERN, CH-1211 Geneva 23, [email protected]

Jukka KommeriHelsinki Institute of Physics, Technology Programme

CERN, CH-1211 Geneva 23, [email protected]

Ari-Pekka HameriHEC, University of Lausanne,CH-1015 Lausenne, Switzerland

[email protected]

Abstract—Energy-efficiency is an increasingly important com-ponent in computation costs in scientific computing. We havestudied different scheduling settings with different hardware forhigh-throughput computing trying to minimise the electricityusage of computing jobs. Instead of common practice of onetask per CPU core scheduling in grid clusters, we have testedvariations of different scheduling methods based on idea to fully-load computing nodes. Our test showed running multiple taskssimultaneously can decrease energy usage per computing taskover 40% and improve throughput of the computing node upto 100% when running a high-energy physics (HEP) analysisapplication. The trade-off is that processing times of individualtasks are longer but in cases, such as HEP computing, in whichthe tasks are not time critical, only the total throughput isimportant.

I. INTRODUCTION

Energy consumption has become one of the main cost ofcomputing and several methods to improve the situation hasbeen suggested. The focus on research has been in hardwareand infrastructure aspects. Most of the computing centersand computing clusters of research institutes focus on high-performance computing trying to optimise processing timeof individual computing jobs. Jobs can have strict deadlinesor require massive parallelism. Instead, in high-throughputcomputing the aim is slightly different, since individual jobsare not time critical and the aim is to optimise the totalthroughput over a longer period of time.

In computing intensive sciences, such as high energy-physics (HEP) energy-efficient solutions are important. Forexample the Worldwide LHC Computing Grid (WLCG) tobe used to analyse the data that the Large Hadron Collier ofCERN will produce, includes tens of thousands CPU cores.In this scale, even a small system optimisation can offernoticeably energy and cost savings. Since scientific computing,and especially, high-energy physics computing has specialcharacteristics, also energy-optimisation methods can be tai-lored for it. The main characteristics in this sense are: largesets of similar kind jobs, data-intensive computing, no timecritical, no preceding conditions, and no intercommunicationbetween jobs or their tasks, i.e. high parallelisms. In spiteof this special nature, improving energy efficiency in cluster

and grid computing for HEP has mostly focused on similarinfrastructure issues as in general HPC computing such ascooling and purchasing energy-efficient hardware. As far aswe know, there are not many studies focusing on optimisingthe system configuration and scheduling settings for gridcomputing.

In this paper, we focus on a typical grid computing problem:How to process a large set of jobs efficiently. We try to opti-mise the energy efficiency and the total processing time of theset of jobs by choosing an optimal scheduling policy. In thissense our focus is closer to high-throughput computing thanhigh-performance computing. Basically the problem is similarto production management in any manufacturing processes.This kind of optimisation problem can lead to a trade-offsituation: improving energy-efficiency can weaken throughput.However, our tests indicated that these two aims are not neces-sary contradictory, meaning that optimising system throughputalso improves its energy efficiency.

Our method is based on a conclusion that computers shouldrun full-power or be turned off, since the fixed power con-sumption is around 50% of the full-power of the server.Since computers can run multiple tasks simultaneously ina CPU core using time sharing techniques, this naturallyleads to load-based scheduling policy. Our previous tests [1]indicated that the load should not mean only the processorload but all components of the computer including memoryusage, processor load, and I/O traffic. In the current study wetested different computing hardware: single core, low-energymini-PCs, and modern multicore systems commonly used incomputer centers. Our test software included applications util-ising different resources of the computer and a HEP analysisapplication.

The basic terminology used in this paper is:- A task is the smallest entity of processing work. Thetask starts, retrieves/reads its possible input file, processthe data and possibly writes its output file.

- A job is a collection of tasks. In general situation taskscan have preceding relations but in our case tasks areindependent.

153

- A computing node, i.e. node is a part of the computingcluster. It has one or more CPU cores, a fixed amountof memory and disk space, and a network connectionwith some fixed capacity. The node schedules its jobsindependently to its CPU cores.

- Energy efficiency means how many similar jobs can beprocessed by using the same amount of electricity.

- Computing efficiency, i.e. the system throughput, meanshow many similar jobs can be processed in a time unit.

The paper has been organised as follows. In Backgroundsection we explain the common concepts of scheduling andreview related literature. After that the used methodology isdescribed in Section 4. Tests are explained in Section 5 and theresults in Section 6. Finally conclusions are given in Section7.

II. BACKGROUND

A. Scheduling

Scheduling means in which order and to which computingnodes computing tasks should be allocated. How an individualcomputers schedule their processes is not included to our topic.Scheduling problems can be classified according to followingproperties:

- on-line / offline- knowledge on jobs- knowledge on computing resources

There are lots of research on scheduling multiprocessor sys-tems. More formally (e.g. following [2] or [3]) the schedulingproblem can be defined as follows: We have m machinesMj(j = 1, ...,m) (i.e. computing nodes in our case) and j jobsJi(i = 1, ..., n) to be processed. A schedule S is an allocationof time intervals from machines for each job. The challenge isto find an optimal schedule for jobs when there exist certainconstraints. A schedule is called optimal if it minimises a givenoptimality criteria. The criteria can be for example time, costor usage of some resource. Figure I illustrates a schedule.

M1 J1

M2 J2 J2 J1

M3 J3 J3 J3

T ime

TABLE IA SCHEDULE

The optimality criteria can be defined in several ways. Ifthe finishing time of the job Ji is denoted by Ci the cost ismarked by fi(Ci). Now, the usual cost functions are calledbottleneck objectives and sum objectives. The bottleneck isthe maximum value of the cost functions of all jobs while thesum is the summarized value. The cost function can be definedin several ways. The most common ones are make-span, totalflow time and weighted flow time. When designing a schedulethere are different objective functions to be minimised such as:

the completion time of the last job or the total completion timei.e. sum of all completion times.

Often there are several objectives such as processing timeand energy efficiency in our case. Then the overall objectiveis a (weighted) sum of the sub-objectives. This often leads toa Pareto-optimal schedule.

B. Cluster Schedulers

There are various batch scheduling systems – also called asjob schedulers or distributed resource management systems –available, such as Torque 1, OpenPBS 2, LSF 3, Condor 4, andSun Grid Engine 5. These systems have different features butbasic functionality is very much similar.

In our tests we used Sun Grid Engine (SGE) [4] of SunMicrosystems that is also commonly used in grid computingclusters. It has various features to control scheduling. Thescheduling is done based on the load of the computing nodesand resource requirements of the jobs. SGE supports check-pointing migration of checkpointing jobs among computingnodes. In addition to batch jobs also interactive and paralleljobs are supported. SGE accounts the resources, such as CPU,memory, I/O, that a job has used. SGE contains an accountingand reporting console (“ARCo”) that store accounting data intothe SQL database for later analysis.

In SGE scheduling is done in fixed intervals (default setting)or triggered by some events such as a new job submission. Thescheduler finds the best nodes for pending jobs based on, forexample, resource requirements of the job, the load of nodes,the relative performance of the nodes. By default the schedulerdispatches the jobs to the queues, i.e. nodes, in the order inwhich they have arrived. If several queues are identical, theselection is random.

It is also possible to change to scheduling algorithm butonly one algorithm is shipped with the default distribution.However, the scheduling can be controlled in four ways: 1)Dynamic resource management, 2) Queue sorting and 3) Jobsorting, 4) Resource reservation. Here we focus only queueand job sortings. In the queue sorting the queue instancesof computing nodes are ranked to the order the schedulershould use them. The ranking possibilities for queues are, forexample: system load, scaled system load, user defined systemload, or fixed order. The jobs sorting can be done, for example,in following ways: ticket-based job priority, urgency or POSIX-based priorities, user or group based quotas.

III. RELATED WORK

Venkatachalam and Franz [5] give a detailed overview ontechniques that can be used to reduce energy consumption ofcomputer systems. There are several studies in different partsof the topic such as optimising processors by dynamic voltagescaling (e.g. [6] and [7]); optimising disk systems (e.g. [8],

1www.clusterresources.com/pages/products/torque-resource-manager.php2http://www.openpbs.org3www.platform.com/Products/Workload-Management4www.cs.wisc.edu/condor5http://gridengine.sunsource.net

154

[9], and [10]); network optimisation (e.g. [11]); and compilers(e.g. [12]). There are also several studies on pure energy issues.For example, Lefurgy et al. [13] suggest a method to controlpeak power consumption of servers. The method is based onpower measurement information on each computing server.Controlling peak power makes it possible to use smaller andmore cost-effective power supplies.

Scheduling is a widely studied topic but there is little workon scheduling as an energy saving method. Instead some workssuggest clearly opposite approaches: For example, Koole andRighter [14] suggest a scheduling model in which tasks arereplicated to several computers. However, the authors do notestimated how much more resources is needed when the sametasks (or at least parts of them) are computed in severaltimes. Fu et al. [15] present a scheduling model being ableto restart batch jobs. They give an efficient algorithm to solvethe problem but they do not touch the resource usage.

There also exists several studies relevant to our topic suchas: Kurowsk et al. [16] studies two-level hierarchical gridscheduling. Their approach is taking into account all stake-holders of grid computing systems. The approach does notrequire time characteristics of jobs being known and in it aset of jobs in the grid level is scheduled simultaneously to thelocal computing resources.

Edmonds [17] studies non-clairvoyant scheduling in multi-processor environments. In his model, the jobs can have arbi-trary arrival times and execution characteristics can change.

Wang et al. [18] have studied optimal scheduling methodsin a case of identical jobs and different computers. They aimto maximise the throughput and minimise the total load. Theygive an on-line algorithm to solve the problem.

Shivam et al. [19] presents a learning scheduling model,while Srinivasa Prasanna and Musicus [20] give a theoreticalscheduling model in which the number of processors allocatedto a task can be a continuous variable and it is possible toallocate all processors for one task if needed.

Medernach [21] have studied workload in a grid computingcluster to be able to compare different scheduling methods.The idea of the work was to find ways how the users ofthe cluster can be grouped to characterise their usage. Thescheduling is based on one job per CPU core idea.

Etsion and Tsafrir [22] compared commercial workloadmanagement systems focusing on their scheduling systems anddefault settings. According to the authors, the default settingsare often used by the administrators, or they are just slightlymodified.

Aziz and El-Rewini [23] have studied online schedulingalgorithms based on evolutionary algorithms in the grid con-text. Ges et al. [24] have studied scheduling of irregular I/Ointensive parallel jobs. They note that CPU load alone isnot enough but all other system resources (memory, network,storage) must be taken into account in scheduling decisions.Santos-Neto et al. [25] have studied scheduling in case ofdata-intensive data mining applications.

IV. METHODOLOGY

A. Problem Description

We assume having a large set of tasks – compared to thenumber of CPU cores available – organised as a job. Furtherwe assume that jobs do not have deadlines, all of them arriveat time zero, there is no preceding relations between them oramong tasks in them, and the number of tasks is much largerthan the number of computing nodes. These assumptions areusually true in HEP computing and they make the neededscheduling algorithm easier. Figure 1 illustrates the situation.

Our scheduling problem can be divided into two indepen-dent steps:

1. Finding the optimal load combination for the comput-ing node. The optimal means that a job can be run byusing smallest possible amount of energy in the minimaltime.

2. Scheduling jobs to the computing nodes in such a waythat all computing nodes are as close as possible to theoptimum state (i.e. Step 1).

In an important special case in which all tasks are identicalinside a job, the problem simplifies into a form: how manytasks to run simultaneously in a computing node. Then theproblem is how to measure the load of the node and definethe optimum load level. Generally, it is important to noticethat we did not try to minimise the processing time of anindividual task but a large job containing several tasks. We donot include the local process scheduling on the node to ourstudy.

Cluster scheduler

Cluster queue

Node queue

Node scheduler

Core Core

Core Core

memory network

disk

Node queue

Node scheduler

Core Core

Core Core

memory network

disk

job

Resources

Nodelevel scheduling

Clusterlevel scheduling

Fig. 1. Scheduling system

Briefly, our hypothesis is:Running several tasks simultaneously in a CPU coreimproves energy-efficiency and throughput comparedto running only one task.

155

Possible reasons for this can be:1. If tasks have I/O access, there would be idle

time for the CPU core because of slow disksand network.

2. If tasks have intensive memory access, therewould be idle time for the CPU core becauseof slow main memory access.

B. Test Method and Environment

Our test method was to execute a job , i.e. a largeset of tasks and measure the time and electricityconsumed during the test run. The same test job wasrun with different cluster configurations to find outthe optimum one.We ran our tests on different test environments:

- A Xeon test cluster including one front-endand three computing nodes running Sun GridEngine [4]. The nodes had two single coreIntel Xeon 2.8 GHz processors (SupermicroX6DVL-EG2 motherboard, 2048 KB L2 cache,800 MHz front side bus) with 2 gigabytes ofmemory and 160 gigabytes of disk space.

- A Dell PowerEdge SC1435 computer with two4-core ADM Opteron 2376 2.3GHz and 32gigabytes of memory.

- A cluster of three EeeBox mini computers withIntel Atom N270 and 2 gigabytes of memory.

The operating system used with Xeon and Opteronwas Rocks 5.0 with kernel version 2.6.18. Eeeboxesrequired newer drivers so with them Rocks 5.2 andkernel version 2.6.24.4 was used. The effect of thekernel was tested and found nonexistent.The electricity consumption of the computing nodeswas measured with the Watts Up Pro electricity me-ter. We tested the accuracy of our test environmentby running the same tests several times with exactlythe same settings. The differences between the runswere around +-1% both in time and electricityconsumption.We developed and customised some tools to maketesting process easier. The test runs were submittedusing a Perl script that automatically set the wantedcluster parameters and stored all information into arelational database.We assumed knowing the type of a job and thecharacteristics of the hardware in advance. The testapplications were real HEP analysis applications anddummy test applications simulating CPU intensive,memory intensive, and disk intensive applications.

V. TESTS

To find out a reason why our hypothesis is valid, weformed tests to test our assumptions:

1. I/O access delays: we used two similar testapplication one having intensive I/O access

and the other one no I/O access at all.2. Memory access delays: we used two similar

test application one having intensive memoryaccess and the other one using very little ofmemory.

We tested two different scenarios:1. In the simplest case all tasks and all jobs and

computing nodes were identical.2. The second case we had different jobs but

identical computing nodes. Now the problemis how to allocate tasks to computing nodes.

The following scheduling methods were tested fo-cusing first two mentioned resources:

- The default scheduling settings of SGE. Jobslots are equal to the number of CPU cores.Currently this is often used in grid clusters.

- Slot-based: fixed number of jobs per CPU core.

A. Basic Tests

We used following applications for basic tests:1. The I/O test application was a simple pro-

gram that wrote and read 300MB files mul-tiple times. First it created a file containingnumbers generated using the process id as a”random” seed, making all files unique. Fileswere named using the pid to avoid simultane-ous writing/reading into the same file. Aftergenerating the file, the contents were copiedto another file 20 times. Each time was a bitdifferent (small shift in numerical values) toavoid buffering.

2. The CPU test application used here was along loop calculating floating point multiplica-tion. A reminder of the index variable was alsoused to make compiler optimization harder.

3. The memory test application reserved mem-ory (200 MB) and filled it with numbers. Afterthat it read a part of the memory and wrote itto another part. This was done multiple times.

We performed two types of tests: 1) running identicaltest applications, and 2) mixing the applications. Inmixed test we submitted test applications to testclusters in the following order: CPU, memory, I/O,and CPU, i.e. two CPU test per one set of memoryand I/O.

B. Test with Physics Software

We used a CMS analysis application in our tests.Input data for the test was from the CRAFT (CMSRunning At Four Tesla) experiment that used cosmicray data recorded with CMS detector at the LHCduring 2008 [26]. This detector was used in thesimilar way to future LHC experiments. Our testapplication uses CMSSW framework and it is veryclose to an analysis applications that will be used

156

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

200

400

600

800

1000

1200

1400

1600

1800

Time (minutes)

Spe

ed (

bloc

ks/s

)

Write speedRead speed

Fig. 2. I/O usage of a single HEP analysis job

0 1 2 3 4 5 60

50

100

150

200

250

300

350

Time (minutes)

Use

d m

emor

y (M

egab

ytes

)

Fig. 3. Memory usage of a single HEP analysis job

when LHC collision data is available. The analysissoftware reads the input file (350 MB), performsthe analysis, and writes the output file (1.6MB) inthe local disk. The disk I/O during the applicationexecution is shown in Figure 2 and the memoryusage in Figure 3.

VI. RESULTS

Results of our basic tests are shown in Table II,mixed basic tests in Table III, and physics analysistests in Table IV.Generally running more than one task in a CPUcore improved throughput and decreased electricityconsumption. Figure 4 illustrates this in the caseof the multicore computer. However, improvementsheavily depended on the application and the hard-ware. In modern multicore environment running 3-4task per CPU core gave the the best results whilein older single core cluster 1-2 task per CPU wasthe best. The single core cluster also was the onlyenvironment in which running multiple tasks candecrease the efficiency. This can partially be related

to low amount of memory.There are big differences in energy efficiency amongdifferent hardware. Figure 5 illustrates this in our testenvironment: the modern multicore computer is over7 times more energy efficient than the older singlecore one. Even the modern low energy mini-PC isnot very energy-efficient in heavy computing tasks.However, because of its very low power consump-tion and low idle power, it could be efficient in someother tasks.According to our tests a multicore computer withsufficient amount of memory is the best hardwarein physics computing. With this hardware runningmultiple analysis tasks per CPU core gave the bestimprovement: running three simultaneous tasks percore increased throughput 100% and decreased elec-tricity consumption 43% compared to the situationrunning only one task per CPU. Remarkable is thatone task per CPU core configuration in multicore en-vironment uses more energy per task than optimisedmini-PC environment does.

Fig. 4. Improvements on 2x4 core machine

Fig. 5. Wh/job performance in physics jobs

157

Hardware Test Environment Jobs/core Jobs/hour Wh/job Avg. kW / core Avg. kW / nodeXeon cluster Memory normal 1 60 11.88 119.17 715

optimal 3 91 8.44 127.50 765Disk normal 1 119 6.06 119.67 718

optimal 1 119 6.06 119.67 718CPU normal 1 176 3.79 111.33 668

optimal 1 176 3.79 111.33 6682x4 core Opteron Memory normal 1 90 2.91 32.50 260

optimal 3 91 2.88 32.75 262Disk normal 1 320 0.79 31.25 250

optimal 4 493 0.59 36.25 290CPU normal 1 407 0.6 30.38 243

optimal 3 461 0.56 31.75 254Eeebox cluster Memory normal 1 22 2.05 14.83 44.5

optimal 4 32 1.48 15.27 45.8Disk normal 1 38 1.27 15.53 46.6

optimal 3 65 0.76 16.13 48.4CPU normal 1 24 1.89 14.70 44.1

optimal 2 30 1.52 14.73 44.2

TABLE IIBASIC TESTS

Hardware jobs/core jobs/h Wh/job jobs/h/node avg.power/node %-throughput %-electricityXeon cluster 4 95 7.18 31.67 227.00 17% -14%2x4 core Opteron 4 258 0.77 258.00 258.00 38% -22%Eeebox cluster 3 43 1.09 14.33 15.27 105% -50%

TABLE IIIMIXED BASIC TESTS

Hardware Environment Jobs/core Jobs/hour Wh/job %-throughput %-electricityXeon cluster normal 1 127 5.65Xeon cluster optimal 2 124.3 5.61 -2.1% -0.8%2x4 core Opteron normal 1 188 1.1342x4 core Opteron optimal 3 376 0.77 100% -42.54%Eeebox cluster normal 1 33 1.53Eeebox cluster normal 3 45 1.24 36.6% -19.0%

TABLE IVPHYSICS TEST RESULTS

158

VII. CONCLUSION AND FUTURE WORK

Our tests showed that both energy efficiency andthroughput can be remarkable improved by runningseveral tasks simultaneously in each CPU core.However, the results highly depended on the usedhardware. The biggest improvement in physics com-puting in both throughput (100%) and energy con-sumption (43%) was achieved in modern 2 x 4 CPUcore computer. The same hardware was clearly themost energy efficient, too.Our future work includes development of a load-based scheduler being able to find automaticallythe best amount of tasks per computing node byusing the utilisation rate of different resources ofthe computing system.

ACKNOWLEDGEMENTS

We would like to thank Arttu Klementtila whohelped us implementing a part of the test applica-tions and executing the test runs and Magnus Ehrn-rooth’s Foundation for a grant for the test hardware.

REFERENCES

[1] T. Niemi, J. Kommeri, K. Happonen, J. Klem, and A.-P.Hameri, “Improving energy-efficiency of grid computingclusters,” in Advances in Grid and Pervasive Computing, 4thInternational Conference, GPC 2009, Geneva, Switzerland,2009, pp. 110–118.

[2] M. Pinedo, Scheduling: Theory, Algorithms, and Systems.Springer, 2008.

[3] P. Brucker, Scheduling algorithms. Springer, 2007.[4] BEGINNER’S GUIDE TO SUNTM GRID ENGINE 6.2

Installation and Configuration, Sun Microsystems, 2008.[5] V. Venkatachalam and M. Franz, “Power reduction tech-

niques for microprocessor systems,” ACM Comput. Surv.,vol. 37, no. 3, pp. 195–237, 2005.

[6] R. Ge, X. Feng, and K. W. Cameron, “Performance-constrained distributed dvs scheduling for scientific appli-cations on power-aware clusters,” in SC ’05: Proceedingsof the 2005 ACM/IEEE conference on Supercomputing.Washington, DC, USA: IEEE Computer Society, 2005,p. 34.

[7] N. Kappiah, V. W. Freeh, and D. K. Lowenthal, “Just intime dynamic voltage scaling: Exploiting inter-node slackto save energy in mpi programs,” in SC ’05: Proceedingsof the 2005 ACM/IEEE conference on Supercomputing.Washington DC, USA: IEEE Computer Society, 2005.

[8] D. Essary and A. Amer, “Predictive data grouping: Definingthe bounds of energy and latency reduction through predic-tive data grouping and replication,” Trans. Storage, vol. 4,no. 1, pp. 1–23, 2008.

[9] Q. Zhu, Z. Chen, L. Tan, Y. Zhou, K. Keeton, and J. Wilkes,“Hibernator: helping disk arrays sleep through the winter,”in SOSP ’05, 20th ACM symposium on Operating systemsprinciples. New York, NY, USA: ACM, 2005, pp. 177–190.

[10] X. Li, Z. Li, Y. Zhou, and S. Adve, “Performance directedenergy management for main memory and disks,” Trans.Storage, vol. 1, no. 3, pp. 346–380, 2005.

[11] S. Conner, G. M. Link, S. Tobita, M. J. Irwin, andP. Raghavan, “Energy/performance modeling for collectivecommunication in 3-d torus cluster networks,” in SC ’06:Proceedings of the 2006 ACM/IEEE conference on Super-computing. New York, NY, USA: ACM, 2006.

[12] W. Zhang, J. S. Hu, V. Degalahal, M. Kandemir, N. Vi-jaykrishnan, and M. J. Irwin, “Reducing instruction cacheenergy consumption using a compiler-based strategy,” ACMTrans. Archit. Code Optim., vol. 1, no. 1, pp. 3–33, 2004.

[13] C. Lefurgy, X. Wang, and M. Ware, “Server-level powercontrol,” in ICAC ’07: Proceedings of the Fourth Interna-tional Conference on Autonomic Computing. Washington,DC, USA: IEEE Computer Society, 2007.

[14] G. Koole and R. Righter, “Resource allocation in gridcomputing,” J. Scheduling, vol. 11, no. 3, pp. 163–173,2008.

[15] R. Fu, T. Ji, J. Yuan, and Y. Lin, “Online scheduling ina parallel batch processing system to minimize makespanusing restarts,” Theor. Comput. Sci., vol. 374, no. 1-3, pp.196–202, 2007.

[16] K. Kurowski, J. Nabrzyski, A. Oleksiak, and J. Weglarz, “Amulticriteria approach to two-level hierarchy scheduling ingrids,” J. Scheduling, vol. 11, no. 5, pp. 371–379, 2008.

[17] J. Edmonds, “Scheduling in the dark,” Theor. Comput. Sci.,vol. 235, no. 1, pp. 109–141, 2000.

[18] C.-M. Wang, X.-W. Huang, and C.-C. Hsu, “Bi-objectiveoptimization: An online algorithm for job assignment,” inAdvances in Grid and Pervasive Computing, GPC 2009,Geneva, Switzerland, 2009, pp. 223–234.

[19] P. Shivam, S. Babu, and J. Chase, “Active and acceleratedlearning of cost models for optimizing scientific applica-tions,” in VLDB ’06: Proceedings of the 32nd internationalconference on Very large data bases. VLDB Endowment,2006, pp. 535–546.

[20] G. N. S. Prasanna and B. R. Musicus, “The optimal controlapproach to generalized multiprocessor scheduling,” Algo-rithmica, vol. 15, no. 1, pp. 17–49, 1996.

[21] E. Medernach, “Workload analysis of a cluster in a gridenvironment,” in Job Scheduling Strategies for ParallelProcessing 11th International Workshop, JSSPP 2005.Springer, 2005.

[22] Y. Etsion and D. Tsafri, “A short survey of commercialcluster batch schedulers,” Hebrew Univ. of Jerusalem, Tech.Rep. 2005-13, 2005.

[23] A. Aziz and H. El-Rewini, “On the use of meta-heuristics toincrease the efficiency of online grid workflow schedulingalgorithms,” Cluster Computing, vol. 11, no. 4, pp. 373–390,2008.

[24] L. F. Ges, P. Guerra, B. Coutinho, L. Rocha, W. Meira,R. Ferreira, D. Guedes, and W. Cirne, “Anthillsched: Ascheduling strategy for irregular and iterative i/o-intensiveparallel jobs,” in Job Scheduling Strategies for Paral-lel Processing 11th International Workshop, JSSPP 2005.Springer, 2005.

[25] E. Santos-Neto, W. Cirne, F. Brasileiro, A. Lima, R. Lima,and C. Grande, “Exploiting replication and data reuse toefficiently schedule data-intensive applications on grids,”in Proceedings of the 10th Workshop on Job SchedulingStrategies for Parallel Processing, 2004, pp. 210–232.

[26] D. Acosta and T. Camporesi, “Cosmic success,” CMS Times,November 2008.

159

Energy Efficient High Available System: An Intelligent Agent Based Approach

Ankit Kumar, R.K.Senthil Kumar, B.S.Bindhumadhava Centre for Development of Advanced Computing,

'C-DAC Knowledge Park', Opp. HAL Aeroengine Division, No. 1, Old Madras Road, Byappanahalli,

Bangalore-560038, India. ankitk, senthil, [email protected]

Abstract— For achieving high availability in agent based applications which are of mission critical nature, a replica of the real agent system can be created. However even the replica of real agent system may fail due to various reasons like node failure ,failure in communication link etc which may lead to the agent loss. Another major issue with such types of systems is the wastage of energy since the replica has to be in the active mode (full power mode) all the time which is really harmful for the environment. The need for improved energy management in these type of systems has become essential for many reasons like reduced energy consumption & compliance with environmental standards.

To overcome these issues, we present an intelligent agent based approach for efficient energy management in these systems and also agent loss prevention by creating a replica 'on demand' of the real agent system using an efficient election algorithm (to find the best suited system for replication) designed for dynamic networks.

KEY WORDS Agent System, Mobile Agents, Election Algorithm, Energy

Management

I. INTRODUCTION & MOTIVATION In distributed systems, there are various activities like

electronic commerce, network management, process control applications and defence applications which are mobile agent based. A mobile agent is a self-managed software program performing a particular task and which is capable of autonomously migrating through a heterogeneous network. An agent can exist only on nodes which have agent system running on them. An agent system provides an execution environment to mobile agents. In CMAF (C-DAC Mobile Agent Framework) [1], agent systems are classified into two categories - real agent system and proxy agent system. Pluggable services like registry, communication, and user interface provide functionalities to the agent system. These services are called system agents, since they work for agent system. In a single network domain, there is only one real agent system and all the other agent systems are run as proxy agent system. A proxy agent system has lesser load compared to a real agent system. Real agent system maintains a registry of all the agent systems running in the network whereas proxy agent system does not. Mobile agent execution is initiated from the real agent system and it can migrate to any proxy agent system which is registered with the real agent system.

Mobile agent based applications which are of mission critical nature require high degree of dependability and

consistency. Despite the rapid evolution in all aspects of computer technology, both the computer hardware and software are prone to numerous failure conditions which may lead to the termination of these applications. So providing highly availability for these types of applications becomes increasingly important. High availability refers to the availability of resources in a system, in the wake of component failures in that system. High availability in agent based applications can be achieved by detecting node failures and reconfiguring the system appropriately, so that the workload of the real agent system can be taken over by the other node in the system called replica. So fault tolerance for the replica is also required to prevent the loss of agents performing critical applications. Check pointing [3] of real agent system or replica is not a good approach We intended to achieve such a goal by applying replication ‘on demand’.

Instead of making the system run in the full power mode all the time which leads to wastage of energy, we put it to the sleep state and bring it to the active state only when required. To achieve this, we propose an intelligent agent called “green agent .

In this paper, we present an approach to ensure reliability of real agent system using replication based on election algorithm for dynamic networks. In this approach, we optimize the performance by replicating only the real agent system and running all the other agent systems as proxy agent systems with minimal load.

The rest of the paper is organized as follows. In Section 2, we describe the ‘agent based highly available environment’. Election algorithm for dynamic networks is discussed in Section 3. Section 4 discussed about the agent based energy efficient high available system .Performance evaluation is explained in Section 5. Section 6 discusses the conclusions.

II. AGENT BASED HIGHLY AVAILABLE ENVIRONMENT In our proposed approach, ‘agent based highly available

environment’, we use replication to address agent failure due to node failure or agent system failure. Replication increases the dependability and availability of a system.

In this model, all the agents running in the agent systems

which are registered are check pointed in the real agent system. On failure of a proxy agent system, the agents which are abnormally terminated continue their execution from the

160

real agent system using the last check pointed state. To achieve dependability of agents, the real agent system is replicated.

The location of the real agent system and its replica are maintained in all the agent systems as realLocFile and replicaLocFile respectively.

Whenever a new proxy agent system comes up, it gets the location of the real agent system and its replica from the neighbour agent system. The proxy agent system then registers itself with the real agent system and its replica as shown in Figure 1.

Figure 1. Agent Based Highly Available Environment

When the real agent system fails, the replica will takeover the control and hence become the new real agent system. The realLocFile in the real agent system is updated with the new location. The agents which are blocked will continue their execution in the new real agent system. These self healing and self configuring properties make our system highly available and self aware environment for agent execution.

Whenever a replica becomes the new real agent system, another replica is created. The location of the current real agent system and its replica is updated in all the agent systems. The replication of real agent system is based on election algorithm for dynamic networks.

III. ELECTION ALGORITHM FOR DYNAMIC NETWORKS The agent systems running in a distributed network form

a hierarchical structure with the real agent system as the root. Since proxy agent system can get registered and unregistered at anytime, this network is dynamic in nature. In the proposed system, we use an election algorithm to select the best-suited agent system from the proxy agent systems which are running in the network and reconfigure the selected agent system to create replica for the real agent

system. This algorithm handles the dynamic nature of the network.

Generally, the aim of an election algorithm [4] is to elect a node from a fixed set of nodes. Some of the most common applications of election algorithm are key distribution [5], routing coordination [6], sensor coordination [7] and general control [8], [17]. Now a days, election algorithms are also being used in mobile agent based applications. In case of mobile agent based networks, the election algorithm should adapt with the dynamic nature of the network as well as it should elect the agent system based on its performance.

Some existing election algorithms works for static networks [9], [10], [11], [7], [12], [13], [14] or assume that the network is static in nature [15], [16]. Existing election algorithms designed for dynamic networks use random selection of node [17]. Sudarshan Vasudevan, Jim Kurose and Don Towsley proposed in their paper an election algorithm for mobile ad hoc networks based on extrema-finding [18]. There are also other extrema- finding algorithms [8] and clustering algorithms for mobile networks [19], [20]. But these algorithms are not used in

161

our approach since they require high amount of message passing between nodes which will increase the overhead.

In our approach, we propose an election algorithm for dynamic mobile agent based networks. This algorithm selects an agent system among the proxy agent systems based on the performance-related characteristics of the system. Since the processor speed and the amount of memory required for a proxy agent system and a real agent system are same, we consider only the hard disk space and the load average for replica election. During the start up of real agent system, election is triggered by creating a mobile agent called ElectionAgent. On failure of real agent system, the replica will takeover the control and creates a new replica by reconfiguring the proxy agent system selected by ElectionAgent.

We now describe the operation of election algorithm for mobile agent based networks. In Section A, we explain the algorithm for electing best proxy agent system. An algorithm for performance comparison of proxy agent systems is given in Section B. Section C describes the process of updating the replica location in all the proxy agent systems.

A. Algorithm for Electing Best Proxy Agent System When the real agent system starts up, it triggers the

election for best proxy agent system by creating a mobile agent called ElectionAgent.

We describe the election process by explaining the methods used by the ElectionAgent. Table 1 shows the different methods used by the ElectionAgent.

Table I Methods used by Election Agent

Method Purpose getList gets the list of all proxy agent systems move migrates to a particular proxy agent system getInfo get the attributes of the proxy agent system moveBack move back to the real agent system compareInfo compare the information gathered updateList checks the registry for new proxy agent system

1) getList: Real agent system maintains a registry of all the agent systems running in the network. The ElectionAgent makes a list of all the proxy agent systems from this registry where each entry contains the agent system name and its location.

2) move: The agent takes the first entry from the list and tries to get the communication service of that agent system by using its location. Once the communication service is received, the ElectionAgent migrates to that particular proxy agent system. If the communication service is not available, the ElectionAgent will retry for ten times. After the retrials, it will discard this agent system, gets the next

agent system entry from the list and tries to migrate to that agent system.

3) getInfo: After the successful migration to the proxy agent system, the ElectionAgent continues its execution. The election of agent system is made on the basis of the hard disk space and the load average.

4) moveBack: ElectionAgent gathers this information of the proxy agent system and gets the communication service of the real agent system to move back. It migrates back to the real agent system using its communication service.

5) compareInfo: After migrating back to the real agent system, the information which is gathered is compared with previous information, if it exists and the best value is saved in a tmpInfoFile. The tmpInfoFile contains name, location and attributes of the best proxy agent system.

6) updateList: The ElectionAgent checks the registry for any new proxy agent system entry. If it finds a new entry, it is added to the list of agent systems.

For each proxy agent system entry in the list, the ElectionAgent gets the information about that agent system, compares with the previous information in the tmpInfoFile and updates the tmpInfoFile with the best value. Finally, the tmpInfoFile which will contain the information of the best proxy agent system is renamed as infoFile. The ElectionAgent keeps on continuing its execution and hence, we assume that the infoFile will contain the best proxy agent system. When the real agent system fails, the replica will become the new real agent system. The realLocFile in the real agent system is updated with the new location and the agent system which is contained in the infoFile is reconfigured as the new replica. The replicaLocFile in the real agent system is updated with the new replica location and the location of real agent system and its replica are updated in all the proxy agent systems.

B. Algorithm for Performance Comparison We describe an algorithm for comparing the

performance of proxy agent systems based on the information gathered by the ElectionAgent. Here we consider the hard disk space and the load average of the proxy agent systems for comparison. The load average is the average number of processes in the kernel's run queue during an interval.

We represent a proxy agent system with free hard disk space, h and load average, l as (h,l). Let (h1,l1) be the proxy agent system contained in the tmpInfoFile and (h2,l2) be the proxy agent system that is to be compared. The best proxy agent system is selected based on the different conditions given below. Condition 1: h1=h2

In this case, we compare l1 and l2, and select the proxy agent system having lesser load average.

162

l1>l2 => (h2,l2) l1<l2 => (h1,l1) l1=l2 => (h1,l1)

Condition 2: When there is a negligible difference between h1 and h2 ,

we give more priority to load average for the selection of proxy agent system.

l1>l2 => (h2,l2) l1<l2 => (h1,l1) When there is no difference between l1 and l2, we select

the proxy agent system having comparatively more free hard disk space.

l1=l2 => (h1,l1), if h1>h2 l1=l2 => (h2,l2), if h2>h1 Condition 3:

When there is a significant difference between h1 and h2 , we give priority to either load average or free hard disk space for the selection of proxy agent system, based on certain criteria.

Here we consider that a system with single CPU is overloaded if the load average is greater than 1. So if one of the systems is having a load average less than 1, we select that system.

l1>1 and l2 <1 => (h2,l2) l1<1 and l2 >1 => (h1,l1) When l1<1 and l2 <1, or l1>1 and l2 >1, we have two

cases. If the system having comparatively more free hard disk space has lesser load average, we select that system. Otherwise, we select the system having lesser load average when the difference between load averages is significant and when the difference is negligible, system with comparatively more free hard disk space is selected. 1) h1>h2

l1<l2 => (h1,l1) l1=l2 => (h1,l1) l1>l2 => (h 2,l2), if the difference between l1 and l2 is significant l1>l2 => (h1,l1), if the difference between l1 and l2 is

negligible 2) h1<h2

l1>l2 => (h2,l2) l1=l2 => (h2,l2) l1<l2 => (h 1,l1), if the difference between l1 and l2 is significant l1<l2 => (h2,l2), if the difference between l1 and l2 is negligible

C. Updating Replica Location in Proxy Agent Systems The location of real agent system and its replica is

updated in all the proxy agent systems by the real agent system. We describe this process below.

The entries of all the proxy agent system which are registered with the real agent system are added into a list. Each entry in the list contains the agent system name and its location. The location of the real agent system and its replica are retrieved from the realLocFile and the replicaLocFile. This information is send by the real agent

system to the first agent system entry in the list of agent systems using the communication service. In the proxy agent system, the entry of realLocFile and replicaLocFile are updated with the location of the new real agent system and its replica. The real agent system checks the registry for any new proxy agent system entry. If it finds a new entry, it is added to the list of agent systems. Example:

We illustrate an example for the operation of the algorithm. Let us consider a network of agent systems as shown in Figure 2(a). It consists of one real agent system (R1), one replica (R2) and three proxy agent systems (P1, P2, P3). The location of the real agent system and its replica are shown in all the agent systems as [R1,R2].

R1 will now create ElectionAgent (EA). EA maintains a list of all proxy agent systems as [P1,P2,P3]. It moves to P1 as shown in Figure 2(b). After reaching the proxy agent system P1, EA collects the information about P1, i.e., I(P1) and moves back to the real agent system R1. The information which is gathered is compared with previous information, if it exists the best value is saved in a tmpInfoFile. This is represented as P1 as shown in Figure 2(c ).

Before EA migrates to the next proxy agent system, it updates the list of proxy agent systems. In this example, a new proxy agent system P4 is added to the list. The updated list is [P2,P3,P4] and is shown in Figure 2(d). EA moves to each agent system in the list, gets the information and saves the best value. Finally, the tmpInfoFile which will contain the information of the best proxy agent system is renamed into infoFile and we assume that P3 is the best proxy agent system as shown in Figure 2(e). The EA will continue its execution and updates the infoFile each time.

When R1 fails, R2 will become the new real agent system and the location of the real agent system in R2 is updated as [R2,R2]. Now R2 will fetch the best proxy agent system entry from the infoFile i.e., [P3]. P3 is reconfigured as the new replica R3 and the location of the replica in R2 is updated as [R2, R3] as shown in Figure 2(f).

The real agent system R2 updates the location of real agent system and its replica in all the proxy agent systems. The real agent system R2 has a list containing the entries of all proxy agent systems registered with it, [P1,P2,P4]. The location of the real agent system and its replica [R2,R3] are retrieved from the realLocFile and replicaLocFile. Real agent system sends this information to the first proxy agent system entry in the list, P1 using the communication service as shown in Figure 3(a). In the proxy agent system P1, the location of real agent system and its replica [R1,R2] are updated with [R2,R3]. For each proxy agent system entry in the list, the real agent system updates the entry of realLocFile and replicaLocFile with the location of the new real agent system and its replica as shown in Figure 3(b). In Figure 3, we assume that EA is running and shows only the entries in the tmpInfoFile and infoFile.

163

Figure 2. Election Process in Proposed Environment

Figure 3. Updation Process in Proposed Environment

Figure 4. Working of Green Agent

164

IV. AGENT BASED ENERGY EFFICIENT HIGH AVAILABLE SYSTEM

As we choose the replica dynamically ‘on demand’ using an election algorithm in ‘agent based highly available environment’. Most of the time this replica is idle and is put to use only when it needs to be synchronized with the real agent system. During this idle time the replica runs in full power mode which leads to a considerable amount of energy wastage.

Modern computer systems are equipped with special utilities that allow users to either manually or automatically schedule their computers to switch to sleep mode for energy saving purpose, which is done by the administrator [22,23]. Practically in highly available systems, it is not possible for an administrator to determine when exactly the replica will not be in use which leads to significant energy wastage. So for reducing the energy wastage we introduced an agent based efficient energy management approach for High available systems. This approach is first implemented for replica of real agent system of our ‘agent based highly available environment’ and then extended to other systems running as proxy agent systems.

In the current approach we will put the system in sleep mode using an agent known as ‘Green Agent’. Figure 4 illustrates the working of Green Agent.

A. Working of Green Agent: When the real agent system starts up, it triggers the

energy saving process by creating a mobile agent called GreenAgent.

We describe the working of GreenAgent by explaining the methods used by the GreenAgent. Table 2 shows the different methods used by the GreenAgent.

Real agent system has a special registry which contains the status of all other agent systems whether its in active mode or sleep mode with its location and is periodically updated by the GreenAgent.

Table II Methods used by GreenAgent

Method Purpose

getList gets the list of all active proxy agent systems

move migrates to a particular proxy agent system

checkInfo

get the specific parameters of proxy agent system and check them to make that system sleep or remain to be active agent system

moveBack move back to the real agent System

makeSleep Make all the system sleep whose flag is set to ‘ready to sleep’

updateList checks the registry for new proxy agent system

1) getList: The GreenAgent gets the entry of all the active proxy agent systems from the registry and makes a list. Each entry in the list contains the agent system name and its location.

2) move: It takes the first agent system entry from the list of agent systems and tries to get the communication service of that agent system by using its location. Once the communication service is received, the GreenAgent migrates to that particular proxy agent system. If the communication service is not available, the GreenAgent will retry for ten times. Even after the retrials, if it is not available, it will discard this agent system, gets the next agency system entry from the list and try to migrate to that agent system.

3) checkInfo: After the successful migration to the proxy agent system, the GreenAgent continues its execution. It checks specific parameters like CPU load, no of application/processes, running mouse/keyboard activity in order to detect whether the current system reaches an idle state and if so it sets the status flag corresponding to this agent system as ‘ready to sleep’ from ‘active’ in the real agent system.

4) moveBack: GreenAgent now gets the communication service of the real agent system and migrates back to the real agent system using its communication service.

5) makeSleep: The GreenAgent finds all the agent system whose status flags are set to ‘ready to sleep’ and puts all these systems to stand by mode by calling a special API function on these agent systems and sets their flags to ‘sleep’.

6) updateList: The GreenAgent checks the registry for any new proxy agent system entry. If it finds a new entry, it is added to the list of agent systems with its status.

The real agent system checks the status flag of the proxy agent system before sending any other mobile agent to it. If the flag is set as ‘active’ then only it sends it to the proxy agent system, and if the flag is set as ‘sleep’ or ‘ready to sleep’, it first makes the corresponding system in active mode , sets the flag as ‘active’ and then sends the agent there.

The replica is updated by the real agent system from the synchronization process. It should be in active state only during synchronization, rest of the time it should be in sleep state. As there is no agent is running on the replica, the real agent system directly puts it into the active and sleep state as and when required. So first time after synchronization real agent system will set replica to ‘stand by’ mode and after that before each synchronization it will set replica to full power mode. Through this approach we can achieve efficient energy management technique in high available systems.

V. PERFORMANCE EVALUATION For comparing the performance of a mobile agent on

CMAF [1] and ‘agent based highly available environment’, four different situations were simulated.

165

1) ACP represents normal execution of agent in CMAF [1]. This involves migration time, processing time and agent check pointing [2] time.

2) ASFT represents normal execution of agent in fault tolerant agent system i.e., ‘agent based highly available environment’.

3) ASFT-fP represents execution of agent in ‘agent based highly available environment’ on failure of the real agent system when the agent is in proxy agent system. In this case, we assume that the agent moves back from the proxy agent system to the replica before the updation of the realLocFile and the replicaLocFile.

4) ASFT-fR represents execution of agent in ‘agent based highly available environment’ on failure of the real agent system when the agent is in the real agent system itself.

Our algorithm was simulated to study the impact of agent size on its execution time. A simulation environment was setup with one real agent system, one replica and one proxy agent system. For each simulated situations, an agent was sent from the real agent system to the proxy agent system by increasing its size. The results of the simulation are shown in Figure 5(a). The difference between ACP and ASFT curves is due to the replication overhead. There is a constant difference between ASFT and ASFT-fP. This is due to the delay taken by the agent to realise that the real agent system has failed and that it has to move to the replica. The significant difference between ASFT and

ASFT-fR is due to the time taken by the replica to take over the control on failure of real agent system, time taken for reconfiguration of proxy agent system to replica and restoration overhead. Since the updation of realLocFile and replicaLocFile in proxy agent systems is done after the restoration of agents, it does not affect the ASFT-fR curve. However, this updation process does not take much delay.

Another simulation environment was setup to analyze the influence of number of nodes on agent execution time. For each of the four simulated situations, the agent execution time was measured by increasing the number of nodes. Figure 5(b) shows the results of the simulation. From the figure, we examine that there is a small difference between ACP and ASFT curves due to the replication overhead. We can also see that the difference between ASFT and ASFT-fP as well as ASFT-fR is almost constant.

The energy management system was tested on a network consisting of 12 computers. CMAF was installed on each of the 12 machines and the total power consumed by the 12 computers was measured. After this the green agent was triggered in all of the 12 machines and again the power consumed by all the machines was measured. The total power consumed by the machines was measured for a significant time for both the cases. A graph was plotted of power consumed Vs time from the result obtained in both the cases. Figure 5(c) shows the power variations with and without the green agent. From the graph it can be clearly concluded that a significant amount of energy is saved if we use the green agent.

20 40 60 80 1002

4

6

8

10

12ACPASFTASFT-fPASFT-fR

Agent Size (kb)

Agen

t Exe

cutio

n Ti

me

(s)

20 40 60 80 100

10

20

30

40

50

60ACPASFTASFT-fPASFT-fR

No. of Nodes

Agen

t Exe

cutio

n Ti

me

(s)

(a) (b)

(c)

Figure 5. Simulation Results

14

12

10

8

6

4

2

0 1000 1030 1100 1130 1200 1230 1300

Power (5W)

Replication without GreenAgent

Replication with GreenAgent

Time (Hr)

166

VI. CONCLUSION In this paper, we have proposed a ‘agent based highly

available environment’ for achieving high availability in agent based mission critical applications by creating a replica ‘on demand’ for a real agent system. We achieve this by using an election algorithm designed for dynamic networks to find the best-suited proxy agent system in the network. Since we are creating the replica dynamically ‘on demand’ hence we can avoid the agent loss in such type of systems.

In this paper we have also considered the issue of heavy energy wastage in such types of systems and proposed an intelligent agent based efficient energy management approach for saving energy.

Finally, from the simulations, we found that the agent execution delay due to replication is not of much overhead. We can summarise that there is a significant delay in agent execution only when real agent system fails. Our approach enhances the availability and self awareness of the agent system and provides a highly reliable environment for the execution of mobile agent based mission critical applications at the expense of replication overhead.

We have also observed and compared the energy consumed by the systems with and without the ‘Green Agent’. The result clearly shows that a considerable amount of energy is saved by using our energy management approach as compared to the normal approach.

Our future work will concentrate on the performance optimization of the agents based energy efficient highly available environment.

REFERENCES [1] [1] S. Venkatesh, B. S Bindhumadhava and Amrit Anand Bhandari,

"Implementation of automated Grid software management tool: A Mobile Agent based approach", Proc. of Int'l Conf on Information and Knowledge Engineering, June 2006, pages 208-214.

[2] Banupriya, Manju Abraham and B. S Bindhumadhava, "Fault Tolerance for Mobile Agents", Proc. of Int'l Conf on Wireless Networks, June 2007.

[3] Eugene Gendelman, Lubomir F. Bic and Michael B. Dillencourt, “An Application-Transparent, Platform-Independent Approach to Rollback-Recovery for Mobile Agent Systems”.

[4] N.Lynch, “Distributed Algorithms”, Morgan Kaufmann Publishers, Inc, 1996.

[5] B. DeCleene et al., “Secure Group Communication for Wireless Networks”, Proc. of MILCOM 2001, VA, October 2001.

[6] C. Perkins and E. Royer, “Ad-hoc On-Demand Distance Vector Routing”, Proc. of the 2nd IEEE WMCSA, New Orleans, LA, February 1999, pp. 90-100.

[7] W. Heinzelman, A. Chandrakasan and H. Balakrishnan, “Energy-Efficient Communication Protocol for Wireless Microsensor Networks”, Proc. of HICSS, 2000.

[8] K. Hatzis, G. Pentaris, P. Spirakis, V. Tampakas and R. Tan, “Fundamental Control Algorithms in Mobile Networks”, Proc. of 11th ACM SPAA, March 1999, pages 251-260.

[9] R. Gallager, P. Humblet and P. Spira, “A Distributed Algorithm for Minimum Weight Spanning Trees”, ACM Transactions on Programming Languages and Systems, vol.4, no.1, pages 66-77, January 1983.

[10] D. Peleg, “Time Optimal Leader Election in General Networks”, Journal of Parallel and Distributed Computing, vol.8, no.1, pages 96-99, January 1990.

[11] D. Coore, R. Nagpal and R. Weiss, “Paradigms for Structure in an Amorphous Computer”, Technical Report 1614, Massachussetts Institute of Technology Artificial Intelligence Laboratory, October 1997.

[12] D. Estrin, R. Govindan, J. Heidemann and S. Kumar, “Next Century Challenges: Scalable Coordination in Sensor Networks”, Proc. of ACM MOBICOM, August 1999.

[13] S. Vasudevan, B. DeCleene, N. Immerman, J. Kurose and D. Towsley, “Leader Election Algorithms for Wireless Ad Hoc Networks”, Proc. of IEEE DISCEX III, 2003.

[14] A. Amis, R. Prakash, T. Vuong, and D.T Huynh, “MaxMin D-Cluster Formation in Wireless Ad Hoc Networks”, Proc. of IEEE INFOCOM, March 1999.

[15] M. Aguilers, C. Gallet, H. Fauconnier and S. Toueg, “Stable leader election”, LNCS 2180, p. 108 ff.

[16] G. Taubenfeld, “Leader Election in presence of n-1 initial failures”, Information Processing Letters, vol.33, no.1, pages 25-28, October 1989.

[17] N. Malpani, J. Welch and N. Vaidya, “Leader Election Algorithms for Mobile Ad Hoc Networks”, Fourth International Workshop on Discrete Algorithms and Methods for Mobile Computing and Communications, Boston, MA, August 2000.

[18] Sudarshan Vasudevan, Jim Kurose and Don Towsley, “Design and Analysis of a Leader Election Algorithm for Mobile Ad Hoc Networks”.

[19] C. Lin and M. Gerla, “Adaptive Clustering for Mobile Wireless Networks”, IEEE Journal on Selected Areas in Communications, 15(7):1265-75, 1997.

[20] P. Basu, N. Khan and T. Little, “A Mobility based metric for clustering in mobile ad hoc networks”.

[21] Newsham G & Tiller D, “Energy Consumption of Desktop Computers: measurements and saving potentials” IEEE Transactions on Industry applications Jul/Aug 1994, pp1065-1072

[22] Nordman B., Kinney K., Piette M.A, Webber C., "User Guide to Power Management in PCs and Monitors", University of California, Berkeley, January 1997

167

A Two-phase Bi-criteria Workflow Scheduling

Algorithm in Grid Environments

Amit Agarwal and Padam Kumar

Department of Electronics and Computer Engineering

Indian Institute of Technology Roorkee

Roorkee, India

aamitdec, [email protected]

Abstract—Scheduling workflow application in a highly dynamic

and heterogeneous grid environment is a complex NP-complete

optimization problem. It may require several different criteria to

be considered simultaneously when evaluating the quality of

solution or a schedule. The two most important scheduling

criteria frequently addressed by the current GRID research are

the execution time and the economic cost. This paper presents an

efficient bi-criteria scheduling heuristic for workflows called

Duplication-based Bi-criteria Scheduling Algorithm (DBSA). The

proposed approach comprises two phases: (1) Duplication-based

Scheduling – optimizes the primary criterion i.e. execution time

(2) Sliding Constraint Schedule Optimization – optimizes

secondary criterion i.e. economic cost keeping primary criterion

within sliding constraint. The sliding constraint is defined as a

function of the primary criterion to determine how much the

final solution can differ from the best solution found in primary

scheduling. The experimental results reveal that the proposed

approach generates schedules which are fairly optimized for both

economic cost and makespan while keeping the makespan within

defined constraints for executing workflow applications in the

grid environment.

Keywords- grid computing; bi-criterion scheduling;

optimization; workflow applications; DAG.

I. INTRODUCTION

Grid [1] is a unified computing platform which consists of

diverse set of heterogeneous resources distributed over large geographical region inter-connected over high speed networks and Internet. A workflow application can be defined as a collection of tasks with precedence constraints that are executed in a well-defined order to achieve a specific goal [2]. Scheduling workflow applications in grid with characteristics of dynamism, heterogeneity, distribution, openness, voluntariness, uncertainty and deception, is a complex optimization problem and several different criteria are needed to be considered simultaneously to obtain a realistic schedule. In general, minimization of total execution time (or makespan) of the schedule is applied as the most important scheduling criteria [3-6]. The current grid computing systems are based on system-centric policies, whose objectives are to optimize the system-wide metrics of performance i.e. total execution time or makespan. The convergence of grid computing toward the service-oriented approach is fostering a new vision where economic aspects represent central issues to burst the adoption of computing as a utility [7]. In current economic market models [8, 9, 10], economic cost (cost of executing a workflow

application over grid) has been considered as another important scheduling criterion to employ the user-centric policies.

Considering multiple criteria enables us to propose a more realistic solution. Thus, an effective multi-criteria scheduling algorithm is required for executing workflows over grid while assuring the high speed of communication, reducing the tasks execution time and economic cost. A workflow type of application can be modeled as Directed Acyclic Graph (DAG) in which nodes represents the executable tasks and the directed edges represent the inter-task data and control dependencies. Since the DAG scheduling problem in grid is NP-complete, we have emphasized on heuristics for scheduling rather than the exact methods. In [8, 10, 11, 12, 13, 14], several scheduling algorithms have been proposed which minimizes the makespan and economic cost of the schedule but only few of them address the workflow types of applications. In [8, 13], Buyya et al. propose the multi-objective planning for workflow scheduling approaches for utility grids. In [10], a quality of service optimization strategy for multi-criteria scheduling has been presented for criteria namely payment, deadline and reliability. In [11], Wieczorek et al. presents an efficient bi-criterion scheduling algorithm called Dynamic Constraint Algorithm (DCA) based on a sliding constraint. It takes list-based heuristic for primary scheduling. In [12], Dogan and Ozguner show another tradeoff between makespan and reliability using a sophisticated reliability model assuming computation and network performance. Our work represents the different specification for two specific criteria (i.e. makespan and economic cost).

In [15], Deelman et al. describes the three different workflow scheduling strategy namely full-plan-ahead scheduling, in-time local scheduling, and in-time global scheduling. In Just-in-time scheduling (in-time local scheduling), scheduling decision for an individual task is postponed as long as possible, and performed before the task execution starts (fully dynamic approach). In Full-ahead planning (full-plan-ahead), the whole workflow is scheduled before its execution starts (fully static approach). In this paper, we adopted full-ahead planning as it does not incur the run-time overheads and scheduling complexity on a federation of geographically distributed computing resources known as a computational grid. The research study shows that heuristics performing best in static environment (e.g. HLD [4], HBMCT [6]) have the highest potential to perform best in a more accurately modeled Grid environment.

168

mailto:aamitdec,%20padamfec%[email protected]

Scheduling heuristics can be categorized as list-based scheduling, cluster-based scheduling and duplication-based scheduling. In large, the current multi-criteria scheduling approaches have adopted the list-based scheduling heuristics (such as HEFT [3], HBMCT [6]) as a primary scheduling. In extensive literature survey, it has been observed that duplication-based heuristics [4-5, 16] generate remarkably much shorter schedules as compared with the list-based and cluster-based heuristics. The duplication approach utilizes the idle time slots (scheduling holes) for task duplication which, in turn, reduces the communication cost. The duplication strategy generates more optimized alternative schedules which help to minimize the overall schedule length. This motivates us to adopt the duplication based approach for primary scheduling to optimize the makespan (primary criterion).

In secondary scheduling, our objective is to optimize the economic cost (secondary criterion) of the schedule while keeping the makespan within the defined sliding constraint. The fig.1 illustrates that the primary solution (M1, C1) can be obtained considering primary criterion i.e. makespan in primary scheduling that yields makespan of length M1 while the economic cost is C1. In secondary scheduling, the above schedule is optimized for the economic cost dragging the makespan from M1 to M2 (M2 is the maximum allowable schedule length) that yields the schedule with makespan M2 and the reduced economic cost C2. This approach generates schedules which are remarkably more optimized both in terms of the execution time and the economic cost as compared to other relative algorithms.

In general, bi-criteria optimization yields a set of solutions (a Pareto set) rather than a single solution. Each solution in a Pareto set is called Pareto optimum and when these solutions are plotted in the objective space they are collectively known as Pareto front. The main objective of multi-criteria optimization problem is to obtain a Pareto front. In literature [17], two approaches i.e. LOSS and GAIN were proposed to compute the weight values for a given DAG. In LOSS, initial assignment is done for optimal makespan using an efficient DAG scheduling [3] whereas in GAIN, initial assignment is done by allocating tasks to cheapest machines in order to reduce the economic cost as much as possible. In this paper, we consider the LOSS approach where initial assignment is done using duplication-based scheduling approach inspired from our earlier research work [16] rather than HEFT [3] since duplication-based heuristic produces shorter makespan.

Figure 1. A bi-criteria optimization process

The rest of the paper is organized as follows. Section II describes the bi-criteria scheduling problem and related terminology. Section III presents the bi-criteria scheduling approach. The proposed bi-criteria scheduling algorithm (DBSA) is described in section IV. In section V, simulation results are presented and discussed. Section VI concludes the proposed research work.

II. BI-CRITERIA SCHEDULING PROBLEM

A. Workflow Application Model

A workflow scheduling problem can be defined as the assignment of available grid resources to different workflow tasks. A workflow can be modeled as DAG, as shown in fig. 2

and it can be represented by ),,,( CTENW , where N is a

set of n computational tasks, T is a set of task computation

volumes (i.e., one unit of computation volume is one million instructions), E is a set of communication arcs or edges that shows precedence constraint among the tasks and C is the set

of communication data from parent tasks to child tasks (i.e., one unit of communication data is one Kbyte). The value of

Ti is the computation volume for the task Nni . The

value of Ccij is the communication data transferring along

the edge ije , Eeij from task in to task jn , for in , jn N .

Figure 2. A workflow application (values are in Kbytes)

The execution time (makespan) can be defined as the total time between the finish time of exit task and start time of the entry task in the given DAG and the economic cost is the summation of the economic costs of all workflow tasks scheduled on different resources which can be computed as:

m

1j

jC =(EC)Cost Economic

where m is the total no. of available resources in grid and jC is

the execution cost of the tasks scheduled on a resource

jp which can be calculated as:

jjjj MpPBTC )(

F

P

C2

C1

Secondary Criterion

(Economic Cost)

Sliding Constraint

Primary Criterion

(Makespan)

M1 M2

Primary Solution (M1, C1)

Final Solution (M2, C2)

Local

Search

Direction

50

0

800

70

0 800

1000

300 100

300 500

n1

n2 n3 n4 n5

n6 n7

n8

700

0 200

169

where jM is the per unit time cost of executing task on a

resource jp and jPBT is the total busy time consumed by tasks

scheduled on resource jp .

TABLE I. SHOWING COMPUTATION COST, B-LEVEL AND

TASK SEQUENCE FOR DAG IN FIGURE 2

In this model, the cost of the idle time slots between the scheduled tasks on any resource is also considered in economic cost as it is difficult for the grid scheduler to schedule other workflow tasks in these idle slots. Thus, the total execution time (makespan) can be expressed as:

)()( entryexit nASTnAFTmakespan

where AFT and AST are the actual finish and actual start time

of exit task and entry task respectively. The normalized schedule length (NSL) of a schedule can be calculated as:

min

min

CPn

ijPp

ij

makespanNSL

The denominator is the summation of the minimum execution

costs of tasks on the minCP [3].

TABLE II. RESOURCE CAPACITY

TABLE III. MACHINE PRICE

B. Grid Resource Model

A grid resource model can be represented by undirected

weighted graph ),,,( BAQPG as shown in fig. 3, where

,......,2,1,| piPppP ii is the set of p available

resources, ,......,2,1,)(|)( piAppA ii is the set of

execution rates (Table II), where )( ip is the execution rate for

resource ip , ,......,2,1,),(|),( piQppqppqQ jiji is

the set of communication links connecting pairs of distinct

resources, ),( ji ppq is the communication link between ip and

jp , and ,......,2,1,),(|),( piBppppB jiji is the

set of data transfer rates (bandwidths) , where ),( ji pp is the

data transfer rate between resource ip and jp (Fig. 3). In our

model, task executions are assumed to be non-preemptive and intra-processor communication cost between two tasks scheduled on same resource is considered as zero.

C. Bi-criteria Performance Criteria

The computation cost of task in on jp is ij (see Table I). If

resource jp is not capable to process the task in then ij .

In grid, some resources may not always be in fully connected topology. Therefore, bandwidths between such resources are computed by searching alternative paths between them with maximum allowable bandwidths. The communication cost

between task in scheduled on resource mp and task jn scheduled

on resource np can be computed as:

),( nm

ij

ijpp

c

Figure 3. Grid with 4 resources (Bandwidth is in Kbps)

In this model, we avoid the communication startup costs of resources and intra-processor communication cost is negligible. A workflow of tasks is submitted to the Grid scheduler [14] where tasks are queued in non-decreasing order of their b-level.

The b-level (bottom level) of task in can be defined as the

longest directed path including execution cost and

communication cost from task in to the exit task in the given

DAG. It can be computed recursively as:

Resources p1 p2 p3 p4

Processing Capacity

)( ip (in MIPS) 220 350 450 310

Resource

ip

Machine cost per MIPS

)( ipM (in Dollar $)

1 1.0

2 2.5

3 3.0

4 2.0

p1

p2

p3

p4

1 50

100

50

100

100 100

Direct path Indirect path

170

)(~max~ijijjii nsuccnbb

where )( insucc refers to the immediate successors of task

node in and i~ is the mean computation cost of task in and ij

~ is

the mean communication cost between task in and jn .

The optimization goal of bi-criteria scheduling is to obtain the schedule with minimum schedule cost. It can be expressed in terms of performance metrics called effective schedule cost (ESC) which can be computed as:

ECNSLESC

III. BI-CRITERIA SCHEDULING APPROACH

In this paper, we consider the makespan as the primary criterion and the economic cost as the secondary criterion. We define the sliding constraint for the primary criterion i.e. how much the final solution may differ from the best solution found for the primary criterion. The primary scheduler adopts an efficient duplication scheduling approach to minimize the schedule length as much as possible. Then, this schedule is forwarded to the secondary scheduler. The secondary scheduler optimizes the above schedule produced by primary scheduler to minimize the economic cost. In secondary scheduling, some of duplicated tasks are removed and schedule is modified such that makespan of the schedule after removing those duplicated tasks remains within maximum allowable execution length.

Further, it investigates those tasks in the schedule which have been duplicated on other resources. Such tasks may become unproductive if their descendent tasks are receiving input data from their duplicated version. Thus, such unproductive tasks or schedules are removed in order to reduce the economic cost. If the makespan of the aforementioned schedule is less than the upper limit of the defined sliding constraint then it can be further modified to reduce the economic cost. The above schedule is modified by swapping tasks among resources (costlier to cheaper resources) such that it reduces the economic cost of the schedule while keeping the makespan within the upper limit.

IV. DBSA ALGORITHM

The proposed algorithm can be divided into two phases: (1) Primary Scheduling – optimizing the makespan (2) Secondary Scheduling – optimizing economic cost while keeping the makespan within the defined sliding constraint. The pseudo code for the proposed algorithm is described in Algorithm I for primary scheduling and Algorithm II for secondary scheduling. An efficient duplication-based scheduling heuristic has been applied for the primary scheduling [16]. It generates a

preliminary solution SCsol prelw , with the total costs of

primary criterion and the secondary criterion which can be are

denoted asprelc1 and

prelc2 , respectively. The set SC contains all

possible schedules for workflows to be executed over grid [11].

Algorithm (DBSA) Input:

A DAG (workflow) W with task computation and communication costs.

A set of available resources P with cost of execution per unit time.

Sliding constraint L (10%, 25%, 50% and 75% of the makespanprelc1 ).

The secondary scheduling optimizes primary solution for the secondary criterion, generating the best possible

solution SCsol finalw , and the total costs

finalc1 and

finalc2 of

primary and secondary criteria. The sliding constraint is equal to L such that the primary criterion cost can be increased

fromprelc1 to

prelc1 + L . We can calculate the maximum

allowable execution time maxT of workflow with cheapest

economic cost minC using cost optimization algorithm such as

GreedyCost [13]. Similarly, maximum allowable economic

cost maxC of workflow with shortest possible execution time

minT can be computed using time optimization algorithm such

as HED [16].

Algorithm I: Primary Scheduling Construct a priority based task sequence based on highest b-level

for (each unscheduled task in in task sequence)

Let finish time iF of task in is Infinite

for (each capable resource jp )

Compute finish time iF of task in on resource jp

Construct task predecessor list )pred_list( in

Initialize ),(_ ji pnlisttemp to zero

if (pred_list)

for (each predecessor kd not scheduled on jp )

if duplication of kd on jp reduces the finish time ijF

Add kd in ),(_ ji pnlisttemp

Update finish time ijF

endif

endfor

endif

if ijF < iF

iF = ijF ;

ir = jp ;

if (temp_list)

Copy ),(_ ji pnlisttemp into ),(_ ji pnlistduplicate

endif

endif endfor

Assign task in on resource ir and update schedule S.

if (duplicate_list)

Duplicate tasks from duplicate_list to ir and update schedule S.

endif

endfor

Compute makespan prelc1 (Equation 3) and Economic Cost

prelc2 (Equation 1) from schedule S.

171

The schedule produced by primary scheduling is illustrated in fig. 4(a) for the workflow as shown in fig. 2. The total execution time (or makespan) of this schedule is 16. This schedule yields the total economic cost of 18.81$ computed as per machine cost described in Table III using equation (1) and (2). Further, we apply the secondary scheduling to optimize the economic cost while keeping the makespan within sliding constraint. In fig. 4(b), the above schedule is optimized after removing some duplicated tasks whose removal keeps the makespan within maximum allowable limit. It reduces the economic cost of the schedule to 17.31$. Again, we identify those tasks or sub-schedules which have been duplicated and try to remove whose descendent tasks are receiving input data from their duplicated version. Such sub-schedules or tasks are removed which reduces the economic cost of the schedule to 14.32$ as shown in fig. 4(c). The tasks in the above schedule are tried to reschedule on the cheapest resources if it reduces the economic cost while keeping the makespan within the

maximum allowable limit. The schedule in fig. 4(c) is modified in order to reschedule tasks from resource P3 to P4 which reduces the economic cost to 9.32$ while makespan is kept below 18 (+10% of makespan in primary scheduling).

V. SIMULATION RESULTS AND ANALYSIS

The algorithms described in section IV have been simulated and implemented for the evaluation of different random task graphs or DAGs of different graph sizes (100, 200, 300, 400 and 500) with different parallelisms i.e. maximum outdegree of nodes in DAG (2, 4, 6, 8 and 10). The algorithms have been executed and compared in the grid of heterogeneous clusters of different sizes (5, 10, 15, 20, and 25) with 4 resources in each cluster. The proposed algorithm (DBSA) has been compared with DCA [11] for performance metrics i.e. effective schedule cost (ESC) as described in section II with respect to workflows and grids of different sizes. The algorithms have been run under the same conditions for fair comparison: for each workflow, each algorithm is run to find best possible second criteria cost while keeping the primary criteria within the defined sliding constraint.

Figure 4. Schedule of bi-criteria approach using (a) Primary Scheduling (b)

to (d) Secondary Scheduling with sliding constraint (+10%) of makespan

Algorithm II: Secondary Scheduling

Let L (maximum allowable schedule length) = prelc1 + 10% of

prelc1

if (duplicate_list && prelc1 <= L )

Copy tasks in list A and sort them in non-decreasing order of start time

for (each duplicated task ia in A)

Compute schedule length SL without considering ia in schedule S

if SL<=L

final

c1 =SL;

Remove ia from S and update S and A

endif

endfor

Construct list B of tasks from A that were duplicated Sort list B in non-decreasing order of task start time

for (each task ib in B)

Compute schedule length SL without considering ib in schedule S.

if SL<=L

final

c1 =SL;

Remove ib from S and update S and B

endif

endfor endif

Compute economic cost final

c2 of the optimized schedule S

for (each task in scheduled on resource jp in S)

Construct list R of capable resources in non-decreasing order whose

machine cost is less than )( jpM

for (each resource kp in R)

Reschedule task in to resource kp for final

c1 <=L

Compute economic cost EC’ (using Equation 1)

if EC’<final

c2

Update schedule S

final

c2 =EC’;

endif endfor endfor

172

The algorithm is run for primary scheduling for both the criteria (i.e. makespan and economic cost) to find the best and

worst solution for primary criteria ( bestc1 and worstc

1 ) which yield

the maximum sliding constraint i.e. the

difference ||11bestworst cc . The algorithm is run for three different

sliding constraint values: 25%, 50% and 75% of

difference ||11bestworst cc . The simulated results and graphs

reveal that the proposed bi-criteria scheduling approach outperforms the DCA algorithm in terms of both economic cost and schedule length. In fig. 5 and fig. 6, the DBSA algorithm yields reduced effective schedule cost (ESC) as compared to DCA over grid. The simulation parameters for modeling workflows and grid environments are presented in Table IV.

TABLE IV. GRID ENVIRONMENT LAYOUTS

Number of grid resources [20, 100]

Resource bandwidth [100 Mbps, 1 Gbps]

Number of tasks [100, 500]

Computation cost of tasks [50, 2000] ms

Data transfer size [20 Kbytes, 20 Mbytes]

Resource capability (MIPS) [220, 580]

Execution cost (per MIPS) [1-5 $ per MIPS]

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

20 40 60 80 100

Eff

ec

tive

Sc

he

du

le C

ost

Number of Resources

DBSA

DCA

Figure 5. Effect of grid sizes on effective schedule cost

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

100 200 300 400 500

Eff

ec

tive

Sc

hed

ule

Co

st

Number of Tasks

DBSA

DCA

Figure 6. Effective schedule cost on different workflows

VI. CONCLUSIONS

In this paper, a novel bi-criteria workflow scheduling approach has been presented and analyzed. We have proposed

an efficient scheduling algorithm called Duplication-based Bi-criteria Scheduling Algorithm (DBSA) which optimizes both the makespan and economic cost of the schedule. The schedule generated by DBSA algorithm is much more optimized than other related bi-criteria algorithms in respect of both makespan and economic cost. The algorithms have been implemented to schedule different random DAGs onto different grids of heterogeneous clusters of various sizes. Different variants of the algorithm were modeled and evaluated.

REFERENCES

[1] I. Foster and C. Kesselman, “The Grid 2: Blueprint for a New Computing Infrastructure”, Morgan Kaufmann Pub.,Elsevier Inc., 2004.

[2] Zhio Shi, Jack J. Dongarra, Scheduling workflow applications on processors with different capabilities, Elsevier, 2005.

[3] H. Topcuoglu, S. Hariri, and M. Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. In IEEE Transactions on Parallel and Distributed Systems, volume 13(3), pages 260–274, March 2002.

[4] Savina Bansal, Padam Kumar and Kuldip Singh, Dealing with Heterogeneity Through Limited Duplication for Scheduling Precedence Constrained Task Graphs, Journal of Parallel and Distributed Computing, 65(4): 479-491, Apr 2005.

[5] A. Dogan and F. Ozguner, LDBS: A Duplication Based Scheduling Algorithm for Heterogeneous Computing Systems, Proceedings of the Int’l Conf. on Parallel Processing, pp. 352-359, Aug 2002.

[6] R. Sakellariou and H. Zhao. A hybrid heuristic for DAG scheduling on heterogeneous systems. In 13th IEEE Heterogeneous Computing Workshop (HCW’04), Santa Fe, New Mexico, USA, April 2004.

[7] Nadia Ranaldo and Eugenio Zimeo, Time and Cost-Driven Scheduling of Data Parallel Tasks in Grid Workflows, IEEE Systems Journal VOL. 3, NO. 1, pp.104-120, MARCH 2009.

[8] J. Yu, R. Buyya, and C. K. Tham, Cost-based Scheduling of Scientific Work on Applications on Utility Grids, in Proceedings of the 1st IEEE International Conference on e-Science and Grid Computing (e-Science 2005), IEEE. Melbourne, Australia: IEEE CS Press, Dec. 2005.

[9] C. Ernemann, V. Hamscher and R. Yahyapour. Economic Scheduling in Grid Computing. In Proceedings of the 8th Workshop on Job Scheduling Strategies for Parallel Processing, Vol. 2537 of Lecture Notes in Computer Science, Springer, pages 128–152, 2002.

[10] Chunlin Li and Layuan Li, Utility-based QoS optimization strategy for multi-criteria scheduling in Grid, in JDPC, 2006.

[11] Wieczorek, M.; Podlipnig, S.; Prodan, R.; Fahringer, T., "Bi-criteria Scheduling of Scientific Workflows for the Grid," Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on , vol., no., pp.9-16, 19-22 May 2008.

[12] A. Dogan and F. Ozguner, Biobjective Scheduling Algorithms for Execution Time-Reliability Trade-off in Heterogeneous Computing Systems, Comput. J., vol. 48, no. 3, pp. 300.314, 2005.

[13] J. Yu and R. Buyya, Scheduling Scientific Workflow Applications with Deadline and Budget Constraints using Genetic Algorithms, Scientific Programming Journal, vol. 14, no. 1,pp:217-230, 2006.

[14] Agarwal, A. and Kumar, P. "An Effective Compaction Strategy for Bi-criteria DAG Scheduling in Grids", Int. J. of Communication Networks and Distributed Systems (IJCNDS), Inderscience Publishers, in press.

[15] E. Deelman, J. Blythe, Y. Gil, and C. Kesselman. Workflow Management in GriPhyN. Grid Resource Management, State of the Art and Future Trends. pages 99–116, 2004.

[16] Agarwal, A. and Kumar, P. "Economical Duplication Based Task Scheduling for Heterogeneous and Homogeneous Computing Systems," Advance Computing Conference, 2009. IACC 2009. IEEE International, vol., no., pp.87-93, 6-7 March 2009.

[17] H. Z. E. Tsiakkouri, R. Sakellariou and M. D. Dikaiakos, Scheduling Workflows with Budget Constraints, In Proceedings of the CoreGRID Workshop ”Integrated research in Grid Computing“, S. Gorlatch and M. Danelutto, Eds., Nov. 2005, pp. 347.357.

173

ADCOM 2009HUMAN COMPUTER

INTERFACE -2

Session Papers:

1. Mozaffar Afaq, Mohammed Qadeer, Najaf Zaidi and Sarosh Umar ,“Towards Geometrical Password for Mobile Phones”

2. Md Sahidullah, Sandipan Chakroborty and Goutam Saha, “Improving Performance of Speaker Identification System Using Complementary Information Fusion ”

3. Narayanan Palani, “Right Brain Testing-Applying Gestalt psychology in Software Testing”

174

Towards Geometrical Password for Mobile Phones

Mozaffar Afaque Dept of Computer Science,

Indian Institute of Technology, Kharagpur,

India [email protected]

M Sarosh Umar Dept of Computer Engg,

Aligarh Muslim University, Aligarh, India

[email protected]

Najaf Zaidi Design Engineer I ,T R &D,

ST Microelectronics, Greater Noida, India [email protected]

Mohammed A Qadeer Dept of Computer Engg,

Aligarh Muslim University, Aligarh, India

[email protected]

Abstract — Mobile cell phones have brought a revolution in the modern world. They have become profound instruments in bringing social as well as financial transformation. Mobile phones today not only hold the key to communication problems but can also be a suitable medium to facilitate commercial and financial transaction. There is an urgent need to establish ways to authenticate people over cell phone. The current method for authentication uses alphanumeric username and password. The textual password scheme is convenient but has its own drawbacks. Alphanumeric passwords are most of the times easy to guess, offer limited possibilities and are easily forgotten. With financial transactions at stake, the need of the hour is a collection of robust schemes for authentication. Graphical passwords are one of such schemes which offer a plethora of options and combinations. We are proposing a scheme which is simple for the user and robust at the same time. Graphical password by drawing geometries will provide a larger password space; at the same time will allow the user to use its photographic memory, making it easy to remember. The proposed scheme is suitable for all touch sensitive mobile phones.

Keywords: User authentication, graphical password, smart phone security, geometrical password.

I. INTRODUCTION

Cell phones have become a necessity. The ability to keep in touch with family, business associates, and access to email are not the only reasons for the increasing demand of cell phones. Today's technically advanced cell phones are capable of not only receiving and placing phone calls, but can very conveniently store data, take pictures and connect to the internet. These features have allowed them to become successful mediums in the field of e-commerce. With the foray of mobiles into the world of finance a method to authenticate users and their transactions was required. Textual passwords were the first choice, not because they were robust, but because they were easy to implement. With this choice we opted for a catch 22 situation where if the passwords are easy to remember, they are also easy to crack or guess and when they are complex they are easily forgotten.

Most of the passwords chosen by users are dictionary based in textual password system. And it makes the cracker’s (one who tries to guess your password) job easier. Armed with a dictionary of 250,000 words, a cracker could compare their encryptions with those already stored in the password file in a little more than five minutes [1]. Even if the edited words are included it will add 14 to 17 additional tests per word. This will add another 1,000,000 words to the list of possible passwords for each user [1]. In this paper we will demonstrate a graphical grid based password scheme which will aim at providing a huge password space along with ease of use .We will also analyze its strength by examining the success of brute force technique. In this scheme we will try to make it easy for the user to remember and more complex for the attacker.

II. RELATED WORK

Many papers have been published in recent years with a vision to have a graphical technique for user authentication. Primarily there are just two methods, having recall and recognition based approach respectively. Traditionally both the methods have been realized through the textual password space, which makes it easy to implement and at the same time easy to crack.

Figure 1: VisKey SFR

175

The study shows that there are 90% recognition rates for few seconds for 2560 pictures [2]. Clearly the mind of Homo sapiens is best suited to respond to a visual. A recall based password approach is VisKey [3], which is designed for PDAs. In this scheme to make a password, users have to tap spot in sequence. As for PDAs have a smaller screen difficult to point exact location of spot. Theoretically it provides a large password space but not enough to face a brute force attack if number of spots is less than seven [4]. A scheme like Passfaces in which user chooses the different relevant pictures that describes a story [5], an image recognition based password scheme. Recent study of graphical password [6], says that people are more comfortable with graphical password which is easier to remember. Recall based password user has to remember the password.

Figure 2: DAS scheme. Jermyn, et al. [7], proposed a technique, called “Draw - a – secret (DAS)”, which allows the user to draw their unique password (figure 2). DAS in which user defined drawing by stylus strokes in case of PDA is recorded and the user have to make same to authenticate himself. DAS scheme also allows for the dots only as one of the example shown in figure 2.

Figure 3: Example of a password in DAS has only dots. But research shows that people optimally recall only 6 to 8 points in pattern [8], and also successful number of recalls decreases drastically after 3 or 4 dots [9]. Our main motivation will be to increase password space. The user can choose the geometrical shape of their choice for the device like PDA having graphical user interface that will also optimize that password storage space. In our scheme we will allow users to draw some geometrical shape with some fixed end points and by putting dots at different location but it will give some filed triangle in such a way that chances of remembering those positions will be better.

III. DRAWING GEOMETRY

Drawing geometry is a graphical password scheme in which the user draws some geometrical object on the screen. Through this scheme we are targeting devices like mobiles, notebook computers and hand-held devices such as Personal Digital Assistants (PDAs) which have graphical user interface. Since these devices are graphical input enabled we can draw some interesting geometries using stylus. In this scheme there will be mxn grids and each grid is further divide into four parts by diagonal lines as shown in figure 4.

Figure 4: Grid provided to user and some simple geometrical shape drawn by user.

176

In the above figure 4, we have considered 4x5 grid keeping in mind the typical screen size of the PDAs these days and its width height ratio. Depending on the screen size it can be changed with justifiable number of rows and columns. But taking that size (4x5) we will have total of 5x4 = 20 block and each block has four triangles so total of possible triangle (20 blocks) x (4 triangle/block) = 80 triangle. Similarly each block has 4 small diagonal lines so total lines in that way (20 blocks) x (4 lines/block) = 80 lines. Also we do have some lines which are a result of joining adjacent points horizontally and vertically. That will give 4x6=24 (horizontal) and 5x5=25 vertical lines which makes a total of 24+25 = 49 (horizontal and vertical) lines. In that way we will have total of

p (5,4) => 80 + 80 + 49 = 209 (1)

These 209 objects can be used to choose password by drawing some of these objects in efficient manner. A password is considered to be the selection of certain lines and triangles. When a triangle is selected it is filled with some color and when a line is selected the color of that line changes (gets highlighted). Any combination of the selection of lines and triangles will form a password as shown in figure 5. In this way highlighted lines and filled triangle will provide us larger password space. Filling triangle and highlighting work can be done by using stylus of PDAs either by putting dot in triangle or by dragging the stylus crossing that line. As research shows that if the number of dots increases to difficult to remember those it is also increases. In this scheme we fill the triangle highlighted lines makes geometric shape which is to be recalled not the dots. More over we give another option which converts all highlighted lines to un-highlighted and vice-versa and the same for filling triangle by single click a button “Invert” a button which at least double the password space within practical limit of password length. A line which is not inclined at an angle of 45° or 0° or 90° i.e. the line which is not parallel to diagonal, horizontal as well as vertical lines. (Let’s call them non-parallel lines) These non-parallel lines can also be drawn by joining two points after enabling those drawing by clicking the button given labeled line “Line” which enables user to draw non-parallel lines. As we can see that crossing the same lines again cancels the effect of highlighting, figure 6, in general we can say that crossing even number of times the same line will cancel the highlighting effect. The users don’t need to recall the strokes but the resulting geometry. By using inversion operation as shown in figure 7 the user can deselect all currently highlighted lines and triangles and select all the unselected lines and triangles.

Figure 5: Drawing solid triangle

Note that the inversion does not take place for non-parallel lines. Figure 8 shows a password made by using parallel and non-parallel lines. To draw that we have button stylus able to draw those lines by dragging stylus from one point to another. The start point and end point of such line will be decided by actually where stylus touches the screen and where it leaves it. As illustrated in figure 8 if stylus touches the screen at any location say coordinate (x,y) where two vertical line va and vb (nearest vertical lines from point P at a distance half cell width) such that va _ x < vb and ha_ y < hb the nearest point of region P will be considered. Same strategy will be adopted for end point where stylus release screen. If lines drawn by user are parallel but procedure adopted by user to draw is as of nonparallel, in that case the scheme will automatically detect that and even if parallel lines are drawn by non-parallel method of drawing it will be considered as parallel lines.

Figure 6: Drawing lines

177

Figure 7: Inversion of drawn geometry

Figure 8: Example of non-parallel lines

The grid shown on screen is for the user’s convenience. Password drawn on invisible grids is shown in figure 7 also illustrates the inversion.

IV. TEXT SIMULATION

The above mentioned technique can be used to write any textual password. In the example shown the word “IMAGINE” is written vertically to accommodate more letters on the screen, still letter E is missing (purposely) as shown in figure 9. If the password contains more words then multiple screens (say frames) can be used to accommodate them.

Figure 9: Example of textual password

This can allow users to use textual passwords in graphical way. This letter(s) can be drawn in any direction and at any letter can be entered at any position on screen as per the user’s convenience.

V. EXTENSION FOR POSITION INDEPENDENCE AND MULTISTAGE

As of now we have considered that the shapes as well as its location constitute the password, together. If the user has written letter ‘A’ but fails to recall the position of the ‘A’ even then the password will be incorrect. This scheme can be extended to accommodate such cases. The location of the figure can be ignored if the shape is correct (as illustrated in figure 10). The same shape pattern at two different location circled should be treated as same. Obviously doing so the password space decreases but by increasing number of grid this can be compensated. As we have seen that text can be drawn but size of the PDAs limits the grid size. We can have multiple stages for drawing shapes i.e. one shape in first frame followed by next frame and so on. The user can select the more button provided (not shown any where) to go into next fresh blank frame on which more letters or shape can be drawn. As we could not write full word IMAGINE but by doing so (multistage) we can write first few letters say IMA in first frame and rest GINE in second frame. Multistage increases the time required to enter the password but also it gives us huge password space like my password word GRAPH is simulated in geometry the way it can be entered or chosen by user increased like GR and APH or GRA and PH etc for two stage, though stages will be less normally but by not fixing the number of stage we get advantage of high password space.

178

Figure 10: Example of position independence

VI. STORAGE OF PASSWORD

Since there is no need to store any image therefore only password need to be stored as we have seen in case of grid size (4x5) there are 209 possible objects if non-parallel lines are not considered, if we numbered every object from number 0,1,2,… ,208 then 209 bits is sufficient to store such password. An extra bit should be kept for inversion whether the password is inverted or not to avoid more calculation while entering the password. For including non-parallel lines, each non-parallel line can be stored by storing the coordinates of two points (start point and end point). The first fix number of bits will represent how many such lines are there and then the coordinates of end points of each line (10 bits for each). So if number of non-parallel lines is np then total password length by taking 10 bits for representing each non-parallel line is given below Required bits to store password;

= 209 + 1 + 10 + np * 10;

= 220 + np * 10.

So this scheme does not take much space to store the password as many graphical schemes take [10].

Figure 11: Variation of password space with increase in number of grids.

179

VII. SECURITY ANALYSIS

As we have seen in eqn.(1) that we have 209 objects individually each will be either highlighted or unselected. Considering only the 209 objects and excluding the non-parallel lines then we have a total of 2209 = 8.2275X1062 possibilities which is huge in terms of password space. So it is very robust from security point of view even after excluding non-parallel lines. If we consider non-parallel lines also the additional 220 lines will be added which will be also either highlighted or unselected so in that case total password possible 2(209+220) = 2429 = 1.386X10129 . It is clear that the password space will increase exponentially with increase in rows or columns as shown in table above. It will be possible for device with a bigger screen (like ATM) to have many more columns and rows. Due to this larger password space it is very difficult to carry out brute force attack on this password. With this scheme even if user decides to have the graphical representation of the text, he will be least susceptible to dictionary attacks. We have computed above password space in simple case with only 4x5 grids and single stage password entry. If we include those properties then the password space from the scheme will be increased by many folds. Since we have not made any special assumption for text simulation in this scheme, so password space remains same even if we take it as textual password scheme.

VIII. CASE STUDY We had requested 25 users to try this scheme and share their experience with us. When asked to rate the ease of usage of the new methodology on a scale of 1-10, we got an average of 8.5. Twenty out of twenty five users said they found it easier to remember passwords in graphical geometries. Twenty one users out of twenty five could reproduce their passwords after an interval of one week.

IX. CONCLUSION AND FUTURE WORK

In this paper we have proposed a graphical password scheme in which the user can draw simple geometrical shapes consisting of lines and solid triangles. The user does not need to remember the way in which password has been drawn but just the final geometrical shape. This scheme gives more password space and is competent in resisting brute force attack. This way of storing the password requires less space to store passwords as compared to other graphical schemes. This scheme is immune to shoulder surfing as the screen of the hand held device is visible to the user only. However

when employed on PCs and ATM machines it is susceptible to shoulder surfing. To make it more robust and handle the problem of shoulder surfing we will have to take into account the order in which the various components of the geometrical shape were drawn i.e. which line or triangle was first selected and then the next line which was selected and so on. This consideration will limit the scheme’s vulnerability to shoulder surfing and will also expand the password space.

REFERENCES

1. Daniel V. Klein “ Foiling the Cracker: A Survey of, and

Improvements to, Password Security”.

2. Perception and memory for pictures: Single-trial learning of 2500 visual stimuli. Psychonomic Science, 19(2):73-74, 1970.

3. SFR-IT-Engineering, http://www.sfrsoftware.de/cms/EN/pocketpc/viskey/, Accessed on January 2007.

4. Muhammad Daniel Hafiz, Abdul Hanan Abdullah, Norafida Ithnin, Hazinah K. Mammi “Towards Identifying Usability and Security Features of Graphical Password in Knowledge Based Authentication Technique”, in Proceedings of the Second Asia International Conference on Modelling & Simulation, IEEE Computer Society.

5. D. Davis, F. Monrose, and M. Reiter. On User Choice in Graphical Password Schemes. In 13th USENIX Security Symposium, 2004.

6. J. Thorpe and P. van Oorschot. Graphical Dictionaries and the Memorable Space of Graphical Passwords. In 13th USENIX Security Symposium, 2004.

7. I. Jermyn, A. Mayer, F. Monrose, M. Reiter, and A. Rubin. The Design and Analysis of Graphical Passwords. 8th USENIX Security Symposium, 1999.

8. R.-S. French. Identification of Dot Patterns From Memory as a Function of Complexity. Journal of Experimental Psychology, 47:22–26, 1954.

9. S.-I. Ichikawa. Measurement of Visual Memory Span by Means of the Recall of Dot-in-Matrix Patterns. Behavior Research Methods and Instrumentation, 14(3):309–313, 1982.

10. Xiaoyuan Suo, Ying Zhu, G. Scott. Owen “Graphical Passwords: A Survey”, in Proceedings of the 21st Annual Computer Security Applications Conference (ACSAC 2005), IEEE Computer Society.

11. Konstantinos Chalkias, Anastasios Alexiadis, George Stephanides “A Multi-Grid Graphical Password Scheme”.

12. Julie Thorpe P.C. van Oorschot “Towards Secure Design Choices for Implementing Graphical Passwords”, in Proceedings of the 20th Annual Computer Security

180

Applications Conference (ACSAC’04), IEEE Computer Society.

13. Julie Thorpe P.C. van Oorschot Anil Somayaji “Pass-thoughts: Authenticating With Our Minds”.

14. Phen-Lan Lin, Li-Tung Weng, Po-Whei Huang “Graphical Passwords Using Images with Random Tracks of Geometric Shapes”, in Proceddings of 2008 Congress on Image and Signal Processing, IEEE Computer Society.

15. Sonia Chiasso, P.C. van Oorschot, and Robert Biddle “Graphical Password Authentication Using Cued Click Points”.

16. M. W. Calkins. Short studies in memory and association from the Wellesley College Laboratory. Psychological Review, 5:451-462, 1898.

17. M. A. Borges, M. A. Stepnowsky, and L. H. Holt. Recall and recognition of words and pictures by adults and children. Bulletin of the Psychonomic Society, 9:113-114, 1977.

181

Improving Performance of Speaker IdentificationSystem Using Complementary Information Fusion

Md. Sahidullah, Sandipan Chakroborty and Goutam SahaDepartment of Electronics and Electrical Communication EngineeringIndian Institute of Technology, Kharagpur, India, Kharagpur-721 302

Email: [email protected], [email protected],[email protected]: +91-3222-283556/1470, FAX: +91-3222-255303

Abstract—Feature extraction plays an important role as afront-end processing block in speaker identification (SI) process.Most of the SI systems utilize like Mel-Frequency CepstralCoefficients (MFCC), Perceptual Linear Prediction (PLP), LinearPredictive Cepstral Coefficients (LPCC), as a feature for repre-senting speech signal. Their derivations are based on shorttermprocessing of speech signal and they try to capture the vocaltractinformation ignoring the contribution from the vocal cord. Vocalcord cues are equally important in SI context, as the informationlike pitch frequency, phase in the residual signal, etc could conveyimportant speaker specific attributes and are complementary tothe information contained in spectral feature sets. In thispaperwe propose a novel feature set extracted from the residual signalof LP modeling. Higher-order statistical moments are used hereto find the nonlinear relationship in residual signal. To get theadvantages of complementarity vocal cord based decision scoreis fused with the vocal tract based score. The experimentalresults on two public databases show that fused mode systemoutperforms single spectral features.

Index Terms—Speaker Identification, Feature Extraction,Higher-order Statistics, Residual Signal, ComplementaryFea-ture.

I. I NTRODUCTION

Speaker Identification is the process of identifying a personby his/her voice signal [1]. A state-of-the art speaker identi-fication system requires feature extraction unit as a front endprocessing block followed by an efficient modeling scheme.Vocal tract information like its formant frequency, bandwidthof formant frequency etc. are supposed to be unique for humanbeings. The basic target of the feature extraction block is tocharacterize those information. On the other hand this featureextraction process represents the original speech signal into acompact format as well as emphasizing the speaker specificinformation. The function of the feature extraction processblock is also to represent the original signal into a robustmanner. Most of the speaker identification system uses MelFrequency Cepstral coefficients (MFCC) or Linear PredictionCepstral Coefficient (LPCC) as a feature extraction block [1].MFCC is the modification of conventional Linear FrequencyCepstral Coefficient keeping in mind the auditory system ofhuman being [2]. On the other hand, the LPCC is based ontime domain processing of speech signal [3]. Later conven-tional LPCC is also modified motivated by perceptual propertyof human ear [4]. Like vocal tract, Vocal cord information

also contains some speaker specific information [5]. Residualsignal which can be obtained from the Linear Prediction(LP) analysis of speech signal contains information related tosource or vocal cord. Earlier Auto-associative Neural Network(AANN), Wavelet Octave Coefficients of Residues (WOCOR),residual phase etc. were used to extract the information fromresidual signal. In this work we have introduced Higher-order Statistical Moments to capture the information from theresidual signal. In this paper we are integrating the vocalcord information with vocal tract information to boost upthe performance of speaker identification system. The loglikelihood score of both the system are fused together toget the advantages of their complementarity [6], [7]. Thespeaker identification results on both the databases prove thatcombining the two systems, the performance can be improvedover baseline spectral feature based systems.

This paper is organized as follows. In section II we firstreview the basic of linear prediction analysis followed by theproposed feature extraction technique. The speaker identifica-tion experiment with results is shown in section III. Finally,the paper is concluded in section IV.

II. FEATURE EXTRACTION FROM RESIDUAL SIGNAL

In this section we first explain the conventional methodof derivation of residual signal by LP-analysis. The proposedfeature extraction process is described consequently.

A. Linear Prediction Analysis and Residual Signal

In the LP model,(n − 1)-th to (n − p)-th samples of thespeech wave (n, p are integers) are used to predict then-thsample. The predicted value of then-th speech sample [3] isgiven by

s(n) =

p∑

k=1

a(k)s(n− k) (1)

wherea(k)pk=1are the predictor coefficients ands(n) is

the n-th speech sample.The value ofp is chosen such that itcould effectively capture the real and complex poles of thevocal tract in a frequency range equal to half the samplingfrequency.The Prediction Coefficients (PC) are determinedby

182

0 50 100 150−1000

−500

0

500

1000

Number of Samples →

Am

plitu

de →

0 50 100 150−1000

−500

0

500

1000


Am

plitu

de →

5 10 15 20−0.04

−0.02

0

0.02

0.04

Number of Moments →

Am

plitu

de →

0 50 100 150−100

−50

0

50

100


Am

plitu

de →

0 50 100 150−100

−50

0

50

100


Am

plitu

de →

5 10 15 20−0.2

−0.1

0

0.1

0.2

Number of Moments →

Am

plitu

de →

Fig. 1. Example of two speech frames (top), their LP residuals (middle) and corresponding residual moments (bottom).

minimizing the mean square prediction error [1] and the erroris defined as

E(n) =1

N

N−1∑

n=0

(s(n)− s(n))2 (2)

where summation is taken over all samples i.e.,N . The setof coefficientsa(k)pk=1

which minimize the mean-squaredprediction error are obtained as solutions of the set of linearequation

p∑

k=1

(j, k)a(k) = (j, 0), j = 1, 2, 3, . . . , p (3)

where

(j, k) =1

N

N−1∑

n=0

s(n− j)s(n− k) (4)

The PC,a(k)pk=1are derived by solving the recursive

equation (3).Using the a(k)pk=1

as model parameters, equation (5)represents the fundamental basis of LP representation. Itimplies that any signal can be defined by a linear predictorand its prediction error.

s(n) = −

p∑

k=1

a(k)s(n− k) + e(n) (5)

The LP transfer function can be defined as,

H(z) =G

1 +∑p

k=1a(k)z−k

=G

A(z)(6)

whereG is the gain scaling factor for the present input andA(z) is thep-th order inverse filter. These LP coefficients itselfcan be used for speaker recognition as it contains some speakerspecific information like vocal tract resonance frequencies,their bandwidths etc.

The prediction error i.e.,e(n) is called Residual Signal andit contains all the complementary information that are not con-tained in the PC. Its worth mentioning here that residual signalconveys vocal source cues containing fundamental frequency,pitch period etc.

B. Statistical Moments of Residual Signal

Residual signal which is introduced in Section II-A gener-ally has a noise like behavior and it has flat spectral response.Though it contains vocal source information, it is very difficultto perfectly characterize it. In literature Wavelet OctaveCoef-ficients of Residues (WOCOR) [7], Auto-associative Neural

183

LP Analysis Inverse Filtering

Higher Order Moment

Computation

Windowed Speech Frame

Residual Moment Feature

LP Coeff

Magnitude Normalization

Fig. 2. Block diagram of Residual Moment Based Feature ExtractionTechnique.

Network (AANN) [5] , residual phase [6] etc are used toextract the residual information. It is worth mentioning herethat higher-order statistics have shown significant results in anumber of signal processing applications [8] when the natureof the signal is non-gaussian. Higher order statistics alsogotattention of the researchers for retrieving information fromthe LP residual signals [9]. Recently, higher order cumulantof LP residual signal is investigated [10] for improving theperformance of speaker identification system.

Higher order statistical moments of a signal parameterizesthe shape of a function [11]. Let the distribution of randomsignalx be denoted byP (x), the central moment of orderkof x be denoted by

Mk =

∞∫

−∞

(x− )kdP (7)

for k = 1, 2, 3..., where is the mean ofx.On the other hand, the characteristics function of the prob-

ability distribution of the random variable is given by,

'X(t) =

∞∫

−∞

ejtxdP =

∞∑

k=0

Mk

(jt)k

k!(8)

From the above equation it is clear that moments (Mk) arecoefficients for the expansion of the characteristics function.Hence, they can be treated as one set of expressive constantsof a distribution. Moments can also effectively capture therandomness of residual signal of auto regressive modeling[12].

In this paper, we use higher order statistical moments ofresidual signal to parameterize the vocal source information.The feature derived by the proposed technique is termed asHigher Order Statistical Moment of Residual (HOSMR). Thedifferent blocks of the proposed feature extraction techniquefrom residual are shown in fig. 2.

At first the residual signal is first normalized between therange[−1,+1]. Then central moment of orderk of a residualsignale(n) is computed as,

mk =1

N

N−1∑

n=0

(e(n)− )k (9)

where, is the mean of residual signal over a frame. As therange of the residual signal is normalized, the first order mo-ment (i.e. the mean) becomes zero. The higher order moments(for k = 2, 3, 4...K) are taken as vocal source features as theyrepresent the shape of the distribution of random signal. Thelower order moments are coarse parametrization whereas thehigher orders are finer representation of residual signal. In fig.1, LP residual signal of a frame is shown as well as its higherorder moments. It is clear from the picture that if the lowerorder moments are considered both the even and odd ordervalues are highly differentiable.

C. Fusion of Vocal Tract and Vocal Cord Information

In this section we propose to integrate vocal tract andvocal cord parameters identifying speakers. In spite of thetwoapproaches have significant performance difference, the waythey represent speech signal is complementary to one another.Hence, it is expected that combining the advantages of both thefeature will improve [13] the overall performance of speakeridentification system. The block diagram of the combinedsystem is shown in fig. 3. Spectral features and Residualfeatures are extracted from the training data in two separatestreams. Consequently, speaker modeling is performed for therespective features independently and model parameters arestored in the model database. At the time of testing sameprocess is adopted for feature extraction. Log-likelihoodoftwo different features are computed w.r.t. their correspondingmodels. Finally, the output score is weighted and combined.

We have used score level linear fusion which can beformulated as in equation (10). To get the advantages of boththe system and their complementarity the score level linearfusion can be formulated as follows:

LLRcombined = LLRspectral + (1− )LLRresidual (10)

whereLLRspectral andLLRresidual are log-likelihood ratiocalculated from the spectral and residual based systems, re-spectively. The fusion weight is decided by the parameter.

III. SPEAKER IDENTIFICATION EXPERIMENT


1) Pre-processing stage:In this work, pre-processing stageis kept similar throughout different features extraction meth-ods. It is performed using the following steps:

∙ Silence removal and end-point detection are done usingenergy threshold criterion.

∙ The speech signal is then pre-emphasized with0.97 pre-emphasis factor.

∙ The pre-emphasized speech signal is segmented intoframes of each20ms with 50% overlapping ,i.e. totalnumber of samples in each frame isN = 160, (samplingfrequencyFs = 8KHz.

∙ In the last step of pre-processing, each frame is windowedusing hamming window given equation

w(n) = 0.54 + 0.46 cos(2n

N − 1) (11)

whereN is the length of the window.

184

Fig. 3. Block diagram of Fusion Technique: Score level fusion of Vocal tract (short term spectral based feature) and Vocal cord information (Residual).

2) Classification & Identification stage:Gaussian MixtureModeling (GMM) technique is used to get probabilistic modelfor the feature vectors of a speaker. The idea of GMM is touse weighted summation of multivariate gaussian functionstorepresent the probability density of feature vectors and itisgiven by

p(x) =M∑

i=1

pibi(x) (12)

wherex is a d-dimensional feature vector,bi(x), i = 1, ...,Mare the component densities andpi, i = 1, ...,M are the mix-ture weights orprior of individual gaussian. Each componentdensity is given by

bi(x) =1

(2)d

2 ∣Σi∣1

2

exp

−1

2(x−i)

tΣi−1(x−i)

(13)

with mean vectori and covariance matrixΣi. The mixtureweights must satisfy the constraint that

∑M

i=1pi = 1 andpi ≥

0. The Gaussian Mixture Model is parameterized by the mean,covariance and mixture weights from all component densitiesand is denoted by

= pi,i,ΣiMi=1

(14)

In SI, each speaker is represented by the a GMM and is re-ferred to by his/her model. The parameter of are optimizedusing Expectation Maximization(EM) algorithm [14]. In theseexperiments, the GMMs are trained with 10 iterations whereclusters are initialized by vector quantization [15] algorithm.

In identification stage, the log-likelihood scores of thefeature vector of the utterance under test is calculated by

log p(X∣) =T

∑

t=1

p(xt∣) (15)

WhereX = x1,x2, ...,xt is the feature vector of the testutterance.

In closed set SI task, an unknown utterance is identifiedas an utterance of a particular speaker whose model givesmaximum log-likelihood. It can be written as

S = arg max1≤k≤S

T∑

t=1

p(xt∣k) (16)

where S is the identified speaker from speaker’s model setΛ = 1, 2, ..., S andS is the total number of speakers.

3) Databases for experiments:YOHO Database:The YOHO voice verification corpus

[1], [16] was collected while testing ITT’s prototype speaker

185

verification system in an office environment. Most subjectswere from the New York City area, although there were manyexceptions, including some non-native English speakers. Ahigh-quality telephone handset (Shure XTH-383) was used tocollect the speech; however, the speech was not passed througha telephone channel. There are138 speakers (106 males and32females); for each speaker, there are4 enrollment sessions of24 utterances each and10 test sessions of4 utterances each. Inthis work, a closed set text-independent speaker identificationproblem is attempted where we consider all138 speakersas client speakers. For a speaker, all the96 (4 sessions ×24 utterances) utterances are used for developing the speakermodel while for testing,40 (10 sessions × 4 utterances)utterances are put under test. Therefore, for138 speakers weput 138× 40 = 5520 utterances under test and evaluated theidentification accuracies.

POLYCOST Database:The POLYCOST database [17] wasrecorded as a common initiative within the COST250 actionduring January- March 1996. It contains around10 sessionsrecorded by134 subjects from14 countries. Each sessionconsists of14 items, two of which (MOT01 & MOT02 files)contain speech in the subject’s mother tongue. The databasewas collected through the European telephone network. Therecording has been performed with ISDN cards on two XTLSUN platforms with an8 kHz sampling rate. In this work, aclosed set text independent speaker identification problemisaddressed where only the mother tongue (MOT) files are used.Specified guideline [17] for conducting closed set speakeridentification experiments is adhered to, i.e. ‘MOT02’ filesfrom first four sessions are used to build a speaker model while‘MOT01’ files from session five onwards are taken for testing.As with YOHO database, all speakers (131 after deletion ofthree speakers) in the database were registered as clients.

4) Score Calculation:In closed-set speaker identificationproblem, identification accuracy as defined in [18] and givenby the equation (17) is followed.

Percentage of identification accuracy (PIA) =

No. of utterance correctly identified

Total no. of utterance under test× 100 (17)

B. Speaker Identification Experiments and Results

The performance of speaker identification system basedon the proposed HOSMR feature is evaluated on both thedatabases. The order of LP is kept at17 and 6 residualmoments are taken to characterize the residual information. Wehave conducted experiment based on GMM based classifierfor different model order. The identification results are shownin Table I. The identification performance is very low becausethe vocal cord parameters are not the only cues for identifyingspeakers but it has some inherent contribution in recognition.At the same time it contains information which are notcontained in spectral feature. The combined performance ofboth the system is to be observed. We have conducted SIexperiment using two major kinds of baseline features, someare based on LP analysis (LPCC and PLPCC) and others(LFCC and MFCC) are based on filterbank analysis. The

feature dimension is set at19 for all kinds of features forbetter comparison. In LP based systems19 filters are usedfor all-pole modeling of speech signals. On the other hand20filters are used for filterbank based system and19 coefficientsare taken for extracting Linear Frequency Cepstral Coefficients(LFCC) and MFCC after discarding the first co-efficient whichrepresents dc component. The detail description are availablein [19], [20]. The derivation LP based features can be foundin [1], [4], [21].

The performance of baseline SI systems and fused systemsfor different features and different model orders are showninTable II and Table III for POLYCOST and YOHO databasesrespectively. In this experiment, we take equal evidence fromthe two systems and set the value of to 0.5. The results forthe conventional spectral features follows the results shownin [22]. The POLYCOST database consists of speech signalscollected over telephone channel. The improvement for thisdatabase is significant over the YOHO which is micro-phonic.The experimental results shows significant performance im-provement for SI system compare to only spectral systems forvarious model order.

TABLE ISPEAKER IDENTIFICATION RESULTS ONPOLYCOSTAND YOHO

DATABASE USING HOSMRFEATURE FOR DIFFERENT MODEL ORDER OF

GMM (HOSMR CONFIGURATION: LP ORDER= 17, NUMBER OF

HIGHER ORDER MOMENTS= 6).

Database Model Order Identification Accuracy

POLYCOST

2 19.49604 21.61808 19.098116 22.4138

YOHO

2 16.88414 18.22468 15.126816 18.224632 21.213864 21.9565

IV. CONCLUSION

The objective of this paper is to propose a new techniqueto improve the performance of conventional speaker identifica-tion system which are based on spectral features representingonly vocal tract information. Higher-order statistical momentof residual signal is derived and treated as a parameter carryingvocal cord information. The log likelihood of both the systemare fused together. The experimental results on two popularspeech corpus prove that significant improvement can beobtained in combined SI system.

REFERENCES

[1] J. Campbell, J.P., “Speaker recognition: a tutorial,”Proceedings of theIEEE, vol. 85, no. 9, pp. 1437–1462, Sep 1997.

[2] S. Davis and P. Mermelstein, “Comparison of parametric representationsfor monosyllabic word recognition in continuously spoken sentences,”Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 28,no. 4, pp. 357–366, Aug 1980.

186

TABLE IISPEAKER IDENTIFICATION RESULTS ONPOLYCOSTDATABASE

SHOWING THE PERFORMANCE OF BASELINE(SINGLE STREAM) SYSTEM

AND FUSED SYSTEM(HOSMR CONFIGURATION: LP ORDER= 17,NUMBER OF HIGHER ORDER MOMENTS= 6, FUSION WEIGHT ()= 0.5).

Feature Model Order Baseline System Fused System

LPCC

2 63.5279 71.48544 74.5358 78.91258 80.3714 81.697616 79.8408 82.8912

PLPCC

2 62.9973 65.78254 72.2812 75.59688 75.0663 77.321016 78.3820 80.5040

LFCC

2 62.7321 71.61804 74.9337 78.11678 79.0451 81.299716 80.7692 83.4218

MFCC

2 63.9257 69.76134 72.9443 76.12738 77.8515 79.443016 77.8515 79.5756

TABLE IIISPEAKER IDENTIFICATION RESULTS ONYOHO DATABASE SHOWING THE

PERFORMANCE OF BASELINE(SINGLE STREAM) SYSTEM AND FUSEDSYSTEM (HOSMR CONFIGURATION: LP ORDER= 17, NUMBER OF

HIGHER ORDER MOMENTS= 6, FUSION WEIGHT ()= 0.5)).

Feature Model Order Baseline System Fused System

LPCC

2 80.9420 84.71014 88.9855 91.08708 93.8949 94.782616 95.6884 96.286232 96.5399 97.101464 96.7391 97.2826

PLPCC

2 66.5761 72.55434 76.9203 81.05078 85.3080 87.771716 90.6341 91.902232 93.5326 94.311664 94.6920 95.3986

LFCC

2 83.0072 85.81524 90.3623 91.79358 94.6196 95.489116 96.2681 96.684832 97.1014 97.355164 97.2464 97.6268

MFCC

2 74.3116 78.60514 84.8551 86.93848 90.6703 92.029016 94.1667 94.692032 95.6522 95.996464 96.7935 97.1014

[3] B. S. Atal, “Effectiveness of linear prediction characteristics of thespeech wave for automatic speaker identification and verification,” TheJournal of the Acoustical Society of America, vol. 55, no. 6, pp. 1304–1312, 1974.

[4] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,”The Journal of the Acoustical Society of America, vol. 87, no. 4, pp.1738–1752, 1990.

[5] S. M. Prasanna, C. S. Gupta, and B. Yegnanarayana, “Extraction ofspeaker-specific excitation information from linear prediction residualof speech,”Speech Communication, vol. 48, no. 10, pp. 1243 – 1261,2006.

[6] K. Murty and B. Yegnanarayana, “Combining evidence fromresidual

phase and mfcc features for speaker recognition,”Signal ProcessingLetters, IEEE, vol. 13, no. 1, pp. 52–55, Jan. 2006.

[7] N. Zheng, T. Lee, and P. C. Ching, “Integration of complementaryacoustic features for speaker recognition,”Signal Processing Letters,IEEE, vol. 14, no. 3, pp. 181–184, March 2007.

[8] A. Nandi, “Higher order statistics for digital signal processing,”Math-ematical Aspects of Digital Signal Processing, IEE Colloquium on, pp.6/1–6/4, Feb 1994.

[9] E. Nemer, R. Goubran, and S. Mahmoud, “Robust voice activitydetection using higher-order statistics in the lpc residual domain,”Speechand Audio Processing, IEEE Transactions on, vol. 9, no. 3, pp. 217–231,Mar 2001.

[10] M. Chetouani, M. Faundez-Zanuy, B. Gas, and J. Zarader,“Investiga-tion on lp-residual representations for speaker identification,” PatternRecognition, vol. 42, no. 3, pp. 487 – 494, 2009.

[11] C.-H. Lo and H.-S. Don, “3-d moment forms: their construction andapplication to object identification and positioning,”Pattern Analysisand Machine Intelligence, IEEE Transactions on, vol. 11, no. 10, pp.1053–1064, Oct 1989.

[12] S. G. Mattson and S. M. Pandit, “Statistical moments of autoregressivemodel residuals for damage localisation,”Mechanical Systems andSignal Processing, vol. 20, no. 3, pp. 627 – 645, 2006.

[13] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,”Pattern Analysis and Machine Intelligence, IEEE Transactions on,vol. 20, no. 3, pp. 226–239, Mar 1998.

[14] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the em algorithm,”Journal of the RoyalStatistical Society. Series B (Methodological), vol. 39, pp. 1–38, 1977.

[15] G. R. Linde Y., Buzo A., “An algorithm for vector quanization design,”IEEE Transactions on Communications, vol. COM-28, no. 4, pp. 84–95,1980.

[16] A. Higgins, J. Porter, and L. Bahler, “Yoho speaker authentication finalreport,” ITT Defense Communications Division, Tech. Rep.,1989.

[17] H. Melin and J. Lindberg, “Guidelines for experiments on the polycostdatabase,” inin Proceedings of a COST 250 workshop on Applicationof Speaker Recognition Techniques in Telephony, 1996, pp. 59–69.

[18] D. Reynolds and R. Rose, “Robust text-independent speaker identi-fication using gaussian mixture speaker models,”Speech and AudioProcessing, IEEE Transactions on, vol. 3, no. 1, pp. 72–83, Jan 1995.

[19] S. Chakroborty, A. Roy, S. Majumdar, and G. Saha, “Capturing comple-mentary information via reversed filter bank and parallel implementationwith mfcc for improved text-independent speaker identification,” inComputing: Theory and Applications, 2007. ICCTA ’07. InternationalConference on, March 2007, pp. 463–467.

[20] S. Chakroborty, “Some studies on acoustic feature extraction, featureselection and multi-level fusion strategies for robust text-independentspeaker identification,” Ph.D. dissertation, Indian Institute of Technol-ogy, 2008.

[21] L. Rabiner and H. Juang B,Fundamental of speech recognition. FirstIndian Reprint: Pearson Education, 2003.

[22] D. Reynolds, “Experimental evaluation of features forrobust speakeridentification,” Speech and Audio Processing, IEEE Transactions on,vol. 2, no. 4, pp. 639–643, Oct 1994.

187

Right Brain TestingApplying Gestalt psychology in Software Testing

Narayanan Palani,Student in Computer Applications, BITS- Pilani, India.

[Tel:+91-9900465054 E-mail: [email protected]]

Abstract— Applying Gestalt psychology in Software Testing can explore innovative paths and new techniques in web testing methodologies. It can address the critical testing processes like test blindness testing, condition testing in complex conditions and web security testing.

I. INTRODUCTION

Software testing is an art and a complete testing ensures that the application meets customer’s expectations. A tester can use various testing methods to address the key challenges in application testing. Applying psychology in software testing can meet and resolve few flaws in testing like ‘test blindness’, ‘application security constraints’, ‘desired functionality and requirement mismatch’.

II. SCOPE:Application of psychology in software testing can be achieved and well utilized by,

Providing Psychological Lab Trainings to SQA team. Sessions with Research Psychologist to develop

frameworks to address critical problems in software testing.

SQA teams can assist research scholars and Psycho-analysts for ‘Collaborative Testing Assignments’ in which innovative testing methods can be explored.

III. RIGHT BRAIN TESTING:It analyses application using diagrammatic, graphical and

analytical representations (using right side of brain) hence it is right to mention it as a ‘right brain testing’ to find right behavior of AUT.

IV.APPLICATION OF GESTALTISM IN SOFTWARE TESTING:Gestalt psychology is a theory of mind and brain positing

that the operational principle of the brain. The Gestalt effect refers to the form-forming capability of our senses, particularly with respect to the visual recognition of figures/objects and whole forms instead of just a collection of simple lines and curves.

It can be applied to software testing to address the challenges of critical testing aspects in content testing, navigation testing, usability testing and user interface testing.

V. APPLICATION OF LAWS OF GESTALTISM IN SOFTWARE TESTING:

A. Law of Closure :The mind may experience elements it does not perceive

through sensation, in order to complete a regular figure (that is, to increase regularity).

Application: Figures, fonts, objects of AUT can’t perceive through the presentation, in order to complete the regular information (that is, to increase regularity).

B. Law of Similarity:The mind groups similar elements into collective entities

or totalities. This similarity might depend on relationships of form, color, size, or brightness.

Application: Collective entities of AUT depend upon broad range of user information. So it is required to present collective entities in similar places. It can be analyzed as a usability issue to improve customer experience.

C. Law of Proximity: Spatial or temporal proximity of elements may induce the

mind to perceive a collective or totality.

Application: Spatial or temporal proximity of elements in AUT may induce the mind to perceive a collective or totality. It should not make any negative impact and it can be tested for user experience. It can be achieved by training a Subject Matter Expert (SME) to test AUT using Gestalt psychology.

D. Law of Symmetry (Figure ground relationships): Symmetrical images are perceived collectively, even in

spite of distance.

Application: Symmetrical images/data/items in GUI are perceived collectively, even in spite of distance. So tester should make sure that the information provides right impact and it gives user friendly experience in navigation (in navigation testing).

E. Law of Continuity:

The mind continues visual, auditory, and kinetic patterns. Application: The mind continues visual, auditory, and

kinetic patterns as per the similar pages/menus in AUT. So user

188

expects similar objects in consecutive steps/functionalities and processes. Similar objects/information can group together in GUI to overcome this issue (in User Interface Testing).

F. Law of Common Fate: Elements with the same moving direction are perceived as a

collective or unit.

Application: Elements with the same objective and scope should be collective in AUT. It makes good customer experience over the navigation period and it gives clear flow for testing in Usability Testing.

VI.APPLICATION OF GESTALT PROPERTIES IN SOFTWARE TESTING:

A. Application of Emergence:A SME can identify ‘strong emergence’ and ‘weak

emergence’ of components/functions in web pages by applying emergence into testing. A tester can validate linkages, parameters passing mechanisms by applying ‘emergence’ to track on various components/scripts which communicates together at various stages of user interactions. In User Interface Testing, it addresses the critical aspects like application and the user interaction problems, ‘strong emergence’ of application components with interrupts. In integration testing, a tester can easily track the problems that arises when units are combined strongly (using strong emergence) or weakly (using weak emergence). It delivers clear differences between strong and weak emergence of components.

Figure 1. Strong Emergence of Unit1-3, Weak Emergence: Unit1-2, Unit2-3 in Integration testing.

Figure 2. Weak Emergence in picture1 and 2. Strong Emergence in picture3.

B. Application of Reification:

Figure3: Example picture representation of Reification. A complete three-dimensional shape is visible in picture,

where in actuality no such thing is drawn.

Reification is the constructive or generative aspect of perception of object/process in the AUT. Misunderstanding of test requirements or wrong assumptions on functionalities can resume new bugs in application. Strong understanding of reification can explore innovative ways to track on hided defects in AUT. Tester can create a table to compare reification points with understanding on test functionalities.

Unit2

Unit3

Unit1

189

C. Application of Multistability:

Figure4: Edger Rubin’s Rubin Figure-An Example for Multistability.

Classification of paths to test in Rubin Figure:1.Left Face Only, 2.Right Face Only, 3.Left Face and Right Figure, 4. Right Face and Left Face, 5.White color of Figure, 6.Black color of Figure, 7.Size and Color Ratio.

Combinations for testing in Rubin Figure:

1-5, 2-5, 3-5, 4-5, 1-6, 2-6, 3-6, 4-6, 5-6

Unstability between two or more alternative interpretations is known as ‘Multistability’. When test requirement is documented, it can be understood by two or more alternative interpretations and content testing can be done as per the understandings which leads to incorrect testing. By practising with ‘Multistability’ methods of Gestaltism, a tester can clearly derive the combinations and types of test requirements. It makes tester to observe the varities of view points from requirement specification and to the developers if needed.

D. Application of Invariance:

Figure5: An ornamental pattern in which dozens of features are processed simultaneously to represent

‘Gestalt Invariance’.

Independent object recognition of rotation, translation, and scale and variations like elastic deformations, different lighting, and different component features are monitered using ‘Invariance Testing’ in AUT.

VII. APPLICATION OF ‘PRODUCTIVE THINKING’ IN USABILITY TESTING:

Formulae:

PTR = [Combinations*Test Techniques]/Time;

PTR=Productive Thinking Result which analyses test factors.It can be customized based on evaluations to identify the waitage of Productive Thinking Process.

VIII. REPRODUCTIVE THINKING AND ITS REPRESENTATION IN TESTING:

Solving a problem with previous experiences by reproductive thinking can be applied to all kind of software testing methods. But tester must document the observations and innovative new approaches for future testing references.

IX.APPLICATION OF ‘RULE OF FIGURE’ IN GRAPHICAL USER INTERFACE (GUI) TESTING:

For Example, look at the bottom of the page where the letters got cut off but you still knew what it was telling you. It can also be the concept of looking at a picture that has two optical illusions. It is known as ‘Rule of Figure”. It can be the right option to test GUI of AUT.

X. TECHNIQUES CAN BE DERIVED FROM “RULE OF FIGURE”:

View AUT partially and imagine about remaining GUI and take notes on those expectations to compare it later.

Analyse size and formats of various diagrams, reports, graphical representations in AUT.

XI.VIEWS WHICH ARE GOING AGAINST GESTALT PSYCHOLOGY:

The Three-Process View Neo-Gestalts View

It is must to consider views which are oppose gestaltism when it is useful for testing. Three process view and Neo-Gestalts view are critising gestaltism but properties of these views are more useful to explore innovative software testing approaches.

190

XII. APPLICATION OF NEO-GESTALTS VIEW:

A. Motion like properties:It can be applied to find the rate of change of particular

functionality/process and bug.

B. Rhythmic properties:Relative timing of particular process in various test versions

and builds can be explored by applying ‘Rhythmic Properties’ in regular testing perspectives.

Ex: ‘Cash Transaction’ in e-Banking applications.

XIII. APPLICATION OF THREE-PROCESS VIEW:

A. Selective-Encoding Insight(SEI): SEI Involving one to distinguish what is important in a

problem and what is irrelevant.

It helps tester to identify customer specifications clearly from various various communications with customers.

B. Selective-Comparison Insight (SCPI): SCPI Identifying information by finding a connection

between acquired knowledge and experience.

It provides hidden flows and flaws in regular flows of ‘Data Flows’ in ‘Path Testing’.

C. Selective-Combination Insight(SCBI): SCBI Identifying a problem through understanding the

different components and putting everything together. It addresses integration test engineers and various unit level integration testing.

XIV. LIMITATIONS OF GESTALT PSYCHOLOGY:

Descriptive representation

Diagrammatic Analyze of problems.

A. Avoid descriptive approach and derive straight forward methods:Gestalt psychology is more descriptive rather than

exploratory. It is mandate to explore standard way of testing representations and testers must understand methods and techniques by exploratory definitions of psychological applications in software testing to find bugs productively.

B. From Diagramatic Represention to Systematic Definitions and Formulas:Gestaltism deals with diagrams, pictorial representations

and various observations. It is essential to apply these techniques in specific testing process to find ‘time specific

objectives’ to test. It delivers good understanding on AUT for various testing activities.

XV. CONCLUSION:Complete testing in AUT is a big challenge to SQA and

innovative new techniques can address this problem to address the expectations of customers. Application of Gestalt psychology in software testing is a new path In which software testers are benefited with innovative techniques to find abnormal defects in AUT. It is a challenging research area and it can address the key challenges of software testing with innovative techniques and methods.

XVI. ACKNOWLEDGMENT

I am heartily thankful to my faculties, whose encouragement, guidance and support from the initial to the final level enabled me to develop an understanding of the subject.

Lastly, I offer my regards and blessings to all of those who supported me in any respect during the completion of this research initiative.

XVII. REFERENCES:[1] “Lessons learned in software testing” by Cem Kaner, James Bach, Bret

Pettichord.[2] “Software Testing Techniques” by Boris Beizer.[3] “Game Testing All in One” by Charles P.Schultz, Robert Bryant, Tim

Langdel.[4] http://en.wikipedia.org/wiki/Gestalt_psychology[5] http://www.optimum-web.co.uk/whatcontent.htm[6] http://www.contentmanager.net/magazine/article_244_testing_in_conten

t_management_projects.html[7] http://machineslikeus.com/the-constructive-aspect-of-visual-perception[8] http://en.wikipedia.org/wiki/Rubin_vase[9] http://drezdel123.wordpress.com/2009/03/[10] http://www.allgraphicdesign.com/graphicsblog/2008/03/04/the-rules-of-

the-gestalt-theory-and-how-to-apply-it-to-your-graphic-design-layouts/

191

ADCOM 2009MOBILE AD-HOC

NETWORKS

Session Papers:

1. Vijayashree Budyal, Sunilkumar Manvi and Sangamesh Hiremath, “Intelligent Agent based QoS Enabled Node Disjoint Multipath Routing”

2. Adil Erzin, Soumyendu Raha and.V.N. Muralidhara, “Close to Regular Covering by Mobile Sensors with Adjustable Ranges”

3. Dipankaj Medhi,“Virtual Backbone Based Reliable Multicasting for MANET”

192

1

Intelligent Agent based QoS Enabled Disjoint Multipath Routing in MANETs

*Vijayashree Budyal, #S. S. Manvi,**S. G. Hiremath ,*Kala K. M.,

*ECE Department, Basaveshwar Engg. College, Bagalkot, India#ECE Department, Reva Institute of Tech. And Mgmt., Bangalore, India

**ECE Department, G M Institute of Technology, Davangere, India

Abstract: Mobile ad hoc network (MANET) is an infrastructureless, multihop, wireless, and frequently changing network. To support multimedia applications such as video and voice, MANETs require an efficient routing protocol and Quality of Service (QoS) mechanism. QoS support in MANETs is an important issue as best effort routing is not efficient for supporting multimedia applications. Whenever there is a link break on the route, the best effort protocols need to initiate a new route discovery process. This results in a high routing load. On demand Node-Disjoint Multipath Routing (NDMR) alleviate these problems and reduces routing overhead. This paper proposes an intelligent agent based QoS enabled NDMR, built on NDMR for supporting multimedia applications that considers bandwidth, delay as QoS metrics for optimal path computation in Mobile Ad hoc Networks (MANETs). A mobile agent is employed to find QoS paths and to select an optimal path among them. The performance of the scheme is evaluated for packet delivery ratio, end-to-end delay, QoS acceptance ratio and route discovery time for various network scenarios.

1. Introduction

Mobile ad hoc networks are infrastructureless networks that can be rapidly deployed. They are characterized by multihop wireless connectivity, frequently changing network topology [1]. The design of efficient and reliable routing protocols in such a network is a challenging issue. Ad Hoc On-demand Distance Vector (AODV) and Dynamic Source Routing (DSR) are the two most widely studied on-demand ad hoc routing protocols. The limitation of both of them is that, they build and rely on a single path for each data transmission. Whenever there is a link break on the route, protocols need to initiate a new route discovery process, resulting in high routing overheads [2].

Multipath routing is one of the solutions that aim to establish multiple paths between source and destination. A lot of benefits have been explored for multipath routing [4]. On-demand Node-Disjoint

Multipath Routing (NDMR) has two novel aspects compared to the other on-demand multipath protocols: It reduces routing overhead dramatically and achieves multiple node disjoint routing paths (No node except source-node and destination-node are common in multipaths) [5].

Important issue in multimedia communications is routing of application data based on QoS requirements. QoS routing is a method of finding QoS routes between a source and destination. If a proper QoS route is identified, the applications will meet the guaranteed services [6]. A Biologically inspired QoS routing algorithm is described in [7] based on swarm intelligence inspired routing technique.

Agent technology is emerging as a new paradigm in the areas of artificial intelligence and computing. Agents are said to become the next generation components in software development, because of its inherent structure and behaviour, which can be used to facilitate Internet services.

Agents are the autonomous programs situated within a programming environment. The agents achieve their goals by collecting the relevant information from the host without affecting the local processing. They have certain special properties such as mandatory and orthogonal (supplementary), which make them different from the standard programs. The mandatory properties are autonomy, reactive, proactive and temporally continuous. The orthogonal properties are communicative, mobile, learning and believable. An agent should posses the mandatory properties which are compulsory. The orthogonal properties enhance the capabilities of agents and provide strong notion of agents. An agent may or may not posses the orthogonal properties.

193

Agents can be classified as local/user interface agents, networked agents, distributed artificial intelligence (AI) agents and mobile agents. The networked agents and user interface agents are single agent systems, whereas the other two types of agents are multi-agent systems [9].

From the literature, we have noticed that a dynamic QoS path computation, based on rapidly changing network conditions and capable of providing adaptability, flexibility, software reuse and customizability features, has not been addressed. This paper proposes an Intelligent agent based QoS enabled NDMR built on NDMR for supporting multimedia applications that considers bandwidth, delay as QoS metrics for optimal path computation. A mobile agent is employed to find QoS paths and to select an optimal path among them.

II. QoS Enabled NDMR

This paper proposes an intelligent agent based QoS enabled NDMR built on NDMR for supporting multimedia applications, which identifies a set of multiple paths that meet the QoS requirements of a particular application, and selects a path which leads to highest overall resource efficiency. The scheme uses a mobile agent to perform this operation. Every node in a network comprises of an agent platform tosupport mobile agents. In this section, we describe, NDMR in brief and explain the functioning of QoS Enabled NDMR.

2.1 Node-disjoint multipath routing protocol

Node-disjoint multipath routing protocol (NDMR) is a new protocol developed by Xuefei Li [5], modifying and extending AODV to enable the path accumulation feature of DSR in route request packets. It can efficiently discover multiple paths between source and destination nodes with low broadcast redundancy and minimal routing latency.

In the route discovery process, the source creates a route request packet (RREQ) containing message type, source address, current sequence number of source, destination address, the broadcast ID and route path. Then the source node broadcasts the packet to its neighbouring nodes. The broadcast ID is incremented every time that the source node initiates a RREQ, forming a unique identifier with the source node address for the RREQ. Finding node-disjoint multiple paths with low overhead is not straightforward when the network topology changes dynamically. NDMR routing computation has three key features that help it to achieve low broadcast

redundancy and avoid introducing a broadcast flood in MANETs: Path accumulation, decreasing multipath broadcast routing packets (using shortest routing hops), and selecting node-disjoint paths.

In NDMR, AODV is modified to include path accumulation in RREQ packets. When the packets are broadcast in the network, each intermediate node appends its own address to the RREQ packet. When aRREQ packet finally arrives at its destination, the destination is responsible for judging whether or not the route path is a node-disjoint path. If it is a node-disjoint path, the destination will create a route reply packet (RREP) which contains the node list of whole route path and unicasts it back to the source that generated the RREQ packet along the reverse route path.

When an intermediate node receives a RREP packet, it updates the routing table and reverse routing table using the node list of the whole route path contained in the RREP packet. When receiving a duplicate RREQ, the possibility of finding node-disjoint multiple paths is zero if it is dropped, for it may come from another path. But if all of the duplicate RREQ packets are broadcast, this will generate a broadcast storm and dramatically decrease the performance. In order to avoid this problem, a novel approach is introduced in NDMR recording the shortest routing hops to keep loop-free paths and decrease routing broadcast overhead. When a node receives a RREQ packet for the first time, it checks the node list of the route path calculates the number of hops from the source node to itself and records the number as the shortest number of hops in its reverse routing table. If the node receives a duplicate RREQ packet again, it computes the number of hops and compares it with the shortest number of hops in its reverse routing table. If the number of hops is larger than the shortest number of hops in the reverse routing table, the RREQ packet is dropped. Only when it is less than or equal to the shortest number of hops, the node appends its own address to the node list of the route path in a RREQ packet and broadcasts it to neighboring nodes again.

The destination node is responsible for selecting and recording multiple node-disjoint paths. When receiving the first RREQ packet, the destination records the list of node IDs of the entire route path in its reverse route table and sends a RREP packet along the reverse route path. When the destination receives a duplicate RREQ, it compares the whole node IDs of the entire route path in the RREQ to all of the existing node-disjoint paths in its reverse routing table. If there is no common node (except the source

194

and destination node) between the node IDs from the RREQ and node IDs of any node-disjoint path in the destination’s reverse table, the route path in current RREQ is a node-disjoint path and is recorded in the reverse routing table of the destination. Otherwise, the current RREQ is discarded.

2.2 Functioning of QoS enabled NDMR.

We describe the QoS routing metrics, and Agency at each node considered in the proposed work.

2.2.1 QoS routing metrics

Consider a network as an undirected graph G (V, E) to describe QoS metrics used in the proposed scheme, where V is a set of nodes and E is a set of edges. A path ‘P’ from source ‘s’ to destination ‘d’, P(s, d), is a sequence of edges belonging to set E .The proposed scheme uses residual bandwidth (bwe) and delay (de) metrics of a link for QoS routing of an application. A QoS of an application is specified as Q = bwmin, D,where bwmin is the minimum bandwidth required for an application, D is the bounded end-to-end delay for delivery of information. The application can be viewed by a user with acceptable QoS by guaranteeing bwmin and D metric. These concave and addtive properties are defined below for a path P=l1, l2, . . ., ln , where ln is the nth link and m(P) is the metric value on the path P.

• Additive: A metric m is said to be additive for a given path P, if m (P) = m (l1) + m (l2) + · · · +m (ln).

• Concave: A metric m is said to be concave for a given path P, if m ( P ) = min m ( l1); m ( l2 ); · · ·,m ( l n ).

P(s, d) should satisfy the following bandwidth and delay criteria (1) and (2) for an application to begin and progress: bw p(s , d) = min bwe ( i, j) ≥ bwmin ∙ ∙ ∙ (1) (i, j) є p (s, d)

delay p(s,d) = ∑ de (i , j) ≤ D ∙ ∙ ∙ (2) (i, j) є p (s, d)

Notations bw p(s,d) and delay p(s,d) are bandwidth and delay values associated with optimal-path P(s,d), respectively, where as notations bwe( i , j) and de(i, j) denote residual bandwidth and delay on a link connecting nodes i and j on the path P(s,d), respectively.

2.2.2 Agency at each Node

We assume that every node in a network maintains an agency as shown in Fig.1. An agency consists of a BlackBoard, Communication Manager Agent, Delay and Bandwidth Estimator agent (D&B), QoS negotiator/re-negotiator agent.

• BlackBoard : is a shared knowledge base structure, which is read and updated by agents as and when required. It consists of information such as residual bandwidth, delays that connected to a node as shown in Fig 2.• Communication Manager Agent (CMA): is a static agent running at each node to serve the applications requesting for on-demand QoS routes and also support route finding operations when the node is acting as an intermediate node. This agent is responsible for creation of the agents and updates the data in blackboard. All the operations like, communication, updating the blackboard, reading the blackboard and so on, take place with the permission of communication manager agent.

Fig 1. Agency at each Node

Node Residual-bandwidth (Mbps)

Delays (secs)

4 : :

3 : :

4 : :

Fig 2. Entry of BlackBoard at a node 4

195

Algorithm 1: Functions of Communication- Manager-agent.

To maintain QoS requirements of an application by performing dynamic negotiation/re-negotiation

Begin1. Receive connection request for source and destination with QoS requirements (maximum bandwidth and delay required). 2. Trigger QoS –negotiator/Re-negotiator agent to negotiate the resources to find a QoS route (Algorithm 2a)3. Trigger D&B-Estimator-Agent (Algorithm 3a)4. Trigger D&B-Estimator-Agent periodically to observe the QoS of an application in the network (Algorithm 3b)5. Check, if there is a QoS violation then, Trigger QoS-negotiator/Re-negotiator-agent to re-negotiate the resources with the nodes in the established path (Algorithm 2b)6. Repeat steps 4 -5 until the session is completed.7. Dispose the created agents.8. stop.End.

• QoS Negotiator/Re-negotiator agent: QoS negotiator/re-negotiator agent is a mobile agent, which is used to find the QoS route (a route satisfying bandwidth, delay) from the source to destination at the beginning of a session as well as whenever required. It negotiates/re-negotiates the resources in the path. It follows the principle of NDMR routing protocol while establishing multiple paths between the source to destination by collecting the neighbours connectivity and resource information (bandwidth availability, delays). Finally chooses a maximum bandwidth and a minimum delay path among them for resource reservation.

Algorithm 2a: Negotiation phase

To find a QoS route considering bandwidth and delayparameters.

Begin1. The QoS negotiator/re-negotiator agent collects the

QoS requirements from the Communication-manager-agent.

2. The agent migrates from source to its neighbours until it reaches the destination. While traversing it collects the resource availability from each of the intermediate nodes.

3. The agent routes from destination to source as per the pre-existing routing ofNDMR routing protocol.

4. When the agent reaches the source, it finds a set of multiple QoS paths that satisfies the required resources • Prune all the edges/links in collected connectivity/ resource information that have less than the desired bandwidth and delay. • Find K Node-disjoint paths (No node is common in more than one path) by following principles of NDMR routing protocol..5. If QoS path(s) available then, select a best QoS path (path with widest bandwidth and lowest delays) and reserve the resources on the path. Else, inform the communication manager agent that QoS path is not available.6. Dispose the QoS negotiator/re-negotiator agent;7. Stop.End.

Algorithm 2b: Re-negotiation phase

To re-negotiate resources whenever a QoS violation or congestion/failure is detected during a session.Begin1. The QoS negotiator/re-negotiator agent collects the QoS requirements to be re-negotiated from the Communication-manager-agent.2. It migrates on the specified path by visiting every node on the path.3. After reaching the destination, it checks whether the renegotiation is successful at all the visited nodes.4. If re-negotiation is successful then inform the newly negotiated QoS values on the existing path to the Communication-manager-agent and goto step 6. Else find the multiple QoS paths that satisfies required resources (as given in step4 of Algorithm 2a)5. If QoS path(s) available then, select a best QoS path and reserve the resources on the path and inform the server and manager agent; Else, inform the communication manager agent that QoS path is not available;6. Dispose the QoS negotiator/re-negotiator agent;7. Stop.End.

• Delay and Bandwidth Estimator Agent: is a agent that monitors the node. Such agents are created for each node by the Communication-manager-agent. The Delay and bandwidth-Estimator-agent computes

196

the bandwidth, delay for each node and updates the Blackboard at regular intervals. Bandwidth is calculated by monitoring flow of information on link of a node and the available Residual bandwidth. Delay is computed by averaging the queuing delays taken for all the packets on the link within a given time interval.

Algorithm 3a: Delay and Bandwidth Estimator agentcomputes Bandwidth and Dealy

Begin1. Compute bandwidth and delay as follows bw p(s , d) = min bwe ( i, j) ≥ bwmin

(i, j) є p (s, d)

delay p(s,d) = ∑ de (i , j) ≤ D

(i, j) є p (s, d)

Notations bw p(s,d) and delay p(s,d) are bandwidth and delay values associated with path P(s,d), respectively, where as notations

bwe(i,j) and de(i,j) denote residual bandwidth and delay on a link connecting nodes i and j on

the path P(s,d), respectively.2. Informs the calculated value to communication – manager-agent.3. Dispose the agent4. StopEnd

Algorithm 3b: Periodically Delay and BandwidthEstimator agent observes bandwidth and delayvariations

Begin1. Observe bandwidth and delay at each node of the specified path by traversing from destination to source

2. In the event of any QoS violations the agent informs the Communication-manager- agent.3. Dispose the agent4. StopEnd.

3. Simulation

The proposed scheme is simulated in various network scenarios by using C-Programming language to verify the performance and operation effectiveness of the scheme. In this section we describe the simulation model and simulation procedure.

3.1 Simulation model

The proposed model has been simulated in various network scenarios on Pentium-4 machine by using “C++” programming language for the performance and effectiveness of the approach. Simulated area for the network topology is of A X B sq.mts, Total “N” number of nodes, “C” Mbps is the link capacity and Transmission range are considered in the simulation. In order to simulate mobility of nodes in the network, we considered nodes to move in any of the 8 directions at a distance “d” mts with speed varying between 0-12 mph (meter per hour). The QoS requirements of an application are specified as Q= bandwidth, delay, where all the metrics of Qare randomly distributed. Principles of NDMR routing is used to generate node-disjoint multipaths and create routing tables at each of the nodes before applying the QoS Enabled NDMR scheme.

3.2 Simulation procedure

To illustrate some results of the proposed scheme, The simulation inputs are A=300 mts, B=300 mts, Number of nodes varies between 10 to 25, Transmission range=100 mts, speed of the node varies between 0 to 12 mph (meters per hour), propagation delay varies between 1 to 5 secs, Data rate varies between 2 to 5 Mbps, Packet size=1 KB, Number of services are assumed to be constant.

Simulation procedure is as follows:Begin

1. Create a network topology with random size of nodes.2. Randomly select source node and destination node.3. Deploy the proposed scheme.4. Compute the performance parameters.End.

The following performance metrics are used for evaluating the scheme.• Packet Delivery Ratio: Packet Delivery ratio is the ratio of the number of data packets delivered to the destination node to the number of data packets transmitted by the source node.• Route discovery time: The time required to find the optimal QoS path.• Average end-to-end Delay: The average time thedata packet takes to reach from source to destination is known as Average end-to-end delay; it includes all possible delays caused by queuing and retransmission plus acknowledgement.• Bandwidth utilization ratio: is the ratio of the sum of utilization of bandwidth of all the links to the total bandwidth of the network.

197

• QoS Acceptance ratio: is the ratio of sum of Qos fulfilled paths to existing multipaths.

4. ResultsFigure 3 depicts the route discovery time that the proposed scheme is more than NDMR scheme, because the mobile agents need time to traverse the network to find a optimal path.

Fig 3. Route discovery time vs. no. of nodes

Fig 4. End-to-End Delay Vs. no. of nodes

As shown in Figure 4 the Average End-to-End Delay for proposed scheme is less than NDMR because, proposed scheme selects the optimal path.

Fig 5. QoS Acceptance ratio Vs Delay required

As in Figure 5, the acceptance ratio is high in IANDMR compared to NDMR with higher delay

requirement because more number of optimal paths are available.

Fig 6. Packet Delivery ratio Vs. no. of nodes

We experimented by injecting certain percentage of link failures; a mobile agent tries to find another QoS route for the same application. Figure 6 depicts with increased packet delivery ratio in IANDMR scheme than NDMR.

5. Conclusions An Intelligent-agent-based QoS enabled NDMR scheme using the metrics bandwidth and delay for a feasible path computation has been proposed. A comparison of NDMR and the proposed QoS enabled scheme presented in terms of Average End-to-End delay, Packet delivery ratio, network bandwidth utilization ratio and Route discovery time for sparse and dense network scenarios. The results demonstrate that Average end-to-end delay, packet delivery ratio and the network bandwidth utilization of the proposed scheme is better than the NDMR scheme. The performance of the scheme is dependent on the richness of the network connectivity information gathered by a mobile agent that is this scheme performs better in the case of dense networks. The agent’s visibility of its visited nodes also plays an important role in improving the QoS acceptance ratio. Important benefits of the agent-based scheme as compared with traditional methods of software development are flexibility, adaptability, software reusability and maintainability.

Acknowledgments

We are very much thankful to the reviewers foruseful comments that helped us in improving thequality of paper.

198

References

[1] Jun-Zhao Sun, Mobile Ad Hoc Networking: an essential technology for Pervasive computing”, proc. IEEE International conference on Info-tech and Info-net, Beijing vol.3, pp.316-321, 2001.

[2] Liza Abdul LatiffI, Norsheila Fisal, “Routing Protocols in Wireless- Mobile Ad Hoc Network - A review”, proc. IEEE 9th Asian Pacific conference on Communications, vol. 2, pp. 600-604, 2003.

[3] Ahmed Al-Maashri Mohamed Ould-Khaoua“Performance Analysis of MANET Routing Protocols in the Presence of Self-Similar Traffic”, proc. IEEE 31st conference on Local Computer Networks, pp. 801-807,Nov 2006.

[4] Hongxia Sun, Herman D. Hughes, “Adaptive Multi-path Routing Scheme for QoS Support in Mobile Ad-hoc Networks”, www.scs.org/getDoc.cfm? id=2454.

[5] Xuefei Li and Laurie Cuthbert, “On-demand Node- Disjoint Multipath Routing in Wireless Ad hoc Networks”, proc. IEEE 29th Annual International Conference on Local Computer Networks, U.S.A, pp. 419-420, Nov 2004.

[6] Chenxi Zhu and M. Scott Corson, “QoS routing for mobile ad hoc networks”, proc. 21st Annual Joint Conference of the Compute and Communications Societies, vol. 2, pp. 958-967, Jun 2002.

[7] Zhenyu Liu, Marta Z, Kwiatkowska, and Costas Constantinou, “A Biologically Inspired QoS Routing Algorithm for Mobile Ad Hoc Networks”, International Journal of wireless and mobile computing, 2006.

[8] Jennings, N.R, “An agent-based approach for building complex software systems”, Communications of ACM, vol. 44, pp. 35-41, 2001.

[9] Manvi S.S, and Venkataram P, “Applications of agent technology in communications: a review”, Computer. Communication Journal, pp. 1493–1508, 2004.

[10] Chess, D., Harrison, C., and Kershenbaum, “A Mobile agents: are they a good idea?”,Lecturer notes in Computer Science SpringerBerlin /Heidelberg, vol. 1222, pp. 25-45, 2006.

[11] Lange, D.B., and Oshima, M, “Seven good reasons for mobile agents”, Communications of ACM, vol. 42, pp. 88-89, 1999.

199

Close to Regular Covering by Mobile Sensors withAdjustable Ranges

A.I.Erzin V.N.Muralidhara S.Raha

Abstract— A mobile wireless sensor network of a set of mobilesensors with adjustable sensing and communication ranges is an-alyzed for coverage. Each sensor, being in active mode, consumesits limited energy for sensing, communication and movementandin a sleep mode, preserves its energy. The problem that we focusin this paper is to maximize the lifetime of WSN. This problemissufficiently complex, and even special cases are NP-hard [1]. Ourgoal is to take advantage of mobility of sensors in comparisonwith the static sensors in class of regular covers [2], [3], [4].

Index Terms— Adjustable ranges, coverage, energy efficiency,wireless sensor network.

I. I NTRODUCTION

A wireless sensor network (WSN) is composed of a largenumber of sensor nodes deployed densely close a area ofinterest and are connected by a wireless interface. Wirelesssensor networks constitute the platform of a broad range ofapplications such as national security, surveillance, military,health care, and environmental monitoring. A sensor node ina WSN, is typically equipped with Radio Transceiver, microcontroller and power supply and every node in the networkhave very limited processing, storage and energy resources.In most of the applications in the real world, it is almostimpossible to replenish the power resources, hence energyoptimization is the most important issues in WSN.

Suppose the WSN is presented by the setJ, |J | = m, ofmobile sensors with adjustable sensing and communicationranges, which are distributed randomly over the plane regionO of spaceS. Each sensor, being in active mode, consumes itslimited energy for sensing, communication and movement. Inasleep mode sensor preserves its energy. Let the monitoring andcommunication areas of every sensor are the disks of certainradii with sensor in the centers [2], [3], [5], [6].

We say that the regionO is covered, if every point in O

belongs to at least one monitoring disk. Thelifetime of a WSNis the number of time rounds during which the regionO iscovered by connected active sensors [7]. Observe that by max-imizing the lifetime of a WSN, we are actually maximizingthe time period over which the region is covered by the sensornodes with the limited energy resources. The problem that we

This research was supported jointly by the Russian Foundation for BasicResearch (grant 08-07-91300-IND-a) and by the Department of Science andTechnology of Government of India (grant INT/RFBR/P-04)

A.I.Erzin is with the Sobolev Institute of Mathematics, Russian Academy ofSciences, Novosibirsk, Russia and the Novosibirsk State University, Novosi-birsk, Russia.

V.N.Muralidhara was with the Supercomputing Education andResearchCenter, Indian Institute of Science, Bangalore, India. At present he is afaculty member at the International Institute of Information Technology (IIIT)Bangalore,

S.Raha is with the Supercomputing Education and Research Center, IndianInstitute of Science, Bangalore, India.

focus in this paper is to maximize the lifetime of WSN. Thisproblem is sufficiently complex, and even special cases areNP-hard [1]. Our goal is to take advantage of the mobility ofsensors in comparison with the static sensors in class of regularcovers [2], [3], [4]. In the model that we consider, the sensornodes can adjust the sensing and communication ranges byconsuming some energy. In this paper, we show that mobilityof the sensor nodes can be exploited to improve the life timeof the WSN.

II. F IXED GRID

Let the regionO be tiled by the regular triangles (tiles)with the sideR

√3. These triangles form a regular grid with

the set of grid nodesI. Suppose, each sensor has the energystorageq > 0. For any sensor, sensing energy consumptionper time period depends on a sensing ranger (radius of thedisk) and equalsSE = µ1 ra, µ1 > 0, a ≥ 2; communicationenergy consumption per time period depends on the distanced and equalsCE = µ1r

b, µ2 > 0, b ≥ 2; and the energyconsumption per time round during the motion depends onthe speedv and equalsME = µ3r

c, µ3 > 0, c > 0. Wesuppose that during the motion, sensor does not consume theenergy for sensing and communication.

i j

R

3R

3R

k

disk i

Fig. 1. Covering modelA1

If all sensors have the same sensing rangesR, and areequally placed in the grid nodes, then the covering model, we

200

call it A1, [8], [4](Fig. 1) is optimal with respect to the sensingenergy consumption (or covering density). In the modelA1each triad of neighbor disks of radiusR with centers in thenodes of triangle has one common point in the center of thetile. In coverA1 each sensor, located in the nodei, must coverthe disk of radiusR and center in the nodei (we call it diski). The density of the cover isDA1

= 2π/√

27 ≈ 1.2091 [4],[8], and the sensing energy consumption of every sensor isSEA1

= µ1Ra. The communication distance for each sensor

in A1 is R√

3, hence the communication energy consumptionis CEA1

= µ2(R√

3)b. Therefore, the lifetime of one sensoris

tA1 =q

µ1Ra + µ2(R√

3)b

Since the minimal number of grid nodes isN ≈2S/(R2

√27) [4], then the lifetime of coverA1 is

tA1 ≈tA1m

N≈

qm√

27

2S(µ1Ra−2 + µ2Rb−2(√

3b)

Let the sensors be distributed uniformly over the regionO,and parameteraij = 1 if i is the closest grid node to the sensorj(i.e. the distance betweenj and i is dij = mink∈Idkj ) andaij = 0 otherwise. Denote the setJi = j ∈ J |aij = 1.Then the sensors inside the regular hexagoni with center inthe nodei and the sides at the distanceδ = R

√3/2 from

the center, are in the setJi (Fig. 2). We reasonably supposethat the sensorj in Ji (or in hexagoni) must cover the diski. Then if j is located on the distancer away from the gridnodei, then it must increase the sensing range byr in order tocover the diski. Moreover, if the distance between the nodei and sensorj1 ∈ J1 is r1, and the distance between thenodek and sensorj2 ∈ J2 is r2, then in order to guaranteea communication between the neighbor sensorsj1 and j2, itis necessary to increase the communication range ofj1 andj2 by at leastr1 + r2 units. But additionally, every sensorj ∈ Ji can move towards the nodei during some time roundsin order to be neareri. For the sake of simplicity, we supposethat the speed of every sensor possess the two values0 or v.Therefore, if sensorj ∈ Ji is moving, then the speedv anddirection (towards the grid nodei) are known.

Let us consider the concentric circles of radiiδk = kv, k =1, 2, . . . k = δ

v. Denote the setJk

i = j ∈ Ji|δk−1 < dij ≤δk. Then any sensorj ∈ Jk

i could reach the nodei by atmostk time rounds.

Since the resource of each sensor is limited byq, then ifany sensorj ∈ Jk

i moves l time rounds and, as a result,consumeslµ3v

c units of its energy, then taking into accountthe remainder sensor-node distance(k − l)v, it can be activeduring

tk(l) ≈q − lµ3v

c

µ1(R + (k − l)v)a + µ2(R√

3 + 2(k − l)v)b

time rounds. Functiontk(l) is concave, then one can findTk =tk(lk) = max0≤l≤ktk(l) in O(log K). For example, whenq = 365, µ1 = 0.5, µ2 = 0.25, µ3 = 1.0, v = 0.15, R = a =

kv

vk )1(

i

j

disk i

sides of

hexagon i

Fig. 2. Sensors Inside the Regular Hexagon

b = c = 2, k = 6, one getslk = k = 6, and lifetime of sensorj ∈ Qk

i equalstk(lk) = 65.61.Since the sensors are distributed uniformly, then there are

Nk ≈ mπ(2k−1)v2/S sensors in every setJki . Let first active

sensors are initially located inJ1i , and we suppose that they

do not move and are active during

L1 =qN1

µ1(R + v)a + µ2(R√

3 + 2v)b

time periods. During timeL1 sensors inJ21 could move

towards the grid nodei, and then they can be active duringL2 = N2 max0≤l≤min2,L1

t2(l) time periods. Therefore,during the timeΛk−1 =

∑k−1

l=1Ll sensors inJk

i could moveto the grid nodei, and then they can be active duringLk =Nk max0≤l≤mink,Λk−1

tk(l) time periods. As a result, we getthe lifetime of sensors as

Λδ = ΛK =

K∑

l=1

Lk

For example, whenq = 365, µ1 = 0.5, µ2 = 0.25, µ3 =1.0, v = 1, a = b = c = 2, R = 6, we haveK = 5 andlk = k for each sensorj ∈ Jk

1 , k = 1, 2, . . . k = δv. Let us

compare the lifetimeΛ0 of the WSN in example when thesensors are static, and the lifetimeΛδ of the WSN when thesensors are mobile. We have

Λ0 ≈

K∑

k=1

qNk

µ1(R + kv)a + µ2(R√

3 + 2kv)b

≈2.365mπ

S

5∑

k=1

2k − 1

40 + 8(1 +√

3)k + 3k2

201

≈2.365mπ

S(

1

64.85+

3

95.71+

5

132.56+

7

175.42+

9

224.28)

≈ 374m

S

and

Λδ ≈qN1

µ1(R + v)a + µ2(R√

3 + 2v)b

+

K∑

k=2

(q − lkµ3vc)Nk

µ1(R + (k − lk)v)a + µ2(R√

3 + 2(k − lk)v)b

≈mπ

S(

365

62.89+

1

45

5∑

k=2

(365 − k)(2k − 1))

≈ 621m

S

In this example the motion of the sensors gives us a con-siderable gain in lifetime in comparison with the static case,which may be advantageous only when the movement energyconsumptionME = µ3r

c is relatively big. In any case,optimal value oflk can be zero, and if it is disadvantageous,then the sensors will not move. Therefore, the model withmobile sensors is always better than one with static sensors.

III. F REE GRID

The previous results depend on the parameterδ and are ob-tained in case of fixed grid. Suppose that the number of sensorsNk in Jk

i is sufficiently large, for eachk = 1, 2, . . . k = δv.

If grid is wandering, then we may replace it several timeswithout changing the size (the new grid nodein is relocatedfrom the previous positionin−1 by 2δ distance fromin toright or down like on the Fig. 3), then WSNs lifetime can beincreased as follows. Let us setδ = v and suppose that duringthe first time round, when a part of sensors inJ1

i1are active,

other sensors in everyJ1in

, n ≥ 1 move to the grid nodein.The number of sensors in eachJ1

i1which are active during the

first time round isn′

1 ≈ (µ1(R + v)a + µ2(R√

3 + 2v)b)/q,and we suppose thatn′

1 ≤ N1. These sensors do not move andmust increase their sensing ranges byv. During the first timeroundN1−n′

1 sensors in each setJ1i1

will reach the grid nodei1, and it is not necessary to increase their sensing ranges tocover theO. Moreover, each sensor inJ1

in, n ≥ 2, will reach

in during the first time round. The number of sensors in everysetJ1

in, n ≥ 2, is N1, and the number of these sets (new grid

nodes) isn′

2 ≥ ⌊R√

3

2v⌋2 − 1 (valuen′

2 + 1 is the number ofdisks of radiusδ packed inR

√3 side rhombus), where⌊A⌋

is an integer part ofA. Every sensor, located outside the setsJ1

in, n ≥ 1, has two time rounds to reach the nearest grid node

(Fig. 3). The number of such sensors in the rhombus is

n′

3 ≈R2

√27m

2S− (n′

2 + 1)N1

≥mR2

4S(2√

27 − 3π)

≈ 0.24mR2

S

Fig. 3. Relocation of Grids with Movement of a Sensor

SinceN1 ≈ mπv2/S, then the WSN’s lifetime in this case is

Λv ≈ 1 + (N1 − n′

1 + N1n′

2)(q − µ3v

c)

µ1Ra + µ2(R√

3)b

+n′

3

(q − 2µ3vc)

µ1Ra + µ2(R√

3)b

≈ 1 + (3R2mπ

4S

−µ1(R + v)a + µ2(R

√3 + 2v)b

q)

(q − µ3vc)

µ1Ra + µ2(R√

3)b

+6mR2

25S

(q − 2µ3vc)

µ1Ra + µ2(R√

3)b

For the last example, (whenq = 365, µ1 = 0.5, µ2 =0.25, µ3 = 1.0, v = 1, a = b = c = 2, R = 6), the lifetimeof the sensor network with free grid isΛv ≈ 747m/S, theWSNs lifetime in case of fixed grid isΛδ ≈ 621m/S, thelifetime of coverA1 is LA1 ≈ 758m/S, and the lifetime ofstatic WSN isΛ0 ≈ 374m/S. Then in the example, one gets2Λ0 ≈ 1.2Λδ ≈ 1.01Λv ≈ LA1 and Λ0 < Λδ < Λv < LA1.InequalitiesΛ0 ≤ Λδ ≤ LA1 and Λv ≤ LA1 are alwaystrue. InequalityΛδ ≤ Λv depends on the parameters. Thusif energy consumption for motion in unit timeME = µ3r

c

is considerably big, then one may get inequalityΛ0 > Λv.Let in the last example change only the value ofµ3 andset it µ3 = 180. Then Λδ is the sameΛ0 ≈ 374m/S, butΛv ≈ 333m/S < Λ0.

IV. GRID SIZE OPTIMIZATION

The above results depend on the radiusR which, in turn,determines the tiles size. The lifetimeLA1 of model A1 isirrespectiveR. Suppose for simplicitya = b = c = 2, v = 1and lk = k for any k ≤ R

√3/2, and let us find the optimal

value ofR ∈ [2, 8] giving the maximum to the WSNs lifetimesΛ0(R), Λδ(R) and Λv(R). In this case for the consideredmodels the lifetimes are

Λ0(R) ≈mπq

S

R√

3/2∑

k=1

2k − 1

µ1(R + k)2 + µ2(R√

3 + 2k)2

202

Λδ(R) ≈mπq

S(

q

µ1(R + 1)2 + µ2(R√

3 + 2)2

+1

(µ1 + 3µ2)R2

R√

3/2∑

k=2

(q − kµ3)(2k − 1))

Λv(R) ≈m(2.6q − 2.83µ3)

S(µ1 + 3µ2)+ 1

−µ1(R + 1)2 + µ2(R

√3 + 2)2

qR2

q − µ3

(µ1 + 3µ2)

Function Λv(R) is non-decreasing, and when, for exam-ple, q = 36, µ1 = 1.0, µ2 = 0.2, µ3 = 5.0 thenmaxR∈[2,8] Λ

v(R) = Λv(Rv) ≈ 57mS

, and optimalRv =8. Λ0(R) and Λδ(R) are multi-extremal functions, and getmaxR∈[2,8] Λ

0(R) = Λ0(R0) ≈ 20mS

when R0 ≈ 7 andmaxR∈[2,8] Λ

δ(R) = Λδ(Rδ) ≈ 33mS

whenRδ ≈ 2.4.Further details can be found in [9].

REFERENCES

[1] S. Slijepcevic and M. Potkonjak, “Power efficient organization of wirelesssensor networks,” inICC, 2001, pp. 472,476.

[2] H. Zhang and J. Hou, “Maintaining sensing coverage and connectivity inlarge sensor networks,”Ad Hoc & Sensor Wireless Networks, pp. 89–124,2005.

[3] J. Wu and S. Yang, “Energy-efficient node scheduling models in sensornetworks with adjustable ranges,”Int. J. of Foundations of ComputerScience, no. 16, pp. 3–17, 2005.

[4] R. Williams, “The geometrical foundation on natural structure,”A SourceBook of Design; Dover Pub. Inc.: New York, pp. 51–52, 1979.

[5] M. Cardei, J. Wu, and M. Lu, “Improving network lifetime using sensorswith adjustable sensing ranges,”Int. J.of Sensor Networks, no. 1, pp.41–49, 2006.

[6] J. Wu and F. Dai, “Virtual backbone construction in manets usingadjustable transmission ranges,”IEEE Trans On Mobile Computing, no. 5,pp. 1188–1200, 2006.

[7] M. Cardei and J. Wu, “Energy-efficient coverage problemsin wireless ad-hoc sensor networks,”Computer Communications, no. 29, pp. 413–420,2006.

[8] R. Kershner, “The number of circles covering a set,”American Journalof Mathematics, no. 61, pp. 665–671, 1939.

[9] A. I. Erzin, “Close to regular plane covering by mobile sensors,”Abstractsof Int. Conf. on Optimization and Applications (OTIMA-2009), Petrovac,Montenegro, pp. 25–26, 2009.

203

Virtual Backbone Based Reliable Multicasting for MANET

Dipankaj G Medhi ADG, Evolving Systems Network India Ltd

[email protected]

Abstract

In this paper, we propose a distributed reliable

multicasting scheme that uses a virtual backbone infrastructure to transmit the packets reliably over the unreliable communication channel of a mobile ad-hoc network (MANET). This novel approach consists of two phases of execution. In the first phase, we used a distributed clustering algorithm that extract a d-hop dominating set and interconnect them to from a backbone. The backbone infrastructure is dynamically changed to reflect the underlying topology condition. Coverage of the backbone node is automatically adjusted in response to communication link quality and node mobility. The reliable multicasting mechanism uses this infrastructure to transmit the data in the second phase of execution, which is the prime area of our investigation. We introduces a NACK based localized loss recovery mechanism, in which the cluster-head act as the source node temporarily. Besides protecting the source from unnecessary retransmission, this approach reduces global congestion caused by control message and retransmission. If the loss packet can’t recover locally, a global loss recovery mechanism is triggered that pull back the lost packet globally. Moreover, it introduces an ACK based one hop reliable packet delivery scheme to reduce control message and retransmitted data packet overhead. Simulation result demonstrates a potential packet delivery ratio in high mobility conditions. 1. Introduction

Mobile Ad Hoc Network (MANET) is an autonomous system consists of a collection of wireless nodes that operates without any infrastructure. MANETs are envisioned to support advances application such as, battlefield and disaster relief operation, temporary event networks, vehicular networks or any other applications that requires a

network on demand. The domain of application targeted by MANET is group oriented. Undoubtedly, multicast is an efficient means of group oriented communication. Although, some applications such as temporary event networks (e.g. audio/video conferencing) can tolerate packet loss/error, but other applications such as battlefield applications are loss sensitive. Hence, an inevitable need for reliable multicast arises. However, reliable multicast solutions proposed for wire-line networks [1, 2, 3, 4, and 5] are not efficient to deploy in MANETs. Reliable multicast is a challenging research problem due to salient characteristics of ad-hoc networks such an infrastructure-less dynamic network topology, error prone wireless transmission media and node mobility, limited bandwidth and resources etc. We accept this challenge and propose a noble solution to this problem in this paper.

Within the broad scope of group communication, this research work addresses the fundamental problem of reliable multicasting. Many approaches are proposed to resolve this problem is on literature. It is also proven that the performance of multicast protocols designed with extension to conventional routing protocols is not attractive solution in stress condition. For example, Jetcheva et al [6, 7] shows that performance of ADMR, MAODV and ODMRP is below 70% (packet delivery ratio) when number of sources increases. Similarly, Zhu et al [8] demonstrated that performance of MAODV degrades to around 80% when group size is 5 with maximum speed of individual node is 20 m/sec. ADMRP [6] exhibits Packet Delivery Ratio up to 95% with pause time 800 second, speed 20m/sec, three group with 3 source and 10 receiver/per group. In nutshell, all these protocols exhibit intolerably high packet loss rates under moderate to high mobility. Further, flooding is also not an alternative approach for reliable multicasting as Obraczka [9] demonstrate that “when mobility intervals are very small and node speed is sufficiently high, even flooding becomes unreliable”.

204

As losses of data packet is obvious in wireless environment, special attention needs to pay to improve the packet delivery ratio to achieve reliability. The normal way of recovery the lost packet is by sending feedback from the receiver(s) to the sender. Based on the feedback, the sender may retransmit the lost packet once again. This mechanism may introduce request message explosion problem in extreme condition. To rectify this problem, ReMHoc [10] slow down the recovery process randomly. Slowing downs the recovery process by individual node augment end-to-end delay and hence it is not a suitable solution. To overcome this problem, AG [11], ReACT [12], RALM [13], RMA [14] introduces a localized loss recovery mechanism, in which the feedback message is sent to a set of nearby nodes. To get back the lost packet in any localized loss recovery system, receiver must obtain information about the recovery nodes that hold the lost packet. Some of the existing approaches (for example AG [11], ReACT [12]) uses explicit search mechanism to find out the recovery node. This in turn introduce control message overhead. Other approaches (for example RALM [13], RMA [14]), maintain a list containing the information of all the receiver nodes. This information is collected by flooding of control messages in the entire network or at least to a part of the network. Further, most of the existing approaches depend on underlying multicast protocol (for example, RALM, ReMHoc, ReACT etc)

Keeping all these points in mind, we proposed a new approach for reliable multicasting that utilizes a virtual backbone infrastructure. Constructing virtual backbone for routing in MANET is not new. But none of the existing reliable multicasting mechanism explores this area. In doing so, we developed a distributed clustering algorithm that extract a dominating set and interconnect them to from a backbone. Rest of the nodes is get associated with any of the cluster-head and forms a forest of varying depth. The backbone construction mechanism is similar with ADB [15] and ModDHop [16] except the cluster-head selection criteria that perfectly reflect the dynamic environment of MANET. Next, we proposed an approach for reliable multicasting over the backbone infrastructure to achieve guaranteed delivery of data packets. Further, we develop an ACK based one hop loss detection and recovery mechanism to achieve high degree of reliability. Moreover, we used a NACK based localized loss recovery mechanism considering the cluster-head as the recovery node. Thus, our approach does not demand for explicit search of recovery node, hence network does not suffer from control message overhead. If the lost packet can’t recover locally (within cluster), a global loss recovery

mechanism is proposed, in which the lost packet is pulled from other group (cluster). Further, in our mechanism, the sender does not have to keep any information about the receiver nodes.

The rest of the paper is as follows-Section 2

describes the backbone construction process. Section 3 provides correctness proof of the approach, theoretically. Section 4 explains the mechanism used in this research work to achieve the reliable multicasting. Section 5 summarized this work. 2. Virtual Back-bone construction process

The virtual backbone infrastructures allow a smaller subset of nodes to participate in forwarding control/data packets and help reduce the cost of flooding. Most virtual backbone construction algorithms for ad hoc networks are based on the connected dominating set problem but like ADB and MobDHop, we constructed a d hop connected dominating set that creates a forest of varying depth tree, each of which is rooted at a backbone node.

2.1 The Cluster-head Selection Criteria

For construction of virtual backbone, the criteria for selecting a cluster-head is trivial. The uniqueness of our virtual backbone creation approach is the cluster-head selection criterion. For efficient working of the proposed approach, the following cluster-head selection criteria need to be considered –

Mobility: The most important factor to be considered in cluster-head selection process is mobility. In order to avoid the frequent cluster-head (re)selection process, the cluster-head should be relatively stable. A dynamic node changes its geographical position very frequently and hence, the node associated with it also change very frequently. If such a node is selected as cluster-head, a frequent breakdown of the backbone will take place.

Unlike others, we believe that mobility can not be measure with speed (WCA) or with Normalized Link Failure Frequencies (VDBM) or with Link Life Time (RMA) only. Speed of an individual node does not reflect the surrounding environment. Similarly Normalized Link Failure Frequencies (VDBM/ADM) reflects the dynamic condition of the area surrounding a node in terms of the number of link failures per neighbor. But, some times the link may not be failed due to unavailability of

205

neighbor, but due to loss of Neighbor Discovery request packet (due to buffer overflow, hidden/expose terminal problem) at MAC layer. Moreover, keeping track of each link to find out Link Lifetime may be resource consuming in multicasting environment. So, a combination of speed and Network Layer Link Failure Frequencies (NLLFF) is used to represent the true mobility scenario of an individual node at network layer. A simplified approach for measuring NLLFF is used in this work.

Degree: To reduce the overhead associated with the cluster-head and to make a proper load balancing, there should be a limitation of the number of node that can be associated with a cluster-head. So, unlike other approaches (e.g. WCA) in this work preference is given to that node which is not highly loaded (the degree of the node is lower than the threshold degree), in cluster-head selection process.

Node ID: If the value of the above two criteria for selecting CH is same, the conflict is resolved by node ID. The lowest ID node will be selected as cluster-head in this situation.

Based on these criteria, a weight-age is assigned to every node, as follows

tiWt = speedti × Speed_Factor + NLLFFti ×

NLLF_Factor + Degreeti × Degree_Factor Where ti

Wt Weight of node i at time t

NLLFFti Network Link Failure Frequency of node i at time t Degreeti Degree of node i at time t and, Speed_Factor, NLFF_Factor and Degree_Factor are multiplicative factor for speed, NLFF and degree for node i, respectively.

Network Layer Link Failure Frequency (NLLFF) reflects the dynamic condition of the surrounding area by measuring how frequently the neighbor table of the current node changes. To measure the NLLFF, the node has to remember this information temporarily. After every ∆t interval of time, the node compares this remembered information with newly gathered neighbor information. The difference in this comparison gives the number of link expired. The NLLFF for node i at time t can be calculates as

titi ti

NumberOfLinkExpiredLFF

DegreeOfNode=

The Network Link Failure Frequency at time t is

where α smoothing factor of the past history.

Value of α < 1

NLFF at time t-∆t

2.2 Cluster-head Selection Process

The core selection process at each node begins after the node has wait for a random period of time, usually long enough to allow the node to have heard Hello packets from all of its neighbors. This process decides whether a node should still be serving as a core, or become a child of an existing core. Fig 1 demonstrates the cluster-head election and tree construction procedure.

To keep the number of node as low as possible, the cluster-head election procedure has to maintain two constrains-

WT_CONSTRAIN: It limits the maximum cumulative weight from the core to the child node. Because of this constrain, the size of the tree in highly dynamic area will be smaller.

DEPTH_CONSTRAIN: It limits the maximum depth a tree can have.

For maintaining the backbone structure in

dynamic environment, every node exchanges a periodic NEIGH_UPDATE message among the neighborhood. Upon receiving a NEIGH_UPDATE message, the node updates its NIT table and performs the following checks

If the NEIGH_UPDATE message sender is a new core node, check weight of the new core node with the weight of the current core node. If the weight of the new core node is lesser that the current core node, the node can join the new core node provided it does not violate WT_CONSTRAIN and DEPTH_CONSTRAIN.

If the message sender is a tree node with better cumulative weight, the current node can join the tree provided it does not violate WT_CONSTRAIN and DEPTH_CONSTRAIN.

206

If the message contains no better information, it will continue with current status.

Let, NIT ← One Hope Neighbor Table q ← min_wt(NIT) /* return the least weighted node information */ MyBackboneStatus = NONMEMBER

VirtualBackboneCreation () 1. q ← min_wt(NIT); 2. if(my_wt < q→wt) 3. MyCoreID` = MyOwnID; 4. MyParentID = MyOwnID; 5. MyBackboneStatus = CLUSTERHEAD; 6. OneHopBroadcast_MyStatus (MyCoreID, MyParentID, Wt, DistanceToCore); 7. else 8. if(my_wt = q→wt) 9. if(my_ID < q→ID) 10. MyCoreID = MyOwnID; 11. MyParentID = MyOwnID; 12. MyBackboneStatus = CLUSTERHEAD; 13. OneHopBroadcast_MyStatus (MyCoreID, MyParentID, Wt, DistanceToCore); 14. for(;;) 15. on receiving MyStatus (CoreID, ParentID, Wt, DistanceToCore) 16. if(my_wt + Wt) <WT_TH) /* Wt constrain is not violated */ &&((DistanceToCore + 1)<DEPTH_TH) /* Distance constrain is not violated */ 17. if(my_wt > Wt) /* I am heavier */ 18. MyCoreID = CoreID; 19 MyParentID = ParentID; 20. MyBackboneStatus= MEMBER; 21. CumulativeWt = my_wt + Wt; 22. DistanceToCore ++; 23. OneHopBroadcast_MyStatus (MyCoreID, MyID, CumulativeWt, DistanceToCore); 24. else if(my_wt == Wt) /* my wt is same */ 25. if(my_ID < senderID) /* conflict resolve */ 26. MyCoreID = CoreID; 27. MyParentID = ParentID; 28. MyBackboneStatus = MEMBER; 29 CumulativeWt = my_wt + Wt; 30. DistanceToCore ++; 31. OneHopBroadcast_MyStatus (MyCoreID,

MyID, CumulativeWt, DistanceToCore);

else 32. Wait to covered by other node 33. if (Wait to covered Time expired && MyBackboneStatus == NONMEMBER) 34. MyCoreID = CoreID; 35. MyParentID = ParentID;

Figure 1: Backbone Construction Process 2.3 An Illustrative Example

Fig 2 to 6 demonstrates the backbone creation process. Figure 2 shows the initial configuration of the

nodes in the network with individual node ids. Dotted circles with equal radius represent the fixed transmission range for each node.

Figure 2. Initial Configuration of node

Figure 3. Neighbor nodes with weight

Figure 4. Cluster with clusterhead

Fig 3 shows the neighbor nodes with

corresponding weight of every node. This is the resultant weight after executing the backbone construction process. The cluster-head selection procedure is executed in purely distributed manner and elected 3, 4 and 7 as cluster-head. From example, node 7 is the minimum weight node among its neighborhood (node 2, node 14 and node 9) hence it declares itself as cluster-head and broadcast this information to neighborhood. All other node will join the cluster-head, gradually. Finally, a cluster will form locally as shown in Fig 4. From Fig 4, it is observable that node 9 is in the transmission range of node 7 and node 4. So, this node will exchange core node information

207

with node 7 and node 4. Thus node 7 and node 4 will able to know each other.

Figure 5. The backbone

Figure 6. The Logical View

Figure 5 shows the overall logical view after

executing the proposed backbone construction algorithm. Figure 6 shows the logical view of the virtual structure, which will be used in rest of this paper for simplicity.

3 Correctness Proof

Assumption: To study the correctness of the approach from theoretical prospect, an ideal network situation has to be assumed. In other words, there will be no queuing delay and packet loss in the network. We represent the MANET as a unit disk graph G = (V,E), where V is the set of nodes in the vicinity and E is the bidirectional links between the neighboring nodes. After execution of the algorithm, the graph will be divided into two subset Vc = c: c is a core node and c∈V and Vm = m: m is a member node and c∈V. Clearly, Vc∪Vm = V. Two nodes are consider a neighbors if and only if their geographical distance is no more than a given transmission range r. Let N1(V) denotes the set of all nodes that are in V or have a direct neighbor in V. The set N1(V) is dominating set of V which cover all the node given by V- N1(V) . In general, d-hop subgraph Gd(v), induced from d-hop information of v, is (Nd(v), Ed(v)). Nd(v) denotes the d-hop neighbor set of node

v. In other words, N0(v) = v and

1 ( ) 1 1( ) ( ( )) ( )dd u N v dN v N u N v−∈ −= ∪U

for d ≤ 1. Ed(v) denotes set of links between d-hop neighbor. The local execution of the algorithm by any node v will generate a Nd(v)

Lemma 1(correctness): Every node can be

member of only one cluster. Proof: The cluster is identified by the core ID.

Every core node will send MyStatus message (line no 6 & line no 13 of Fig 1) and this message will be received by all the 1-hop neighbor. This will create N1(v). The weights of these neighbors are obviously greater than core node. So, they are covered by that core node. These nodes (covered by current core node) can extend their coverage to other nodes (line no 23 and line no 31 in Fig 1) to construct N2(v). So, they are also belonging to the same cluster.

When Nd(v) and d is exceeding DEPTH_THERSHOLD and the wait time expired (line 33), the node create own cluster. Hence, every node can determine its cluster and only one cluster.

Lemma 2 (Time-boundedness): The algorithm

terminates in a finite amount of time Proof: Initially all nodes are NONMENBER. At

any instant of time t1, there exist at most one NONMENBER node having minimum (Weight, Degree, ID) because of uniqueness of the metric. At time t2 ≥ t1 + (initial wait time), this NONMEMBER node turns into core node and its neighbors turn into MEMBER after message propagation delay. Thus the number of NONMEMBER node decrease at least by 1. In worst case situation after T≤|N| × (message propagation delay) all NONMEMBER nodes are exhausted. Hence the lemma follows.

Theorem 1: (Correctness and time-boundedness)

The algorithm generates a d-hop dominating set within a finite time.

Proof: The core node expands its coverage to d hop (Lemma1). Moreover, from Lemma 2, the set of CORE node form a dominating set. Hence, the algorithm generates a d-hop dominating set.

Theorem 2: No two nodes in D are neighbor,

where D is the d-hop dominating set generated by execution of the algorithm.

Proof: It can be proved by contradiction. Assume that two core node i, j ε D, the dominating set, are neighbor. During execution of the algorithm, the lowest weighted node will be selected as core among neighbor. So, if i and j are both in d, both should have

208

minimum (Weight, Degree, ID) then other, which is not possible. Hence, the theorem is proved 4. Virtual Backbone Based Reliable Multicasting Protocol

Now, we will describe the proposed reliable multicasting protocol that uses the virtual infrastructure constructed by the above-mentioned approach. For guaranteed delivery of data packet, we used a ACK based technique for ensuring reliable one hop data delivery and a receiver initiated NACK based approach for data recovery at receiver side.

4.1 One Hop Reliable Packet Delivery: ACK based approach

One of the main goals of this research work is to reduce bottleneck in the sender node by limiting retransmission request. To achieve this goal, an ACK based retransmission mechanism for successful packet delivery to the neighbor node is used. Consider the scenario shown in Fig 7, in which sender S is sending data packet to receiver R via intermediate node i1 & i2. It is observable that, packet no 4 is lost in between node S and i1. By one hop reliable packet delivery scheme, this loss is identified and tried to recover. In receiver initiated NACK based scheme, the receiver will try to recover the data packet (packet no 4) by sending retransmit request (NACK) over the link (R, i2), (i2, i1), (i1, S). Hence, the link (R, i2), (i2, i1) will be overburdened by retransmission request. We believe that, this overburden can be reduce by providing a ACK based system, in which loss recovery can be done in one hop neighbor. Moreover, it will reduce bottleneck at sender due to NACK messages.

Figure 7 Packet loss during transmission from sender

to receiver In order to recover from one hop packet loss,

every node maintains a packet sent table. As soon as it transmits /forwards a data packet, it makes an entry in that table along with ACK_Recv = 0. As soon as a node receives a data packet, it sends back an ACK to the sender. Upon receiving an ACK, it set the ACK_Recv of the packet sent table for the particular packet for the corresponding destination node (for <MCastAddr, SeqNo, TimeStamp, DestID>

combination). Periodically, every node check its packet sent table, and retransmit the data packet to the one hop destination node, if it is not receiving any ACK from that node. This process is continue for MAX_RETRY times (in our simulation, this value is 2). After MAX_RETRY times of attempt, it presumes that the neighbor node is going out of range.

4.2 Intra Cluster Operation

The reliable multicasting protocol explained in this research work perform its operation in two levels – intra cluster (local) and inter cluster (backbone) level. Inter-cluster operation is based on the concept of rendezvous point, where all the control message and data packets are directed to the cluster-head of the cluster the node belongs.

4.2.1 Multicast Joining

Each node maintains a Multicast Member Joining Table (MMIT) that keeps the information about the child nodes that participate in the multicast process. To join a multicast group, a multicast member node register itself with own cluster-head. For this purpose, node i announces its existence by sending a JOIN_REQ message to its parent (ParentIDi). The parent node updates the MMIT, set itself as a forwarder and forwards the message towards the cluster-head. This process continues till the JOIN_REQ message reaches the cluster-head. As soon as the cluster-head receives a JOIN_REQ message, it makes an entry in MMIT. Thus the forwarding node creates a multicast tree rooted at the core.

Similarly, every node sends a LEAVE_REQ message, whenever it wants to leave the group. This LEAVE_REQ message is also processed in the same way like JOIN_REQ message. As soon as any parent node/cluster-head hears a LEAVE_REQ message, it simply removes the entry of the node from MMIT.

4.2.2 Data Transmission

After construction of the backbone and the forest of varying depth tree, any node can start to transmit data packet via the backbone or the tree. If the source is a tree member, it can simply forward the data packet to its parent node and send to all the multicast member child nodes (consulting the MMIT). Upon receiving a packet from a child node, the parent node forwards the data packet until it reaches the cluster-head. In this data forwarding process, if there is any multicast member on the path towards the cluster-head, it simply

209

receives the packet and sends it to upper layer. When a core node receives a data packet from the downstream node, the packet is buffered and forward to other core nodes by the backbone level multicast process. The multicast source inserts a sequence number, multicast group address and a time stamp (packet creation time) in every outgoing packet to uniquely identify a packet.

4.2.3 NACK based Error Recovery

One hop packet delivery mechanism improves the

packet delivery ratio between a pair of nodes. The supportive mechanism may not guarantee successful delivery of packet to the destination node in highly dynamic situation. Hence, a NACK based error recovery mechanism is adopted in both local and backbone level.

In our local loss recovery mechanism, cluster-head acts as recovery node. Hence, unlike other system, our approach does not need explicit search of the recovery node. Instead of sending retransmission request as soon as a receiver detects a lost packet, a receiver periodically inform the cluster-head about the packet it has not yet received. The major advantage of this mechanism is that it helps reduce NACK explosion in the core node. Each NACK message includes a sequence number R and a vector V. The sequence number R indicates that all the packets up to R is received successfully and each flag in the vector corresponds to sequence number of a lost packet.

4.3 Inter Cluster Operations

The previous section explained how to perform multicast operation inside a cluster. To achieve data forwarding among different cluster, a core receiving a data packet must distribute among other core nodes.

4.3.1 Data Transmission

Any node in the backbone may help play one of

the following two roles – Core and Forwarder. The core node may be a multicast source or receiver or it may simply act as a distributor of data packet among other nodes.

If a core node is a source node, it sends the data packet to all other cluster-head via forward node (s). Moreover, it sends the data packet to all multicast member child within the cluster by consulting with the MMIT. Whenever, a cluster-head receives a data packet for the first time, it buffers the packet, sends back an ACK to the sender and performs a cluster level multicasting. Moreover, it forwards the data packet to

other nearby core nodes listed in Core Information Table (CIT). If the received packet is an older one, the packet is simply discarded and sends back an ACK to the sender. Upon receiving a data packet, forward node simply delivered the data packet in the link towards that cluster-head. 4.3.2 NACK Based Error Recovery

As soon as a cluster-head receive a new data packet,, it buffers that packet to satisfy the NACK requestor. Due to inherent characteristics of MANET, a cluster-head may require to pull data packet from other cluster-head to satisfy the need of the own cluster member. For this purpose, a NACK based error recovery mechanism is required in backbone level.

Upon received a NACK message from the downstream, cluster-head checks the availability of the data packet in the buffer. If the packet is available in the buffer, it sends back the data packet to the requester. Otherwise, it makes an entry in NACK_Table (table that keeps track the NACK message received by a node) generate a NACK request message putting own ID in the requestor field and hand over the request message to the neighbor leading to nearby cluster-head. Thus, it helps in reducing lost packet recovery latency for the receiver.

5 Simulations and Performance Evaluation

To analyze the performance of the proposed approach, we have conducted experiments using GloMoSim2.02. It is a library based simulator designed by Scalable Network Inc for mobile networks. In our experiments, 50 nodes are placed randomly in 1000m by 1000m area. Constant bit rate (CBR) traffic is generated by the application with each payload being 512 bytes. UDP is used in the transport layer. 802.11 is used as MAC layer protocol with two ray path loss model. All the experiments are run for 15 minute of simulation time. 5.1 Performance Analysis

To analyze the performance of the proposed reliable multicast protocol, the following three metric were used –

Average Packet Delivery Ratio: defined as the number of received data packets divided by the number of data packets generated. This metric measures the effectiveness and reliability of the protocol,

210

Average End-to-End delay: is the delay over all the packets received by all the receivers. It can be define as. This metric evaluates the protocol’s timeliness and efficiency.

First we will study the behavior of the proposed approach with varying mobility, for that purpose we fixed the radio range of individual node at 12.0 dB with packet inter-departure interval 200ms and the total nodes in the vicinity are 50.

Fig 8 plots the packet delivery ratio against the node mobility. The packet delivery ratio achieved was above 99%. However, with increase of mobility (mostly above 30 m/sec), and with increase of number of group there is a slight trend in reducing packet delivery ratio (less than 0.3%). This degradation is due to two different major facts

At extreme mobility situation, there will be instability situation in the backbone, as the number of re-affiliation will increase. So, for a fraction of second (during cluster-head handoff), the node may not able to receive any packet.

During loss recovery process, more and more control messages will be generated by the network, which in turn increase the congestion in the network, as a result more packets will be lost.

0.9

0.92

0.94

0.96

0.98

1

1.02

0 10 20 30 40 50

Mobility (m/sec)

Pack

et D

eliv

ery

Ratio

Multicas t Group = 3Multicas t Group = 2Multicas t Group = 1

Figure 8: Effect of mobility on PDR (Interdeparture Interval 200ms, Tx Range 12dB)

Fig 9 shows the end-to-end delay comparisons with different multicast group size. It is clearly observable that there is a sharp increase in end-to-end delay with increase in mobility and number of senders. The end-to-end delay suffered by the receivers in small

group is inspiring. Multicast Group 1 in Fig 9 is the evidence of this conclusion.

0

0.5

1

1.5

2

2.5

3

3.5

0 10 20 30 40 50

Mobility (m/sec)

Av E

nd to

End

Del

ay (s

ec)


Figure 9: Effect of mobility on end-to-end delay

The Fig 10 and Fig 11 show the packet delivery ratio with varying inter-departure time with mobility 0 and mobility 50 m/sec respectively. In both the situations, the packet delivery ratio achieved by the proposed approach is above 99%. However, with decrease of packet inter-departure interval, and with increase of number of group there is a slight trend in reducing packet delivery ratio (less than 0.1%). This degradation is due to the high contention the network experiences as the traffic rate and network load is grows.

0.9

0.92

0.94

0.96

0.98

1

1.02

100 200 300 400 500

Packet Interderpature Interval

Pac

ket D

eliv

ery

Ratio


Figure 10: Effect of Traffic rate on Packet Delivery

Ratio (mobility 0, Tx Range 10)

211

0.9

0.92

0.94

0.96

0.98

1

1.02

100 200 300 400 500

Interdeparture Interval (msec)

Pack

et D

eliv

ery

Ratio


Figure 11: Effect of Traffic rate on Packet Delivery

Ratio (mobility 50 m/sec, Tx Range 10)

The Fig 12 and Fig 13 show the average end-to-end delay with varying inter-departure time with mobility 0 and mobility 50 m/sec respectively. As seen in Fig 12, the end-to-end delay is constantly increased with decrease in packet inter-departure time interval, with increase in group member. Although, it shows a promising result with higher inter-departure interval, the end-to-end delay ratio is increased up to approximately 2 seconds with larger number of sources and lower packet inter-departure interval.

0

0.5

1

1.5

2

2.5

3

100 200 300 400 500Packet Interderpature Interval

(msec)

Av

End

-To-

End

Del

ay (s

ec)


Figure 12: Effect of Traffic rate on End-to-end delay

(mobility 0)

0

0.5

1

1.5

2

2.5

3

100 200 300 400 500

Interdepature Interval (msec)

Av

End

-To-

End

Del

ay (s

ec)

Multicast Group = 3Multicast Group = 2Multicast Group = 1

Figure 13: Effect of Traffic rate on End-to-end delay

(mo50 m/sec, Tx Range 0) 6. Conclusion and Future Directions

The research work proposed a reliable multicasting approach for mobile ad hoc network. We used a different cluster-head selection criterion to find the d-hop dominating set. For load balancing, the highest degree node is giving less weight-age in this process. Then we used the virtual infrastructure for reliable multicasting.. The error control mechanism combines one hop ACK based reliable data packet delivery approach helps in reducing global packet retransmission request and hence increases the performance of the network. Unlike other approaches, the localized loss recovery mechanism used in this approach does not demand for explicit searching of recovery node. Besides protecting the source from unnecessary retransmission request, it reduces global congestion caused by control message and retransmission. Through extensive simulation, we evaluate the performance of the proposed approach for wide range of MANET scenarios. It shows a promising packet delivery ratio up to 1, which was one of our main objectives. 7. References [1] T Gopalswamy, M Singhal, D Panda and P Sadayappan A Reliable Multicast Algorithm for mobile Ad Hoc Networks In Proceedings of ICDCS 2002, July 02-05, 2002 [2] S Paul, K K Sabnani, J C Lin and S Bhattacharyya Reliable Multicast Transport Protocol (RMTP) IEEE

212

Journal on Selected Areas in Communications Vol 5 no 3, Apr 1997 pp 407-421 [3] J Macker and W Dang The Multicast Dissemination Protocol (MDP) framework IETF Internet Draft, draft-macker-mdp-framework-00.txt [4] K Obraczka Multicast Transport Mechanism: A Survey and Taxonomy IEEE communication Magazine Vol 36, no 1, Jan 1998 pp94-102 [5] T Speakman, N Bhaskar, R Edmonstone, and D Farinacci, S Lin A Tweedly and L Vicisano PGM Reliable Transport Protocol Specification IETF Internet Draft, draft-speakman-pgm-spec-03.txt [6] Jetcheva Jorjeta G & Johson David B. (2001) Adaptive Demand Driven Multicast Routing in Multi hop Wireless Ad Hoc Networks MobiHoc 2001 CA USA [7] Jetcheva Jorjeta G. and Johnson David B. (2004) A Performance Comparison of On-Demand Multicast Routing Protocols for Ad Hoc Networks reports-archive.adm.cs.cmu.edu/ anon/2004/CMU-CS-04-176.pdf [8] Zhu Yufang & Kunz Thomas (2004) MAODV Implementation for NS-2.26 Systems and Computing Engineering, CU, Technical Report, 2004 [9] Obraczka Katia & Viswanath Kumar (2001) Flooding for Reliable Multicasting in Multi-hop Ad Hoc Networks Wireless Networks Kluwer Academic Publishers, page 627-634 [10] Sobeih Ahmed, Baraka Hoda, Fahmy Aly (2004) ReMHoc: A Reliable Multicast Protocol for Wireless Mobile Multihop Ad Hoc Network Consumer Communication and Networking Conference (CCNC 2004) [11] Chandra R, Ramasubramanian V, Birman K Anonymous Gossip: Improving Multicast Reliability in Mpbile Ad-hoc Network, International Conference on Distributed Systems, Pages 275-283 April 2001 [12] Venkatesh Rajendran, Yi Yunjung, Obraczka Katia, Lee Sung-Ju, Tang Ken, Gerla Mario.(2003) "Reliable, Adaptive, Congestion-Controlled Ad hoc Multicast Transport Protocol: Combining Source-based and Local Recovery". UCSC Technical Report 2003

[13] Tnag Ken, Obraczka Katia Lee Sung-Ju, Gerla Mario (2002) A Reliable, Congestion-Controlled Multicast Transport Protocol in Multimedia Multihope Networks in WPMC, 2002 [14] Gopalsamy Thaigaraja, Singhal Mukesh, Panda D Sadayappan P (2002) A Reliable Multicast Algorithm for Mobile Ad hoc Networks 22nd IEEE International Conference on Distributed Computing System (ICDCS 02) [15] 25. Jaikaeo C, Shen Chien-Chung (2002) Adaptive Backbone-Based Multicasting for Ad Hoc Networks IEEE International Conference on Communications (ICC), New York City, April 28-May 2, 2002. [16] 26. Inn Inn ER, Seah Winston K G (2004) Mobility Based d-Hop Clustering Algorithm for Mobile Ad Hoc Networks In: proceedings of IEEE Wireless Communications and Networking Conference, March 21-25, 2004

213

ADCOM 2009DISTRIBUTED SYSTEMS

Session Papers:

1. Achyanta Kumar Sarmah, Smriti Kumar Sinha and Shyamanta Moni Hazarika, “Exploiting Multi-context in a Security Pattern Lattice for Facilitating User Navigation”

2. Sundar Raman S and Varalakshmi P, “Trust in Mobile Ad Hoc Service GRID”

3. Soumitra Pal and Abhiram Ranade,“Scheduling Light-trails on WDM Rings”

214

1

Exploiting Multi-context in a Security Pattern Lattice for Facilitating UserNavigation

Achyanta Kumar Sarmah1,2, Smriti K. Sinha1 and Shyamanta M. Hazarika1

1 School of Engineering, Tezpur University, Assam, India, (achinta,smriti,smh)@ tezu.ernet.in2 Rajiv Gandhi Indian Institute of Management, Shillong, Meghalaya, India, [email protected]

Abstract

Repositories ofSecurity Pattern (SP)s developed overthe years are based on security templates that are essen-tially different from one another, each trying to capturesecurity solutions at different levels of abstractions underdifferent perspectives..This lack of uniformity amongst therepositories leaves the user without a proper organizationof SPs to choose from. In addition to this, no representationof SPs have facilitated for retrieval of inter-dependentpatterns in a context even though patterns are alwaysrelated to one another in a context. In this paper, wecarry forward our idea of Security Pattern Lattice(SPL)proposed in our previous work [1] and attempt to reachat a SP template that would cover different existing SPrepositories. We conceptualize a Security Concept to com-prise of an extension of SPs and intension ofSecurityRequirement. Also, we introduce the concept of aMulti-context in a SPL that would allow a user to search for aconcept with a given SP and retrieve it’s related patterns.

Keywords: Multi-context, Security Pattern, Security Re-quirement, SPL

I. Introduction

Experts and developers working on security are primar-ily concerned with the architecture and design of securitysolutions. In contrast to this a user would want a convenientand simple way of incorporating security solutions into asystem without need for understanding the complexity ofthe architecture and design of security solutions. SP is anengineering approach to bridge this gap. It documents areusable solution of some recurring security problem in acontext. In essence, it captures the expert’s knowledge toaddress a security problem.

For a structural design of a security concept we canattach certain conceptual elements to a SP and define a

template. Available collection and repositories of SPs isfound to follow some template structure. However thesetemplates defines SPs in different levels of abstraction andthe SPs themselves differ in their perspectives. As such acommon structure to emcompass all these patterns is stillmissing which would have allowed us to select, implementand deploy a pattern at the user level without requiring tounderstand the engineering and design details of the pat-tern. In our previous work [1] in this regard, we organizedSecurity Patterns as a Concept Lattice. Carrying forwardwith this, we attempt reaching at a SP template that wouldcover different existing SP repositories in different levelsof abstraction and perspective. We introduce the conceptof a Multi-context in a SPL that would allow a user tonavigate to a concept in the SPL and then select the exactpattern we want.

II. Related works on SP

A host of SPs have been proposed till date. The au-thors in each case defines a security template at certainlevel of abstraction and then enumerates SPs from someperspective. For example, in [2], seven patterns relatedto the application and network domains are proposed atthe design level. In [3] twenty three patterns related toJ2EE Applications, Identity Management, Web servicesand service provisioning are proposed at the deploymentand functional level.In [4] fourteen patterns related tothe application and network domains have been proposedat implemenmtation and deployment level. Apart fromsuch attempts at enumerating security patterns, there alsohave been attempts at reaching a common repository ofthese patterns as in [5], [6]. However these repositoriesserves chiefly as a documentation of the existing patterns,their templates and are void of a structured algorithmicformalism that would allow a developer to directly extracthis/her required pattern from the repository and implement

215

2it in any platform. This cheifly happens because of tworeasons - firstly the templates used are different fromone another and specifically suit only the abstraction andperspective for which it is used and secondly a hierarchybased organizational scheme for security patterns, whichare related to one another is missing. A hierarchy basedorganizational scheme allows users a one point navigationfacility to search for some patterns based on any character-istic. In [1], we decided on a security template of our ownbased on Christopher Alexander’s definition of pattern andattempted organizing patterns by exploiting results fromFCA and constructing a concept lattice of security patterns,the SPL. Here we propose a generic template as discussedabove.

III. SP template

A. SP templates from related works

In this section we summarize some of the templatesused in the field of SP. Our aim of reviewing thesetemplates is not to reach at a template structure that wouldencompass all these templates and give us uniformity ofrepresentation. It is rather to detect the various perspectivesof SPs. We then attempt to capture these perspectives intothe semantics of our proposed template and reach at analgorithmic structure. Uniformity of representation wouldallow proper and efficient selection while the algorithmicstructure would help us tranlate a pattern into any imple-menting framework.

1) Markus Schumacher’s pattern template:Based onthe terminology provided in the Common Criteria,Schumaher in [7] proposes a template with the elementsName, Context, Problem and Solution. Along with thiscomes some other optional elements which could improvethe comprehension of a SP.

2) Kienzle and Elder’s template:Kienzle et. al. in [8]considers four levels of abstraction for SPs viz:

a) Concepts: These encompasses general strategies andare represented by abstract nouns that could notbe directly implemneted by developers. For example”least privilege”.

b) Classes of patterns: It represents a general problemarea that could have multiple solutions.

c) Patterns: A pattern is specific enough to allow basicproperties to be specified and trade-off analysis to beconducted against other patterns.

d) Examples: An example is typified by sample code. Itis the most immediately useful, but in a very narrowcontext.

The authors takes an object oriented approach and proposesa template at the third level of abstraction with the elements

Pattern name, Abstract, Aliases, Problem, Solution, StaticStructure, Dynamic structure, Implementation issues, Com-mon attacks, Known uses, Sample code, Consequences,Related patterns, References. The elements of this templateexhibits different perspectives:

• Documentation: Pattern name, Abstract, Aliases,Known uses, Consequences.

• Functionality: Problem, Solution.• State: Static structure, dynamic structure• Environmental: Related patterns, References, Com-

mon attacks• Implementation: Implementation issues, Sample

code

3) Sun’s Core SP template for JEE by Nagappan et.al.: Motivated by the concept ofSecurity by default[3], anotion that ensures security at all OSI levels, the authorsforwards a security design methodology with the followingstages

a) Define Security requirementsb) Candidate Security Architecturec) Perform Risk and Trade off analysisd) Identify SP and create Security Designe) Implement prototypef) Validation testing and auditing

The security template used for security patterns in thiscase hasProblem, Forces, Solution, Structure, Strategies,Consequences, Reality checks and security actors andrisks as its elements. The reality check element in thistemplate considers the perspective of testing resources forapplicability of a pattern, though it is at a very conceptuallevel.

4) Microsoft’s Web Service Security template:Mi-crosoft classifies patterns for Web Service security intoArchitectural, Design and Implementation patterns. Forthis purpose it considers the elements Name, Context,Problem, Forces and Solution with semantics similar tothe elements in other templates.

Most of these patterns adhere to the basic elements of asecurity pattern viz: Name, Context, Problem and Solution,with additon of some other optional elements. In our case,we consider a pattern to have an algorithmic represen-tation. We therefore base upon these basic elements andpropose a template as follows:

CONTEXT - These are thepreconditionsthat need tobe met for applying the SP. For example, Authenticationneeds to be performed before authorization always.

A Preconditionhere exhibit inter pattern relationship ordependencies between patterns. It could be implementedas a vectorV ect<pattern,perspective> , where each elementof the vector is a tuple< pattern, perspective> and thepattern in question is applicable for one or more of theelements of Vect.

216

3PROBLEM- It defines a situation that would require

this pattern to be applied. For example, Authenticationelements viz: cards, passes etc tend to becomenamesakesafter some time.

SOLUTION- These are security algorithms/measures tobe applied for solving the above problem. For example,Digital signature is a solution for authorization.

CONSEQUENCES- The context of a SP would usuallybe given in terms of some variables or objects. Conse-quences would give us the change in values of the contextor introduce/deduce them. Apart from these, we couldhave a number of optional elements which could serve asselection criteria for the patterns from a repository. Someof these optional elements may be:

• Aliases• Known Uses or examples• Abstract• Sample code• Common Attacks and risks

IV. Security Requirements

SR are in general considered to be those system securitypolicies that constraints functional requirements. They areexpected to provide information about the desired levelof security for a system. While identifying and specifyingSRs, a common problem is that they tend to be accidentallyreplaced with security specific architectural constraintsthat may unnecessarily constrain the security team fromusing the most appropriate security mechanisms patternsin our case, for meeting the true underlying SRs . Keepingthis in mind we need a structured way of representingsecurity requirements that could distinguish between thesecurity engineering artifact and the domain concepts thatthe requirements satisfy. We base on the CIA model ofinformation security for our SRs and define a template foridentifying them as follows

• CONCERN: This is the security concern representedby the requirement. Concern in our case have threecomponents based on the CIA model of Informationsecurity, viz: Confidentiality, Integrity and Availabil-ity. Extending these components to other perspectivesof security, we may have an enumeration of securityconcerns asAvailability, Identification, Authentica-tion, Immunity, Integrity, Intrusion, Non repudiation,Privacy, Security auditing, Survivality, Physical pro-tection, System Maintenance Security

• ENGINEERING ARTIFACT These are objects orartifacts that are used for enforcing some securityrequirement in a system. Any requirement may haveone or more engineering artifact related to it. Forexample, Authorization is usually enforced by AccessControl List or User Role.

• DOMAIN CONCEPT These are concepts like role,session, subject which are specific to a certain domainand represents some security requirement.

We illustrate the reltionships between the above elementswith the example in figure 1:

Domain Concepts

<refines> <refines><refines>

<depends>

<depends>

Access Control List RBAC Capability

Subjects Objects

PrivilegesRolesSessions

Authorization Concern

EngineeringArtifacts

Fig. 1. Elements of Authorization [9]

V. Security Pattern Lattice(SPL)

SPL is a lattice theory based approach to organize SPs.Over the years, a corpus of SPs have been developed whichis expanding continuously. A need of the hour is to addresshow to organize these security patterns within such acorpus to enable application developers to navigate throughthe library of patterns and select suitable patterns withoutambiguity. However pattern organization is a nontrivialtask. The problem of organization is not only related to SPsbut also to the corpus of patterns in other domains. SPs canbe defined as a partially ordered set which in turn can beorganized in the form of a lattice. This has been exploitedin our work [1] to present the SPL using results fromFCA. We consider a SP template with elements<NAME,ALIASES, TASK, PARAMETER, EXHIBIT>, and definea Trust Element (TE)

Definition 1: : Trust Element is a property or a state-ment about an entity in a context which is otherwiseunknown and whose absence makes the entity vulnerableto certain attacks. Here an entity could be a subject, aobject or a situation in a given context.Observing the fact that the set of TEs and the set of SPsexhibit a Galois connection where minimizing one of themmaximizes the other and vice versa, we define securitycontext and security concept based on the principles ofFCA and build a Formal Concept Lattice for SPs in theapplication context and call it SPL. The set of SPs serveas the extension while the set of TEs serve as the intensionhere.

217

4Thereafter, we attempt at classifying SPs within SPL by

use of scaling techniques.

VI. Multi context in a SPL

We extend our idea of a SPL to accomodate search ofa specific pattern in a security concept as follows:

The conceptual TE is formalized as SR with a template.The user searches for some SP on the basis of a SR. Thistakes the user to a security concept in the SPL that has SPssatisfying the provided SR. For reaching at the specificSP desired by the user, he/she launches a second levelsearch with some other parameter that would discriminateamongst the SPs in the Security Concept at hand. Inthe approach presented here, the second level seach isdone on the basis of a precondition or in other words arelated pattern. The related pattern submitted by the useris searched for in the Precodition vector of each of thepatterns in the extent of the security concept at hand. Thesearch returns all those patterns whose precondition vectorcontains the related pattern submitted by the user in thesecond level search. From our template of SP, we have theelement of precondition as multivalued, so we representa security concept as another lattice say<G c,M c,I c>where

G c=Set of security patterns in the present concept.M c = Union of all preconditions belonging to all

patterns in GcI c = Set of relationship between elements of Gc and

M c tat tells us which precondition is applicable to whichpattern.

In this context, the atomic concept having only thedesired precondition as its intension would give us thenecessary SPs.

VII. Generating and navigating in a multi-context SPL

A. Concept Generating Algorithms

Generation and navigation in a concept lattice involvesiterating through all possible concepts in a given context.Hence, the complexity of any algorithm that generates ornavigates in a concept lattice depends upon the numberof concepts in the given context. Again, the number ofconcepts is exponential in the size of it’s input context.This case is similar to the case of finding the powersetof a set. Here the complexity increases exponentially withthe size of the set. So, from the standpoint of worst casecomplexity, an algorithm generating all concepts and/or theconcept lattice can be considered optimal if it generates the

lattice inpolynomial time delay1 or space linear in numberof all the concept[10]

Algorithm 1 presented here is a concept generatingalgorithm based on the Next-Closure [11], [12] algorithmby Ganter.

Boundary conditions for the algorithm 1• Single attribute concepts of the context are given.• No new attribute or object is added during concept

generation.

Data structure involved and input to the algorithm 1• OBJECTS = Array of objects.• ATTRIBUTES= Array of attributes.• NO ATR = Total number of attributes.• NO OBJ = Total number of objects.• CONCEPTARRAY = A array whose elements are

CONCEPTS.• SUPREMUM = The CONCEPT with all objects and

null attribute.• MAXEXTENT: The extent with all objects from

OBJECTS as member.• SINGLEATTRIBUTECONCEPTS: List of concepts

whose intents have only one attribute.

Output from the algorithm 1CONCEPTLATTICE - Generated lattice of the minedconcepts.

Procedures involved in the algorithm 1• VAL: Function to build the bitstring from the indices

of a subset.• UPDATESUBITENTS: Function to update the list of

subintents.Input Parameters:INTENTS: List of intents.SUBINTENT: The sub intent for all intents in IN-TENTS.

• RETRIEVEPOSITIONS: Function to retrieve the setbit positions in an integer.

Classes involved in the algorithm 1• SUBSET ITERATOR (It is a class which allows us

to iterate through the subsets of a given set)Data:A) SET SIZE: Size of the parent set.B) SUBSETSIZE: Size of the SubsetsC) SUBSETINDICES: List of Indices of the currentsubset.D) FIRSTTIME: Flag to check whether the iteratorhave been called for the first time.Functions:

1Algorithm for listing a family of combinatorial structuresis saidto have polynomial delay if it executes at most polynomiallymanycomputation steps before either outputting each next structure or halting.

218

5A) INCREMENT SUBSETINDICES(): Incrementpresent subset indices to produce next subset indices.B) NEXT SUBSET(): Function to return indices ofnext subset on the fly.

• CONCEPT (It is a class that represents a concept)Data:A) CONCEPT: A 16 bit integer that represents aconcept. It’s higher order NOATR bits represents theintent and lower order NOOBJ bits represents theEXTENT.B) SUB INTENTS: List of INTENTS of CONCEPTSthose are subsumed from this CONCEPT.C) SUPINTENTS: List of INTENTS of CONCEPTSthat subsumes this CONCEPT.Functions:A) GETEXTENT(): Function to return the EXTENTof this CONCEPT.B) GETINTENT(): unction to return the INTENT ofthis CONCEPT.C) ADD SUPINTENTS(): Function to add superin-tents of this CONCEPT.D) ADD SUBINTENTS(): Function to add subintentsof this CONCEPT.E) ADD SUPINTENT(): Function to add a singlein-tent of this CONCEPT.

B. Extending algorithm 1 for facilitating

Multi-context in the concept lattice

In algorithm 2 described below, we extend algorithm 1to facilitate multi-context in the concept lattice generatedand allow the user to search for a pattern based on twocriterias, security requirement and related patterns.

Boundary conditionsSame as for algorithm 1

Data structures involved and input to the algorithmThe data structures and input to algorithm 1 is extendedwith the following

• NO PERS = Total number of perspectives.• VECT PRECON = Vector of preconditions whose

elements are objects of the class PRECON describedbelow.

• VECT SELECTEDSP: vector that holds the selectedSPs.

Output from the algorithmVECT SELECTEDSP populated with Selected SPs cor-responding to selection criteria provided i.e. SECREQ,PREREQ.

Procedures involvedSame as for algorithm 1

Classes involvedThe set of classes for algorithm 1 is extended with thefollowing classes

• PRECON (A class that represents a precondition)

Data:A) PATTERN: The pattern whose presence isrequired as the precondition.B) PERSPECTIVE: The perspective in whichPATTERN should be applied as a precondition forthe parent pattern.Function:A) isPRECON(PATTERN pat) :Returns true if patis same as PATTERN. Returns false otherwise. B)GETPAT: Function that returns the PATTERN.C) GETPERSPECTIVE: Function that returns thePERSPECTIVE.

• SECPAT(A class representing a SP)

Data:A) PROBLEM: Textual definition of the securityproblem at hand.B) SOLUTION: Archiectural and Designspecification for the pattern.C) CONTEXT: A vector representing thepreconditionsD) CONSEQUENCES:

Function:A) GETNEXTPRECON: Function that returns thenext PRECON for this SP.B) MOVEFIRST CONTEXT: Positions theCONTEXT vector at the first index.C) MOVENEXT CONTEXT: Positions theCONTEXT vector at the next index.D) MOVEABS CONTEXT(POS): Positions theCONTEXT vector at the index POS.E) isFIRSTPRECON: Boolean function that checkswhether it is the first position of the CONTEXTvector or not.F) isLASTPRECON: Boolean function that checkswhether it is the last position of the CONTEXTvector or not.G) isAFTERLASTPRECON: Boolean function thatchecks whether the next index is out of bound ornot.

Selection criteria for a pattern in the generated lattice:• SECREQ: The security requirement for which an

applicable pattern is to be searched for.• PREREQ: The pattern which exist as a precondition

for the pattern(s) applicable for the SECREQ pro-vided.

219

6Variables: CUR ATTRIBUTE INDICE=0repeat1

Find all subsets of ATTRIBUTES of size CURATTRIBUTE INDICE+1.2

CUR INTENT=0, CUR EXTENT=MAXEXTENT.3

foreach subset S from step 2do4

CUR INTENT= S5

CUR EXTENT = Intersection of the EXTENTS of CONCEPTS from SINGLEATTRIBUECONCEPTS6

corresponding to each element in S.if CUR EXTENT<> 0 then7

1) NEW CONCEPT=CUR INTENT, CUR EXTENT 2) Update SUPINTENTS of NEW CONCEPT with CURINTENT3) Update SUBINTENTS of each CONCEPT corresponding to each element in CURINTENT with

NEW CONCEPT.4) Append NEWCONCEPT to CONCEPTARRAY.

end8

end9

CUR ATTRIBUTE INDICE = CUR ATTRIBUTE INDICE + 1; CUR INTENT=010

until CUR ATTRIBUTEINDICE=NO ATR11

Algorithm 1. Algorithm to produce a concept lattice from a given context

variable: CUR ATTRIBUTE INDICE = 0repeat1

Find all subsets of ATTRIBUTES of size CURATTRIBUTE INDICE+1.2

CUR INTENT=0, CUR EXTENT=MAXEXTENT, TAR CONCEPT=0.3

foreach subset S from step 2do4

CUR INTENT= S5

CUR EXTENT = Intersection of the EXTENTS of CONCEPTS from SINGLEATTRIBUTECONCEPTS6

corresponding to each element in S.if CUR EXTENT<> 0 and SECREQ∈ CUR INTENT then7

TAR CONCEPT=CUR INTENT, CUR EXTENT, TAR EXTENT=CUR EXTENT8

foreach pattern in TAREXTENTdo9

if the current pattern CURPAT in TAREXTENT is validthen10

Search PRECON of CURPAT for a match with PREREQ. If a match found then add it to11

VECT SELECTEDSPend12

end13

end14

CUR ATTRIBUTE INDICE = CUR ATTRIBUTE INDICE+115

end16

until CUR ATTRIBUTEINDICE=NO ATR17

Algorithm 2. Algorithm for searching in a multicontext SPL

Complexity of the proposed algorithmThe complexity of the algorithm 2 for navigating in amulti-context SPL can be compared with that of Subsetgeneration of a set of cardinality n.

Complexity of subset generation algorithm = O(2n),where n is the cardinality of parent set.

In this case, maximum number of concepts to besearched for is equal to the total number of concepts inthe SPL which is equal to cardinality of the power setof the set of attributes, ATTRIBUTES. But the actualnumber of valid concepts would be less in most cases andthe search may reach a target much earlier in the lattice.

220

7Let X < 2NO ATR, where X is the number of valid

concepts. C1 = Time required to perform an indexedsearch for an extent or intent = Linear and ConstantC2 = Time required for 1 computation = Linear andconstant.C3 = Time required to perform 1 bit level operation =Linear and constant.C4 = Time required to perform 1 conditional operation =Linear and constant.C5= Time to retrieve an extent or intent of aconcept=Linear and constant.C6=Average time required to search linearly for anattribute in an intent=LinearC7=Average time required to search for a givenprecondition in a pattern=Linear

We proceed to compute the analytical complexity ofour algorithm based on X as follows:

No of comparisons / computations to search for aconcept:

1) Search for a SECREQ in the intent of the currentCONCEPT = C6

2) Retrieve extent if search at step 1 successful=C5+1In the worst case, the total number of comparisons that

may be required to search for a concept in the SPL =X ∗ (C6 + C5 + 1)

No of comparisons to search for patterns in a conceptwith a given precondition

1) Average number of patterns in the extent of a givenconcept = (NO OBJ + 1)/2

2) Average time required to search for a preconditionin a pattern=C7

3) Total time to search for patterns in a concept, witha given precondition=C7 *(NO OBJ + 1)/2

Total number of comparisons / computations forselecting patterns in SPL based on a SECREQ and aprecondition:No. of comparisions/computations to search for a concept+ No. of comparisions to search for patterns in aconcept with a given precondition = X*(C6+C5+1)+C7*(NO OBJ + 1)/2

Since C5 is linear and constant, takingC5 ≈ 1 we haveNo. of comparisions/computations≈ X ∗ (C6+2)+C7 ∗(NO OBJ + 1)/2In the above expression C6 is of order linear in the numberof attributes and C7 is of order linear in the number ofpatterns. Therefore the terms C7*(NO OBJ + 1)/2 and(C6+2) are linear in order. Hence the overall complexityof the above expression depends upon the order of X i.e.O(X). consequently, the algorithm is of polynomial order

(and hence the algorithm is feasible) if X is of atmost orderpolynomial. The order of X depends on the sparseness anddenseness of the initial context that gives all the singleattribute concepts. Higher the denseness of the context,higher would be the order of X and equivalently for sparsecontext.

VIII. Conclusions and future directions

Analysing the existing templates of SPs, we attempt tocapture the various perspectives in which security patternsare considered. Building up on the SPL proposed by theauthors earlier, which is a concept lattice organizing secu-rity patterns in a formal manner, the present work extendsthe SPL further to include a multiple context within theSPL. For this an all encompassing template for the securitypatterns is considered. Also, applicability of all SPs areconsidered in light of various SRs based on the CIA model.Casting the SPs and SRs into a FCA framework as in SPL,the authors proposes a generating algorithm for the SPL,based on concept generating tools like Nextclosre, Objectexploration and attribute exploration etc. Observing thefact that patterns in a domain always works under a systemof forces where the patterns are related to one another, theidea of multi-context is introduced in the template of SP asPRECON which is a set of patterns related to the currentone. In this regard the generating algorithm is extended tofacilitate navigation for a particular pattern based on a SRand related SP provided. The present algorithm facilitatesnavigation based on only one characteristic of SP i.e.the CONTEXT or PRECONDITION. Also, the generatingalgorithm works only with a finite context. If a new SP is tobe added then a rerun of the algorithm is required. Futurework in this direction could be to make provision foradding a new concept to the existing lattice without needof re-run of the whole generating algorithm. Also, facilitycould be incorporated in the navigation algorithm to allowsearch on any characteristic of a SP. With the algorithm atour hand for generation and navigation, application couldbe developed that would allow us to build a view of thegenerated lattice in any platform automatically.

References

[1] A. K. Sarmah, S. M. Hazarika, S. K. Sinha, Security pattern lattice:A formal model to organize security patterns, In Proceedings ofthe 2008 19th International Conference on Database and ExpertSystems Application (2008) 292–296.

[2] J. Yoder, J. Barcalow, Architectural patterns for enabling applicationsecurity, In Proc. of PLoP 1997.

[3] C. Steel, R. Nagappan, R. Lai, Core Security Patterns, Prentice Hall,2007.

[4] S. Romanosky, Page 1 security design patterns part 1 v1.1(2001).[5] D. Kienzle, M. Elder, D. Tyree, J. Edward-Hewitt, Security pattern

repository v1.0.

221

8[6] M. Hafiz, Security patterns and secure software architecture, 51st

tutorial in International Conference on Object Oriented Program-ming, Systems, Languages and Applications.

[7] S. Markus, R. Utz, Security Engineering With Patterns, Springer-Verlag New York, Inc., 2003.

[8] D. M. Kienzle, P. D, M. C. Elder, P. D, D. S. Tyree, Intro-duction security patterns template and tutorial, Retrieved fromciteseerx.ist.psu.edu/viewdoc/summary10.1.1.131.2464.

[9] M. Dan, R. Indrakshi, R. Indrajit, H. S. Hilde, Building secu-rity requirement patterns for increased effectiveness early in thedevelopment process, In proc. of Symposium on RequirementsEngineering for Information Security,Paris.

[10] S. Kuznetsov, S. Obiedkov, Comparing performance of algorithmsgenerating for concept lattices, Journal of experimental and Theo-retical Artificial Intelligence (2002) 189–216.

[11] B. Ganter, Two basic algorithms in concept analysis, TechnicalReport preprint, TH Darmstadt.

[12] B. Ganter, K. Reuter, Finding all closed sets: A generalapproach,Order 8 (1991) 283–290.

222

Trust in Mobile Ad Hoc Service GRID

P. Varalakshmi 1 S. Thamarai Selvi2 S.Sundar Raman 3

Department of Information Technology, Madras Institute of Technology, Anna University Chennai, Chennai-600044, India

Email : [email protected], [email protected], [email protected]

Abstract A mobile ad-hoc network (MANET) is a kind of wireless ad-hoc network, and is a self-configuring network of mobile routers connected by wireless links. The routers/mobile nodes are free to move randomly and organize themselves arbitrarily; thus, the network's wireless topology may change rapidly and unpredictably. The mobile devices forming the ad hoc network can be laptops, PDAs and mobile phones. These devices can be integrated to form an infrastructure known as grid. In order to effectively share and use these heterogeneous resources we visualize a grid overlay on this network. The major challenge in forming a grid over an ad hoc network is the infrastructure less nature and the trust implementation. The trust is calculated by the reputation of each node. The reputation differs based on the behavior of the node for the given job. We use the existing architecture of mobile ad hoc grid and add the trust computation. This makes the ad hoc grid nodes to trust the remaining nodes for the assignment of jobs. Trust management is an effective method to maintain the credibility of the system and keep honest entities. 1. Introduction Grid computing initially focused on large-scale resource sharing, innovative applications, and achievement of high performance computing. Today, the Grid approach suggests the development of a distributed service environment that integrates a wide variety of resources with various qualities of service capabilities to support scientific and business problem solving environments. Grid service is a web service that provides a set of well defined interface and that follows specific conventions. When we take the Grid Services on mobile devices, it becomes a real challenge to deploy such kind of core Grid services and their requirements in terms of space and computational power, especially in case of hand-held devices. Mobile Ad Hoc Network (MANET) is an autonomous collection of mobile users (nodes) that communicate over

relatively bandwidth-constrained wireless links. Due to mobility, the network topology may change rapidly and unpredictably over time. The network is decentralized, where network organization and message delivery must be executed by the nodes themselves, i.e., routing functionality will be incorporated into mobile nodes. We need to use the underlying connectivity and routing protocols that exist on Ad-Hoc networks in order to develop Mobile Ad-Hoc Service Grid (MASGRID). Thus MASGRID is a dynamic, secure, coordinated resource sharing among mobile devices and can be referred as “Mobile Virtual Organizations”. Implementing trust evaluation over the MASGRID improves the overall performance and thus increases the security of the ad hoc GRID environment. 2. Proposed Work In traditional GRID environment, the trust evaluation is carried out based on the job success rate and user feedback. But in MASGRID, in addition to the above factors, the reputation of a node can be evaluated based on the mobility and power constraints. The power constraint can be evaluated based on the battery capability of the device we consider to be resource. In MASGRID, each nodes act as both grid user as well as grid resource provider. Hence each node has to maintain the evaluation factors in its own database. 2.1 System Architecture

The Mobile Ad Hoc Service GRID with trust evaluation for proper resource management and job submission required the present MASGRID architecture to be reconfigured. The architecture consists of two services 1) Resource Discovery Service 2) Resource Access Service. Along with the above services, watchdog service must be added.

Resource Discovery Service (RDS) and Resource Access Services (RAS) use the underlying ad hoc network protocols for their functionality.

Resource Lookup Table (RLT) is used to maintain the resource information accessible to the particular node.

223

Resource Discovery Service – RDS is a service that is used to find a particular resource node using RLT.

Resource Access Service – RAS is a service that is used to make the job execute in a remote grid node and keep track of the jobs submitted.

Watchdog Service is used to check the job execution status and update the trust factor in the database.

Each node maintains three tables in its database. The tables are node_info, job_sent_info and job_received_info. The node_info table contains the neighbor listing and the available resource information. The trust and mobility factor information for each neighbor node will be updated in this table. The job_sent_info table contains the job submission information. Each job contains the status flag which shows the job success and the node to which the particular job is assigned. The job_received_info table contains the job accepted from its neighbors. This table is maintained as the queue and the job execution is carried out based on the First-In First-Out basis.

Consider a nodei needs to execute its jobj in MASGRID environment. Then it gets the neighbor listing from the node_infoi table. Each node in the neighbor listing will be assigned the part of the jobj. Thus nodei’s job_sent_infoi table will be updated with the neighbor nodes information. Later the corresponding neighbor nodes’ job_received_info tables are also updated with the nodei

information. The watchdog service checks for the job execution feasibility in the neighbor nodes and updates the status flag of the job in both job_sent_info and job_received_info tables.

2.2. Trust Evaluation The trust evaluation is based on the job success rate for a node. Each node calculates the job success rate for its neighbors and updates the trust factor in its database. Job Success Rate – It is the ratio of the number of jobs executed successfully in a node to the total number of jobs assigned to the same node. Thus the job success rate greatly influences the trust and reputation value of a node. The job success rate can be calculated based on the Eq. (1) as Ti = JSi / TJi (1) Here, Ti is trust value of node ‘i’, JSi is number of jobs completed successfully for node ‘i’ and TJi is the total number of jobs assigned for node ‘i’. Each node can behave either good or bad. Hence both have the equal probability of 0.5. Hence each node assigns an initial value to all neighbor nodes such that a node can be initially trusted partially for job submissions. But later the trust value differs based on the job success history of the particular node.

2.3. Evaluation of Mobility Factor The mobility factor is based on the concept of how far a node is mobile relative to other nodes in the neighbor list. The interval of a node can be identified based on the theory of relativity. Each node itself moves towards the destination and hence the neighbor node speed cannot be predicted by just the replacement in unit interval of time. Thus the mobility factor is calculated based on the given Eq. (2) as mobility_nodej = 1– (interval(nodej) / max_interval(nodej)) (2) The interval function returns the unit of time taken by the node to reach the destination from a source point. The max_interval function returns the maximum interval of one of the neighbor nodes of nodej. Thus for a node ‘j’ with ‘m’ neighbors can have one node (which is having maximum interval) with mobility factor 0. All the remaining nodes have the mobility factor relative to that node. The mobility factor for all nodes will be 0 initially. Each simulation interval updates the mobility factor for each node. Thus this mobility factor will be in the range of [0, 1]. This factor identifies the stability in the network. The random waypoint mobility model is chosen for random positioning and keeping the random movements to all nodes. This gives a relatively real world model for mobile ad hoc networks in which each node is free to move to any destination as it wishes as shown in Figure 1. Node j, t D i Vj

Node i, t Vi Node j, t+1 Node i, t+1 D i+1

Node i, t - Node ‘i’ at the time‘t’ Node i, t+1 - Node ‘i’ at the time‘t+1’ Node j, t - Node ‘j’ at the time‘t’ Node j, t+1 - Node ‘j’ at the time‘t+1’ Di - The distance between node ‘i’ and node ‘j’ at time‘t’ Di+1 - The distance between node ‘i’ and node ‘j’ at time‘t+1’ Thus using the above said Di and Di+1, a relative velocity between nodei and nodej will be found out. This can be used for the computation of mobility factor.

224

2.4. Job Submission to the next hop Further broadcasting of jobs can be carried out once the neighbor nodes are found to be not capable enough to finish the job. The broadcasting of these jobs is restricted based on the hop count limit. This increases the job success rate since each node is in search of capable nodes in the later hop region. nodej hop1 nodei hop2 nodek

In Figure 2, a scenario shown may happen only when nodei was not capable of finishing the job received from some nodej. Then nodei will again assign the same job with the source node being the nodej to its neighbor say nodek and receive the status. The status will be updated for the nodej too. Though the job submissions to next hop proved to work well for less number of jobs, there are some demerits in the higher order of job count. When the number of jobs raised, then nodei only just broadcast the jobs and failed to receive the later jobs from nodej. Since the number of resources available for the job is increased, the job success rate also increases for the increased hop count. 3. Simulation The simulation of trust evaluation in MASGRID has been carried out using java and glomosim. Mysql is used as the trust and mobility factor keeping database. The mobility model chosen is random waypoint mobility model. This model makes the system react similar to the mobile ad hoc environment. The following three comparisons are carried out in the simulation.

1. Trusted Vs. Untrusted MASGRID 2. Trusted Vs. Untrusted mobility 3. Single Vs. Next Hop Job Submissions

For each node, separate tables are maintained such that they can keep the trust and mobility values for their neighbors separately and assign jobs to the neighbors’ respectively. The jobs are generated randomly and assigned to the given number of nodes. Each job is identified by instructions and input/output file sizes.

The neighbor listing for each node is updated for each simulation interval based on the transmission range and the current position of each node. The ad hoc devices are identified and the configurations are used randomly to each node. Thus all the nodes collectively act as the ad hoc environment. For each simulation, the following values need to be configured. Number of nodes = 30 Transmission range = 400 Terrain = 2000, 2000 Number of jobs = 100 – 800 Max_length,Min_length = 1000, 100 Max_file_size,Min_file_size = 1000, 10 After simulation, the performance evaluation is carried out and the graph is plotted. 4. Results and Discussion In Figure 3, a graph is plotted with number of jobs against job success rate. The job success rate for a set of nodes can be calculated using the cumulative of all nodes job success rates to the number of nodes. Consider a case that there are ‘m’ nodes chosen for simulation and nodei have the job success rate of JSi . Then the cumulative job success rate for number of jobs ‘n’ will be given Eq. (3) as Job Success Raten = i=0..m JSi / m (3) So, for unit increment in the number of jobs, the corresponding job success rate are calculated and plotted in all the cases. In Figure 3, simulated results are shown for without trust vs with trust; there is a upgrade in the performance of MASGRID with the consideration of trust for assigning jobs to the neighbor nodes. Increase in the number of jobs reduces the job success rate for un-trusted whereas trusted improves its performance.

! " # $ % &'

Trusted Vs. Untrusted MASGRID

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

100 200 300 400 500 600 700 800

Number of jobs

Job

Suc

cess

Rat

e

WITH TRUST

WITHOUT TRUST

225

In Figure 4, a graph is plotted with number of jobs against job success rate and the simulated results are shown for mobility without trust vs mobility with trust. Here, though initially both have the same performance but increasing the number of jobs gives better performance only for trusted mobility node. Compared to previous analysis, this shows a fluctuation in the graph this is based on the concept of mobility model pattern. Though the graph fluctuates, never the untrusted mobility beats the trusted mobility value. The effect of simulation will be more clear once the environment is more mobile and the nodes seems to move randomly.

Trusted mobility Vs. Untrusted Mobility

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

100 200 300 400 500 600 700 800

Number of jobs

Job

Suc

cess

Rat

e

TRUSTED MOBILITY

UNTRUSTEDMOBILITY

( !

In Figure 5 also, a graph is plotted with number of jobs against job success rate and the simulated results are shown for with first hop vs next hop also. In this graph, the next hop values are almost twice that of the first hop simulation. In this case, the number of hop count is 2. The increase in the number of hop count with proper and efficient flooding scheme shows the improved overall performance and hence increases the overall job success rate.

Single Hop Vs. Next Hop

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

100 200 300 400 500 600 700 800

Number of jobs

Job

Suc

cess

Rat

e

NEXT HOP

SINGLE HOP

) * + *

5. Conclusions Trust implementation greatly improves the performance of MASGRID in proper resource utilization and increased job success rate. The mobility factors too taken into consideration improve the stability in job submission and raise the confidence that the nodes seems to be stable until the result of execution is received by the sender. This mobility factor also avoids the job submission of nodes which are found to be more mobile and less stable. Those nodes once received the job may suddenly leave the transmission range and almost impossible to detect the node availability in the ad hoc environment. Extending the job submission to the next hop increase the scalability of job submissions and avoids assigning the job to the inefficient nodes again. References [1] Ihsan I., Abdul Qadir M., and Iftikhar N. “Mobile Ad hoc Service Grid – MASGRID”, Proceedings of World Academy of Science, Engineering and Technology,2005. [2] Kurdi H., Li M. and Al-Raweshidy H. “A Classification of Emerging and Traditional Grid Systems”, IEEE Distributed Systems, IEEE Computer Society, Vol. 9, No.3, 2008. [3] Li Z., Sun L. and Ifeachor E., University of Plymouth, UK. “Challenges of Mobile Ad-hoc Grids and their applications in E-HealthCare”,Conf. on Computational Intelligence in Medicine and Healthcare, 2005. [4] Amin K., Laszewski G., Sosonkin M., Mikler R., Hategan M. “Ad Hoc GRID Security Infrastructure”, Grid Computing Workshop, 2006. [5] Ma B., Sun J., Yu C.,School of Electronic Information Engineering,Tianjin University, China, “Reputation-based Trust Model in Grid Security System”, Journal of Communication and Computer, ISSN1548-7709,2006. [6] Kwok Y., Senior Member, IEEE, Hwang K., Fellow, IEEE, and Song S., Member, IEEE. “Selfish Grids: Game-Theoretic Modeling and NAS/PSA Benchmark Evaluation” , IEEE transactions on parallel and distributed systems, Vol 18,No. 5.,2007. [7] Srinivasany A., Teitelbaumy J., Liangz H., Wuyand Mihaela J., Cardeiy A. “Reputation and Trust-Based Systems for Ad Hoc and Sensor Networks” , Conference on GRID Security, 2007. [8] Martin A., “Trust and Security in Virtual Communities”, IEEE conference on secure Virtual Organization, 2007. [9] Brinklov M., Sharp R., “Incremental Trust in Grid Computing”, IEEE International Symposium on Cluster Computing and the Grid(CCGrid'07), IEEE Computer Society, 2007. [10] Callaghan D. and Coghlan B., Department of Computer Science,Trinity College Dublin, Ireland, “On Demand Trust Evaluation”, Grid Computing Conference-06,2006.

226

Scheduling Light-trails on WDM RingsSoumitra Pal and Abhiram Ranade

Department of Computer Science and Engg.,Indian Institute of Technology Bombay,

Powai, Mumbai 400076, India.

Abstract—We consider the problem of scheduling commu-nication on optical WDM (wavelength division multiplexing)networks using the light-trails technology. We give two onlinealgorithms which we prove to have competitive ratios O(log n)and O(log2 n) respectively. We also consider a simplificationof the problem in which the communication pattern is fixedand known before hand, for which we give a solution usingO(c + log n) wavelengths, where c is the congestion and a lowerbound on the number of wavelengths needed. While congestionbounds are common in communication scheduling, and we usethem in this work, it turns out that in some cases they arequite weak. We present a communication pattern for which thecongestion bound is O(log n/ log log n) factor worse than the bestlower bound. In some sense this pattern shows the distinguishingcharacter of light-trail scheduling. Finally we present simulationsof our online algorithms under various loads.

I. INTRODUCTION

Light-trails [1] are considered to be an attractive solutionto the problem of bandwidth provisioning in optical networks.The key idea in this is the use of optical shutters which areinserted into the optical fiber, and which can be configuredto either let the optical signal through or block it frombeing transmitted into the next segment. By configuring someshutters on (signal let through) and some off (signal blocked),the network can be partitioned into subnetworks in whichmultiple communications can happen in parallel on the samelight wavelength. In order to use the network efficiently, it isimportant to have algorithms for controlling the shutters.1

In this paper we consider the simplest scenario: two fiberoptic rings, one clockwise and one anticlockwise, passingthrough a set of some n nodes, where typically n < 20because of technological considerations. At each node of aring there are optical shutters that can either be used toblock or forward the signal on each possible wavelength.The optical shutters are controlled by an auxiliary network(“out-of-band channel”). It is to be noted that this network istypically electronic, and the shutter switching time is of theorder of milliseconds as opposed to optical signals which havefrequencies of Gigahertz.

For this setting we give three algorithms for controllingthe shutters, or bandwidth provisioning. The first two considerdynamic traffic, i.e. communication requests arrive and departin an online manner, i.e. they have to be serviced as soon asthey arrive. The algorithm must respond very quickly in this

1Notice that in the on mode, light goes through a shutter without being firstconverted to an electrical signal – this is one of the major advantages of thelight-trail technology.

case. The third algorithm considers stationary traffic. In thiscase, our algorithm can be allowed to take more time, becausethe computed configuration will be used for a long time sincethe traffic pattern does not change. For both problems, ourobjective is to minimize the number of wavelengths needed toaccommodate the given traffic.2

The input to the stationary problem is a matrix B, in whichB(i, j) gives the bandwidth demanded between nodes i andj. We give an algorithm which schedules this traffic usingO(c+log n) wavelengths, where c = maxk

∑i,j|i≤k<j B(i, j)

is the maximum congestion at any link. The congestion asdefined above is a lower bound, and so our algorithm can beseen to use a number of wavelengths close to the optimal. Thereader may wonder why the additive log n term arises in theresult. We show that there are communication matrices B forwhich the congestion is much smaller than 1, but which yetrequire Ω(log n/ log log n) wavelengths. In some sense, thisjustifies the form of our result.

For the online problem, we use the notion of competitiveanalysis [2], [3], [4]. Specifically we establish that our firstalgorithm is O(log n)-competitive, i.e. it requires at mosta O(log n) factor more wavelengths as compared to thebest possible algorithm, including an unrealistic algorithmwhich is given all the communication requests in advance.A multiplicative O(log n) factor might be considered to betoo large to be relevant for practice (and indeed it is animportant open problem whether a lower factor can be proved);however, the experience with online algorithm design is thatsuch algorithms often give good hints for designing practicalalgorithms. We establish that our second algorithm for theonline problem is O(log2 n), nevertheless we mention itbecause it is a simplified version of the first algorithm andit seems to perform better in our simulations.

That brings us to our final contribution: we simulate twoalgorithms based on our online algorithms for some trafficmodels. We compare them to a baseline algorithm which keepsthe optical shutter switched off only in one node for eachwavelength. Note that at least one node should switch off itsoptical shutter otherwise light signal will interfere with itselfafter traversing around the ring. We find that except for thecase of very low traffic, our algorithms are better than the

2If our analysis indicates that some λ wavelengths are needed while onlyλ0 are available, then effectively the system will have to be slowed down bya factor λ/λ0. This is of course only one formulation; there could be otherformulations which allow requests to be dropped and analyse what fractionof requests are satisfied.

baseline. For very local traffic, our algorithms are in fact muchsuperior.

The rest of the paper is organized as follows. We beginin Section II by comparing our work with previous relatedwork. In Section III we give the details of our algorithm forthe stationary problem. Section IV gives an example instanceof the stationary problem where congestion lower bound isweak. We describe our two algorithms for the online problemin Section V. We give results of simulation of our onlinealgorithms in Section VI.

II. PREVIOUS WORK

Our problem as formulated is in fact similar to the problemof reconfigurable bus architectures [5], [6]. These have beenproposed for standard electrical communication; like the opti-cal shutter in light-trails, there is a switch which connects onesegment of the bus to another, and which can be set on or off.Again, even in this model, the switches are slow as comparedto the data rates on the buses. So from an abstract view pointsthis model is very similar to ours.

While there is much work in the reconfigurable bus lit-erature, it mostly concerns regular interconnection patterns,such as those arising in matrix multiplication, list rankingand so on [7], [8], [9], [10]. The only work we know ofdealing with random communication patterns is in relationto the PARBUS architecture. Such patterns are handled usingstandard techniques such as Chernoff bounds [11]. We do notknow of any work which discusses how to schedule arbitraryirregular communication patterns in this setting. This is prob-ably understandable because reconfigurable bus architectureshave mostly been motivated as special purpose computers,except for the PRAM simulation motivation of PARBUSwhere the communication becomes random. However, if thenetwork is used for general purpose computing, it does makesense to have algorithms to provision bandwidth for arbitraryirregular patterns, as we do here.

After the light-trail technology was was introduced in [1],much work has been published in the literature. For example,[12] has a mesh implementation of light-trails for generalnetworks. The paper [13] implements a tree-shaped variant oflight-trail, called as clustered light-trail, for general networks.The paper [14] describes ‘tunable light-trail’ in which thehardware at the beginning works just like a simple light-path but can be later tuned to act as light-trail. There issome preliminary work on multi-hop light-trails [15] in whichtransmissions are allowed to go through a sequence of overlap-ping light-trails. Survivability in case of failures is consideredin [16] by assigning each transmission request to two disjointlight-trails.

Even with this basic hardware implementation, there aredifferent works solving different design problems. Several ob-jectives are mentioned in the seminal paper [17] – to minimizetotal number of light-trails used, to minimize queuing delay,to maximize network utilization etc. Most of the work inthe literature seems to solve the problem by minimizing totalnumber of light-trails used [18], [19], [20], [21]. Though the

paper [19] suggests that minimizing total number of light-trails also minimizes total number of wavelengths, it may notbe always true. For example, consider a transmission matrix inwhich B(1, 2) = B(3, 4) = 0.5 and B(2, 3) = 1. To minimizetotal number of light-trails used, we create two light-trails ontwo different wavelengths. Transmission (2,3) is put in onelight-trail and transmissions (1,2) and (3,4) are put in the otherlight-trail. On the other hand, to minimize total number ofwavelengths, we put each of them in a separate light-trail ona single wavelength. We believe that minimizing the numberof light-trails (while fixing the number of wavelengths) isan appropriate objective for the online case, where this is ameasure of the work done by the scheduler. In our opinion,for the stationary problem, the number of wavelengths is abetter measure. There are few other models as well, e.g. [22]minimizes total number of transmitters and receivers used inall light-trails.

The general approach followed in the literature to solvethe stationary problem is to formulate the problem as aninteger linear program (ILP) and then to solve the ILP usingstandard solvers. The papers [18], [19] give two differentILP formulations. However, solving these ILP formulationstakes prohibitive time even with moderate problem size sincethe problem is NP-hard. To reduce the time to solve theILP, the paper [20] removed some redundant constraints fromthe formulation and added some valid-inequalities to reducethe search space. However, the ILP formulation still remainsdifficult to solve.

Heuristics have also been used. The paper [20] solves theproblem in a general network. It first enumerates all possiblelight-trails of length not exceeding a given limit. Then itcreates a list of eligible light-trails for each transmission and alist of eligible transmissions for each light-trail. Transmissionsare allocated in an order combining descending order of band-width requirement and ascending order of number of eligiblelight-trails. Among the eligible light-trails for a transmission,the one with higher number of eligible transmissions andhigher number of already allocated transmissions is givenpreference. The paper [21] used another heuristic for theproblem in a general network. For a ring network, [19] usedthree heuristics.

For the problem on a general network, [16] solves twosubproblems. The first subproblem considers all possible light-trails on all the available wavelengths as bins and packsthe transmissions into compatible bins with the objectiveof minimizing total number of light-trails used. The secondsubproblem assigns these light-trails to wavelengths. The firstsubproblem is solved using three heuristics and the secondproblem is solved by converting it to a graph coloring problemwhere each node corresponds to a light-trail and there isan edge between two nodes if the corresponding light-trailsconflict with each other.

For the online problem, a number of models are possible.From the point of view of the light-trail scheduler, it isbest if transmissions are not moved from one light-trail toanother during execution, which is the model we use. It is also

appropriate to allow transmissions to be moved, with somepenalty. This is the model considered in [19], [23], where thegoal is to minimize the penalty, measured as the number oflight-trails constructed. The distributions of the transmissionsthat arrive is also another interesting issue. It is appropriate toassume that the distribution is fixed, as has been considered inmany simulation studies including our own. For our theoreticalresults, however, we assume that the transmission sequencecan be arbitrary. The work in [19] assumes that the trafficis an unknown but gradually changing distribution. They usea stochastic optimization based heuristic which is validatedusing simulations. The paper [20] considers a model in whichtransmissions arrive but do not depart. Multi-hop problemshave also been considered, e.g. [24]. An innovative idea toassign transmissions to light-trails using online auctions hasbeen considered in [25].

A. Remarks

As may be seen, there are a number of dimensions alongwhich the work in the literature may be classified: the networkconfiguration, the kind of problem attempted, and the solutionapproach. Network configurations starting from simple lineararray/rings [9], [19], [23] to full structured/unstructured net-works [8], [16], [18], [20], [21], [24] have been considered,both in the optical communication literature as well as thereconfigurable bus literature. The stationary problem as wellas the dynamic problem has been considered, with additionalminor variations in the models. Finally, three solution ap-proaches can be identified. First is the approach in whichscheduling is done using exact solutions of Integer LinearPrograms [18], [19], [20]. This is useful for very smallproblems. For larger problems, using the second approach,a variety of heuristics have been used [16], [19], [20], [21].The evaluation of the scheduling algorithms has been doneprimarily using simulations. The third approach could betheoretical. However, except for some work related to randomcommunication patterns [11], we see no theoretical analysisof the performance of the scheduling algorithms.

In contrast, our main contribution is theoretical. We givealgorithms with provable bounds on performance, both forthe stationary and the online case. Our work uses the com-petitive analysis approach [2] for the online problem. Weuse techniques of approximation algorithms to solve the sta-tionary problem. To our knowledge, this competitive analysisand approximation algorithm approach to solve the light-trailscheduling problem has not been used in the literature. Wealso give simulation results for the online algorithms.

III. THE STATIONARY PROBLEM

In this section, instead of considering two unidirectionalrings, we consider a linear array of n nodes, numbered 0 ton−1. Communication is considered undirected. This simplifiesthe discussion; it should be immediately obvious that all resultsdirectly carry over to two directed rings mentioned in theintroduction.

The input is a matrix B with B(i, j) denoting the bandwidthrequirement for the transmission from node i to node j,without loss of generality, as a fraction of the bandwidth of asingle wavelength. The goal is to schedule these in minimumnumber of wavelengths w. The output must give w as wellas a partitioning of each wavelength into a set of segments.The partitioning may be specified as an increasing sequenceof numbers (what we refer to as configuration) between 0and n − 1; if u appears in the sequence it indicates that theshutter in node u is off, otherwise the shutter in node u ison. The segment between two off shutters is a light-trail. Atransmission from i to j can be assigned to a light-trail L onlyif u ≤ i, j ≤ v where u, v are the endpoints of the light-trail.Further the sum of the required bandwidths assigned to anysingle light-trail must not exceed 1.

It is customary to consider two variations: non-splittable, inwhich a transmission must be assigned to a single light-trail,and splittable, in which a transmission can be split into twoor more transmissions by dividing up the bandwidth, and theresultant transmissions can be assigned to different light-trails.Our results hold for both variations.

We will use cl(S) to denote the congestion induced on a linkl by a set S of transmissions. This is simply the total band-width requirement of those transmissions from S requiringto cross link l. Clearly maxl cl(S), the maximum congestionover all links, is a lower bound on the number of wavelengthsneeded. We use c(S) to denote maxl cl(S). Finally if t isa transmission, then we abuse notation to write cl(t), c(t),instead of cl(t), c(t), for the congestion contributed byt only, which is equal to the bandwidth requirement of t.

The key observation behind our algorithm for the stationaryproblem is: if all transmissions go the same distance in thenetwork, then it is easy to get a nearly optimal schedule. Thuswe partition the transmissions into classes based on the lengthof the transmission. We then stitch back the separate schedules.

Define the length of a transmission to be the distancebetween the origin and the destination. Transmissions withlength between 2i−1 (non-inclusive) and 2i (inclusive) are saidto belong to the ith class where 0 ≤ i ≤ dlog2(n− 1)e.

Let R denote the set of all transmissions, and Ri the set oftransmissions in class i. Class 0 is served simply by puttingshutters off at every node. Clearly, dc(R0)e wavelengths willsuffice for the splittable case, and twice that many for the non-splittable (using ideas from bin-packing [26]). For R1 also itis easily seen that O(dc(R1)e) wavelengths will suffice. So forthe rest of this paper we only consider classes 2 and larger.

A. Schedule Transmissions of Class i

Our aim is to partition Ri further into sets S0, S1, . . . eachwith congestion at most some constant value so that overall itdoes not take many wavelengths. We start with T0 = Ri, andin general given Tj we construct Sj and Tj+1 = Tj \ Sj asfollows:

We add transmissions greedily into Sj starting from theleftmost link l moving right, i.e. for each l pick transmissionscrossing it and move them into Sj until we have removed

at least unit congestion from cl(Tj) or reduced cl(Tj) to 0.Then we move to the next link. So, at the end the followingcondition holds:

∀l, cl(Sj)

= cl(Tj) if cl(Tj) ≤ 1, and≥ 1 otherwise. (1)

However, to make sure that c(Sj) is not large, we move backtransmissions from Sj , in the reverse order as they were added,into Tj so long as condition (1) remains satisfied. Now weclaim the following:

Lemma 1. Construction of Sj , Tj+1 from Tj takes polynomialtime and c(Sj) < 4.

Proof: For the first part, it can be seen that the construc-tion takes at most n|Tj | time in the pick-up step and also inthe move-back step.

For the second part, at the end of move-back step, for anytransmission t ∈ Sj there must exist a link l such that cl(Sj) <1 + c(t) otherwise t would have been removed. We call l as asweet spot for t. Since c(t) ≤ 1 we have cl(Sj) < 2 for anysweet spot l.

Now consider any link x. Of the transmissions through x,let L1 (L2) denote transmissions having their sweet spot on theleft (right) of x. Consider y, the rightmost of these sweet spotsof some transmission t ∈ L1. Note first that cy(Sj) < 2. Alsoall transmissions in L1 pass through both x, y. Thus cx(L1) =c(L1) = cy(L1) ≤ cy(Sj) < 2. Similarly, cx(L2) < 2. Thuscx(Sj) = cx(L1) + cx(L2) < 4. But since this applies to alllinks x, c(Sj) < 4.

Next we show that not too many Sj will be constructed.

Lemma 2. Suppose Sj is created for class i. Then j ≤ c(Ri).

Proof: Suppose Sj contains a transmission that uses somelink l. The construction process above must have removed atleast unit congestion from l in every previous step 0 throughj − 1. Hence j ≤ cl(Ri) ≤ c(Ri).

Every transmission in Sj has length at least 2i−1 + 1, andmust cross some node whose number is a multiple of 2i−1.The smallest numbered such node is called the anchor of thetransmission. The trail-point of a transmission is the right mostnode numbered with a multiple of 2i−1 that is on the left ofthe anchor. If the transmission has trail-point at some nodewith number of the form t2i−1, then we define t mod 4 as itsphase.

Lemma 3. The set Sj can be scheduled using O(1) wave-lengths.

Proof: We partition Sj further into sets Spj containing

transmissions of phase p. Note that the transmissions in anySp

j either overlap at their anchors, or do not overlap at all. Thisis because if two transmissions in Sp

j have different anchors,then these two anchors are at least 2i+1 distance apart. Sincethe length of transmission is at most 2i, the two transmissionscan not intersect.

So for the set Spj , consider 4 wavelengths, each having

shutters off at nodes numbered (4q + p)2i−1. Clearly, for

the splittable case, the transmissions will be accommodatedin these wavelengths, since c(Sp

j ) < 4. For the non-splittablecase, 8 wavelengths will suffice, using standard bin packingideas [26].

Thus all of Sj can be accommodated in at most 16 wave-lengths for the splittable case, and at most 32 wavelengths forthe non-splittable case.

Theorem 4. The entire set Ri can be scheduled such that ateach link x there are O(Cx(Ri) + 1) light-trails.

Proof: We first consider the light-trails as constructed inLemma 3. In that construction, it is possible that some light-trails contain links that are not used by any of the transmissionsassociated with the light-trail. In such cases we shrink thelight-trails by removing the unused links (which can onlybe near either end of the light-trail because all transmissionsassigned to a light-trail overlap at their anchor).

Let j be largest such that x has a transmission from Sj .Then we know that cx(Ri) ≥ j. For each k = 0, 1, . . . , j wehave O(1) light-rails at x as described above. Thus we havea total of O(j + 1) = O(cx(Ri) + 1) light-trails at x.

B. Merge Light-trails of All Classes Together

If we simply collect together the wavelengths as allocatedabove, we would get a bound O(c log n). Note however, thatif two transmissions, one in class i and the other in class j,are spatially disjoint, then they could possibly share the samewavelength. Given below is a systematic way of doing this,which gets us the sharper bound.

We know that at each link l there are a total of O(cl(Ri) +1) light-trails. Thus the total number of light-trails at l areO(cl(R) + log n), summing over all classes.

Think of each light-trail as an interval, giving us a collectionof intervals such that any link l has at most O(cl(R)+log n) =O(c+ log n) intervals. Now this collection of intervals can becolored using O(c+ log n) colors. So we put all the intervalsof the same color in the same wavelength.

IV. ON THE CONGESTION LOWER BOUND

We now consider an instance of the stationary problem. Forconvenience, we assume m = n − 1 = 2k for some k, andall logarithms with base 2. All the transmissions have samebandwidth requirement α = 1/(logm+ 1).

First, we have a transmission going from 0 to 2k. Then atransmission from 0 to 2k−1 and a transmission from 2k−1 to2k. Then 4 spanning one-fourth the distance, and so on. Thuswe have transmissions of logm+ 1 classes, each class havingtransmissions of same length. In class i ∈ 0, 1, . . . , logmthere are 2i transmissions B(sij , dij) = α where sij =jm/2i, dij = (j + 1)m/2i and j = 0, 1, . . . , 2i − 1. All otherentries of B are 0. This is illustrated in Fig. 1 for n = 17.

Clearly the congestion of this pattern is uniformly 1. Con-sider an optimal solution. There has to be a light-trail in whichthe first transmission from node 0 to m is scheduled. Thuswe must have a wavelength with no off shutters except atnode 0 and node m. In this wavelength, it is easily seen

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

i = 4i = 3i = 2i = 1i = 0

Fig. 1. An example instance where congestion bound is weak

that the longest transmissions should be scheduled. So westart assigning transmissions of first few classes in this light-trail. Suppose, all the transmissions for first l classes areassigned. Then we have total 1 + 2 + 4 + · · ·+ 2l = 2l+1 − 1transmissions assigned to this light-trail. Total bandwidthrequirement of these transmissions should be less than 1.This gives us (2l+1 − 1)(1/(logm + 1)) ≤ 1 implyingl ≤ log(logm+ 2)− 1 ≈ log logm.

For the subsequent classes of transmissions, we allocate anew wavelength and create light-trails by putting shutters off atnodes numbered multiples of m/2l+1. It can be seen that againtransmissions of next about log logm classes can be put inthese light-trails. We repeat this process until all transmissionsare assigned.

In each wavelength we assign transmissions of log logmclasses. There are total (1 + logm) classes. Thus the totalnumber of wavelengths needed is d(1 + logm)/ log logme =O(log n/ log log n) rather than the congestion bound of 1.For the example in Fig. 1, using this procedure, we havelog logm = 2. Thus we require d(1 + logm)/ log logme = 3wavelengths. The first wavelength is used for the transmissionsof classes 0,1, the second wavelength for classes 2,3 andthe third for class 4.

V. THE ONLINE PROBLEM

In the online case, the transmissions arrive dynamically.An arrival event has parameters (si, di, ri) respectively givingthe origin, destination, and bandwidth requirement of anarriving transmission request. The algorithm must assign sucha transmission to a light-trail L such that si, di belong to thelight-trail, and at any time the total bandwidth requirementof transmissions assigned to any light-trail is at most 1. Adeparture event marks the completion of a previously sched-uled transmission. The corresponding bandwidth is releasedand becomes available for future transmissions. The algorithmmust make assignments without knowing about subsequentevents.

Unlike the stationary problem, congestion at any link maychange over time. Let clt(S) denote the congestion inducedon a link l at time t by a set S of transmissions. This is simplythe total bandwidth requirement of those transmissions fromS requiring to cross link l at time t. The congestion boundc(S) is maxl maxt clt(S), the maximum congestion over alllinks over all time instants.

For the online problem, we present two algorithms, SEP-ARATECLASS and ALLCLASS having competitive ratios

O(log n) and O(log2 n) respectively. They are inspired by theanalysis of the algorithm for the snapshot problem, as may beseen.

In both the online algorithms, when a transmission requestarrives, we first determine its class i and trail-point x (definedin Section III-A). The transmission is allocated to some light-trail with end nodes x and x+ 2i+1. However, the algorithmsdiffer in the way a light-trail is chosen from some candidatelight-trails.

A. Algorithm SEPARATECLASS

In this algorithm, every allocated wavelength is assigneda class label i and a phase label p, and has shutters off atnodes (4q + p)2i−1 for all q, i.e. is configured to serve onlytransmissions of that class and phase. Whenever a transmissionof class i and phase p is to be served, it is only servedby a wavelength with the same labels. If such a wavelengthis found, and light-trail starting at its trail-point has space,then the transmission is assigned to that light-trail. If no suchwavelength is found, then a new wavelength is allocated andlabeled and configured as above.

When a transmission finishes, it is removed from its asso-ciated light-trail. The wavelength can be relabeled only whenthere are no transmissions in any of its light-trail.

Lemma 5. Suppose, at some point of time, among the wave-lengths allocated by the algorithm, x wavelengths had non-empty light-trails of the same class and phase across a link l.Then l must have congestion Ω(x) at some instant.

Proof: Number these wavelengths in the order that theygot allocated. Suppose the xth one was allocated due to atransmission t. This could only happen because t could not fitin the first x− 1 wavelengths.

For the splittable case this can only happen if the previousx− 1 wavelengths contain congestion at least x− 1− c(t) atthe anchor of t, when t arrived. But this is Ω(x) giving us theresult.

For the non-splittable case, suppose that c(t) ≤ 0.5. Theneach of the first x − 1 light-trails must have congestion ofleast 0.5 when t arrived, giving congestion Ω(x). So supposec(t) > 0.5. Let k be the largest such that wavelength kcontains a transmission t′ with c(t′) ≤ 0.5. If no such k exists,then clearly the congestion is Ω(x). If k exists, then all thewavelengths higher than k have congestion at least 0.5 whent arrived. And the wavelengths lower than k had congestion

at least 0.5 when t′ arrived. So at one of the two time instantsthe congestion must have been Ω(x).

Theorem 6. SEPARATECLASS is O(log n) competitive.

Proof: Suppose that SEPARATECLASS uses w wave-lengths. We will show that the best possible algorithm (in-cluding off-line algorithms) must use at least Ω(w/ log n)wavelengths.

Consider the time at which the wth wavelength was allo-cated. At this time w− 1 wavelengths are already in use, andof these w′ = (w − 1)/4 log n must have the same class andphase. Among these w′ wavelengths consider the one whichwas allocated last to accommodate some light-trail L servingsome newly arrived transmission. At that time, each of thepreviously allocated w′− 1 wavelengths was nonempty in theextent of L. By Lemma 5, c(B) = Ω(((w−1)/4 log n)−1) =Ω(w/ log n). This is a lower bound on any algorithm, evenoff-line.

B. Algorithm ALLCLASS

This is a simplification of the previous algorithm in thatallocated wavelengths are not labeled. When a transmissionarrives, if a light-trail of its class and phase capable ofincluding it is found, then the transmission is assigned to it.If no such light-trail is found, then an attempt is made tocreate such a light-trail from the unused portions of any of theexisting wavelengths. If such a light-trail can be created, thenit is created and the transmission is placed in it. Otherwise anew wavelength is allocated, the required light-trail is created,and the rest of the wavelength is considered unused.

When a transmission finishes, it is removed from its asso-ciated light-trail. If this makes the light-trail empty then weconsider it as unused. Then we check if there are adjacentunused light-trails for the same wavelength. If so, we mergethem by turning on the off shutter between them.

Theorem 7. ALLCLASS is O(log2 n) competitive.

Proof: Suppose ALLCLASS uses w wavelengths. We willshow that an optimal algorithm will use at least Ω(w/ log2 n).Clearly, we may assume w = Ω(log2 n).

We first prove that there must exist a point of time in theexecution of ALLCLASS when there are w/4 log n light-trailscrossing the same link.

Number the wavelengths in the order of allocation. Considerthe transmission t for which the wth wavelength was allocatedfor the first time. Let L be the light-trail used for t. Clearly,the wth wavelength had to be allocated because the otherwavelengths contained light-trails overlapping with L. Ofthese if at least w/4 log n light-trails crossed either end ofL, then we are done. If this fails, there must be at leastw′ = w − 1 − w/2 log n wavelengths that have light-trailswhich are strictly contained inside the extent of L. Let L′

be the light-trail allocated on the w′th of these wavelengths.Note that L′ is strictly smaller than L. Thus we can repeatthe above argument by using L′ and w′ in place of L and wrespectively, only log n times, and if we fail each time, we

will end up with a light-trail L′′ such that there are at leastw′′ wavelengths with light-trails conflicting with L′′, wherew′′ = w−log n−log n(w/2 log n) = w/2−log n ≥ w/4 log nfor w = Ω(log2 n). But L′′ is a single link and so we are done.

Of these w/4 log n light-trails, at least w/16 log2 n musthave the same class and phase. But Lemma 5 applies, andhence there is a link having congestion at least w/16 log2 n.But this is a lower bound on the number of wavelengthsrequired by any algorithm, including an offline algorithm.

VI. SIMULATIONS

We simulate our two online algorithms and a baselinealgorithm on a pair of oppositely directed rings, with nodesnumbered 0 through n− 1 clockwise.

We use slightly simplified versions of the algorithms de-scribed in Section V (but easily seen to have the same bounds):basically we only use phases 0 and 2. Any transmissions thatwould go into class i phase 1 (or phase 3) light-trail arecontained in some class i+1 light-trail (of phase 0 or 2 only),and are put there. We define a class i and phase 0 light-trail tobe one that is created by putting off shutters at nodes jn/2i

for different j, suitably rounding when n is not a power of 2.A light-trail with class i and phase 2 is created by putting offshutters at nodes (jn/2i + n/2i+1), again rounding suitably.For ALLCLASS, there is a similar simplification. Basically, weuse light-trails having end nodes at jn/2i and (j + 1)n/2i orat jn/2i + n/2i+1 and (j + 1)n/2i + n/2i+1. As before, inSEPARATECLASS, we require any wavelength to contain light-trails of only one class and phase; whereas in ALLCLASS, awavelength may contain light-trails of different classes andphases.

For the baseline algorithm in each ring we use a single offshutter at node 0. Transmissions from lower numbered nodesto higher numbered nodes use the clockwise ring, and theothers, the counterclockwise ring.

A. The simulation experiment

A single simulation experiment consists of running thealgorithms on a certain load, characterized by parametersλ,D, rmin and α for 100 time steps. In our results, each data-point reported is the average of 150 simulation experimentswith the same load parameters.

In each time step, all nodes j that are not busy transmitting,generate a transmission (j, dj , rj) active for tj time units.After that the node is busy for tj steps. After that it generatesanother transmission as before. The transmission duration tjis drawn from a Poisson distribution with parameter λ. Thedestination dj of a transmission is picked using the distributionD discussed later. The bandwidth is drawn from a modifiedPareto distribution with scale parameter = 100 × rmin andshape parameter = α. The modification is that if the generatedbandwidth requirement exceeds the wavelength capacity 1, itis capped at 1.

We experimented with α = 1.5, 2, 3 and λ = 0.01, 0.1but report results for only α = 1.5 and λ = 0.01; results forother values are similar. We tried four values 0.01, 0.1, 0.25

and 0.5 for rmin. Here we report the results for rmin =0.01, 0.5. We considered four different distributions D forselecting the destination node of a transmission. 1) Uniform:we select a destination uniformly randomly from the n − 1nodes other than the source node. 2) UniformClass: we firstchoose a class uniformly from the dlog n/2e + 1 possibleclasses. It should be noted that there can be a destination ata distance at most n/2 in any direction since we schedulealong the direction requiring shortest path. 3) Bimodal: firstwe randomly choose one of two possible modes. In mode 1,a destination from the two immediate neighbors is selectedand in mode 2, a destination from the nodes other than thetwo immediate nodes is chosen uniformly. For applicationswhere transmissions are generated by structured algorithms,local traffic, i.e. unit or short distances (e.g.

√n for mesh

like communications) would dominate. Here, for simplicity,we create a bimodal traffic which is mixture of completelylocal and completely global. 4) ShortPreferred: we selectdestinations at shorter distance with higher probability. In fact,we first choose a class i in the range 0, . . . , dlog n/2e withprobability 1

2i+1 and then select a destination uniformly fromthe possible destinations in that class. We report the resultsonly for the distributions Uniform and Bimodal.

B. Results

Fig. 2 shows the results for the 4 load scenarios. For eachscenario, we report the number of wavelengths required bythe 3 algorithms and the measured congestion as defined inSection V. Each data-point is the average of 150 simulations(each of 100 time steps) for the same parameters on ringshaving n = 5, 6, . . . , 20 nodes. We say that the two scenarioscorresponding to rmin = 0.01 have low load and the remain-ing two scenarios (rmin = 0.5) have high load.

For low load, the baseline algorithm outperforms our algo-rithms. At this level of traffic, it does not make sense to reservedifferent light-trails for different classes. However, as loadincreases our algorithms outperform the baseline algorithm.

For the same load, it is also seen that our algorithms becomemore effective as we change from the completely globalUniform distribution to the more local Bimodal distribution.This trend was also seen with the other distributions weexperimented with.

Though we could not show analytically that ALLCLASS isbetter than SEPARATECLASS always, our simulation showsthat ALLCLASS performs better. It may be noted that ouralgorithm for the stationary case mixes up the light-trails ofdifferent classes, and so suggests that the ALLCLASS mightwork better.

VII. CONCLUSIONS AND FUTURE WORK

It can be shown that the non-splittable stationary problemis NP-hard, using a simple reduction from bin-packing. Wedo not know if the splittable problem is also NP-hard. Wegave an algorithm for both variations of the stationary problemwhich takes O(c + log n) wavelengths. It will also be usefulto improve the lower bound arguments; as Section IV shows,

congestion is not always a good lower bound. This may leadto a constant factor approximation algorithm for the problem.

In the online case we gave two algorithms which weprove to have competitive ratios O(log n) and O(log2 n)respectively. In practice we found that the second algorithmwas better, and showing this analytically is an important openproblem.

Our online model is very conservative in the sense thatonce a transmission is allocated on a light-trail, the light-trailcannot be modified. However, other models allow light-trails toshrink/grow dynamically [17]. It will be useful to incorporatethis (with some suitable penalty, perhaps) into our model.

It will also be interesting to devise special algorithms thatwork well given the distribution of arrivals.

ACKNOWLEDGMENT

We would like to thank Ashwin Gumaste for encourage-ment, insightful discussions and patient clearing of doubtsrelated to light-trails.

REFERENCES

[1] I. Chlamtac and A. Gumaste, “Light-trails: A solution to IP centriccommunication in the optical domain,” Lecture notes in computerscience, pp. 634–644, 2003.

[2] A. Borodin and R. El-Yaniv, Online computation and competitiveanalysis. Cambridge University Press, New York, NY, USA, 1998.

[3] D. Sleator and R. Tarjan, “Amortized efficiency of list update and pagingrules,” Communications of the ACM, vol. 28, no. 2, pp. 202–208, 1985.

[4] A. Karlin, M. Manasse, L. Rudolph, and D. Sleator, “Competitive snoopycaching,” Algorithmica, vol. 3, no. 1, pp. 79–119, 1988.

[5] R. Wankar and R. Akerkar, “Reconfigurable architectures and algo-rithms: A research survey,” IJCSA, vol. 6, no. 1, pp. 108–123, 2009.

[6] K. Bondalapati and V. Prasanna, “Reconfigurable meshes: Theory andpractice,” in Fourth Workshop on Reconfigurable Architectures, IPPS,1997.

[7] K. Li, Y. Pan, and S. Zheng, “Parallel matrix computations using areconfigurable pipelined optical bus,” Journal of Parallel and DistributedComputing, vol. 59, no. 1, pp. 13–30, 1999.

[8] C. Subbaraman, J. Trahan, and R. Vaidyanathan, “List ranking and graphalgorithms on the reconfigurable multiple bus machine,” in ParallelProcessing, 1993. ICPP 1993. International Conference on, vol. 3, 1993.

[9] Y. Pan, M. Hamdi, and K. Li, “Efficient and scalable quicksort on a lineararray with a reconfigurable pipelined bus system,” Future GenerationComputer Systems, vol. 13, no. 6, pp. 501–513, 1998.

[10] Y. Wang, “An efficient O (1) time 3D all nearest neighbor algorithmfrom image processing perspective,” Journal of Parallel and DistributedComputing, vol. 67, no. 10, pp. 1082–1091, 2007.

[11] S. Rajasekaran and S. Sahni, “Sorting, selection, and routing on thearray with reconfigurable optical buses,” IEEE Transactions on Paralleland Distributed Systems, vol. 8, no. 11, pp. 1123–1132, 1997.

[12] A. Gumaste and I. Chlamtac, “Mesh implementation of light-trails:a solution to IP centric communication,” Proceedings of the 12thInternational Conference on Computer Communications and Networks,ICCCN’03, pp. 178–183, 2003.

[13] A. Gumaste, G. Kuper, and I. Chlamtac, “Optimizing light-trail assign-ment to WDM networks for dynamic IP centric traffic,” in The 13th IEEEWorkshop on Local and Metropolitan Area Networks, LANMAN’04,2004, pp. 113–118.

[14] Y. Ye, H. Woesner, R. Grasso, T. Chen, and I. Chlamtac, “Trafficgrooming in light trail networks,” in IEEE Global TelecommunicationsConference, GLOBECOM’05, 2005.

[15] A. Gumaste, J. Wang, A. Karandikar, and N. Ghani, “MultiHop Light-Trails (MLT) - A Solution to Extended Metro Networks,” PersonalCommunication.

[16] S. Balasubramanian, W. He, and A. Somani, “Light-Trail Networks: De-sign and Survivability,” The 30th IEEE Conference on Local ComputerNetworks, pp. 174–181, 2005.

0

1

2

3

4

5

6

7

8

9

10

6 8 10 12 14 16 18 20

AllClassSeparateClass

BaselineCongestion

Uniform

0

1

2

3

4

5

6

7

8

9

6 8 10 12 14 16 18 20


BaselineCongestion

BimodalN

umbe

rofw

avel

engt

hsW

Number of nodes n

(a) Low Load

0

2

4

6

8

10

12

14

16

6 8 10 12 14 16 18 20


BaselineCongestion

Uniform

0

2

4

6

8

10

12

14

16

6 8 10 12 14 16 18 20


BaselineCongestion

Bimodal

Num

bero

fwav

elen

gths

W

Number of nodes n

(b) High Load

Fig. 2. Simulation results

[17] A. Gumaste and I. Chlamtac, “Light-trails: an optical solution for IPtransport [Invited],” Journal of Optical Networking, vol. 3, no. 5, pp.261–281, 2004.

[18] J. Fang, W. He, and A. Somani, “Optimal light trail design in WDM op-tical networks,” in IEEE International Conference on Communications,vol. 3, June 2004, pp. 1699–1703.

[19] A. Gumaste and P. Palacharla, “Heuristic and optimal techniques forlight-trail assignment in optical ring WDM networks,” Computer Com-munications, vol. 30, no. 5, pp. 990–998, 2007.

[20] A. Ayad, K. Elsayed, and S. Ahmed, “Enhanced Optimal and HeuristicSolutions of the Routing Problem in Light Trail Networks,” Workshopon High Performance Switching and Routing, HPSR’07, pp. 1–6, 2007.

[21] B. Wu and K. Yeung, “OPN03-5: Light-Trail Assignment in WDMOptical Networks,” in IEEE Global Telecommunications Conference,GLOBECOM’06, 2006, pp. 1–5.

[22] S. Balasubramanian, A. Kamal, and A. Somani, “Network design inIP-centric light-trail networks,” in 2nd International Conference onBroadband Networks, IEEE Broadnets’05, 2005, pp. 41–50.

[23] A. Lodha, A. Gumaste, P. Bafna, and N. Ghani, “Stochastic Optimizationof Light-trail WDM Ring Networks using Benders Decomposition,” inWorkshop on High Performance Switching and Routing, HPSR’07, 2007,pp. 1–7.

[24] W. Zhang, G. Xue, J. Tang, and K. Thulasiraman, “Dynamic lighttrail routing and protection issues in WDM optical networks,” in IEEEGlobal Telecommunications Conference, GLOBECOM’05, 2005, pp.1963–1967.

[25] A. Gumaste and S. Zheng, “Dual auction (and recourse) opportunisticprotocol for light-trail network design,” in IFIP International Conferenceon Wireless and Optical Communications Networks, 2006, p. 6.

[26] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson, “Approximationalgorithms for bin packing: a survey,” Approximation algorithms forNP-hard problems, pp. 46–93, 1997.

ADCOM 2009FOCUSSED SESSION ON

RECONFIGURABLE COMPUTING

235

AES and ECC Cryptography Processor with

Runtime Configuration

Samuel Antao, Ricardo Chaves, Leonel Sousa

Instituto Superior Tecnico/INESC-ID

Technical University of Lisbon

Email: sfan,rjfc,[email protected]

Abstract—In nowadays society, the amount of applications thatrequire cryptographic support keeps growing. The functionalityand security of these applications rely on the capability ofcryptographic accelerators in providing both adequate perfor-mance metrics while maintaining flexibility. In this paper aprogrammable cryptographic processor prototype, supportingAES and EC (Elliptic Curve) ciphering is presented. Thisprocessor consists of up to 12 programmable processing units.We explore and present results for the dynamic reconfigurationof these processing units, allowing the runtime replacement ofAES by EC units (or vice-versa) according to the applicationneeds. Combining programmability and runtime reconfiguration,both flexibility and performance can be improved. Moreover, thereconfiguration capability allows to further reduce the requiredhardware area, since these functionalities are multiplexed in time.The presented prototype is supported by a Xilinx XC4VSX35FPGA, consisting of 6 static EC units and 6 reconfigurableAES/EC units, running simultaneously. This processor is ableto cipher an 128 bit AES block in 22.9 µs and perform anEC point multiplication in 2.02 ms. The full reconfigurationof a processing unit can be achieved in less time than an ECmultiplication.

I. INTRODUCTION

Currently, most applications require security and authen-

tication services. Several protocols have been designed to

provide such requirements to these applications, being used

in a variety of devices: from smart cards, wireless sensors,

cell phones, and laptops, that usually need a small amount

of connections, to high-end servers that have to establish

thousands of connections. For such a wide variety of devices,

there is also a wide range of different demands. The following

highlights the key features that have to be considered:

• performance, supporting high-throughput and low la-

tency;

• low cost, using cheap platforms and massive production

computing elements;

• compactness, allowing the coexistence of different appli-

cations in a small pool of resources;

• flexibility, allowing adjustment to different needs;

• low power, enhancing the battery savings, reducing the

costs in energy and heat sinks, and increasing autonomy.

The security and authentication protocols are often sup-

ported by two main types of cryptographic functions: symmet-

ric and asymmetric. The latter allows to establish a secure and

confidential communication between two entities that share

a secret, while the former allows two entities to create a

distributed secret without any previous agreed information.

Several algorithms have been proposed to implement these

cryptographic functions, and the most successful ones have

been adopted regarding their strength against attacks, and their

compatibility with the performance-and-compact demand [1],

[2].

Regarding the asymmetric algorithms, the Elliptic Curve

(EC) cryptosystem has emerged as a reliable and effective

alternative for the widely used Rivest-Shamir-Adleman (RSA)

algorithm. The EC cryptosystems have the advantage of pro-

viding a greater security per bit of the secret key. Therefore,

smaller keys need to be used/stored. Consequently, more

compact and bandwidth efficient, since smaller keys need to

be transmitted, cryptosystems can be developed.

Although symmetric algorithms do not offer the same

properties as asymmetric ones in terms of the secret key con-

struction, they are simpler, more compact, and more efficiently

computed, allowing for better area and throughput metrics.

Thus, its usage is mandatory for some applications. Currently,

one of the most widely used algorithms for symmetric cryp-

tography is the Advanced Encryption Standard (AES) [2].

Although both symmetric and asymmetric algorithms have

shown to provide good performance metrics, their complex-

ity is still considerable. To overcome this problem, hard-

ware accelerators are employed. Several accelerators have

been proposed supported on Application Specific Integrated

Circuit (ASIC) solutions [3], Field Programmable Gate Array

(FPGA) [4], Graphical Processing Unit (GPU) [5], [6], and In-

struction Set Architecture (ISA) extensions for general purpose

processors [7]. While the flexibility of the solutions increase

when we move from the ASIC to the general purpose solu-

tions, the performance decreases. The ASIC approach allows

for fast and low power solutions, but with limited adaptability

and higher design costs. General purpose processors solutions

allow for optimal programmability, but achieve relatively low

performances and higher power costs. The GPU solutions

allow for the utilization of a large amount of parallel hard-

ware structures with a reduced cost, because of the massive

production due to the gaming market. However, the GPU’s

datapath is not optimized for cryptographic procedures and

the parallelism extraction for cryptography is limited, allied

with the significant power consumption. The FPGA solutions

are a compromise between the high performance/low power of

the ASIC and the flexibility/low cost of the general purpose

236

processors. Moreover, FPGAs allow to combine programmable

solutions with reconfiguration capabilities, providing adaptable

datapaths. FPGAs can be considered as an advised option to

efficiently support a wider range of cryptographic algorithms

and procedures.

This paper proposes a general cryptographic processor

supported on FPGA. This programmable processor was de-

signed to take advantage of the reconfigurable capabilities

of an FPGA for achieving good performance metrics and

enhanced flexibility. The processor proposed in this work

aims to provide support for the majority of the security and

authentication protocols, introducing microcoded AES and EC

cores, and a true Random Number Generator (RNG) supported

on oscillator rings to generate secrets. Very few attempts

have been made in the related art to combine AES and EC

arithmetic into a single arithmetic body. The efficiency of such

approaches is compromised by the difference in the size of

the datapath (m ≥ 163 for the EC versus m = 8 for the

AES), requiring the use of different irreducible polynomials,

thus different reduction structures. Our approach is different:

instead of sharing the datapath for the AES and EC arithmetic,

we create individual, compact and high-performance AES and

EC cores that share the same microcoded control unit. With

this approach and using the reconfiguration capabilities of the

FPGA, it becomes very easy and efficient to dynamically trade

AES and EC cores, depending on the requirements. A compact

and flexible cryptographic processor with good performance

metrics is obtained. With a RNG associated to the processing

units the secret keys of the protocols can be locally computed

and directly stored in the processing units’ memory. Avoiding

the communication of secret keys makes the system more

secure and resistant to external attacks.

The paper is organized as follows. In Section II we provide a

brief introduction on the AES and EC arithmetic. In Section III

we present the details of the reconfigurable architecture used.

In Section IV we describe the system layout in order to

handle the runtime configuration of processing units. Section

V presents results for the developed prototype, and Section VI

draw some conclusions about the developed work.

II. AES AND EC CRYPTOGRAPHY

In this section we briefly introduce the arithmetic that the

proposed processor supports.

A. AES arithmetic

The AES algorithm is composed by three main operations:

the key expansion, the ciphering, and the deciphering. In the

key expansion operation, the used key, with 16, 24 or 32

bytes, is expanded in order to obtain 176, 208, or 240 bytes,

depending on the initial size. This expanded key is divided

in sets of 16 bytes and each set is used in each round of

the ciphering/deciphering operation. The number of rounds

depends on the used key size. The key and data used in the

ciphering/deciphering rounds are organized in a common way:

in a 4×4 bytes matrix. Each AES round affects each of these

matrices’ elements using the following elementary operations:

• byte additions over GF (28), which correspond to a 8-bit

bitwise exclusive OR (XOR) operation;

• non-linear function S(.) often called an SBox and its

inverse; this function can be computed with multiplica-

tions and inversions over GF (28) with the irreducible

polynomial I(x) = x8 + x4 + x3 + x + 1;

• data matrix multiplication with constant matrices, with

the irreducible polynomial I(x);• matrix row rotating shift operation.

Further details about these operations and how they are applied

can be found in [2].

B. EC arithmetic

An EC over GF (2m) is a set composed by a point at infinity

O and the points Pi = (xi, yi) ∈ GF (2m) × GF (2m) that

comply the following equation:

y2i + xiyi = x3

i + ax2i + b, a, b ∈ GF (2m). (1)

By establishing the addition operation over the EC points and

by applying it recursively, it is possible to obtain the multipli-

cation by a scalar operation. It is known to be computationally

hard to invert this operation, since it is difficult to determine,

from the recursive addition result of an EC point, how many

times this point was added. This is known as the Elliptic

Curve Discrete Logarithm Problem (ECDLP), which supports

the security of EC cryptosystems.

The EC point addition and doubling (addition to itself) are

performed with operations over the underlying field GF (2m)applied to the points’ coordinates. These GF (2m) operations

are the addition, multiplication, squaring and the inversion,

modulo an irreducible polynomial with degree m. Details

about how these operations can be efficiently performed can

be found in [8].

III. CRYPTOGRAPHIC PROCESSOR ARCHITECTURE AND

DETAILS

In this work we developed a prototype of a cryptographic

accelerator supported on reconfigurable hardware, namely a

prototyping board powered by a Xilinx Virtex 4 FPGA [9]. In

this prototype the aim is to support the majority of protocols

that need asymmetric and symmetric cryptographic schemes,

and also the secure generation of secret keys for these proto-

cols. A schematic overview of the proposed processor organi-

zation is presented in Figure 1. The processor is composed by

several processing units (PUs), responsible for computing the

cryptographic procedures. An RNG is also included in order

to generate the secret data (such as the private keys). The

processor has an I/O interface to communicate and receive

the data (public keys, plain texts, ciphered texts) to/from the

main controller, which we herein call host of the processor.

This interface is also used to provide commands, such as

start commands for the processing units (PUs) or write/read

commands of data and instructions. Through this interface the

host can query the processor for, e.g., availability of PUs or

check if the required tasks were already done. When the host

sends a write command to any PUs, it also defines the origin

237

of the data to be written, namely external data or internal

data read from the RNG. Thus, the host can use the secret

information without having to touch or to know it.

All the PUs are responsible to run according to the mi-

crocode stored in a centralized instruction memory. For this,

each PU has its own microprogram counter (µPC) and startup

addresses to run and control the flow of the correct program.

An arbiter controls the access to the instructions memory

according to a priority policy, and signals any PU when the

memory retrieves a valid microinstruction for it.

Each PU contains its local data memory, which is addressed

according to the received microinstructions. Input data and

temporary data, as well as the final results are stored in this

local memory. This memory can be accessed by the host, when

the PU is set to the idle state through specific microinstructions

directly provided by the host, in order to be possible to read

and write data from/to the PUs. The width of the data memory

as well as the details of the arithmetic units available is

customizable according to the type of the PU. Different types

of PUs support different cryptographic procedures.

With this modular architecture, the PUs share the same

control through the instruction memory while facilitating the

replacement of a given PU by another one. This allows to

extract full advantage of the reconfiguration capabilities of the

electronic devices.

A. PU for AES

The architecture of the AES PU is presented in Figure 2a.

This architecture is composed by a data RAM of 512 positions,

a ROM and two adders. The ROM implements a look up table

for the non-linear function S(.) and its inverse S−1(.) (see

Section II-A). We also include in this ROM the operations

2S(x), 3S(x), 9x, 11x, 13x, and 14x, where x represents the

ROM address. With these operations, we are able to perform

the multiplications with the constants matrices operations.

Since the computation of the AES is performed over

GF (28) the used datapath and memory width is of 8 bits. Re-

garding that x has 8 bits, the ROM has 2048 entries of 8 bits.

This amount of data fits a single BRAM present in the Xilinx

Virtex 4 technology. Furthermore, since these BRAMs are dual

Host Communication Interfaces

RNG data access

Random Number Gen.

Data in

Microinstructions

RAM

DataRAM

arithmetic

accessingarbiter

control&

µPC

GF(2m) PU

DataRAM

arithmetic

control&

µPC

GF(2m) PU

...request address

...micro-instruction

...valid instruction

...n processing units (PU)

Data out 1 Data out nCommands & PU status queries

Fig. 1: Processor Organization Overview.

DataRAM

+

R(a) R(b)

input A

adder

storage logic

0

input B

Look-upROM

R(a) R(b)

+0

R(a) R(b)0 0

adder

arithmetic logic

(a) AES PU

DataRAM

RA RB

multiplier

++

+R1 ... ... ... ... R6

reduction

R(a) R(b)

R(a) R(b)

red A red B

adder A adder B

R(a) R(b)

input Aadder A

R1R2

red A

input Badder BR3R4red B

multiplication

logic

addition logic

storage logic

Reduction &

squaring logic0 0

0

(b) EC PU

Fig. 2: Architecture of the processing units.

port, the same ROM can be used for two PUs. The two adders

at the input and output of the ROM perform the required

additions for the AES arithmetic. With this architecture the

following basic operations are implemented, where L(.) is a

look-up result: R(c) = R(a)+R(b), R(c) = R(a)+L(R(b)),R(c) = L(R(a) + R(b)), and R(c) = L(R(b)), where a,

b, and c are the addresses provided by the microinstruction.

An operation to load a constant directly to the memory is

also implemented. The byte shift operations can be overcome

with the appropriate addressing of data, since each address

correspond to one byte. Regarding the flow control of the

program, three jump related operations are implemented:

• jmpset: set the value of an indexing counter;

• jmpinc: jump if the value associated to this jump

instruction match the value in the indexing counter; the

indexing counter is incremented;

• jmpdec: similar to jmpinc, but decrements the index-

ing counter;

• end: determines the end of the program, and thus the PU

becomes idle.

It is also possible to sum the value in the indexing counter

multiplied by 16 to the data addresses. This allows to easily

browse through the 16 byte matrices where the AES data

is organized in the data BRAM depending on the indexing

counter. All these functionalities, including the choice of the

ROM look-up and the usage of the indexing counter data in

the addresses, are mapped in microinstructions of 36 bit width.

Each microinstruction run in 3 clock cycles: one cycle to read

the data, one cycle to read the ROM, and another cycle to

write the result.

B. PU for EC

We support the EC PU in our previous work presented in [8]

for polynomial basis field arithmetic. The architecture of this

compact and flexible PU is similar to the one of the AES

PU, and is depicted in Figure 2b. There is also a data BRAM

where the field elements (of size m ≥ 163) are split and

stored in 21 bit words. The arithmetic logic supports two-

238

word with two-word operand additions and Karatsuba-Offman

multiplications.

The microcode adopted for the EC PU can be classified

into two main microinstruction types. The complex microin-

structions (type I) are performed over field elements, while

the lower complexity microinstructions (type II) operate over

words. There is a type I reserved microinstruction that corre-

sponds to a customizable sequence of type II operations.

Type I instructions are used to access the memory (read

and write) and the key register (key) to compute the m bit

add, squaring and reduction operations (add, sqr and red),

and to control the flow by conditionally jumping to a microin-

struction address depending on the key register or by turning

the Processing Unit (PU) to an idle state (jmp and end).

The type II instructions allow for adding and multiplying 2-

word operands (eadd and emult). An instruction determines

the end of a type II sequence (eret) and, consequently,

the end of the pers type I instruction. A customizable

instruction (pers) is also reserved, corresponding to a user-

defined sequence of type II instructions.

Another jump instruction is also introduced. When a PU is

placed in the architecture, an ID is assigned to it. This jump in-

struction, called jumpid, is an unconditional jump operation,

and is only executed if the ID in the microinstruction match the

ID assigned to the PU. If the IDs do not match, this instruction

is ignored. Introducing this instruction allows a program to use

microcode segments of other programs, since this instruction

works as a return instruction that is only considered by a

PU running a specific program. This is useful to shrink the

program sizes by running a single routine needed in different

programs, e.g. the inversion in the scalar multiplication and in

the point addition. This unit also supports an instruction that

signals the end of the program.

The functionality provided by the EC PU can be controlled

by microinstructions of 32 bit width. Since the AES core need

36 bit width instructions, the EC PU also uses 36 bit coded

instructions, by ignoring the 4 most significant bits. Regarding

the clock cycles required for the instructions, the jump and

word size addition needs 3 cycles, the word size multiplication

needs 5 cycles, the field size addition needs 13 cycles, and the

reduction and squaring operations need 14 clock cycles.

C. Arbiter

The arbiter controls the access to the microinstruction

memory when there are simultaneous and pending requests.

The arbiter considers a static priority for each PU, where

all the EC PUs have higher priority than the AES PUs.

This is because the EC programs possess microinstructions

that take a larger amount of clock cycles comparing with

the AES microinstructions. Thus, it is more likely the AES

microinstructions to efficiently fill the clock cycles between the

EC requests, than the opposite, resulting in a better efficiency

of the whole system.

D. True Random Number Generator

A true RNG is also included to generate the secrets that

lead to the private keys. Hence, since the private keys are not

reset clock

random bit

Fig. 3: Random bits generator.

communicated by the device, there is no entity, other than

the host, external to the device capable of achieving them,

at least without implementing sophisticated attacks, such as

Differential Power Attacks [10].

The randomness source of the RNG is the jitter of an

oscillator. In a digital device, such as an FPGA, these os-

cillators can be obtained with combinatorial rings of an odd

number of logical inverters. To obtain a random bitstream,

we can implement several of these oscillators, obtain the

logic exclusive OR for all the outputs of each oscillator, and

sample the obtained signal with a frequency lower than the

frequency of the oscillators [11]. An FPGA implementation

of such RNG was reported in [12] for an Altera Cyclone

II FPGA. The authors in [12] suggest an improvement to

the method presented in [11]. They suggest to sample the

output of each oscillator prior the exclusive-OR operation. This

suggestion is based on the observation that the combinatorial

logic responsible for computing the exclusive-OR operation

may not have enough commutation speed between events at

the inputs. For the RNG designed in this paper we followed

this suggestion, which resulted in the circuit presented in

Figure 3. We also introduced a reset signal in order to halt

the oscillators and the random bitstream generation, in order

to reduce the power consumption when the RNG is not being

used. A shift register was padded to the output of the RNG to

store the random data and allow it to be readily read.

IV. RUNTIME RECONFIGURATION

The proposed processor is specially designed to efficiently

support runtime reconfiguration. The modularity of the pro-

cessor allows to easily configure different processing units

without affecting the behavior of the others. This allows

to fulfill the runtime needs of the host by better adapting

the computation to the protocols being used. Our design is

supported by a Xilinx Virtex 4 FPGA, allowing for the Xilinx

dynamic reconfiguration flow for this processor.

The only concern regarding the control of the dynamic

reconfiguration is related with the dummy requests placed

in the arbiter by the PU under reconfiguration, due to the

unexpected behavior of the PUs outputs during reconfigu-

ration. To overcome this issue, the architecture contains an

enable register that can be accessed by the host. When the

host disables the PU that is going to be reconfigured, the

valid requests of that PU are ignored by the arbiter. After the

239

reconfiguration, when the host enables the PU, a reset pulse

is generated for that PU to set it to the idle state.

In order to support both AES and EC processing units,

the reconfigurable zones should cover the resources required

by the most demanding implementation loaded in that zone.

Between the two considered PUs, the most demanding in terms

of resources is the EC PU, due to the wider datapath (21-bit

instead of 8-bit) and larger complexity. For these reasons, the

reconfigurable zones are sized to fit an EC PU.

Since the several PUs compete to access the instruction

memory, conflicts can exist, thus some PUs may stall waiting

for their request to be fulfilled. These conflicts penalty will

increase if the number of PUs appended to one of the instruc-

tions memory port increases. This effect has to be taken into

account when setting the number of PUs in the design and,

consequently, the number of reconfigurable zones. Each of

the AES operations requires 3 clock cycles to perform, while

an EC operation requires from 3 to 14 cycles to perform.

This means that the average of clock cycles per instruction

in the AES PUs is less than the EC PUs average. Thus, the

AES PUs will generate more conflicts than the EC PUs. Since

the arbiter can issue one instruction per clock cycle, only a

maximum of 3 AES PUs can ideally operate at the same time

without conflicts. Putting a forth PU with less priority than the

others will cause this fourth PU to stall until one of the others

finishes the ongoing computation. This observation determines

the number of the required reconfigurable zones, which is 3

per instruction memory port. Thus, the system can have up

to 3 AES PUs per instruction memory port, implemented in

the reconfigurable zones. The system can have more static

EC PUs according to the conflicts that the user admits or to

the available resources. Considering a dual port instructions

memory, the number of reconfigurable zones can be increased

to the double, 6.

The use of dual port memories also contributes to reduce the

resources used in the design of the AES PUs. Considering a

dual-port look-up ROM the same memory can be implemented

statically outside the PUs and shared by two AES PUs, as

Figure 4 suggests. Moreover, this procedure allows for the

information inside the RAM not to reside among the config-

uration data, enhancing the compactness of the bitstream and

the configuration speed. Another issue that as to be considered

while reconfiguring the PUs is the amount of signals that

cross the reconfigurable zone boundary, since the path of these

signals through the boundary has to be directly instantiated.

This instantiation, except for the clock signal when provided

by a global buffer, is performed recurring to directional slice

bus macros. These bus macros are provided with the Xilinx

ISE tools that support dynamic reconfiguration. The number

and type of the required macros is determined by the number

of PU inputs and outputs signals. Each bus macro occupies

a Configurable Logic Block (CLB), which correspond do 4

slices, and supports up to 8-bit signals. To determine the

number of bus macros, the maximum number of inputs and

maximum number of outputs in both the PUs types (AES

and EC) have to be considered. For the proposed design a

DataRAM

+

R(a) R(b)

input A

adder

storage logic

0

input B

R(a) R(b)

+0

R(a) R(b)0 0

adder

arithmetic logic

DataRAM

+

R(a)R(b)

input A

adder

storage logic

0

input B

R(a)R(b)

+0

R(a)R(b) 00

adder

arithmetic logic

Look-upROM

staticreconf. PU 1 reconf. PU 2

Fig. 4: AES PUs with shared look-up memory.

maximum of 89 input signals (⌈89/8⌉ = 12 bus macros) and

64 output signals (⌈64/8⌉ = 8) is required, corresponding to

a total of 20 bus macros.

V. EXPERIMENTAL RESULTS AND RELATED WORK

The proposed design was successfully implemented and

experimentally tested on a prototyping board powered by

a Xilinx XC4VSX35-10 FPGA [9]. We implemented and

evaluated different combinations in the number of AES and

EC cores. These implementations refer to EC arithmetic

over GF (2163) and AES arithmetic with 128 bit key size.

The FPGA programming files were obtained from a VHDL

description of the hardware, synthesized with the Synplify

Premier C-2009.06 tools and Placed&Routed with the ISE

9.2.04i PR14 tools. The Virtex 4 technology supports the

handling of dynamic reconfiguration using the Internal Config-

uration Access Port (ICAP). The advantage of using this port

is the possibility to direct instantiate and conjugate it with the

remaining design, including the communication logic, that can

write the reconfiguration bitstream directly to this port.

The Virtex 4 FPGA contain block RAMs that provide

true dual port capabilities. This allows for all the memories

employed in the design (instruction, data, and look-up) to be

dual port, saving resources. As discussed in Section IV, the

maximum number of AES PUs competing for an instruction

memory port can be up to 3. Thus, we will use 6 reconfigurable

zones that can be reconfigured with an EC or AES PU. We

also implement another 6 (3 per instruction BRAM port) static

EC PUs. Thus the design can have up to 12 PUs working

simultaneously, and up to 6 PUs can be AES PUs. The reason

for implementing only 6 static PUs is related with the Slice

resources constraint and with the increasing number of con-

flicts while accessing the instruction BRAM. We considered

an instruction memory with 1024 36-bit instructions to contain

all the routines for EC and AES arithmetic.

The static design contains the required logic to implement

the communication with the host, the random numbers genera-

tion, the AES look-up memories, and the 6 static EC PUs. The

required resources to implement the static design are 8,446

slices and 11 BRAMs (2 for the instruction memory, 3 for the

AES look-up ROMs, and 6 for the data storage in the 6 static

EC PUs).

240

Reconfigurable

Zone

Bus Macros

(a) No reconfigurable PUs (b) AES PUs only (c) ECC PUs only

Fig. 5: Processor layout with different reconfigurations.

There are 6 reconfigurable zones in the design with rectan-

gular shape of 13 CLB width and 21 CLB height (13× 21×4 = 1092 total slices). Considering the size of the Virtex 4

configuration frames which have 1 CLB width and 16 CLB

height, the reconfiguration of a reconfigurable area requires

the communication of 26 frames. The different reconfigurable

zones do not intercept reconfiguration frames of the others.

For this, the bottom boundaries of the reconfigurable zones

are at the CLB coordinates 0, 32, and 64 (slices 0, 64, and

128). The layout of the system, as well as the bus macros

location, is depicted in Figure 5 for different contents of the

reconfigurable zones after place and routing. Each PU employs

1 BRAM for its data memory. The reconfigurable AES PUs

requires 157±2 slices and the EC PU requires 943±7 slices.

The variation in the slices resources employed by each PU is

due to the slightly different placing of the resources by the

tools for the different reconfiguration zones. The occupation

of the reconfigurable zones by the PUs is 14% and 86% for

the AES and EC PUs, respectively. Although the occupation

of the reconfigurable zones is not complete, the margin of

free resources allow to enhance the routing delays. Regarding

the complete design, the required resources are 14,092 slices

(92% of the total resources) with all the reconfigurable zones

implementing EC PUs and 9,387 (61% of the total resources)

with all the reconfigurable zones implementing AES PUs.

Considering the reconfigurable zones completely occupied, the

required resources for the complete design are 14,998 slices

(98% of the complete resources). The obtained system can run

at the maximum frequency of 100.3 MHz.

The reconfiguration bitstreams were generated in com-

pressed format, using the appropriate Xilinx tools options.

The maximum and minimum size in 32-bit words of the

runtime reconfiguration bitstreams are 30662 and 31067 for

EC PUs, and 27898 and 29500 for the AES PUs. Although the

reconfiguration area is the same for the AES and EC PUs, the

AES PUs result in approximately 5% smaller bitsreams due to

the lower utilization of resources, allowing for a slightly higher

compression. The reconfiguration time is directly correlated

with the bitstreams and the clock frequency. The ICAP in

Virtex 4 technologies allow to write a 32-bit reconfiguration

word in each clock cycle. The maximum ICAP working

frequency is 100 MHz [13], thus we expect that the maximum

reconfiguration time can be of 31067/100 MHz ≈ 310µs.

Although, in the developed prototype, the reconfiguration

bitstream is communicated from outside the device and written

directly in the ICAP, being the reconfiguration time limited by

the communication process. The communication is performed

through a PCI bus, working at 33 MHz. Hence, we use the

same bus frequency driving the ICAP, being the incoming data

immediately transferred to the ICAP. With this, we obtain a

maximum reconfiguration time of 31067/33 MHz ≈ 941µs.

In the next subsection we present the results specific for the

RNG and PUs operation.

A. Random Number Generator

In order to validate the implementation of the RNGs,

random bitstreams were collected from the processor and their

randomness tested using a battery of tests. Two main batteries

of tests are used for this purpose: the National Institute of

Standards and Technology (NIST) test [14] and the Diehard

one [15]. For the implementation proposed in [12], in order to

pass both batteries of tests successfully, the authors obtained

a RNG with 25 oscillators of 3 inverters each, sampled at

241

100 MHz. The option of using 3 inverters is justified by the

enhanced compactness of the implementation.

The randomness of the bitstream is enhanced if the number

of oscillators increase or/and the sampling frequency decrease.

For the processor herein presented, using 3 inverters per oscil-

lator, the number of required oscillators to pass both NIST and

Diehard tests at 100 MHz, which is the operating frequency

for the prototype, was shown to be 20. Each oscillator is

implemented within a CLB, resulting in a very compact RNG.

B. Processing Units

Using the proposed architecture and microcode format, we

were able to program the EC scalar multiplication and point

addition in 401 instructions, and the AES key expansion,

ciphering and deciphering in 253 instructions. The total latency

for the EC PUs is 201,661 clock cycles for the EC scalar

multiplication and 4,796 clock cycles for the point addition.

The latency for the key expansion and ciphering/deciphering

in our AES PU is 610 clock cycles and 2,290 clock cycles,

respectively.

Performance metrics for different combinations of simulta-

neously working PUs in the cryptographic processor are pre-

sented in Table I. This metrics are measured at the prototype

operating frequency, 100 MHz.

The evaluation in Table I use 1 EC scalar multiplication

and 88 consecutive AES ciphering operations, because the

time consumption of one individual EC point multiplication

is approximately the time of 88 AES operations, allowing a

fair analysis. Although the instruction memory has two ports,

we focus our analysis on a single arbiter individually, thus one

of the instruction memory ports. This analysis hold for both

arbiters, even if the configuration of the PUs attached to them

is different.

An EC point multiplication produces a result in 2.02 ms

if no conflicts occur, thus the proposed design provides a

throughput of 496 Op/s for only one PU. For 6 EC PUs

running simultaneously, the throughput is of 1,536 Op/s, which

is lower than 6 times the throughput for one PU, due to

the conflicts accessing the instructions. Performing the same

analysis for the AES arithmetic, considering the ciphering of

128 bit blocks, the proposed processor provides a throughput

from 5.6 Mbit/s for 1 PU to 16.8 Mbit/s for the 3 PUs. In

this case, the throughput of the system scales directly with

the number of PUs, since all the instructions for the 3 the

AES PUs competing for the instruction memory take the

same 3 clock cycles, thus no conflicts will occur. Intermediary

configurations can be useful for the dynamic requirements of

the host.

We also introduce an efficiency metric in Table I. This

efficiency measures the impact of the request conflicts solved

by the instruction memory arbiter. This efficiency measures the

ratio of time used for useful computing by all the operating

PUs within a specific time interval. To perform this efficiency

measurement we programmed all the PUs to run consecutively

the same program, and after a specific time interval T mea-

sured in clock cycles the number of complete EC (nEC) and

AES (nAES) operations were counted. The efficiency (E) is

given by:

E =nECTEC + nAESTAES

nPUT; (2)

where TEC and TAES is the time of a single EC and AES

operation without conflicts in the memory accessed measured

in clock cycles, respectively, and nPU is the number of the

active PUs. From Table I, it can be observed that the efficiency

is very close to 100% for configurations with less than 4 PUs.

This result arose from the fact that an instruction takes at

least 3 clock cycles to complete, thus the number of conflicts

in the arbiter will be meaningless. Moreover, for the other

configurations the efficiency is always greater than 61%.

Comparing the presented results with the related work is

not straightforward, since different technologies and different

metrics are used by different authors. Nonetheless, we intro-

duce some related art results to comparatively evaluate our

design.

In [4] a compact AES/EC design is proposed, supported on

a Xilinx Virtex XCV800 platform running at 41 MHz. Several

Logical Units (LUs) that support the basic field operations over

GF (28) are organized by two reconfigurable modes: a Single-

Instruction-Multiple-Data (SIMD) mode that support the AES

arithmetic, and a Single-Instruction-Single-Data (SISD) mode

that supports the EC arithmetic. This design does not support

simultaneous EC and AES arithmetic, since the LUs must

be reconfigured to reuse resources. This design offers a

throughput of 3.8 Mbit/s for the AES ciphering (128 bit key),

and a Point multiplication (in GF (2163)) latency of 5.36 ms.

Our AES throughput when using one PU is 5.5 Mbit/s (1.4

times higher) and the latency for the EC point multiplication

is 2.02 ms (2.65 times lower). The design in [4] occupies

220K gates (approx. 2329 Slices), which is 2.1 times more

than one reconfigurable zone in our design. In [4], the sharing

of the datapath between the AES and EC results in the splitting

of an operation in smaller ones, when these operations could

be more efficiently computed in dedicated hardware or using

look-up tables. This could justify the lower performance metric

of this design.

In [3] a 0.18µm ASIC solution operating at 100 MHz is

proposed. In this solution the AES and EC arithmetic share

most the multipliers and registers. With 56K gates, the authors

in [3] state that a throughput of 64 Mbit/s for the AES, and

a latency of 1.8 µs for a field multiplication can be achieved.

Considering that 983 field multiplications and 650 squaring

operations are required for implementing the EC multiplication

algorithm, we estimate that the EC point multiplication latency

would be >2,9 ms. The herein proposed design is able to

perform the EC point multiplication 1.4 times faster. Although

our AES throughput is lower, our design can operate AES and

EC simultaneously and offer a flexibility and programmability

that an ASIC solution can not.

In [16], a compact solution for AES supported by a Xilinx

XC2S15 FPGA running at 67 MHz is proposed. This design is

supported by two main arithmetic units: a multiply accumulate,

and a byte substitution unit, to support the non-linear function

242

TABLE I: Performance metrics for different combinations of simultaneously working PUs.# ECC # AES Latency ECC throughput AES throughput Efficiency

PUs PUs (K clk cycles) ms (Op/s) (Mbit/s) (%)

0 0 - - - - -

1 0 201.7 2.02 496 - 100.00

2 0 201.7 2.02 992 - 100.00

3 0 201.7 2.02 1488 - 100.00

4 0 342.3 3.42 1169 - 82.50

5 0 344.9 3.45 1450 - 71.60

6 0 390.5 3.91 1536 - 61.67

0 1 201.5 2.02 - 5.59 99.98

1 1 206.9 2.07 483 5.44 99.08

2 1 223.5 2.24 895 5.04 96.61

3 1 348.8 3.49 860 3.23 81.80

4 1 354.5 3.55 1128 3.18 71.24

5 1 391.4 3.91 1278 2.88 61.59

0 2 201.5 2.02 - 11.18 99.98

1 2 208.1 2.08 481 10.83 98.57

2 2 348.7 3.49 574 6.46 79.27

3 2 350.3 3.50 856 6.43 70.72

4 2 385.9 3.86 1037 5.84 61.20

0 3 201.5 2.02 - 16.77 99.98

required in the AES. These units are controlled by microin-

structions and a microprogram counter controls the program

flow and branches. This design achieves a throughput of 2.2

Mbit/s occupying 124 slices and 2 BRAMs. Our AES PU

offers a throughput 2.5 times higher with 1092 slices allocated

for its reconfigurable zone and 4 BRAMs. These 4 BRAMs

are the minimum required for an AES PU to operate in the

herein proposed design.

VI. CONCLUSIONS

In this paper, a microcoded and customizable cryptographic

processor prototype is presented, capable of efficiently com-

puting the AES and EC algorithms, as well as the generation

of secrets through a RNG. The adopted approach relies on

efficient and compact EC and AES processing units that share

the same control from a central microinstruction memory,

allowing simultaneous computing of AES and EC routines.

With this processor, customization can be performed by adding

processing units according to the processing needs. Additional

configuration can be achieved in runtime through the dynamic

reconfiguration capabilities of the FPGA. These characteris-

tics make this processor highly adaptable and flexible. The

reconfiguration time for a single PU is smaller than an EC

multiplication, resulting in negligible impact in the system

performance if several reconfigurations need to be performed.

The proposed processing units, that provide the computing

power of the processor, have shown to be very compact

and suitable for embedded systems, supporting AES and EC

with configurations fitting reconfiguration zones of 1092 slices

each, and throughputs up to 1536 Op/s for EC and 16.8 Mbit/s

for AES. Another advantage of the proposed processor is the

inclusion of a compact true RNG in the architecture. This true

RNG allows for the internal generation of secrets (such as

private keys), thus enhancing the system security.

REFERENCES

[1] N. I. of Standards and Technology, “Federal Information ProcessingStandards Publication 186-3: Digital Signature Standard,” June 2009.

[2] ——, “Federal Information Processing Standards Publication 197: Ad-vanced Encryption Standard,” November 2001.

[3] J. Wang, X. Zeng, and J. Chen, “A VLSI implementation of ECCcombined with AES),” Proc. 8th International Conference on Solid-State

and Integrated Circuit Technology, pp. 1899–1904, March 2006.[4] W. Lim and M. Benaissa, “Subword parallel GF(2m) ALU: an implemen-

tation for a cryptographic processor,” Proc. IEEE Workshop on Signal

Processing Systems, pp. 63–68, Aug. 2003.[5] R. Szerwinski and T. Guneysu, “Exploiting the Power of GPUs for

Asymmetric Cryptography,” Proc. Workshop on Cryptographic Hard-

ware and Embedded Systems CHES, pp. 79–99, Aug. 2008.[6] S. Manavski, “CUDA Compatible GPU as an Efficient Hardware Ac-

celerator for AES Cryptography,” Proc. IEEE International Conference

on Signal Processing and Communications, pp. 65–68, Nov. 2007.[7] O. Kocabas, E. Savas, and J. Grossschadl, “Enhancing an Embedded

Processor Core with a Cryptographic Unit for Speed and Security,” Proc.

International Conference on Reconfigurable Computing and FPGAs, pp.409–414, Dec. 2008.

[8] S. Antao, R. Chaves, and L. Sousa, “Compact and Flexible Mi-crocoded Elliptic Curve Processor for Reconfigurable Devices,” Proc.

7th IEEE Symposium on Field-Programmable Custom Computing Ma-

chines, FCCM, March 2009.[9] Annapolis Micro Systems, Inc., Wildcard 4 Summary Description, 2007,

http://www.annapmicro.com/wc4.html.[10] P. Kocher, J. Jaffe, and B. Jun, “Differential Power Analysis,” Proc. 19th

Annual International Cryptology Conference, Advances in Cryptology,

CRYPTO, vol. 1666, pp. 388–397, 1999.[11] B. Sunar, W. Martin, and D. Stinson, “A provably secure true random

number generator with built-in tolerance to active attacks,” IEEE Trans-

actions on computers, vol. 56, no. 1, p. 109, 2007.[12] K. Wold and C. Tan, “Analysis and Enhancement of Random Number

Generator in FPGA Based on Oscillator Rings,” Proc. International

Conference on Reconfigurable Computing and FPGAs, REConFig, pp.385–390, 2008.

[13] Xilinx, Inc., Virtex-4 FPGA Data Sheet: DC and Switching Character-

istics, version 3.7, 2009, http://www.xilinx.com/support/documentation/data sheets/ds302.pdf.

[14] N. I. of Standards and Technology, “A Statistical Test Suite forRandom and Pseudorandom Number Generators for CryptographicApplications, Special Publication 800-22, Revision 1,” August 2008,http://csrc.nist.gov/publications/nistpubs/800-22-rev1/SP800-22rev1.pdf.

[15] G. Marsaglia, “Diehard Battery of Tests of Randomness,” 1995,http://stat.fsu.edu/pub/diehard/.

[16] T. Good and M. Benaissa, “AES on FPGA from the Fastest to theSmallest,” Proc. Workshop on Cryptographic Hardware and Embedded

Systems CHES, pp. 427–440, September 2005.

243

The Delft Reconfigurable VLIW ProcessorStephan Wong, Fakhar Anjam

Computer Engineering LaboratoryDelft University of Technology

Mekelweg 4, 2628 CD Delft,The NetherlandsE-mail: [email protected], [email protected]

Abstract—In this paper, we present the rationale and designof the Delft reconfigurable and parameterized VLIW processorcalled ρ-VEX. Its architecture is based on the Lx/ST200 ISAdeveloped by HP and STMicroelectronics. We implementedthe processor on an FPGA as an open-source softcore andmade it freely available. Using the ρ-VEX, we intend to bridgethe gap between general-purpose and application-specificprocessing through parameterization of many architectural andorganizational features of the processor. The parameters include:instruction set (number and type of supported instructions), thenumber and type of functional units (FUs), issue-width (numberof slots), register file size, memory bandwidth. The parameterscan be set in a static or dynamic manner in order to provide thebest performance or the best utilization of available resourceson the FPGA. A complete toolchain including a C compilerand a simulator is freely available. Any application writtenin C can be mapped to the ρ-VEX processor. This VLIWprocessor is able to exploit the instruction level parallelism(ILP) inherent in an application and make its execution fastercompared to a RISC processor system. This project createsresearch opportunities in the domain of softcore embeddedVLIW processor prototyping, as well as designs that can beused in high-performance applications.

Keywords: Reconfigurable computing, FPGA, softcore, ILP,VLIW.

I. INTRODUCTION

Very Long Instruction Word (VLIW) processors can beused to increase the performance beyond normal Reduced In-struction Set Computer (RISC) architectures [1]. While RISCarchitectures only take advantage of temporal parallelism(by utilizing pipelining), VLIW architectures can additionallytake advantage of the spatial parallelism by using multiplefunctional units (FUs) to execute several operations simultane-ously. VLIW processor improves the performance by exploit-ing Instruction Level Parallelism (ILP) in a program. Field-programmable gate arrays (FPGAs) have become a widelyused tool for rapid prototyping, providing both flexibility (asin software programming) and performance (as in dedicatedhardware). Nowadays, FPGAs are moving beyond their simpleprototyping beginnings towards mainstream products beingutilized in many markets: general-purpose, high-performance,and embedded.

For an application to take advantage of performance im-provement from FPGA, it must possess inherent parallelism, orthe application source code should be structured in such a wayto expose its parallelism. Applications in different domainssuch as multimedia, bio-informatics, wireless communication,

numerical analysis etc. contain a lot of ILP, as they havemany independent repetitive calculations. VLIW processorssuch as the Lx/ST200 [1] from HP and STMicroelectronicsand the TriMedia [2] from NXP can exploit the ILP found inan application by means of a compiler. By issuing multipleoperations in one instruction, a VLIW processor is able toaccelerate an application many times compared to a RISCsystem [1][3].

This paper presents the design of an open source, extensibleand reconfigurable softcore VLIW processor. The processorarchitecture is based on the VEX (VLIW Example) InstructionSet Architecture (ISA), as introduced in [4], and is imple-mented on an FPGA. Parameters of the VLIW processor suchas the number and type of functional units (FUs), supportedinstructions, memory-bandwidth, and register file size can bechosen based on the application and the available resourceson the FPGA. A software development toolchain includinga highly optimizing C compiler and a simulator for VEXis made freely available by Hewlett-Packard (HP) [5]. Weadditionally present a development framework to optimallyutilize the processor. Any application written in C can beexecuted on the processor implemented on the FPGA. TheISA can be extended with custom operations and the compileris able to generate code for the custom hardware units, furtherenhancing the performance.

The remainder of the paper is organized as follows. Sec-tion II explains the rationale behind the project. In Section III,some previous work related to softcore processors is discussed.The VEX VLIW processor architecture and the availablesoftware toolchain are discussed in Section IV. Section Vpresents the design and implementation details of our softcoreVLIW processor ρ-VEX. Finally, conclusions are presented inSection VI.

II. THE RATIONALE

The utilization of reconfigurable hardware (with the mostcommon nowadays: field-programmable gate array (FPGA))has increased tremendously in the past years due to theirinherent parallelism1 that can be exploited in order to improvethe execution of many applications, e.g., multimedia, bio-informatics, and many large-scale scientific computing appli-cations. Many approaches have been adopted to exploit recon-

1There are multiple factors that played a role, e.g., lowering cost ofownership, but these are not mentioned as the discussion is focussed onperformance.

244

figurable hardware, but no single all-encompassing solutionhas emerged as each performs usually only very well for theirparticular environment or supported application(s). However,many of these solutions are hampered not by their ingeniousdesigns but by the lack of tools to fully exploit the solutionfor more general-purpose cases. Therefore, we proposed theρ-VEX processor as a reconfigurable and extensible VLIWsoftcore processor to bridge the gap between application-specific and general-purpose processing. In the following,we first highlight the advantages of our choice for a VLIWprocessor as a starting point:

• simple hardware: One of the main advantages of VLIWprocessors is that their hardware design is relativelysimple compared to other RISC processors as there is noneed for complex instruction decoders (e.g., out-of-orderexecution) in hardware since the compiler has alreadytaken care of the instruction scheduling. This means thatthe hardware we need to implement on the FPGA canbe kept simple and, therefore, higher clock frequenciescan be achieved to improve performance. Furthermore,additional parallelism can be provided by simply addingmore issue slots or functional units.

• availability of existing tools: Compilers for VLIW pro-cessor are readily available and research and developmenteffort in improving them is still ongoing. Moreover, forthe VEX that we have chosen as a basis, a simulatoris available to investigate the performance gains for dif-ferent architectural instances of the VEX processor. Thismeans we can exploit existing compilers (and simulators)and future advancements without the need to dedicatemuch effort in their development.

• no need for language translations: Another benefit ofusing an existing VLIW architecture and its toolchain isthat there is no longer need for translators and automaticsynthesis tools. Nowadays, e.g., when looking at C-2-VHDL tools, restrictions must be placed on the Cconstructs before they can be utilized for the purposeof automatic hardware synthesis and sometimes coderewriting is necessary to achieve improved performance.In the latter, the (software) programmer needs to possesshardware knowledge, which is not always the case. Thismeans we can take any existing code and compile it toour VLIW processor without rewriting code and withoutrequiring the programmer to have hardware knowledge.We see a clear motivation for a reconfigurable VLIWprocessor between hardware design using automatic syn-thesis tools (starting from C) and manual design asadequate performance can be achieved after just thecompilation time.

The choice for a VLIW processor clearly has its advantagesand in the following, we will discuss reconfigurability-specificbenefits we foresee:

• static resource sharing: When the size of the recon-figurable hardware structure and the available hardwarearea are known beforehand, one or several pre-configured

VLIW softcore(s) can be instantiated and configured onthe FPGA. In this manner, a short trade-off study, e.g.,via a simulator or model, can determine the parametersmost-suited for the available hardware and targeted ap-plication(s) at hand. This scenario is most suited for theembedded design environment as the requirements andplatform are usually well-known and fixed. The sharingof resources between multiple VLIW processors is alsopre-determined.

• dynamic resource sharing: When neither the applicationnor the precise characteristics of the attached recon-figurable hardware is known at design time, the mostappropriate scenario is to allow for dynamic resourcesharing. In this scenario, enough resources are instan-tiated to allow for sharing among the multiple VLIWprocessors running on the same chip. The method howto do this is under investigation and initial solutions havebeen proposed already.

• on-the-fly resource instantiation: When new resourcesare needed they can be instantiated on-the-fly. Similarly,when they are no longer needed their space can be freedand be dedicated to other applications.

The most promising solution to implement is most certainlythe combination of the second and the third benefit mentionedabove. On the other hand, one must not loose sight of certainintrinsic disadvantages of VLIW architectures that preventedit to become mainstream processors. However, we believe thatthese disadvantages are mainly due to their fixed design andmany of these disadvantages can be mitigated when beingimplemented on reconfigurable hardware. We will highlightseveral issues2 in the following and how they could beaddressed:

• varying instruction word widths: Different applicationscontain different levels of parallelism (this is true evenwithin the same application). In order to fully exploitthis more issue slots should be used leading to longer(and therefore, different) instruction widths. Moreover,when using a different number of instructions can leadto a different encoding scheme of the VLIW instructionsand thereby varying their length again. This issue canbe easily dealt with by the reconfigurable nature of areconfigurable and parameterized VLIW processor asdifferent instruction decoders can be instantiated. Thiscan be achieved with or without reconfiguring the issueslots (in the latter, unused issue slots can be shared amongother different softcores).

• high number of NOPs: Due to the traditionally fixedimplementation nature of VLIW processors, their organi-zation may not completely match the parallelism inherentin the application leading to a high number of NOPsbeing scheduled. This leads to an under-utilization of theavailable resources (in some cases to over 50%). Instead

2The length of this paper does not allow for an extensive discussion of theshortcomings of VLIW processors and how they can be addressed. Therefore,we only mention the most important ones.

245

of idling issue slots, the reconfigurable VLIW processorcan reconfigure the issue slots, or reduce the number ofissue slots - i.e., either physically or enable sharing.

• unbalanced issue slots: This issue is tightly coupledwith the previous issue as it is one of the causes forthe scheduling of NOPs as functional units might notbe available across all issue slots. This issue can beaddressed by adding more functional units per issue slot.

Having stated how a reconfigurable and parameterized VLIWcan overcome the traditional shortcomings of a VLIW proces-sor, we will highlight in the following how such a reconfig-urable VLIW processor can be used in two likely scenarios:

1) stand-alone general-purpose processor: In this sce-nario, complete applications (or application threads) runon the VLIW processor. The implementation of theprocessor can be fixed during the execution of multipleapplications, but our envisioned reconfigurable VLIWprocessor should be able to adapt itself to differentapplications (or even to code portions within a singleapplication).

2) application-specific co-processor: In this scenario, onlyspecific kernels that require acceleration are being com-piled to the VLIW processor. The benefits are: (1) noneed for code rewriting, (2) avoidance of using complextools such as C-2-VHDL translators, and (3) manualdesign of accelerators can be skipped. We have to noteagain that we are not stating that there is no need for theaforementioned actions or tools, but they can be avoidedwhen the VLIW processor is capable of providing goodenough performance within the requirements (such as,power, area) set.

Having stated the above, we present an advantage due tothe existence of a reconfigurable and parameterized VLIWprocessor, namely instruction-set architecture (ISA) emulation.This means that we can implement different ISAs on top of theVLIW processor and ensure that each emulation is the most ef-ficient. This will have the obvious advantage that applicationscompiled for different architectures can be executed withoutcode recompilation (cumbersome) or software code emulation(slow). Moreover, having the mentioned ability allows for thefollowing scenarios:

1) ISA extension emulation: When new ISA extensionsare being introduced, much research and developmenteffort is needed in order to ensure marked acceptance.However, with a reconfigurable ISA emulator it is possi-ble to implement and ship the (draft) extension to poten-tial end-users for actual use and evaluation. Furthermore,bug reports can lead to further improvements before theextension is fixed in hardware. The latter is still neededsince the performance and power utilization of reconfig-urable hardware is usually not optional. However, early-on experience of developers can lead to a much earliermarket adoption of the intended ISA extension.

2) instantiation of dedicated processor organizations:When new processors are released, in many cases code

recompilation is needed to take advantage of new orga-nizational improvements. This need can be relaxed asdedicated organizational features can be provided in thereconfigurable hardware for particular already-compiledcode.

3) relaxation of backwards compatibility: Rarely usedinstructions can be implemented in reconfigurable hard-ware and their implementation can be instantiated whenneeded. This means that complex instruction decodinghardware can be avoided leading to simpler hardwaredesign and potentially lower power consumption.

By no means, our research in the design of the ρ-VEXprocessor is finished and there are still many open questionsthat need to be solved. However, discussing them is beyondthe scope of this paper. In the remainder of this paper, wehighlight several other similar approaches and describe morein-depth the design of our ρ-VEX processor and its currentdevelopment status.

III. RELATED WORK

In literature, few softcore VLIW processors with a completetoolchain can be found. The first VLIW softcore processorfound in literature is Spyder [6]. The design and implementa-tion of Spyder marked the beginning of a reconfigurable VLIWsoftcore processor. Spyder consists of three reconfigurableunits. A compiler toolchain was made available. One of thedrawbacks of Spyder was that both the processor architectureand the compiler were designed from scratch. Because thedesigner had to put efforts in both directions, the processordid not evolve extensively.

Instance-specific VLIW processors are presented in [7][8].These architectures are specific implementations for someapplications, and do not represent a more general VLIWprocessor. A VLIW processor with reconfigurable instructionset is presented in [9]. An FPGA based design of a VLIWsoftcore processor is presented in [10]. Additionally, thisprocessor is able to execute custom hardware. It has an ISAthat is binary-code compatible with the Altera NIOS-II softprocessor. To support this architecture, a compilation anddesign automation flow are described for programs written inC. The compilation scheme consists of a Trimaran [11] as thefront-end and the extended NIOS-II as the back-end. Due tothe licensed Altera NIOS-II, this VLIW design is less flexibleand not open source.

In [12], a modular design of a VLIW processor is reported.Certain parameters of the processor architecture could bealtered in a modular fashion. The lack of a good softwaretoolchain and the absence of parametric extensibility limitedthe use of this architecture. In [13], the architecture andmicro-architecture of a customizable soft VLIW processorare presented. Additionally, tools are discussed to customize,generate and program this processor. Performance and areatrade-offs achieved by customizing the processor’s datapathand ISA are evaluated. The limitation is the absence of acompiler. In [14], the design and architecture of a VLIW

246

microprocessor is presented without any tool chain, whichrestricts the processor usability.

In [3], we presented the design and implementation of areconfigurable VLIW softcore processor called ρ-VEX. Inaddition, a development framework to utilize the processor ispresented. The processor architecture is based on the VLIWExample (VEX) ISA, as introduced in [4]. VEX represents ascalable technology platform that allows variation in manyaspects, including instruction issue-width, organization offunctional units, and instruction set. A software developmenttoolchain for the VEX architecture [5] is freely availablefrom Hewlett-Packard (HP). The ρ-VEX processor is open-source and implemented on an FPGA. Different parameterssuch as the number and types of functional units, supportedinstructions, memory bandwidth, and size of register file canbe chosen based on the application requirements and availableresources on the FPGA. Initially, an instruction ROM filehas to be generated for each application to be run on theprocessor and the design has to be re-synthesized along withthe instruction ROM file. Now a boot loader like functionalityhas been added and the executable files can be downloadedto the instruction memory and executed directly avoiding thenecessity of resynthesis.

IV. THE VEX VLIW PROCESSOR: HARDWARE ANDRELATED SOFTWARE

Compared to superscalar and RISC processors, VLIW ar-chitecture requires a more powerful compiler due to morecomplex operation scheduling. [4] presents definition of theVLIW design philosophy as: "The VLIW design philosophy isto design processors that offer ILP in ways completely visiblein the machine-level program and to the compiler".

A. The VEX System

The VEX stands for VLIW Example. VEX is a systemdeveloped according to the VLIW philosophy by Hewlett-Packard (HP). VEX includes three basic components [4]:

1) The VEX ISA: VEX Instruction Set Architecture (ISA)is a 32-bit clustered VLIW ISA that is scalable andcustomizable to individual application domains. TheVEX ISA is loosely modeled on the ISA of HP/ST Lx(ST200) family of VLIW embedded cores [1]. VEX ISAis scalable because different parameters of the processorsuch as the number of clusters, FUs, registers, andlatencies can be changed. VEX ISA is customizablebecause special-purpose instructions can be defined ina structured way.

2) The VEX C Compiler: Based on trace scheduling,the VEX C compiler is an ISO/C89 compiler. It isderived from the Lx/ST200 C compiler, which itselfis a descendant of the Multiflow C compiler. A veryflexible programmable machine model determines thetarget architecture, which is provided as input to thecompiler. This means that without the need to recompilethe compiler, architecture exploration of the VEX ISAis possible with this compiler.

3) The VEX Simulation System: The VEX simulator is anarchitectural-level simulator that uses compiled simula-tor technology to achieve faster execution. It additionallyprovides a set of POSIX-like libc and libm libraries(based on the GNU newlib libraries), a simple built-incache simulator (level-1 cache only) and an ApplicationProgram Interface (API) that enables other plug-ins usedfor modeling the memory system.

A VEX software toolchain including the VEX C compilerand the VEX simulator is made freely available by theHewlett-Packard Laboratories [5]. The reason behind choosingthe VEX architecture for our project is the scalability andcustomizability of the VEX ISA and the availability of the freeC compiler and simulator, which can be used for architectureexploration.

B. The VEX Instruction Set Architecture

VEX offers a 32-bit clustered VLIW ISA. VEX models ascalable technology platform for embedded VLIW processorsthat allows variations in the parameters of the processor.Following the VLIW design philosophy, the parameters of theprocessor, such as issue width, FUs, register files and processorinstruction set can be varied. The compiler is responsiblefor scheduling the instructions. Along with basic data andoperation semantics, VEX includes many features for compilerflexibility in scheduling multiple concurrent operations. Someof these features are [4]:

• Parallel execution units, such as integer ALUs and mul-tipliers.

• Parallel memory pipelines, including access to multipledata memory ports.

• Data prefetching and other locality hints supported by thearchitecture.

• A large multiported shared register file made visible bythe architecture.

• Partial predication through select operations.• Multiple condition registers to make efficient branch

architecture.• Long immediate operands can be encoded in the same

instruction.Table I presents the parameters that could be changed for VEXVLIW processor.

The most basic unit of execution in VEX is an operation,which is equivalent to a typical RISC-style instruction. An

Table ITHE VEX DESIGN PARAMETERS

Processor Resource Design ParametersFunctional Units Number of FUs, type, supported instructions,

degree of pipeliningRegister File Register size, register file size, number of

read ports, number of write portsLoad/Store Unit Number of memory ports, memory latency,

cache size, line unitInterconnection Number and width of buses, forwardingNetwork connections between units

247

encoded operation in VEX system is called a syllable. Multiplesyllables are combined to form an instruction, which is anatomic unit of execution in a VLIW processor. The instructionissue-width is the number of syllables in an instruction thatcould be issued, and it depends on the number of FUs inthe processor. An instruction having multiple syllables oroperations is issued every cycle by the compiler to the multipleexecution units of the VLIW processor, which is the mainreason for performance compared to a RISC processor, whichhas an issue-width of one.

1) Multicluster Organization: The number of read portsof the shared multiported register file in a VLIW processoris twice the issue-width, and the number of write ports isequal to the issue-width (assuming that each FU requires twoinput operands and writes one output as a result). Therefore,the issue-width is proportional to the product of the numberof read and write ports of the shared register file. Theresource/area requirement for a multiported register file isdirectly proportional to the product of number of read andwrite ports, therefore these parameters are not scalable to alarge extent or we can say that the issue-width is not scalableto a large extent. To reduce this pressure on the number ofread and write ports of the shared register file, VEX definesa clustered architecture [4]. Using modular execution clusters,the VEX provides scalability of issue-width and functionality.A cluster is a collection of register files and a tightly coupledset of FUs.

VEX clusters are numbered from zero. Cluster 0 is aspecial cluster that must always be present in any VEXimplementation, because the control operations execute on thiscluster. Different clusters have different unit/register mixes, buta single Program Counter (PC) and a unified I-cache controlthem all, so that they run in lockstep or proper sequence [1].The structure of a VEX multicluster architecture is depictedin Figure 1.

FUs within a cluster can only access registers in the samecluster. VEX provides a simple pair of send-receive instructionprimitives that move data among registers on different clusters.These intercluster copy operations may consume resources

Figure 1. The VEX Multicluster Organization

in both the source and destination cluster and may requiremore than one cycle (pipelined or not). There is only asingle instruction cache (I-cache), but different data cache(D-cache) ports and/or private memories can be associatedwith each cluster. This means that VEX allows multiplememory accesses to execute simultaneously. Figure 1 depictsmultiple D-cache blocks, attached by a crossbar to differentclusters, which allows a variety of memory configurations.VEX clusters obey the following set of rules [4]:

• Each cluster has the ability to issue multiple operationsin the same instruction.

• Different clusters can have different issue-widths anddifferent types of operations.

• Different clusters can have different VEX ISA, and notall clusters have to support the entire VEX ISA.

• All units within a cluster are indistinguishable or equallylikely for selection. This means that the operations tobe executed by a cluster do not have to be assigned toparticular units within this cluster. To assign operationsto the units within a cluster is the job of the hardwaredecoding logic.

2) Structure of the Default VEX Cluster: The default singleVEX cluster is a 4-issue VLIW core, as depicted in Figure 2,and consists of the following units [4]:

• Four 32-bit integer ALUs• Two 16x32 multipliers (MULs)• One Load/Store Unit• One Branch Unit• 64 32-bit general-purpose registers (GRs)• 8 1-bit branch registers (BRs)

This cluster can issue up to four operations per instruction.These operations could be either integer ALU, or MUL orLoad/Store operations. All FUs are directly connected toregisters, and no FU is directly connected to another FU. Twotypes of register banks are GR and BR. Both are multiportedand shared register files. Memory units support only load andstore operations, i.e., operations that act on memory and saveresults directly in memory are not supported by the VEXsystem. The branch unit (control unit) in the default clusteris used for program sequencing and is present only in cluster0 in case of a multicluster machine.

C. The VEX C Compiler

The VEX C compiler is derived from Lx/ST200 C com-piler, which is itself derived from the Multiflow C compiler[15]. The Multiflow compiler includes high-level optimizationalgorithms based on Trace Scheduling [16]. The VEX Ccompiler is provided as a part of the freely available VEXtoolchain by HP. The compiler supports the old C language,as well as the ISO/ANSI C. The toolchain has a commandline interface and is provided in the form of binaries. Differentcommand line options are provided for the compiler and thetoolchain. Applications can be compiled with profiling flags,and GNU gprof can be used to visualize the profile data.Because the VEX processor is scalable and customizable,

248

Figure 2. The Default VEX Cluster

the compiler supports this scalability and customizability. Tocompile for a different configuration, the compiler is providedwith configuration information in the form of Machine ModelConfiguration (fmm) file. To include a custom instruction, theapplication code is annotated with pragmas. Different compilerpragmas are available to improve the performance. Refer to [4]for details on how to use the VEX compiler.

D. The VEX Simulation System

The VEX toolchain provides tools that allow C programscompiled for a VEX VLIW configuration to be simulated ona host workstation. The VEX simulator is a fast compiledsimulator (CS) that translates VEX binary to a host computerbinary. It first converts VEX binary to C and then using the Ccompiler of a host generates a host executable. The compiledsimulation workflow is depicted in the Figure 3.

The VEX simulator produces instrumentation code to countexecution cycles and other statistics and generates a log fileat the end of simulation. This log file has all the statisticalinformation that can be analyzed for performance analysisand architecture exploration. The simulator provides a simplecache simulation library, which models an L1 instruction anddata cache. The default cache simulation can be replaced by auser-defined library. In addition, the VEX simulator includessupport for gprof, and different statistical files generated at theend of simulation can be used with gprof tool for analysis andprofiling of the simulated application. Refer to [4] for detailson how to use the VEX simulator.

V. THE ρ-VEX SOFTCORE VLIW PROCESSOR

In [3], we presented the design and implementation of areconfigurable and extensible softcore VLIW processor. Weimplemented a single-cluster standard configuration of VEX

Figure 3. The VEX Simulation Flow

machine for our processor called ρ-VEX. Figure 4 depictsthe organization of our 32-bit, 4-issue VLIW processor imple-mented on an FPGA. The ρ-VEX processor consists of fetch,decode, execute and writeback stages. The fetch unit fetches aVLIW instruction from the attached instruction memory, andsplits it into syllables that are passed on to the decode unit.In the decode stage, the instructions are decoded and registercontents used as operands are fetched from the register file.The actual operations take place in either the execute unit,or in one of the parallel CTRL or MEM units. ALU andMUL operations are performed in the execute stage. This stageis implemented parametric, so that the number of ALU andMUL functional units could be adapted. All jump and branchoperations are handled by the CTRL unit, and all data memoryload and store operations are handled by the MEM unit. Allwrite activities are performed in the writeback unit to ensurethat all targets are written back at the same time. Differentwrite targets could be the GR register file, BR register file,data memory or PC.

The ρ-VEX implements all of the 73 operations of theVEX operations set. It additionally supports reconfigurableoperations as the VEX compiler supports the use of custominstructions via pragmas inside the application code. In thecurrent ρ-VEX prototype, it takes only a few lines of VHDLcode to add a custom operation to the architecture. One of the24 available reserved opcodes could be chosen, and a providedtemplate VHDL function could be extended with the customfunctionality. Currently, the following properties of ρ-VEX areparametric:

• Syllable issue-width• Number of ALU units• Number of MUL units• Number of GR registers (up to 64)• Number of BR registers (up to 8)• Types of accessible FUs per syllable• Width of memory bussesTo optimally exploit the processor utilization, a development

framework is provided, which consists of compiling a pieceof C code with VEX compiler and then generating a VHDLinstruction ROM file by assembling the assembly file with ourassembler [3]. The ROM file is then synthesized with the restof the processor VHDL design files.

As the target reconfigurable technology, Xilinx Virtex-IIPro (XC2VP30) FPGA was chosen, embedded on the XUPV2P development board by Digilent. All experiments wereperformed on a non-pipelined ρ-VEX system with 32 generalpurpose registers (GR). A data memory of 1 kB implementedusing BlockRAM was connected to ρ-VEX to store results.The issue-width of ρ-VEX was varied between 1, 2 and 4.All configurations had the same number of ALU units as theirissue-width. The 2- and 4-issue ρ-VEX configurations had 2MUL units. The application code was loaded in the instructionmemory before synthesis. We developed a debugging UARTinterface to transmit data via the serial RS-232 protocol. Thisinterface invoked a transmission of the hexadecimal represen-tation of the data memory contents, as well as the contents

249

Figure 4. The ρ-VEX VLIW Processor

of the internal ρ-VEX cycle counter register. Synthesis resultsfor ρ-VEX processor are presented in Table II.

A. Recent Developments

The following design improvements are added to the origi-nal ρ-VEX processor:

• The assembler has been extended and it now generatesa binary executable file for the processor. The hardwaredesign is modified and BlockRAM is used as the instruc-tion memory. The executable file can be downloaded intothe instruction memory of the already placed processoron FPGA using a serial port on the PC and the FPGAdevelopment board, and therefore, re-synthesis and re-implementation of the processor with changing the ap-plication is not required.

• We implemented dynamically reconfigurable register filefor the ρ-VEX processor to reduce the resources requiredby the multiported register file [17]. The VEX archi-tecture supports up to 64 multiported shared registersin a register file for a single cluster VLIW processor.This register file accounts for a considerable amount ofarea when the VLIW processor is implemented on anFPGA. Our processor design supports dynamic partialreconfiguration allowing the creation of dedicated registerfile sizes for different applications. The processor candynamically create its own register file composed of theactual number of registers the application needs. Thismeans that valuable area can be freed and utilized forother implementations running on the same FPGA whennot the full register size is needed. Our design needs 924slices on a Virtex-2 Pro device for dynamically placinga chunk of 8 registers, and places registers in multiplesof 8 registers to simplify the design. The processor doesnot need permanently 64 registers requiring 8594 slicesthereby considerably reducing the slice utilization at run-

Table IISYNTHESIS RESULTS FOR ρ-VEX PROCESSOR

ρ-VEX Slices Max. Frequency1-issue 1895 (13%) 89.44 MHz2-issue 5105 (37%) 89.44 MHz4-issue 10433 (76%) 89.44 MHz

time without increasing cycle count.

VI. CONCLUSIONS

In this paper, we presented the design and implementationof a reconfigurable softcore VLIW processor based on theLx/ST200 ISA, developed by HP and STMicroelectronics. Ourprocessor design called ρ-VEX is parametric and differentparameters such as the number and type of functional units,supported instructions, memory bandwidth, register file sizecan be chosen depending upon the application and availableresources on the FPGA. A toolchain including a C compilerand a simulator is freely available. We provide a developmentframework to optimally utilize the reconfigurable VLIW pro-cessor. Any application written in C can be mapped to theVLIW processor on the FPGA. This VLIW processor is ableto exploit the instruction level parallelism (ILP) inherent inan application and make its execution faster compared to aRISC processor system. We described our rationale for theρ-VEX processor and presented the possible advantages itcan provide for its use as a general-purpose processor orapplication-specific co-processor.

REFERENCES

[1] P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, and F. Homewood, "Lx:A Technology Platform for Customizable VLIW Embedded Processing",in Proceedings of the 27th annual International Symposium of ComputerArchitecture (ISCA 00), June 2000, pp. 203 - 213.

[2] TriMedia Processor Series. http://www.nxp.com/.[3] S. Wong, T.V. As, and G. Brown, "ρ-VEX: A Reconfigurable and

Extensible Softcore VLIW Processor", in IEEE International Conferenceon Field-Programmable Technologies (ICFPT 08), Taiwan, December2008.

[4] J. Fisher, P. Faraboschi, and C. Young, Embedded Computing: A VLIWApproach to Architecture, Compilers and Tools. Morgan Kaufmann,2004.

[5] Hewlett-Packard Laboratories. VEX Toolchain. [Online]. Available:http://www.hpl.hp.com/downloads/vex/.

[6] C. Iseli and E. Sanchez, "Spyder: A Reconfigurable VLIW Processorusing FPGAs", in FPGAs for Custom Computing Machines, January1993, pp. 17-24.

[7] C. Grabbe, M. Bednara, J.V.Z. Gathen, J. Shokrollahi, and J. Teich," A High Performance VLIW Processor for Finite Field Arithmetic",in Proceedings of the 17th International Symposium on Parallel andDistributed Processing (IPDPS 03), April 2003.

[8] M. Koester, W. Luk, and G. Brown, "A Hardware Compilation Flow ForInstance-Specific VLIW Cores", in Proceedings of the 18th InternationalConference on Field Programmable Logic and Applications (FPL 08),Sep 2008.

[9] A. Lodi, M. Toma, F. Campi, A. Cappelli, and R. Canegallo, "AVLIW Processor with Reconfigurable Instruction Set for EmbeddedApplications", in IEEE Journal on Solid-State Circuits, vol. 38, no. 11,Jan 2003, pp. 1876 - 1886.

[10] A.K. Jones, R. Hoare, D. Kusic, J. Fazekas, and J. Foster, "An FPGA-based VLIW Processor with Custom Hardware Execution", in Pro-ceedings of the 2005 ACM/SIGDA 13th Internal Symposium on FieldProgrammable Gate Arrays (FPGA 05), New York, NY, USA: ACM,2005, pp. 107 - 117.

[11] http://www.trimaran.org/.[12] V. Brost, F. Yang, and M. Paindavoine, "A Modular VLIW Processor",

in IEEE International Symposium on Circuits and Systems, ISCAS 2007.,Apr 2007, pp. 3968 - 3971.

[13] M.A.R. Saghir, M. El-Majzoub, and P. Akl, "Customizing the Datapathand ISA of Soft VLIW Processors", in High Performance EmbeddedArchitectures and Compilers (HiPEAC 07), LNCS 4367 pp. 276-290,Springer-Verlag Berlin Heidelberg 2007.

[14] W.F. Lee, VLIW Microprocessor Hardware Design For ASICs andFPGA. McGraw-Hill, 2008.

250

[15] P.G. Lowney et al, "The Multiflow Trace Scheduling Compiler", TheJournal of Supercomputing, 7(1/2), 51-142, 1993.

[16] J. Fisher, "Trace Scheduling: A Technique for Global Microcode Com-paction", IEEE Trans. on Computers, C-30(7), 478-490, 1981.

[17] S. Wong, F. Anjam, and M.F. Nadeem, "Dynamically ReconfigurableRegister File for a Softcore VLIW Processor", Accepted for publicationsin DATE 2010.

251

Run-time Reconfiguration of Polyhedral Process NetworksImplementations

Hristo Nikolov Todor Stefanov Ed DeprettereLeiden Institute of Advanced Computer Science

Leiden University, The Netherlandsnikolov, stefanov, [email protected]

Abstract

Run-time reconfigurable computing is a novel com-puting paradigm which offers greater functionality witha simpler hardware design and reduced time-to-market.Although, the reconfigurable technology is constantlyadvancing, yet reconfigurable computing is hardly em-ployed in real systems due to the difficulties associatedwith realizing and managing the reconfiguration pro-cess. In this paper, we address a particular design chal-lenge, namely, the execution management of the dy-namic (reconfigurable) modules. We propose a generaland technology independent approach for modeling andimplementation of run-time execution management forapplications modeled as polyhedral process networks.By exploiting the main characteristics of the polyhe-dral process networks, the approach guarantees consis-tent executions of reconfigurable implementations. Wedo not focus on low-level implementation issues of thereconfiguration process itself since the latter is not (di-rectly) related to the execution management we propose,and therefore, it is out of the scope of this paper.

1 Introduction

When we talk about (re)configurable computing, weusually consider FPGA-based system designs. Suchsystems retain the execution speed of “fixed” hardwarewhile having a great deal of functional flexibility be-cause the logic within the FPGA can be changed if orwhen it is necessary. As a result, hardware bug fixesand upgrades can be administered as easily as theirsoftware counterparts. For example, in order to sup-port a new version of a network protocol, one can re-design the internal logic of the FPGA, and send theenhancement to the affected customers by email. Oncethey have downloaded the new logic design to the sys-tem and restarted it, they will be able to use the new

version of the protocol. Evolving from configurablecomputing, reconfigurable computing goes one step fur-ther by providing manipulation of the logic within theFPGA at rum time. That is, the design of the hardwaremay change in response to the demands placed uponthe system while it is running. Here, the FPGA acts asan execution engine for a variety of different hardwarefunctions much as a CPU acts as an execution enginefor a variety of software threads. A particular exampleof run-time reconfigurable computing is the so calleddynamic partial reconfiguration (DPR). Partial recon-figuration is the process of configuring a portion of afield programmable gate array while the other part isstill running/operating. DPR allows for critical partsof the design to continue operating while a partial de-sign is loaded into the FPGA.

Reconfigurable computing has two major advan-tages. First, it is possible to achieve greater func-tionality with a simpler hardware design. Because notall of the logic must be present in the FPGA at alltimes, the cost of supporting additional features is re-duced to the cost of the memory required to store thelogic design. The second advantage is reduced time-to-market. Most importantly, the logic design remainsflexible up to, and even after the product is shipped.This allows an incremental design flow, a luxury usu-ally not available to typical hardware designs. One caneven ship a product that meets the minimum require-ments and add features after deployment. Moreover,in a networked product like a set-top box or cellulartelephone, it may even be possible to make such en-hancements without customer involvement. In case ofrun-time reconfigurable computing, a main considera-tion is the overhead introduced by the reconfigurationprocess itself. If reconfiguration is performed too of-ten, this overhead can become a bottleneck, limitingsystem performance. Therefore, the ratio execution-time/reconfiguration-time has to be kept reasonablyhigh.

252

1.1 Problem Statement

The principal benefits of using dynamic (partial) re-configuration (DPR) are the ability to execute largerhardware designs with fewer gates and to realize theflexibility of a software-based (multi-threaded) solutionwhile retaining the execution speed of a more tradi-tional, hardware-based approach. However, this comesat the price associated with the difficulties in realizingrun-time reconfigurable computing. First, the provideddesign flows are weak and mostly experimental. It isnot possible to model DPR during all the steps of asystem development. For instance, SystemC can beused for first high-level steps but then it is difficult touse other tools, e.g., HW/SW partitioning tools, sim-ply because DPR is not integrated by the tool vendors.For the low-level steps, it is (almost) impossible to sim-ulate and validate the designs before the platform is in-tegrated into the final board. As a result, designers areoverwhelmed with too many and very low-level detailsin order to “get it right”, making reconfigurable com-puting a highly error-prone and time-consuming task.

In addition to the lack of a tool support, a ma-jor challenge when using dynamic reconfiguration isthe execution management of the dynamic (reconfig-urable) modules. This includes both spatial and tem-poral management. The latter is especially impor-tant in realizing reconfigurable implementations withconsistent run-time behavior. Consistency here meansthat any reconfigurable implementation and executiongenerates results equivalent to its non-reconfigurablecounterpart for the same application. The challenge inrealizing an execution management is further exacer-bated by the complexity of today’s applications, espe-cially in the domain of multimedia embedded systems.Usually, such systems consist of multiple compute mod-ules that operate in a globally asynchronous fashion.If these modules require reconfiguration, i.e., they aredynamic, it is very easy to violate consistency at rumtime. This resembles very much the challenges in soft-ware multi-threading: common problems with threadsynchronizations include deadlock and the inability to(correctly) compose program fragments that are cor-rect in isolation [3, 6]. In general, it is not known howa programmer can come up with a multi-threaded pro-gram with correctness guarantee. The same problemsarise in the reconfigurable computing as well, i.e., thereis no correctness guarantee for applications demandingand implementing reconfiguration at rum time. Weaddress this issue, and in this paper we present an ap-proach based on conditions defining “save” points whenreconfiguration may occur. The main contribution ofthe proposed approach is that if the defined conditions

are respected, consistent system executions are guar-anteed while allowing asynchronous reconfiguration ofdifferent dynamic modules at rum time.

The remaining part of the paper is organized as fol-lows. In Section 2, we discuss the scope of the ap-proach and the main assumptions it relies on. Sec-tion 3 presents the solution approach. Implementationdetails are discussed in Section 4. Section 5 concludesthe paper.

2 Scope of Work

One of the main assumption in our work is thatwe consider only dataflow dominated applications inthe realm of multimedia, imaging, and signal process-ing that naturally contain tasks communicating viastreams of data. Such applications are very well mod-eled by using the parallel dataflow model of computa-tion (MoC) called Kahn Process Network (KPN) [4].The KPN model we use is a network of concurrent au-tonomous processes that communicate data in a point-to-point fashion over bounded FIFO channels, using ablocking read/write on an empty/full FIFO as a syn-chronization mechanism. Each process in the networkperforms a sequential computation concurrently withthe other processes. A well-known characteristic ofKPNs is that their MoC is deterministic. Always fora given input data, one and the same output data isproduced. This input/output relation does not dependon the order in which the processes are executed. Asthe control is incorporated into the processes, no globalscheduler is present.

To represent KPNs, we use polyhedral descriptions,therefore, we call our KPNs polyhedral process net-works (PPN). The PPNs are specific case of KPNs,i.e., PPNs are static and everything about the exe-cution of the process networks is known at compiletime. Moreover, the PPNs execute in finite memoryand the amount of data communicated through theFIFO channels is also known. We are interested inthis subset of KPNs because they are analyzable, e.g.,FIFO buffer sizes and execution schedules are decid-able, and SW/HW synthesis from them is possible.

A PPN is implemented as a heterogeneous multipro-cessor system on chip (MPSoC) using the Daedalusdesign methodology [1, 10]. In such MPSoCs, the pro-cessing components are programmable processors anddedicated HW compute modules (IP cores). The lattermay provide run-time reconfiguration. In this paper,we consider fix communication topologies, i.e., a com-munication topology can not be reconfigured in a targetMPSoC. Hence, reconfiguration can be applied only onthe dedicated dynamic IP cores.

253

An IP core implements the main computation of aPPN process which behaves like a function call. There-fore, the computation performed by a reconfigurable IPhas to resemble a function call as well. This means thatfor each input data read by the IP core, the core is ex-ecuted and it produces output data after an arbitrarydelay. In addition, to guarantee seamless integrationwithin the dataflow of the considered heterogeneoussystems, an IP core must have unidirectional data in-terfaces at the input and the output that do not requirerandom access to read and write data from/to mem-ory. Additional information about the IP cores is givenfurther in Section 4.

3 Solution approach

In this section, we discuss the solution approachwhich allows for run-time reconfiguration of PPN pro-cesses in a way that consistent and deterministic PPNexecutions are guaranteed on the considered MPSoCs.For an illustrative purpose, we use an example pre-sented in Section 3.1. The PPN model is briefly intro-duced in Section 3.2. It contains parameters which maychange values at rum time. The concept of modelingprocess network containing dynamic parameters wasintroduced recently in [7]. We use the same approachas in [7] to preserve consistency of PPN executions,and in addition, we use the parameter values to triggerreconfiguration of particular processes (i.e. IP cores)at rum time. In the proposed solution approach, wedo not discuss technical details about how FPGA par-tial reconfiguration is realized since it is highly vendordependent and it is out of the scope of this work. In-stead, we discuss when actually reconfiguration is safeto happen (in terms of consistency). It is based on con-ditions which have to be respected at rum time. Theconditions are discussed in Section 3.4.

3.1 Illustrative example

Below, we present a part of a multi-format videoencoding application. Usually, encoding algorithmswork on a YUV color space while naturally, the in-put video information is represented in a RGB colorspace. Therefore, initial conversion to YUV is requiredand then, specific processing on the Y, U, and V im-age components is performed. Figure 1 illustrates thisbasic scenario which we will use as our illustrative ex-ample. Figure 1(a) depicts a high-level view of an MP-SoC system in which the input RGB stream is con-verted by processing component Conv to Y, U, andV streams. They are further processed in parallel by

UprocConv

P1

ConvY,U,V pxls

Proc

reconfigure (Y,U,V)

paramemters

P2d data

c

a) Processing without reconfiguration

b) Processing with reconfiguration

c) PPN with dynamic parameters

Yproc ...

...

......

...

U pxls

V pxls

Y pxls

RGB

Vproc

Figure 1. Motivating example.

processing components Y proc, Uproc, and V proc, re-spectively. Figure 1(b) depicts the same YUV-to-RGBconversion and processing, however, implemented on asystem with run-time reconfiguration. In this version,there is one dynamic module Proc which is used to pro-cess the YUV data. According to the data that need tobe processed, Proc is dynamically reconfigured by theConv component. The implementation of the reconfig-uration process must avoid any undetermined behavior.Therefore, explicit handshake logic is required for cor-rect management of the reconfiguration. For brevity,these details are omitted in Figure 1(b).

In our example, we use only the type of the pro-cessed data to illustrate a scenario of a reconfigurablecomputing. However depending on the required levelof flexibility, additional information, e.g., frame size:standard or high definition; type of encoding: MJPEG,MPEG4, or DivX; etc., can also be used for reconfig-uration of the system at rum time. Moreover, due toperformance limitations for example, the quality of theencoding may need to be constrained as well. In ourapproach to reconfigurable computing, we capture re-configuration information at application level, i.e., inthe polyhedral process network model we use to specifyapplication behavior. More precisely, different config-uration possibilities are defined by a set of parametersand their values in a PPN. Our illustrative example isrepresented as a PPN in Figure 1(c). It consists of twoprocesses, P1 and P2, connected through one dataflowchannel (d). P1 implements the RGB-to-YUV conver-sion and P2 realizes the processing of the Y, U, andV components. The information what type of imagecomponent is to be processed is specified by a parame-ter. In order to transfer parameter values between theprocesses, we use control FIFO channels, i.e., channelc in Figure 1(c). At run time, the parameter values areused to trigger proper reconfigurations.

As is the case with all data-flow models, the mainquestion here is whether the PPNs with dynamic pa-rameters are consistent. Consistency has to do with

254

a balancing of the production and consumption of to-kens in the network. When this balancing is dependenton dynamic parameters, consistency conditions may beviolated. In the remaining part of the paper, we dis-cuss how we address this problem in order to guaranteeconsistent executions of applications modeled as PPNson platforms using run-time partial reconfiguration.

3.2 Polyhedral (Kahn) process networks (PPN)

The parallelism in our PPNs is expressed at the levelof the application tasks as a process implements a singleapplication task only. A process of a PPN consists ofa function, input ports, output ports, and control. Thefunction specifies how data tokens from input streamsare transformed to data tokens to output streams. Thefunction also has input and/or output arguments. Theinput and output ports are used to connect a process toFIFO channels in order to read data tokens, initializingthe function input arguments, and to write data gener-ated as a result of the function execution. The controlspecifies how many times the function is executed andwhich ports to read/write at every execution, i.e., atevery iteration (firing) of the process. The control of aprocess can be compactly represented mathematicallyin terms of linearly bounded sets of iterator vectors us-ing the polytope model [2]. A process has a ProcessDomain (DM) which is the set of all iterator vectors.Each iterator vector corresponds to one and only oneintegral point in a polytope. Formally,

DM = P (p) ∩ Zn,

where P (p) is a parametric polytope,

P (p) = i ∈ Qn, p ∈ Zm | Ai ≥ Bp + C,

where i is an iteration vector, A, B and C are in-tegral matrices of appropriate dimensions, and p is aparameter vector with an affine range R(p),

R(p) = p ∈ Zm | Dp ≥ E,

where D and E are integral matrices of appropri-ate dimensions. We use the values of the parametervector’s elements to determine different configurationoptions at rum time.

3.3 Process network instance

In our approach to model dynamic parameters,we introduce a notion of a PPN instance which isdefined by the current value of the elements of theparameter vector. Consider the PPN representing

1 // Execution of process P12 while( 1 )

4

6

87

3

5

910

// Execution cycleread_parameter( N1 );

read( a, x );

write( y, b );

for ( int i=1; i<=N1; i=i+1 )

execute_P1( x, &y );ba c

x y x yP1 P2N1 N2

N1 N2

b) Structure of a processa) PPN with dynamic parameters

Figure 2. PPN and process execution cycle.

a producer-consumer pair, shown in Figure 2(a).N1 and N2 are FIFO channels of the parametersN1 for process P1 and N2 for process P2, respec-tively. Each parameter can take values within a fixedrange. PPN(N1, N2) denotes an instance of the PPN.There is generally a relation between the parame-ters, in this example N1 and N2. Therefore, someinstances PPN(N1, N2) are invalid instances. For thePPN network in Figure 2(a), all different instances are,

Parameters Range: PPN instances – PPN(N1,N2):

1 ≤ N1 ≤ 3; PPN(1,1); PPN(1,2); PPN(1,3)1 ≤ N2 ≤ 3; PPN(2,1); PPN(2,2); PPN(2,3)

N2 ≥ N1; PPN(3,1); PPN(3,2); PPN(3,3)

Instances PPN(2, 1), PPN(3, 1), and PPN(3, 2)are invalid because they violate the condition N2 ≥ N1.Similarly, instance PPN(2, 4) is invalid because N2 isout of its range. Figure 2(b) shows the structure of aprocess we propose to deal with dynamic parameters.Network instances are selected by reading parametervalues at run time. For this purpose, we add a readparameters phase, see line 4, prior to the actual pro-cessing at lines 5-9. Because reading parameters anddata processing are repeated (possibly an infinite num-ber of times), we call it a process execution cycle (lines3-9). When all processes in a PPN have performed anexecution cycle, a network instance has performed anexecution.

Definition 3.1 (Consistency of a PN instance)A PN instance is consistent if after an execution, thenumber of tokens written to any channel is equal to thenumber of tokens read from it.

3.4 Preserving the consistency

The validity of the PPN instances is a necessary butnot a sufficient condition to preserve the PPN consis-

255

tency when changing parameter values at run time. Avalid set of parameters corresponds to a valid (and con-sistent) PPN instance. However, the transition from avalid instance to another valid instance at an arbitrarypoint may violate the consistency of the instances andthe PPN execution. In order to transfer new values forparameters to a process of the PPN at run time, i.e.,to select a new PPN instance, we use control channelswith FIFO organization using blocking read/write syn-chronization mechanism. In addition, we define the fol-lowing three conditions which are sufficient to preserveconsistency when changing parameter values dynami-cally at run time.

C1: Parameter sets have to correspond to valid net-work instances.

C2: A valid parameter set has to initiate a networkinstance execution.

C3: Processes may read new parameters from avalid set (corresponding to the selection of a new validnetwork instance) after they have completed a processexecution cycle.

In other words, parameter values may be changed(reconfiguration may take place) either before or af-ter an execution cycle of the processes. This is takeninto account by the proposed execution cycle of a pro-cess illustrated in Figure 2(b). Note that the definedconditions are valid only for consistent PPN instances.Therefore, a consistency check of a PPN instance is re-quired, either at design time or at run time. In ourapproach, a consistency check is performed at designtime since everything about the execution of a PPN isknown. For more details about the defined conditionsand the approach to deal with dynamic parameters atrun time, we refer to [7] where the presented approachhas been generalized for the SBF MoC [5].

4 Implementation

We consider that reconfiguration is applied on HWIP cores integrated in an MPSoC generated by Es-pam [8, 9]. To integrate an IP core, Espam generatesa HW Module (HM) around an IP core taken froma library. To describe how reconfiguration, based onparameter values, is realized with respect to the pre-viously defined conditions, we explain the structureof a HM, shown in Figure 3. For additional detailsabout HW IP core integration with Espam, we referto [8]. The processes in our PPNs have always thesame structure. It reflects the KPN operational seman-tics, i.e, read-execute-write using blocking read/writesynchronization mechanism. Therefore, a HW Modulerealizing a process of a PPN has a similar structure,

Exi

st

Rea

d

Con

f

Don

e

FIFOs FIFOs

FIFOs CONTROL

Ena

ble

Con

f

Don

e

READ EXECUTE(IP core)

Writ

e

Don

e

Con

f

WRITE Full

Data

Ful

l

Write

Exist

Data

Read

Val

id

Figure 3. HW Module top-view.

shown in Figure 3, consisting of READ, EXECUTE,and WRITE blocks. The READ and WRITE blocksconstitute the communication part of a HM. A set ofinput data ports belongs to the read unit and a setof output data ports belongs to the write unit. Thenumber of input/output ports is equal to the numberof the edges going in (respectively out of) the processof a PPN. The read unit is responsible for getting datafrom proper channels (FIFOs) at each iteration. Thewrite unit is responsible for writing the result to properchannels (FIFOs) at each iteration. Selecting a properchannel at each iteration means to follow a local sched-ule incorporated into the read and write units. Theselocal schedules are extracted from the PPN specifica-tion automatically by the Espam tool.

The EXECUTE block of a HW Module (HM) is ac-tually a dedicated HW IP core to be integrated. It isnot generated by Espam but it is taken from a library.In order to be incorporated into a HW Module, an IPcore has to provide Enable/Valid control interface. TheEnable signal is a control input to the IP core whichallows for running the core when there is data to beprocessed. If input data is not available, or there is noroom to store the output of the IP core to output FIFOchannels, then Enable is used to suspend the operationof the IP core. The Valid signal is a control outputsignal from the IP used to indicate whether the dataon the IP outputs is valid and ready to be written toan output FIFO channel. In addition, the IP core hasalso to provide an interface for accepting configurationinformation, illustrated by the Conf/Done signals inFigure 3.

A CONTROL block is added to capture the pro-cess behavior, e.g., the number of process firings,and to synchronize the operation of the other threeblocks. Also, CONTROL implements the block-ing read/write synchronization mechanism using Ex-ist/Read and Full/Write signals. Another function ofblock CONTROL is to allow the parameter values tobe set/modified from outside the HW Module at runtime. Below, we present how the CONTROL block

256

implements the reconfiguration process such that thepreviously defined conditions are respected.

4.1 Respecting the conditions

Recall that the defined conditions are taken into ac-count by the proposed execution cycle of a PPN pro-cess, shown in Figure 2(b). Therefore, to respect theconditions and to preserve consistency of our PPNs,the CONTROL block of a HW Module (see Figure 3)implements this execution cycle.

In the beginning, the CONTROL block reads pa-rameter values from the corresponding control FIFOchannels. If data has not been written, the controlblock stalls waiting for it. The correctness of the pa-rameter values (i.e., the configuration data) has to beguaranteed (condition C1) by the module generatingthem. Thus, the combined writing of parameter val-ues and the reading of these parameters by the controlblock respects condition C2, because only a valid pa-rameter set will cause a PPN process to initiate anexecution cycle and, consequently, an execution of anetwork instance. After reading control data (e.g.,iteration domains and information about configuringthe IP core), the CONTROL block initiates an execu-tion cycle. First, it performs an IP (re)configurationif it is required as well as setting control informa-tion in the READ and WRITE blocks. After IPcore (re)configuration is completed (indicated by signal’Done’), the control block uses the ’Exist/Read’, ’En-able/Valid’, and ’Full/Write’ interfaces (see Figure 3)to control the execution (cycle) of the HW Module.The end of the cycle is reached when the READ andWRITE blocks have performed all required read andwrite operations. This is indicated by the correspond-ing ’Done’ signals. After that, the control block is freeto initiate another execution cycle (respecting condi-tion C3), i.e., to read new configuration data fromthe control channels and to repeat the steps describedabove.

4.2 Discussion

By using FIFO control channels with blocking syn-chronization mechanism, we keep the KPN semanticsof our polyhedral process networks with dynamic pa-rameters, i.e., we have the capability to control theexecution without changing the model. Keeping theKPN model means that the deterministic behavior ofour PPNs with dynamic parameters is preserved. TheFIFO organization of the control channels and theblocking synchronization mechanism (the KPN seman-tics) keep the right order of selecting new network in-

stances, i.e., the order in which the parameter sets aregenerated outside the network and written to the con-trol channels. Since new parameter values are readby the processes after performing an execution cycle,parameter values selecting alternative PPN instancesmay be written to the control channels while a PPNinstance is being executed. In addition, the proposedmechanism allows the processes to read the parametervalues independently of each other without violatingthe conditions defined for preserving the consistency.

Our approach for run-time reconfiguration is appliedat two levels, i.e., high-level (no FPGA reconfiguration)by setting control registers and low-level by reconfigur-ing the FPGA logic. Since we consider fixed commu-nication topology, the READ and WRITE units arereconfigured by just writing data to control registers,e.g., the amount of data to be communicated and par-ticular communication patterns to read/write from/todifferent FIFO channels. Dynamic partial reconfigura-tion is applied only on the IP core of a HW Module.

From a design-complexity prospective, the proposedapproach to use PPNs with dynamic parameters tocapture (rum-time) reconfiguration information andto target reconfigurable MPSoC implementations con-tributes to a simplified (low-level) design effort be-cause:

1. By using the defined conditions and the controlFIFOs, explicit handshaking (between processes)is eliminated. In addition, a reconfigurable IP corehas to set only a “Done” signal to the CONTROLblock after reconfiguration;

2. During the reconfiguration process, the dataflowFIFOs used for communication between the dy-namic modules ensure proper operation of thestatic portion of the design.

5 Conclusions

In this paper, we proposed a general and technol-ogy independent approach for modeling and implemen-tation of run-time execution management for applica-tions modeled as polyhedral process networks (PPNs)and targeting reconfigurable computing. Based on thecharacteristics of the PPN formal model of compu-tations, we proposed conditions which define “save”points when reconfiguration can occur. The maincontribution of the presented work is that it guaran-tees consistent executions of reconfigurable implemen-tations. In addition, the FIFO communication andsynchronization mechanism of the polyhedral processnetworks simplify design efforts and facilitate auto-mated implementations.

257

References

[1] Daedalus, a system-level design methodology andtoolflow, http://daedalus.liacs.nl/.

[2] P. Feautrier. Automatic parallelization in the polytopemodel. In The Data Parallel Programming Model, vol-ume 1132 of LNCS, pages 79–103, 1996.

[3] M. Herlihy. The multicore revolution. In In 27thFSTTCS: Foundations of Software Technology andTheoretical Computer Science, pages 1–8, 2007.

[4] G. Kahn. The Semantics of a Simple Language forParallel Programming. In Proc. IFIP Congress 74.North-Holland Publishing Co., 1974.

[5] B. Kienhuis and E. Deprettere. Modeling stream-based applications using the sbf model of computa-tion. Journal of VLSI Signal Processing, 34(3), July2003.

[6] E. A. Lee. The Problem With Threads. IEEE Com-puter, 36(5):33–42, 2006.

[7] H. Nikolov and E. Deprettere. Parameterized Stream-Based Functions Dataflow Model of Computation. In

6th Int. Workshop on Optimizations for DSP and Em-bedded Systems (ODES-6), Boston, USA, Apr. 6 2008.

[8] H. Nikolov, T. Stefanov, and E. Deprettere. Au-tomated Integration of Dedicated Hardwired IPCores in Heterogeneous MPSoCs Designed withESPAM. EURASIP Journal on Embedded Sys-tems, 2008:Article ID 726096, 15 pages, 2008.doi:10.1155/2008/726096.

[9] H. Nikolov, T. Stefanov, and E. Deprettere. System-atic and automated multiprocessor system design, pro-gramming, and implementation. In IEEE Trans. onCAD of Integrated Circuits and Systems, volume 27,Mar. 2008.

[10] H. Nikolov, M. Thompson, T. Stefanov, A. Pimentel,S. Polstra, R. Bose, C. Zissulescu, and E. Deprettere.Daedalus: Toward composable multimedia mp-soc de-sign. In In Proc. 45th ACM/IEEE Int. Design Au-tomation Conference (DAC’08), pages 574–579, Ana-heim, USA, June 8-13 2008.

258

Abstract—REDEFINE is a runtime reconfigurable hardware platform. In this paper, we trace the development of a runtime reconfigurable hardware from a general purpose processor, by eliminating certain characteristics such as: Register files and Bypass network. We instead allow explicit write backs to the reservation stations as in Transport Triggered Architecture (TTA), but use a dataflow paradigm unlike TTA. The compiler and hardware requirements for such a reconfigurable platform are detailed. The performance comparison of REDEFINE with a GPP yields 1.91x improvement for SHA-1 application. The performance can be improved further through the use of Custom IP blocks inside the compute elements. This yields a 4x improvement in performance for the Shift-Reduce kernel, which is a part of the field multiplication operation. We also list other optimizations to the platform so as to improve its performance. Index terms—REDEFINE, Explicit Transport, HyperOp, Custom FU, Fused HyperOps, Fused HyperOp Pipeline

I. INTRODUCTION The embedded hardware segment comprised (?) of three kinds of silicon offerings namely Application Specific Integrated Circuits (ASIC), Field Programmable Gate Arrays (FPGA) and General Purpose Embedded Cores (viz. ARM, Power). Each of these solutions performs differently with regard to efficiency metrics. Efficiency is primarily measured in terms of application throughput and energy dissipated by them. ASICs being highly application specific deliver high throughput with very low energy dissipation. Due to high non-recurring engineering costs, however, ASICs can be deployed only in high volume segments. They are unsuitable where changes in standards require changes to the solution, since it would require redesign and redeployment. Redeployment may not be possible in all situations. General Purpose Processors (GPPs) on the other hand deliver lower throughputs with higher energy dissipation. However, their applicability to all domains makes them cost-effective to design and manufacture. GPPs are also unsuitable in environments where strict energy constraints are placed on the system. A via media solution is to use a reconfigurable solution such as a Field Programmable Gate Array. Reconfigurable solutions are composed of combinational and sequential circuit elements that are composed to obtain the desired functionality. This solution does not offer the performance of an ASIC, but is flexible so as to emulate a wide range of hardware circuits. The increased scope of application

renders it more cost-effective to design and manufacture. However, FPGAs can be used only if the application (or part of the application) can be placed completely on the FPGA fabric. While this restriction is relaxed with changing technology, which enables more compute and storage requirements to be placed on chip, FPGAs cannot accommodate all applications. This lead to the creation of a new class of hardware solutions called runtime reconfigurable solutions. Unlike reconfigurable solutions, runtime reconfigurable solutions are designed for reconfiguration at runtime with reduced overheads, to enable an application to be divided into several sub-tasks for their piece-wise execution. In order to render hardware amenable for runtime reconfiguration, we need to equip the hardware with the ability to switch between configurations with very short latencies, where a configuration includes a sequence of one or more operations. In the case of a GPP, an instruction represents such a configuration and instructions can be loaded quickly from the instruction memory onto the processor for execution. The GPP undergoes reconfiguration every clock cycle and performs a different set of operations constituting an instruction. However, in order to achieve such reconfiguration, a GPP has a pre-built set of mathematical operations implemented, with all other transformations having to be expressed in terms of these elementary operations. Due to the lower granularity of these operations, the number of operations to be executed increases, causing an increase in execution time as compared to an ASIC. The mathematical operations supported by a processor may not always be amenable to an optimal realization of a given function. In such cases the FPGA based combinational circuit building blocks serve as a better means to represent the function. An instance of this provided in section VI. Thus a GPP, representing the ideal solution in terms of reconfigurability, serves as a good starting point to develop a runtime reconfigurable hardware solution. We try to eliminate some known inefficiencies in the GPP in order to improve the performance. One of the primary sources of inefficiency is the write back to the register file from the functional unit, as observed by Henk Corporaal [1]. He proposed the use of explicit transports as a means to avoid unnecessary write backs. This involves exposing the interconnect to the compiler infrastructure. A modern processor pipeline includes several stages. The execute and write back stages are shown in Figure 1. The execute stage includes various functional units along with its reservation stations and the writeback stage includes the

Keshavan Varadarajan#, Ganesh Garga#, Mythri Alle#, Alexander Fell#, Ranjani Narayan‡, S K Nandy#‡ #CAD Lab, SERC, Indian Institute of Science, Bangalore, India

‡Morphing Machines, Bangalore, India

REDEFINE: Optimizations for Achieving High Throughput

259

register file. These are connected over a bypass network. The bypass network enables distribution of the result operand to all operations that are waiting on this result so as to proceed to execution. The bypass network uses the broadcast mechanism to distribute the result operand to all dependent operations. The use of broadcast is a worst—case design, since it assumes that there might be operations in all functional units that are awaiting this result. Our analysis indicates that the number of consumers for a result in most cases is one operation and 98% of the results are consumed by at most 3 operations, when measured across several kernels including IDCT, deblocking filter, CAVLC, FIR and FFT. The cumulative density plot is shown in Figure 2 and Figure 3. Also the bus based interconnect used in the case of modern superscalar processors [2] is not scalable. However, the use of dataflow computing inside the execute stage, makes the design resilient to delays that can be encountered, such as Load-Store delays. We modify the design of the execute and write back stages as shown in Figure 4. The following changes were performed: • The register file was eliminated, since it is a major

source of contention • The registers of the reservation stations were made

addressable to enable explicit writes to it. The bus based bypass network was replaced with a 2D interconnect. In this case, we choose a Honeycomb network, since it has the lowest degree per node [3]

Figure 2 Plot showing the cumulative density for different number of destinations of an operations obtained for various kernels.

Figure 3 Plot showing the average cumulative density for different number of destinations of an instruction.

In this modified architecture, the results are directly written back to the reservation stations of the destinations. The implementation of such explicit write backs involves addressing the following issues: • In order to perform explicit transports the compiler

needs to determine all data dependences and issue explicit data transfers between them.

Router ALU Reservation Station

Compute Element (CE)

Figure 4 The modified execution and writeback stages. The register file has been eliminated and the bypass network is a honey comb structure.

Tile

Reservation stations

FU FU FU FU

Bypass Bus

Register

File

Figure 1 Functional Units (along with Reservation Stations) and Register File connected through the Bypass Network

260

• Explicit write back operation can be implemented only if all dependent instructions are in one of the reservation stations. Since this cannot be guaranteed, an external register file is essential to store those operands that cannot be consumed immediately.

• Explicit write back also requires the source operation to know the placement of the destination operations. Dynamic assignment of reservation station and slots within them increases the complexity of hardware.

Our implementation of these requirements is elucidated in the subsequent section. In Transport Triggered Architectures these requirements were implemented in the context of VLIW processors, whereas we approach the problem in the context of the dataflow paradigm. The FPGA perspective of runtime reconfiguration is presented in section III. Section IV compares our solution with other solutions. In section V, we present a performance comparison between REDEFINE and a GPP. In the subsequent section, section VI we indicate how the performance can be improved through the use of custom IP blocks. Other possible improvements identified through an analysis of time spent in various stages are presented in section VII. The conclusions and scope for future work is presented in section VIII.

II. REDEFINE: A RUNTIME RECONFIGURABLE ARCHITECTURE

In order to specify explicit transports between two operations, the compiler – RETARGET [4] constructs a dataflow graph from the SSA representation for the application specification (written in C language), generated by LLVM [5]. LLVM transforms the application into SSA form based on a virtual instruction set architecture (VISA). The operations in the VISA are simple and non-orthogonal. The Application is subdivided into smaller units called HyperOps [4]. HyperOps are constructed by grouping together basic blocks such that a total order of HyperOps can be constructed for execution. Explicit transports can be performed between operations of a HyperOp. These explicit transports are transformed into Transport Metadata that is interpreted by the compute element, in order to forward the results to the destination. Explicit Transports are facilitated through the use of a Network on Chip [6]. Any data communication across HyperOps, i.e. inter-HyperOp data

traffic is facilitated through the inter-HyperOp data forwarder (Figure 5; [7]) and data is stored in the global wait match unit, which is a part of the Scheduler. For performing transports, the operations of the HyperOp need to know the exact placement of the dependent operations and their position on the reconfigurable fabric. The compiler performs virtual binding of all operations i.e. each operation assumes that it is placed at location1

8

(0,0). The transport directives, to dependent operands, are determined relative to the current location [ ]. The routers support routing based on relative addresses. This arrangement makes the HyperOps relocatable. So an operation can be placed at any location as long as the related operations, i.e. operations that supply data to the said operation, or operations that receive data from the said operation, are at offsets that are predetermined by the compiler. The exact location where a HyperOp will be placed is determined at run time by the Resource Binder (Figure 5). A brief description of all the various modules shown in Figure 5 is provided below: • The reconfigurable fabric consists of compute elements

(CE). The CEs include an ALU along with its reservation station and router. A tile comprises a CE and a router. The ALU supports a subset of the operations present in the LLVM VISA. The interconnect employed is a toroidal honeycomb structure [9]. The reservation stations determine which of the ready operations will fire2

Figure 5. In our current

implementation ( ), 64 tiles are interconnected as an 8x8 configuration, with 12 access routers along the periphery. The access routers provide connectivity to the HyperOp Launcher, Inter-HyperOp Data forwarder and the Load—Store Units that are used to access data memory banks. The access routers serve as gateways of communication between the Fabric and the external logic that drives it.

• The HyperOp Launcher is the hardware unit responsible for transferring the compute and transport metadata to the Fabric, along with any data and constant operands. The compute metadata specifies the operation to be executed. This is connected to an instruction memory that is split into 5 banks. This enables parallel transfer of 5 operations from the instruction memory and HyperOp Launcher.

• The Resource Binder, as described previously, determines the exact location on the fabric where the HyperOp is to be placed. It also keeps track of the busy and unused tiles on the fabric.

• The Scheduler hosts the global wait match unit that serves as the external register file, where data exchanged between HyperOps is stored. Whenever all input operands of a HyperOp instance are available it is considered for being launched on the fabric for execution. The HyperOps thus chosen are then

1 The position of each CE on the fabric is specified by an ordered pair which specifies the position as an offset from the origin along the x and y directions. 2 Static scheduling can be used in place of dynamic scheduling, however, since an NoC is employed, the dynamic scheduling can schedule other instructions in case of network delays.

HyperOp Launcher

Inter HyperOp Data Forwarder

Resource Binder

Sche

dule

r

Reconfigurable Fabric

Figure 5 Schematic block diagram of REDEFINE

Glo

bal W

ait

Mat

ch U

nit

261

forwarded to the Resource Binder where an appropriate location on the fabric is determined. The HyperOp Launcher transfers compute metadata, transport metadata, constant operands and input operands to the HyperOps.

• The input operands of a HyperOp are not placed at a predetermined position. This necessitates the need for a lookup table that indicates the position in the global wait match unit where the input operands are placed. The Inter-HyperOp Data Forwader is also responsible for storing loop invariants and employs a sticky token store for this purpose [10].

III. RUNTIME RECONFIGURATION: AN FPGA PERSPECTIVE FPGAs are composed of Look Up Tables (LUTs) interconnected by a programmable interconnect. The LUTs emulate the behavior of a combinational circuit. The truth table of the combinational circuit is programmed into the LUTs. More complex combinational circuits are realized through interconnection of LUTs. Programming an FPGA, thus, involves transferring the truth tables for each LUT and the programming bits for the interconnect in order to setup paths between communicating LUTs. The fine-grained structure of the LUTs and the programmable interconnect has both advantages and disadvantages. The fine-grained nature of the LUTs makes it more amenable to realization of circuits, which are combinational in nature. The use of programmable interconnect renders the transport of data overhead free, due to pre-established paths. The primary disadvantage of the fine-grained structure is the higher latency incurred on programming the LUTs and the interconnect. An FPGA has a reconfiguration time of the order of milliseconds at best, while the GPP can reconfigure itself every clock cycle. Due to the large latencies involved in reconfiguration the FPGAs are not amenable to runtime reconfiguration. More recently, FPGA vendors have been supporting partial reconfiguration, so as to reduce the runtime overhead. However, this alone does not address the problem, as the ratio of configuration size for an application to size of the source specification is quite high. In order to bring down the latency of the reconfiguration in REDEFINE, we replaced the LUTs with more coarse-grained functional units (viz. adder, shifter), akin to a GPP. This ensures that the amount of information needed to program the Compute Elements (CEs) is quite low (just an opcode). The programmable interconnect is replaced by a packet switched Network on Chip. This trades reconfiguration overhead for hardware complexity. The routers in the NoC have embedded routing logic, which determines the path to be taken to transfer data from source to destination.

IV. RELATED WORK Several solutions have been proposed that try to address the power—performance tradeoffs with regard to embedded hardware solutions. Modern embedded processors viz. PowerPC, come along with a host of domain specific accelerators that help improve the performance while having a lower energy overhead for the accelerated

application (when compared to a GPP). The recently released Intel Nehalem too has several onboard application specific accelerators. On the other hand, FPGA vendors ship a General Purpose core alongside to execute software code, while the FPGA fabric provides hardware acceleration. Stretch Inc explores a similar solution [11]. This solution embeds an FPGA fabric alongside a Tensilica core to support Instruction Set Extensions at the post-fabrication stage. Molen [12] uses a similar hardware fabric as Stretch; however, the compiler for Molen supports C to RTL transformation to automatically program the embedded FPGA. All these solutions help in exploiting the best features of both GPPs and FPGAs. However, none of them reduce the time to reconfigure a hardware platform. This is a critical requirement for runtime reconfigurable architectures. The requirement of lower reconfiguration time has been addressed in several hardware solutions viz. NEC-DRP, DAP-DNA [13]. These hardware architectures employ reconfigurable functional Units and ALUs in place of LUTs. However, they continue to use the programmable interconnect, as in the FPGA. In order to reduce the runtime reconfiguration overhead, multiple configuration planes are used. However, such a hardware configuration would be useful only if the execution time of the application subtasks were sufficiently large to hide the configuration load latencies. The hardware structure shown in Figure 5 is akin to several recently proposed general-purpose architectures viz. RAW [14], TRIPS [15] and Wavescalar [16]. In the RAW processor several MIPS cores are repeated in space and are interconnected by a 3-level Mesh interconnect. The TRIPS and Wavescalar processors use Function Units and ALUs in place of a full core. All these processors are geared towards better exploitation of available thread level parallelism. On the other hand our solution is intended towards kernel execution acceleration. This difference in emphasis leads to a completely different utilization of resources on the Fabric. The requirements for explicit transports stated in section I, can be implemented in several ways. Our solution employs explicit transports in the context of modern superscalar processors i.e. uses a dataflow paradigm. In Transport—Triggered Architectures (TTA) [1] this was achieved in the context of VLIW processors. The compiler computes the explicit transports. The register file too is made a functional unit, into which data can be explicitly transferred. Due to the VLIW nature of the machine the locations of the operations are known apriori. The technique of explicit transports is amenable to application in other architectural paradigms as well. The primary reason to adopt the dataflow paradigm, in REDEFINE, as elucidated in the previous section, is to work around network delays. Other VLIW techniques such as dynamic scheduling [17] could have been employed in the context of VLIW processors to work around nondeterministic NoC delays.

262

V. COMPARING PERFORMANCE WITH A GPP The impact on performance of the architectural changes in REDEFINE, as compared to a GPP is provided in this section. The SHA-1 hashing function is executed on both REDEFINE and a GPP. SHA-1 is the most used cryptographic hash function. Cryptographic hash functions compute a cryptographic hash on a message digest, such that a change in the message causes the generation of completely different hash. This facilitates detection of message modifications. The results of the SHA-1 run are available in Table 1. The GPP run was performed on a Pentium 4 running at 2.26GHz. The number of cycles was measured using Intel VTune, as described in [7]. The REDEFINE cycle count was obtained through a cycle-accurate simulation3

Table 1 Comparison of performance of SHA-1 function in REDEFINE vs GPP.

, for an 8x8 fabric of tiles. A computation granularity of 32-bit is used in both the GPP and REDEFINE.

SHA-1 Performance

REDEFINE General Purpose Processor

Execution Cycles 111777 21746290

Operating Frequency

100MHz (@130nm)

2.26GHz (@90nm)

Total Time taken 1.118 ms 1.071ms4

A. Analysis of Results REDEFINE takes 4.2% more time (in seconds) to execute the same program. However, the reported execution time (in seconds) does not take into account the technology node at which the processors were synthesized. After technology normalization, we find that REDEFINE performs 1.91x better than the Pentium processor. This improvement in performance can be attributed to the use of the LLVM compiler [5] (which is known to perform about 30% better

3 A SystemC/C++ based simulator developed in-house. 4 The computation of time in seconds computed based on number of cycles and frequency does not match the time reported in milli-seconds; however the results reported by Intel Vtune are being directly reproduced.

for the said application compared to gcc [18]), the use of RISC operations (as opposed to CISC operations in the Pentium 4 processor), elimination of load stores for scalars and reduction in register writebacks. In REDEFINE, the average number of CEs used across several HyperOp executions is ~2.7 CEs. However, it has a peak CE utilization of 7 CEs. Thus the effects of variable issue width, as reported in [19] are also seen, whereas the Intel Pentium is a fixed issue width processor. The SHA1 function implemented in an ASIC can support up to 116Mbps as compared to ~5Mbps supported by REDEFINE. Therefore REDEFINE performs ~20x worse than an ASIC.

VI. IMPROVING PERFORMANCE THROUGH CUSTOM FU The REDEFINE architecture described thus far contains ALUs that are quite general in nature. In order to improve the throughput and get it as close to an ASIC, it is essential to integrate custom IP blocks in the CEs, so as to accelerate the computation. However, these IP blocks need to domain-specific, as opposed to application-specific, so as to avoid the pitfalls of an ASIC. We consider the example of a field multiplier to illustrate this. Field multiplication is an important constituent of several cryptographic kernels, viz. ECDSA, ECC. Field multiplication, unlike normal multiplication, involves a sequence of Shift-Reduce (as opposed to a Shift in normal multiplication) and XOR operations (as opposed to an add in normal multiplication). The Shift-Reduce operation is compute intensive and is a good candidate for custom FU based acceleration. The design of the CE that contains the custom FU is shown in Figure 6. The results for the runs without the Custom Function Unit and with Custom Function Unit are shown in Table 2. The performance of REDEFINE with custom FU is about 4x better than the one without. A similar improvement in performance is seen in the context of FFT, as reported in [7]. Table 2 Comparison of Execution time of field multiplication without and with Custom FU. Without Custom

FU With Custom FU

Execution Time (in clock cycles)

1692 424

VII. MODULE-WISE BREAK-UP OF EXECUTION LATENCIES While the Compute Unit and Fabric contribute to a large part of the execution delay, nearly an equal amount is contributed by the Support Logic (composed of the Scheduler, Resource Binder, HyperOp Launcher and Inter-HyperOp Data Forwarder; Figure 5). In Table 3, we tabulate the time spent in each of these blocks for the SHA and shift-reduce functions. The time shown in Table 3 is the average of the time spent in the Support Logic across all executing HyperOps. The Support Logic overhead includes the time spent in launching the HyperOp once the last operand for a HyperOp arrived at the Inter-HyperOp data forwarder upto the selection of the first operation for launch, after the

ALU Custom

FU

Transporter

To Router

Reservation Station

Figure 6 Modified Compute Element with Custom Functional Unit

263

opcodes and data operands are transferred on to the fabric by the HyperOp Launcher. It should be noted that the Inter-HyperOp data forwarder continues to receive the data even during normal execution and thus the data transport is overlapped by computation. Similarly, the HyperOp Launcher continues to transfer data operands even after the first operation is ready for launch. Of the time spent in the Support Logic, maximum amount of time is spent in the HyperOp Launcher. The Inter-HyperOp data forwarder, Scheduler and Resource Binder take fixed time periods for processing the HyperOp. However, the time taken by the HyperOp Launcher depends upon the number of operations, input operands and constant operands that need to be transferred and the position of the identified CEs on the fabric. Since the HyperOp Launcher can transfer data only through the ports along the periphery of the fabric, it incurs different latencies for different CEs. This limits the scalability of the fabric. Beyond a certain size, 3-D structures may become essential to support scaling of the fabric, while not incurring high launch delays. The HyperOp Launcher latency is also affected by the routing delays due to network congestions, since it employs the NoC to transfer data to the identified CEs.

Table 3 Percentage of Execution time spent in Support Logic. Application Percentage time spent in

Support Logic SHA-1 48.58 Shift-Reduce (w/o Custom FU)

41.49

Shift-Reduce (w/ custom FU)

48.05

A. Enhancements to reduce Support Logic Overhead In order to ameliorate this, several possible solutions exist. • The maximum latencies are incurred in HyperOps that

are part of a loop. In these cases the HyperOp’s opcodes and constants may be retained on the fabric, only the new input operands are transferred to the HyperOp after every iteration. These are called persistent HyperOps.

• The persistent HyperOps solution helps reduce HyperOp launch latency only in the case of frequently executed HyperOps. Another mechanism is to prefetch a HyperOp. In this case the next HyperOp to be launched is statically predicted using compile time analysis and augmented with profiling runs. This information is used in prefetching the HyperOps, so as to overlap the launch of the next HyperOp with the execution of the current HyperOp.

• In the case of both persistent HyperOps and prefetch the input data needs to be transferred from the Scheduler to the HyperOp Launcher, for transfer onto the fabric. In the case of frequently interacting HyperOps, this latency can be eliminated through a Fused HyperOp. In a Fused HyperOp, also referred to as Custom Instruction in [20], two or more closely interacting HyperOps are merged together to form a combined entity. In this combined entity, the inter-HyperOp communication is achieved completely on the fabric. However, since the fabric is a static dataflow

machine, it is not possible to allow one HyperOp to proceed to next iteration, while the other HyperOp of the same Fused HyperOp is executing the previous iteration.

• Further, to improve performance, a pipeline of Fused HyperOps can be formed referred to as the Fused HyperOp Pipeline. In [20] this was referred to as the Custom Instruction Pipeline. In a Fused HyperOp Pipeline several instances of the Fused HyperOps can be unrolled and data transfer between various instances can be accomplished by setting up a pipeline between these entities. The Fused HyperOp Pipeline is useful for linear communication structures. However, other applications are known to have different communication structures as mentioned in [21]. Other computation structures may need to be created for these applications.

VIII. CONCLUSIONS AND FUTURE WORK In this paper, we presented the architectural evolution of REDEFINE, a runtime reconfigurable architecture. We used a general purpose processor (GPP) as the starting point for evolving a runtime reconfigurable architecture, since a GPP reconfigures itself every clock cycle. With certain architectural optimizations i.e. explicit write backs, we were able to achieve 1.91x performance improvement over a GPP for SHA-1. However, this solution performs 20x worse than an ASIC. To further improve the performance, the use of domain-specific Custom IP blocks within the CE becomes necessary. Runtime Reconfigurable hardware with custom IP block gives a 4x improvement in performance for Shift-Reduce kernel, which is the most compute intensive component of field multiplication. Apart from optimizations to the execution fabric, optimizations are also essential in the support logic viz. schedule, placement, launch of application substructures, in order to come close to the performance of an ASIC. In this paper, we have presented only the performance comparison with regard to an ASIC and a GPP. We intend to extend this work to include a detailed power comparison and power-performance comparison with regard to an FPGA. We also intend to develop a complete library of domain-specific custom IPs that can be used within a compute element of the reconfigurable fabric, to accelerate cryptographic applications.

REFERENCES

[1] H. Corporaal, Microprocessor Architectures: From VLIW to TTA . John Wiley & sons, 1998.

[2] K. C. Yeager, "The MIPS R10000 Superscalar Microprocessor," IEEE Micro, vol. 16, no. 2, pp. 28--40, 1996.

[3] A. N. Satrawala, K. Varadarajan, M. Alle, S. K. Nandy, and R. Narayan, "REDEFINE: Architecture of a SoC Fabric for Runtime Composition of Computation Structures," in FPL 2007. International Conference on Field Programmable Logic and Applications., Amsterdam, 2007, pp. 558-561.

264

[4] M. Alle, K. Varadarajan, A. Fell, S. K. Nandy, and R. Narayan, "Compiling Techniques for Coarse Grained Runtime Reconfigurable Architectures," in ARC'09 International Workshop on Applied Reconfigurable Computing, vol. Volume 5453/2009, London, U.K, 2009, pp. 204-215.

[5] C. Lattner and V. Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation," in CGO '04: Proceedings of the international symposium on Code generation and optimization, Palo Alto, CA, 2004, p. 75.

[6] N. Joseph, et al., "RECONNECT: A NoC for polymorphic ASICs using a low overhead single cycle router," in ASAP '08: Proceedings of the 2008 International Conference on Application-Specific Systems, Architectures and Processors, Leuven, Belgium, 2008, pp. 251-256.

[7] A. Fell, et al., "Streaming FFT On REDEFINE-V2: an Application-Architecture Design Space Exploration," in CASES '09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, Grenoble, France, 2009, pp. 127--136.

[8] G. K. Singh, K. Varadarajan, M. Alle, S. K. Nandy, and R. Narayan, "A Generic Graph-Oriented Mapping Strategy for a Honeycomb," International Journal on Futuristic Computer Applications, vol. 1, no. 1, pp. xx-xx, 2010.

[9] A. Fell, P. Biswas, J. Chetia, S. K. Nandy, and R. Narayan, "Generic Routing Rules and a Scalable Access Enhancement for the Network-on-Chip RECONNECT," in IEEE International SoC Conference, Glasgow, 2009, pp. xx-xx.

[10] J. R. Gurd, C. C. Kirkham, and I. Watson, "The Manchester prototype dataflow computer," Commun. ACM, vol. 28, no. 1, pp. 34--52, 1985.

[11] Stretch Inc. ( , ) Stretch S6000 Devices. [Online]. HYPERLINK "http://www.stretchinc.com/_files/s6ArchitectureOverview.pdf" http://www.stretchinc.com/_files/s6ArchitectureOverview.pdf

[12]

S. Vassiliadis, et al., "The Molen Polymorphic Processor," IEEE Transactions on Computers, vol. 53, no. 11, pp. 1363-1375, Nov. 2004.

[13] Fujitsu Limited, IPflex Inc. (2004, Mar.) IPFlex and Fujitsu Introduce DAP/DNA®-2, the Dynamically Reconfigurable Processor. [Online]. HYPERLINK "http://www.fujitsu.com/global/news/pr/archives/month/2004/20040317-01.html" http://www.fujitsu.com/global/news/pr/archives/month/2004/20040317-01.html

[14]

M. B. Taylor, et al., "The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs," IEEE Micro, vol. 22, no. 2, pp. 25-35, Mar. 2002.

[15] K. Sankaralingam, et al., "The Distributed Microarchitecture of the TRIPS Prototype Processor," in 39th International Symposium on Microarchitecture (MICRO), Orlando, 2006, pp. 480-491.

[16] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, "WaveScalar," in 36th Annual International Symposium on Microarchitecture (MICRO-36), Washington, DC, USA, 2003, p. 291.

[17] B. R. Rau, "Dynamically scheduled VLIW processors," in MICRO 26: Proceedings of the 26th annual international symposium on Microarchitecture, Austin, Texas, 1993, pp. 80-92.

[18] M. Larabel. (2009, Sep.) GCC vs. LLVM-GCC Benchmarks. [Online]. HYPERLINK "http://www.phoronix.com/scan.php?page=article&item=apple_llvm_gcc&num=1" http://www.phoronix.com/scan.php?page=article&item=apple_llvm_gcc&num=1

[19]

M. D. Hill and M. R. Marty, "Amdahl's Law in the Multicore Era," Computer, vol. 41, no. 7, 2008.

[20] M. Alle, et al., "REDEFINE: Runtime reconfigurable polymorphic ASIC," ACM Trans. Embed. Comput. Syst., vol. 9, no. 2, pp. 1--48, 2009.

[21] K. Asanovic, et al., "A view of the parallel computing landscape," Commun. ACM, vol. 52, no. 10, pp. 56-67, Oct. 2009.

265

ADCOM 2009ADCOM 2009POSTER PAPERS

266

A Comparative Study of Different Packet Scheduling Algorithms with Varied Network Service Load In IEEE 802.16 Broadband Wireless Access Systems

Prasun Chowdhury Iti Saha Misra Electronics and Telecommunication Engineering Electronics and Telecommunication Engineering Jadavpur University, Kolkata, India Jadavpur University, Kolkata, India

[email protected] [email protected]

Abstract— In this paper, we conduct a comprehensive performance study of several packet scheduling algorithms such as Strict Priority, WFQ, RR, SCFQ, WRR found in literature under different queue management type like FIFO, RED, RIO, WRED in point-to-multipoint mode of OFDM-based WiMAX networks. We focus our work in comparing quality of service (QoS) of various queue scheduling schemes to suit for different specific environment i.e. different Service Loads. Extensive simulation is performed under QualNet- 4.5 for the performance of various scheduling schemes and is observed the best suitability for a particular service load condition.

Keywords- IEEE 802.16; MAC; QoS; Queue Scheduling algorithm; Queue management type

1. Introduction

Despite the numerous scheduling algorithms proposed for WiMAX (Worldwide Interoperability for Microwave Access) [1] networks, there is no comprehensive study that provides a unified platform for comparing such algorithms. The aim of this work is to allow a thorough understanding of the relative performance of representative uplink scheduling algorithms under various queue management types and subsequently utilize the results to suit for different specific environment i.e. different network load. We focus our work in comparing representative algorithms for the uplink traffic in OFDM WiMAX physical layer using Qual Net 4.5.

The remainder of the paper is organized as follows. In section 2, we provide detailed description of our contribution, simulation under QualNet-4.5. Finally, we conclude the paper in section 3.

2. Our Contribution

2.1 Simulation Setups in QualNet -4.5

We perform our scheduling algorithm comparison in the QualNet-4.5 simulator. We have modified and selected those scheduling algorithms from graphical user interface of QualNet-4.5 to fit to a particular environment.

For this purpose we have designed a scenario which is shown in figure 1.

Figure 1. IEEE 802.16 Wimax scenario in Qual Net

In figure 1, we have designed a scenario consisting of one base station and five subscriber station under one subnet. Each subscriber station has three separate connections for the application namely UGS, rtPS, nrtPS with base station. We omit BE service because this service supports data streams for which no minimum service level is required and which may therefore be handled on a space-available basis. We have selected all the subscriber station with different IP Queue Scheduling schemes i.e.SS1 is scheduled with Strict Priority IP Queue Scheduling scheme [1]. Similarly, SS2, SS3, SS4, SS5 are scheduled with Weighted Fair Queuing (WFQ) [2], Round Robin(RR) [3], Self-Clocked Fair Queuing (SCFQ)[4], Weighted Round Robin (WRR) [3] IP Queue Scheduling schemes respectively but the IP Queue Management type of each subscriber station and subscriber station service load are kept fixed for a particular simulation. In this work, we have taken four IP Queue Management type i.e. First in First out (FIFO) [5], Random Early Detection (RED) Queue [6], Random Early Detection with In/Out (RIO) Queue [6], Weighted Random Early Detection (WRED) Queue [7] and three types of subscriber station Service Load i.e. Service Load1, Service Load2, Service Load 3. Details of the Service Load are shown below. So, in this way we have carried out total twelve simulations in QualNet-4.5 Animator. More details of these components are accessible from Qual Net-4.5 Advance Wireless Documentation [6].

The following Wimax Model Parameters are used in the configuration:

Radio Type – 802.16 Radio MAC Protocol – 802.16 BS Frame Duration – 20 ms

267

mailto:[email protected]


IP Queue scheduling – strict priority, WFQ, RR, SCFQ, WRR

IP Queue Management type – FIFO, RED, RIO, WRED PHY Channel bandwidth – 20 MHz The starting time of the simulations are evenly

distributed in the interval 0s – 100s

Table 1: lists of the SERVICE LOAD configuration parameters

Service Load1

Load2

Load3

Interval(sec)

Start time(sec)

UGS 128 256 512 0.01 0rtPS 512 1024 2048 0.1 0nrtPS 1024 2048 4096 1 0

2.2 Results and discussions

0

10000

20000

30000

40000

50000

Load 1 Load 2 Load 3

Strict priorityWFQ

RR

SCFQ

WRR

Figure 2.Average Throughput (bps) comparison in queue scheduling schemes

0

0.05

0.1

0.15

Load 1 Load 2 Load 3

Strict priorityWFQ

RR

SCFQ

WRR

Figure 3. Average Jitter (sec) comparison in queue scheduling schemes

0

0.5

1

1.5

2

2.5

FIFO RED RIO WRED

strict priorityWFQ

RR

SCFQ

WRR

Figure 4. Average Queuing Delay (sec) comparison for Service Load 1

0

2

4

6

8

10

12

14

FIFO RED RIO WRED

strict priorityWFQ

RR

SCFQ

WRR

Figure 5. Packet loss (%) comparison for Service Load 1

0.92

0.94

0.96

0.98

1

1.02

FIFO RED RIO WRED

strict priorityWFQ

RR

SCFQ

WRR

Figure 6. Queue Service ratio comparison for Service Load 1

0

1

2

3

4

FIFO RED RIO WRED

strict priorityWFQ

RR

SCFQ

WRR

Figure 7. Average queuing delay (sec) comparison for Service Load 2

18.5

19

19.5

20

20.5

21

FIFO RED RIO WRED

strict priorityWFQ

RR

SCFQ

WRR


A

vera

ge Ji

tter (

sec)

Ave

rage

Que

uing

Del

ay (s

ec)

Ave

rage

Q

ueui

ng

Del

ay

(sec

)

P

acke

t Los

s (%

)

A

vera

ge T

hrou

ghpu

t (se

c)

Pack

et L

oss

(%)

268

0.935

0.94

0.945

0.95

0.955

0.96

0.965

FIFO RED RIO WRED

strict priority

WFQ

RR

SCFQ

WRR


0

2

4

6

8

FIFO RED RIO WRED

strict priorityWFQ

RR

SCFQ

WRR

Figure 10.Average queuing delay (sec) comparison for Service Load 3

25

26

27

28

29

FIFO RED RIO WRED

strict priorityWFQ

RR

SCFQ

WRR


0.820.840.860.88

0.90.920.940.96

FIFO RED RIO WRED

strict priorityWFQ

RR

SCFQ

WRR


In case of SERVICE LOAD 1, SERVICE LOAD 2, SERVICE LOAD 3, we can observe that queue scheduling algorithm SCFQ, RR, WFQ respectively with queue management type RED provides good QoS support among all other algorithms.

3. Conclusion In the existing literatures, many researchers have tried to find out flaws within the queue scheduling algorithms and modified those for having better QoS support irrespective of the network Service Load. The present work clearly shows that a particular algorithm which is considered to be good in a specific environment may not provide good QoS to other environment. It is reasonable to take into account the network Service Load while modifying the packet scheduling algorithms for meaningful results.

Acknowledgement: The authors deeply acknowledge the support from DST, Govt. of India for this work in the form of FIST 2007 Project on “Broadband Wireless Communications” in the Department of ETCE, Jadavpur University. References

[1] K. Wongthavarawat, and A. Ganz, “Packet scheduling for QoS support in IEEE 802.16 broadband wireless access systems”, International Journal of Communication Systems, vol. 16, issue 1, pp.81-96, February 2003.

[2]N.Ruangchaijatupon, L.Wang and Y.Ji, “A Study on the Performance of Scheduling Schemes for Broadband Wireless Access Networks”, Proceedings of International Symposium on Communications and Information Technology, pp. 1008-1012, October 2006.

[3]C.Cicconetti, A.Erta, L.Lenzini and E.Mingozzi,“Performance Evaluation of the IEEE 802.16 MAC for QoS Support”, IEEE Transactions on Mobile Computing, vol.6, no.1, pp.26-38, January 2007.

[4] Byung-Hwanchoi, Hong-shik Park “Rate proportional SCFQ Algorithm for high speed packet-switched Networks” ETRI Journal, Volume 22, Number 3, september 2008.

[5] Fei Li, ”Fairness Analysis in Competitive FIFO BufferManagement”Performance,Computingand Communications Conference, 2008. IPCCC 2008. IEEE International

[6] QualNet 4.5 Advanced Wireless Model Library, March 2008,http://www.scalablenetworks.com,http://www.qualnet.com,

[7] Ming-Jye Sheng, Thomas Mak, “Analysis of Adaptive WRED AND CBWFQ Algorithms on Tactical Edge ” IEEE 2008

Pack

et L

oss

(%)

A

vera

ge Q

ueui

ng D

elay

(sec

)

269

http://www.qualnet.com/

http://www.qualnet.com/

http://www.scalablenetworks.com/

A Simulation Based Comparison of Gateway Load Balancing Strategies in

Integrated Internet-MANET

Rafi-U-Zaman Khaleel-Ur-Rahman Khan M.A.Razzaq A. Venugopal Reddy

Department of CSE Department of CSE Department of CSE Department of CSE

M.J.C.E.T M.J.C.E.T M.J.C.E.T UCE, O.U.

Hyderabad, India Hyderabad, India Hyderabad, India Hyderabad, India

rafi.u.zaman, [email protected] [email protected] [email protected]

Abstract

The interconnection of wireless mobile ad hoc

network with the wired Internet is called Integrated

Internet-MANET. This interconnection is facilitated

through gateways which act as bridges between the

heterogeneous networks. Load balancing of these

gateways is a critical issue in Integrated Internet-

MANET. In this paper, two gateway load balancing

strategies for Integrated Internet-MANET are

proposed which are based on load balanced routing

protocols called WLB-AODV and Modified-AODV.

The proposed strategies have been simulated using the

ns-2 simulator. The simulation results indicate that

the strategy based on WLB-AODV performs better

than the one based on Modified-AODV.

1. Introduction

Wireless mobile ad hoc networks (MANET) are

infrastructure less networks. Various protocols have

been proposed to perform routing within an ad hoc

network [1]. To extend its usefulness, a MANET

needs to be connected to the Internet. We call such an

interconnected network an Integrated Internet-

MANET. A review of strategies for Integrated

Internet-MANET can be found in [2]. Such strategies

make use of gateways to interconnect the ad hoc

network to the Internet. It is observed that the problem

of gateway load balancing has not been adequately

addressed in these strategies. A few strategies which

exist in the literature for gateway load balancing [3-7]

make use of traditional ad hoc routing protocols like

DSDV and AODV. None of them uses a specialized

load-balanced routing protocol.

In this paper, we present an extended version of

AODV routing protocol, called Weighted Load

Balanced – ADOV (WLB-AODV) routing protocol. In

the second part of the paper, two gateway load

balancing strategies are presented, one based on WLB-

AODV and the other based on Modified AODV [8].

Based on a simulation study it is observed that the

proposed strategy based on WLB-AODV out-performs

the one based on Modified AODV.

2. Weighted Load Balanced – Ad Hoc On-

Demand Distance Vector (WLB-AODV)

Routing Protocol

Modified-AODV is a modified version of AODV

wherein Aggregate Interface Queue Length (AIQL) is

used as path selection criteria instead of hop count. In

this mechanism, when an intermediate node receives a

Route Request, it adds its interface queue length to the

Route Request and forwards it. This aggregation of

queue lengths of all the nodes lying on a path is called

the Aggregate Interface Queue Length. This process is

repeated until the Route Request reaches the

destination. The destination makes a selection of the

best route based on the AIQL and sends a Route Reply

back to the source. Whereas Modified-AODV is based

on AIQL, WLB-AODV routing protocol is based on

three metrics. They are: hop length (HL), Aggregate

Interface Queue Length (AIQL) and Aggregate

Routing Table Entry (ARTE).

Hop Length (HL): It is the distance in number of

hops between any two mobile nodes in the mobile ad

hoc network. This is the route selection metric used in

the original AODV routing protocol.

Aggregate Interface Queue Length (AIQL): Every

mobile node maintains a queue of outstanding packets

which it has to forward. The longer the length of the

queue, the more work that mobile node has to do.

Hence we can say that the queue length of a mobile

node reflects its current load. The mobile nodes with

longer queue length are better avoided. If the queue

length exceeds a maximum threshold, then any

incoming packets will be discarded. The AIQL is the

270

sum of the queue lengths of all mobile nodes lying on

a path, as explained in the previous section

Aggregate of Routing Table Entries (ARTE): Every

mobile node maintains a routing table which contains

information about valid routes to a set of destinations.

If the number of entries is more, then this mobile node

will act as an intermediate node which knows many

destinations and hence will find itself sending more

Route Reply messages. Thus, a mobile node with more

routing table entries is more likely to be a busy node

and hence better be avoided. Aggregate Routing

Table Entry (ARTE) is the sum of all routing table

entries on a route.

In the WLB-AODV routing protocol, a weighted

sum of the above three metrics is used in the selection

of a route. Now, every mobile node maintains the

AIQL and ARTE values in its routing table for known

destinations, apart from hop count as in original

AODV. The weighted sum is:

Mi = a * HL + b * AIQL + c * ARTE

a + b + c = 1

Where, Mi represents value of the weighted metric

for route ‘i’. a, b and c represent the weights given to

each of the components.

3. Gateway Load Balancing Strategies for

Integrated Internet-MANET using Load

Balanced Routing Protocols

The proposed network architecture for gateway

load balancing consists of two-tiers. The high-tier

consists of foreign agents and the low-tier consists of

mobile nodes which form the mobile ad hoc network

as shown in figure 1.

Figure1. Network Architecture for Gateway

Load Balancing in Integrated Internet-MANET

Two strategies for gateway load balancing are

proposed. The first one, called Strategy-1, uses the

WLB-AODV routing protocol for routing within the

mobile ad hoc network. The second strategy, called

Strategy-2, uses the Modified-AODV routing protocol.

Both the strategies have been implemented for hybrid

gateway discovery mechanism.

4. Simulation of the Proposed Strategies

A simulation of the proposed gateway load

balancing strategies was carried out to determine the

effectiveness of their performance when compared to

each other. The network simulator used was ns-2.31.

The simulations were carried out, for varying Packet

Sending Rates. The simulation parameters common to

all the simulations are given in table1. The constants

a, b and c in metric Mi are given the values a=0.5,

b=0.25 and c=0.25.

Table4.1. Simulation Parameters

Parameter Name Value

Number of mobile nodes 25

Number of source nodes 5

Number of IGW 3

Number of

Correspondent Nodes 3

Topology Size 1200 X 500m

Transmission Range 250m

Traffic Type Constant Bit Rate

Mobile Node Speed 20 m/sec

Packet Size 512 bytes

Pause Time 5 sec

Mobility model Random Waypoint

Model

Carrier Sensing Range 500m

Simulation time 900sec

Packet Sending Rate 5 – 40 Packets/Sec

Interface Queue Length 50 packets

Advertisement Interval 5 sec

Advertisement Zone 3 hops

The performance is analyzed with respect to the

following performance metrics:

Packet Delivery Ratio: It is defined as the percentage

of the number of packets received to the total number

of packets sent.

End-to-End Delay: This is the average overall delay

for a packet to traverse from a source node to a

destination node.

Normalized Routing Load: It is the number of

routing control packets per data packet delivered at the

destination.

Figure 2 shows the comparison of end to end delay

of the two strategies. It is quite clear that Strategy-1

271

outperforms Strategy-2. This indicates that Strategy-1

successfully chooses lesser loaded routes to enable

faster delivery of data packets. Figure 3 shows the

comparison of the routing load of the two strategies.

Here a slight, albeit consistent advantage is observed

for Strategy-1. This indicates that Strategy-1 incurs

lower routing overhead due to lesser control packet

retransmissions, by choosing lightly loaded routes.

Figure 4 again establishes the superiority of Strategy-1

over Strategy-2, by delivering a better packet delivery

ratio. This is due to the fact that lesser loaded routes

are chosen for data delivery and hence more number

of packets is delivered.

Figure2. End to End Delay as a function of

Packet Sending Rate

Figure3. Normalized Routing Load as a

function of Packet Sending Rate

Figure4. Packet Delivery Ratio as a function of

Packet Sending Rate

5. Conclusion

In this paper, two gateway load balancing

strategies were proposed; one based on WLB-AODV,

and the other based on Modified-AODV. Through

simulations in ns-2.31, it is observed that the Strategy

based on WLB-AODV gives performance

enhancements over the Strategy based on Modified

AODV, based on performance metrics End-to-End

Delay, Normalized Routing Load and Packet Delivery

Ratio.

6. References [1] E.M. Royer and C-K. Toh, “A Review of Current

Routing Protocols for Ad Hoc Mobile Wireless Networks”,

IEEE Personal Communications Magazine, pp. 46-55,

(1999).

[2] Khaleel Ur Rahman Khan, Rafi U Zaman, A. Venugopal

Reddy, “Integrating Mobile Ad Hoc Networks and the

Internet: challenges and a review of strategies”, Proceedings

of IEEE/CREATE-NET/ICST COMSWARE 2008, (2008).

[3] J.H. Zhao, X.Z.Yang and H.W.Liu, “Load-balancing

Strategy of Multi-gateway for Ad hoc Internet Connectivity”,

Proceedings of the International Conference on Information

Technology: Coding and Computing (ITCC'05) - Volume II,

Pages: 592 – 596, (2005).

[4] A. Trivino-Cabrera, Eduardo Casilari, D. Bartolome and

A. Ariza, “Traffic Load Distribution in Ad Hoc Networks

through Mobile Internet Gateways”, Proceedings of Fourth

International Working Conference on Performance

Modelling and Evaluation of Heterogeneous Networks,

(2006).

[5] Y-Y Hsu, Y-C Tseng, C-C Tseng, C-F Huang, J-H Fan

and H-L Wu, “Design and Implementation of Two-Tier

Mobile Ad Hoc Networks with Seamless Roaming and

Load-Balancing Routing Capability”, Proceedings of the

First International Conference on Quality of Service in

Heterogeneous Wired/Wireless Networks, Pages: 52 – 58,

(2004)

[6] J. Shin, H. Lee, J. Na, A. Park and S. Kim, “Load

Balancing among Internet Gateways in Ad Hoc Networks”,

Proceeding of 62nd IEEE Vehicular Technology Conference,

Pages: 1677- 1680, (2005)

[7] Q. Le-Trung, P.E. Engelstad, T. Skeie and A.

Taherkordi, “Load-Balance of Intra/Inter-MANET Traffic

over Multiple Internet Gateways”, Proceedings of the 6th

International Conference on Advances in Mobile Computing

and Multimedia, Pages 50-57, (2008)

[8] A. Rani and M.Dave, “Performance Evaluation of

Modified AODV for Load Balancing”, Journal of Computer

Science, Vol 3, issue 11, (2007)

272

ECAR: An Efficient Channel Assignment andRouting in Wireless Mesh Network

Chaitanya P. UmbareDepartment of Computer Science and Engineering

Indian Institute of TechnologyGuwahati, Assam

India-781039Email: [email protected]

Dr. S. V. RaoDepartment of Computer Science and Engineering

Indian Institute of TechnologyGuwahati, Assam

India-781039Email: [email protected]

Abstract—Wireless mesh networking is a promising designparadigm for future generation wireless networks. Wireless meshnetwork(WMN) came into existence to resolve the limitations andto significantly improve the performance of ad-hoc networks,wireless local area networks(WLANs) etc. In a WMN or anynetwork, channel allocation is done based on the queue states atdifferent nodes which depends on how routing is done. Thusit is important to have joint routing and channel allocationalgorithms to control various measures of performance likedelay and throughput. For better and efficient use of multiradioand advanced physical layer technologies, cross-layer controlis required. In this paper, we propose an intelligent channelassignment strategy which results into minimum interference ofchannels and efficient routing which results into reduction ofend-to-end delay.

I. INTRODUCTION

A WMN is a multi-hop, self configured and self healingnetwork, which provides reliability, redundancy, scalability[1].WMNs consist of two types of nodes: mesh routers and meshclients. Mesh routers which forms the backbone of the networkare generally not mobile and equipped with one or more radiosoperating on same or different radio technology. And meshclients are generally mobile and equipped with one radio onlybecause of power constraint.

Now we discuss some of the past work related to ourprotocol.

Use of multiple 802.11 NICs per node is explored inrouting in multi-radio, multi-hop WMNs[2]. Due to the staticchannel assignment to all the nodes, throughput improvementis proportional to the number of NICs. The main idea inmultichannel CSMA MAC protocol for multihop wirelessNetworks[3] is to find an optimum channel for every singlepacket transmission. Basically channel switching is on packetby packet basis, which requires the re-synchronization amongcommunicating network cards for every packet. This leads todecrease in network throughput. Centralized channel assign-ment and routing algorithm for multi-Channel wireless meshnetworks[4], visits all the virtual links in the decreasing orderof their expected loads and upon visiting a link, algorithmgreedily assigns a channel that leads to minimum interference.This centralized algorithms demands the complete informationabout the network to perform channel assignment and routing.

Distributed channel assignment in multi-radio 802.11 meshnetworks[5] presents a distributed, self-stabilizing mechanismthat assigns channels to multi-radio nodes in WMN. Maindisadvantages of this protocol is that the assigned channelsare not changes once the channel assignment is stabilized. Adistributed channel assignment and routing in WMNs[6] isproposed by Raniwala. The assumption made in this paperis that, the traffic of all nodes goes or comes to/from thegateway nodes only. For other traffic patterns, the protocoldoes not works. And also due to uncoordinated allocation ofnodes with the same priority, their channel assignment maynot be convergent and, thus, may cause severe interferenceamong nodes.

In this paper we basically addresses two main issues whichsignificantly affect the performance of the network, routingand channel assignment. Basically we tries to improve theprotocol[6] by intelligent channel assignment which signifi-cantly reduces the interference between neighboring nodes andintelligent routing which reduces lots of routing overhead. Thenext section describes our protocol in detail.

II. PROPOSED SCHEME

The ECAR basically consists of two phases.1) Load-Balancing Routing2) Distributed Load-Aware Channel AssignmentThese phases are explained in the following sections:

A. Load-Balancing Routing

Main idea in the routing process is, if the source anddestination are in the same subnet and if the hop countbetween the router and the destination is less than equal tothree, then router forwards the data packet using NIC[0] to itsappropriate neighbor with the help of master routing table. Ifthe destination router is in the subtree rooted at this router, thenit will send the data packets to appropriate child using NIC[2]using inter-domain routing table, otherwise it sends data packetto its parent using NIC[1] using Inter-domain routing table.

Each router periodically broadcasts HELLO message toall its 1-hop neighbors. After receiving HELLO messagerouter updates its neighbors list, by considering hop-count

273

information. And broadcast this updated information to its 1-hop neighbors. As a results of this, each router has informationabout all the routers in the network, with minimum hop-countfrom the corresponding router. In this way, each router buildsmaster routing table.

In order to build inter-domain routing table, each gatewaynode broadcasts ADVERTISE message to all its 1-hop neigh-bors with residual capacity of the uplink. Upon receivingADVERTISE message, router sends JOIN message to theadvertiser, if the existing gateway residual capacity is morethan the incoming one. Otherwise it will not send any reply.So when a router receives JOIN message, it will add the routerto its children list and send ACCEPT message containinginformation about channel to be used for communication.And also router send ROUTE-ADD message to its parent,which contain address of the node and its children’s and thisprocess continues up to the gateway node. When a routerreceives ACCEPT message, it will update the parent entry andsend LEAVE message to its previous parent. Upon receivingLEAVE message, router deletes the forwarding entries to thatrouter and all its children’s and send ROUTE-DELETE mes-sage to its parent. The process of sending ROUTE-DELETEis recursive. The process of building master routing tableand inter-domain routing table can be done through onlyone message, which avoids the overhead of sending controlmessages.

B. Distributed Load-Aware Channel Assignment

In this section we present a localized distributed algorithmfor channel assignment to each interfaces. The NIC which isused to communicate with parent router is termed as parentNIC, where as child NIC denotes the NIC used to commu-nicate with its children’s. Each WMN router is responsiblefor assigning channel to its child-NIC. And parent NIC ofthe router is associated with a unique child-NIC of the parentrouter and is assigned the same channel as the parent’s child-NIC.

Each router periodically exchanges its individual channelusage information with all of its (k+1) hop neighbors throughCHANNEL-USAGE packet, where k is the ratio of the inter-ference range and the communication range. After receivingthe channel usage information, each router calculates theaggregate traffic load of each channel. Each router excludeschannels used by its ancestors and the channels used by thenodes in the same level and determines a least loaded channeland assign it to the child NIC.

The load on the routers closer to gateways is more, sincemost of the traffic is going and coming from gateway nodes.In order to give more relay bandwidth to them, they are havinghigher priority while assigning the channels. Each routerbroadcasts channel assigned to its child NIC to its children’sthrough CHANNEL-CHANGE packet. Network traffic changemay result in change in the various channels load, whichmay imbalance the channel load and decreases the networkthroughput. In order to load balance the channel and toimprove network throughput, each router dynamically changes

channel assigned to their child NIC by executing the aboveprocedure periodically.

III. PERFORMANCE EVALUATION

This section gives the implementation details of our schemein ns-2[7]. We also present our simulation environment andresults.

A. Simulation Environment

The Simulation environment is as described in Table 5.1.All the simulations are done in ns-2.

No. Parameters Values1 Area 240m X 240m2 Transmission Range 22.5m3 Nodes 25, 50, 75, 1004 Data Rate 50Kbps–4Mbps5 Number of Channels 136 Number of NICs/Node 37 Number of Flows 5–308 Number of Gateway nodes 1–49 Simulation Time 400sec

TABLE ISIMULATION PARAMETERS

B. Simulation Results

The performance parameters that were measured in oursimulation are: Average aggregate throughput and average end-to-end delay. Our protocol assigns the channels to interfacesintelligently, because of this average aggregate throughput ismuch better in our case than DCAR[6] as shown in graphsbelow.

0

2

4

6

8

10

12

14

40 50 60 70 80 90 100 110

Ave

rage

Agg

rega

te T

hrou

ghpu

t (M

bits

/sec

)

Number of Nodes

Number of flows: 30Number of Gateway Nodes: 2

ECARDCAR

6

8

10

12

14

16

18

10 15 20 25 30

Ave

rage

Agg

rega

te T

hrou

ghpu

t (M

bits

/sec

)

Number of Flows

Number of nodes: 100Number of Gateway Nodes: 3

ECARDCAR

274

4

6

8

10

12

14

16

18

1 2 3 4

Ave

rage

Agg

rega

te T

hrou

ghpu

t (M

bits

/sec

)

Number of Gateway Nodes

Number of flows: 30

Number of Nodes: 100

ECARDCAR

In the following graphs we showed effect on average end-to-end delay by changing the number of hops(1 in first graph,2 in second and 3 in third) between the source and destinationwith varying number of nodes.

0

5

10

15

20

25

30

20 30 40 50 60 70 80 90 100 110

Ave

rage

End

-to-

End

Del

ay (

mse

c)

Number of Nodes

ECARDCAR

0

5

10

15

20

25

20 30 40 50 60 70 80 90 100 110

Ave

rage

End

-to-

End

Del

ay (

mse

c)

Number of Nodes

ECARDCAR

0

5

10

15

20

25

20 30 40 50 60 70 80 90 100 110

Ave

rage

End

-to-

End

Del

ay (

mse

c)

Number of Nodes

ECARDCAR

IV. CONCLUSION AND FUTURE WORK

In particular, our protocol efficiently handles the two fun-damental design issues in the multi-channel WMN. First,

which of the available non-overlapped radio channels shouldbe assigned to each 802.11 interface in the WMN? Second,how packets should be routed through a multi-channel wirelessmesh network? Traffic between near by routers is done usingmaster routing table, by sending data packets directly to nexthop, instead of sending through gateway nodes, which resultsin the reduction of routing overhead. Moreover, by dynam-ically changing the assigned channels which can minimizeinterference, hence increases throughput of the network.

REFERENCES

[1] R. Bruno, M. Conti, and E. Gregori, “Mesh networks: commoditymultihop ad hoc networks,” Communications Magazine, IEEE, vol. 43,pp. 123–131, 2005.

[2] R. Draves, J. Padhye, and B. Zill, “Routing in multi-radio, multi-hopwireless mesh networks,” in MobiCom ’04: Proceedings of the 10thannual international conference on Mobile computing and networking.ACM Press, 2004, pp. 114–128.

[3] A. Nasipuri, J. Zhuang, and S. Das, “A multichannel csma mac protocolfor multihop wireless networks,” pp. 1402–1406, 1999.

[4] A. Raniwala, K. Gopalan, and T.-C. Chiueh, “Centralized channel assign-ment and routing algorithms for multi-channel wireless mesh networks,”SIGMOBILE Mob. Comput. Commun. Rev., vol. 8, pp. 50–65, 2004.

[5] B.-J. Ko, V. Misra, J. Padhye, and D. Rubenstein, “Distributed channelassignment in multi-radio 802.11 mesh networks,” in Wireless Communi-cations and Networking Conference, 2007.WCNC 2007. IEEE, 2007, pp.3978–3983.

[6] A. Raniwala and T.-C. Chiueh, “Architecture and algorithms for an ieee802.11 -based multi-channel wireless mesh network,” vol. 3, 2005, pp.2223–2234.

[7] Information Sciences Institute, “NS-2 network simulator,” Software Pack-age, 2003, http://www.isi.edu/nsnam/ns/.

275

Rotational Invariant Texture Classification of Color Images using Local Texture Patterns

A.Suruliandi

Department of Computer Science and Engineering, Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India

[email protected]

E.M.Srinivasan Department of ECE, Government Polytechnic College,

Nagercoil, Tamilnadu, India [email protected]

K.Ramar

Department of CSE, National Engineering College, Kovilpatti, Tamilnadu, India. [email protected]

Abstract— In this paper a new approach to extend Local Texture Patterns (LTP) texture model suitable for color images is presented. In this study, to extract spatial feature of a color mage, Gray-Local Texture Patterns (GLTP) is introduced. Contrast is another important property of images. To extract contrast features, Color-Local Contrast Variance Patterns (CLCVP) is proposed. However, much important information contained in the image can be revealed by joint distributions of individual features. Hence, GLTP/CLCVP is proposed as a textural feature extraction technique for classification of color images. The performance of the proposed features is carried out with rotational invariant classification of Outex texture database images. From the experimental results it is observed that GLTP/CLCVP is yielding higher classification accuracy rate of 99.32% for color images.

Keywords- Texture Analysis, Texture Classification, Local Texture Patterns, Local Gray scale Texture Patterns, Local Color Contrast Variance Patterns.

I. INTRODUCTION

Texture methods can be categorized as statistical, geometrical structural, model-based and signal processing features. A comparative study on texture measures are reported in [3]

A. Motivation and Justification for the Proposed Approach

The color texture models are essential for at least two reasons. The first one is most approaches to texture analysis quantify texture measures by single values like means, variance, entropy etc. However, much important information contained in the distribution of texture values might be lost. The second one is color, which has considerable importance in image analysis. Color as well as texture has been discussed in literature intensively. But most of the known texture models are based on gray-scale images. This is an inadequate restriction in many real world applications of computer vision. These restrictions are the motivation behind the development of color texture models.

He and Wang [1] proposed a texture modeling scheme called ‘Texture Spectrum (TS)’. The TS operator is a gray-scale invariant one but not a rotational invariant. Ojala et al. extended their earlier work in 2002 [4] and proposed a new texture model called Local Binary Patterns (LBP)’. The LBP model is gray-scale and rotational invariant. Recently, Suruliandi and Ramar [6] proposed a new texture model LTP by combining the best features of TS and LBP models. Hence it is proposed to extend the LTP model and to combine it with contrast variance feature for classification of color images.

B. Outline of the Proposed Work

The spatial texture feature used in this study is LTP, which is basically a gray-scale operator and it is extended to process the color image patterns. The feature GLTP is introduced for this purpose. The LTP is an excellent measure of the spatial structure of local image texture, but it discards the contrast variance of the image which is an important property. The feature CLCVP is proposed based on the contrast variance feature. Color and texture being complementary, it is expected that their orthogonal property in the form of joint distribution can perform better for texture analysis. Hence, the feature GLTP/CLCVP is proposed as a joint distribution texture model.

C. Organization of the Paper

This paper is organized as follows. Section II describes the basic operators LTP and VAR. The proposed texture feature extraction technique for color images is explained in Section III. Classification principle is illustrated in Section IV. Experimental results are illustrated in Section V. Discussion and Conclusion is presented in Section VI.

II. BASIC OPERATORS

A. Local Texture Patterns (LTP)

The local image texture information can be extracted from a neighborhood of 3 x 3 local regions. Let gc, g1,

276

g2,…,g8 be the pixel values of a local region where gc is the value of the central pixel and g1, g2,…,g8 are the pixel values of its 8 neighborhood. Let the pattern unit P, between gc and its neighbor gi (i = 1, 2,…,8 b) be defined as

⎪⎩

⎪⎨

⎧

Δ+>…=Δ+≤≤Δ

Δ<=

g) (g g if 98, 1,2, i g) (g gg) - (g if 1

g) - (g g if 0),(

ci

cic

ci

ci ggP (1)

where Δg is a small positive value that represents desirable threshold value set by the user. The values for P can be any three distinct values and here it is chosen as 0, 1 and 9 in order to make the pattern labeling process easier and have no other importance. Fig.1 shows a 3 x 3 local region, P values calculated along the border and its pattern string.

123 110 113 117 120 135 130 125 128

1 0 0 1 9 9 9 9

(a) (b) (c)

Fig.1. (a) 3 x 3 local region (b) Pattern units matrix for Δg=4. (c) Pattern String

To define uniform circular patterns over the pattern

string, a uniformity measure ‘U’ which corresponds to the number of spatial transitions circularly in the Pattern string is defined as

∑=

−+=8

2118 )),(),,(()),(),,((

icicicc ggPggPsggPggPsU (2)

where

⎩⎨⎧ >

=otherwise 0

0Y- X if 1),( YXs (3)

The following rotational, gray-scale shift invariant ‘Local Texture Pattern (LTP)’ operator is proposed for describing a local image texture.

⎪⎩

⎪⎨

⎧≤

= ∑=

otherwise 73

3),(8

1ici UifggP

LTP (4)

For U = 0 there exist 3 LTP (0, 8 and 72), for U = 2 there are 21 LTP (1 to 7, 9, 16, 18, 24, 27, 32, 36, 40, 45, 48, 54, 56, 63 and 64) and for U = 3 there exist another 21 LTP (10 to15, 19 to 23, 28 to31, 37 to 39, 46, 47 and 55). All other non-uniform patterns are grouped under one label 73. The pattern strings such as 00019000 and 00091000 are considered rotate equivalents and such patterns will generate the same LTP. Therefore, the total no of LTP are 46. Since there are few holes in the LTP numbering scheme, they are relabeled to form continuous numbering from 1 to 46 using a small lookup table.

B. Contrast Variance (VAR)

The local contrast variance (VAR) is defined by

21

0)(1

C

P

iig

PVAR μ∑

−

=

−= where ∑−

=

=1

0

1 P

iic g

Pμ (5)

VAR is by definition, invariant against shifts in gray-scale. In forming a pattern over a local region, either the LTP or VAR can be used as texture descriptors. The features, either the joint pair LTP+VAR or their joint distribution LTP/VAR, can also be used as texture descriptors.

C. Patterns Unification Procedure

VAR has a continuous valued output and hence quantization of its feature space is needed. This can be achieved by patterns unification procedure. The unique code corresponding to the inputs P, Q and R is computed as shown in Figure 2 The variables P, Q and R may represent VAR values. A ‘’ in the upper position means its value is higher, a ‘’ in the lower position means its value is lesser and ‘’s in the same line means values are equal. 00999911

P Q R

Code P Q R

Code

1

8

2

9

3

10

4

11

5

12

6

13

7

Figure 2 Patterns unification procedure

III. PROPOSED TEXTURE MODEL FOR COLOR IMAGES

The procedure for computing GLTP/CLCVP is illustrated in Figure 3

277

RGB images

Figure 3. Extraction of GLTP/CLCVP joint distribution

features.

IV. CLASSIFICATION PRINCIPLE

A. Texture Similarities

Similarity between different textures is evaluated by comparing their Pattern Spectrum using the log-likelihood ratio also known as the G-statistic [5].

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜

⎝

⎛

⎥⎦

⎤⎢⎣

⎡⎟⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛

+⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

−⎥⎦

⎤⎢⎣

⎡⎟⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛

−⎥⎦

⎤⎢⎣

⎡

=

∑ ∑∑ ∑

∑∑ ∑

∑∑ ∑

∑ ∑

==

=

−−

−

ms

n

ii

ms

n

ii

msi

n

i msi

n

ii

ms

n

ii

ms

n

iii

ff

ff

ff

ff

G

. 1. 1

,1 ,

1, 1

, 1

log

log

log

log

2 (6)

where ‘s’ is a histogram of the texture measure distribution of the test sample image and ‘m’ is a histogram of the texture measure distribution of model sample image, ‘n’ is the total number of bins in the histogram and ‘fi’ is the frequency at bin ‘i’. The value of the G-statistic indicates the possibility that two image texture distributions come from the same population. The more alike the histograms are, the smaller is the value of G.

B. k-Nearest Neighbor Classification

The algorithm for k-Nearest Neighbor classification is as follows.

• Have training data (xi, ti; i = 1, 2, ..., n) where xi is the attribute of the training sample ‘i’, and ti is the class label of the training sample ‘i’.

• Have some test point x wish to classify. • Calculate similarity or dissimilarity between the

test point and the training points.

Isolate R, G and B planes

• Find the k training points k1, k2,…, kk, which are closest to the test point.

• Set the classification t for the test point to be the most common of the k nearest neighbors.

• The special case when k = 1, the algorithm will behave simply as a Nearest Neighbor classification.

V. EXPERIMENTS

A. Images Used in the Experiments

The performance of the proposed texture features for color images are demonstrated with the textures downloaded from Outex database [2]. In this study five classes of textures are used for classification purposes. In each class of texture, images of three different rotation angles, three different illuminations and three different resolutions (100, 300 and 600 dpi) were used. In total there were 135 (5 x 3 x 3 x 3) textures used for experiments. Figure 4 shows the five texture classes.

B. Rotational Invaiant Texture Classification

There are many applications for texture analysis in which rotation-invariance is important. In most approaches to texture classification, it is assumed that the unknown samples to be classified are identical to the training samples with respect to orientation. However, in reality the samples to be classified can occur at arbitrary orientation.

In this experiment, the classifier was trained with samples of illuminant ‘inca’, rotation angle of 00 and resolution of 600 dpi. 10 samples each with the size of 64 x 64 were extracted from all the five texture classes shown in Figure 4 and there were 50 (5 x 10) training models. For rotational invariant classification purpose, samples of the same illuminant i.e. ‘inca’ and with same resolution as training samples i.e. 600 dpi but with different rotation angle i.e. the other two rotation angles in each texture were used to test the classifier. 40 samples each with size of 64 x 64 were extracted from each texture class and in total there were 200 (5 x 40) validation samples. 3-NN classifier algorithm was used as the classifier. The validation sample was assigned to the class of testing model of majority of 3 matches. The results of the classification are shown in Table 1.

Form 2D joint distribution histogram GLTP/CLCVP using LTP plane and unified

VAR plane

Form intensity scale image using

I=0.299R+0.587G+0.114B

Compute VAR for the local regions of R, G and

B planes using a 3 x 3 sliding window and form R, G and B VAR planes

Combine R, G and B VAR planes using patterns unification procedure to form a unified VAR plane

Form LTP plane by applying LTP operator over ‘I’ using a sliding window of size 3 x 3.

278

(a) (b) (c) (d) (e)

Figure 4. Textures from Outex database used in the experiments: (a) Canvas001; (b) Canvas002; (c) Canvas003; (d) Canvas005; (e) Cardboard001.

TABLE 1. ROTATIONAL INVARIANT CLASSIFICATION.

Texture CV-1-45

CV-1-90

CV- 2-30

CV-2-60

CV-3-00

CV-3-10

CV-5-15

CV-5-30

CB-1-15

CB-1-75 Average

Classification Accuracy

( %) 90.91 97.73 100 100 100 100 100 100 100 100 98.86

C. Comparative Analysis of GLTP/CLCVP for Rotational Invariant Texture Classification

This experiment was conducted to measure the performance of three texture features GLTP/CLCVP, LBP/CLCVP and TS/CLCVP for rotational invariant classification. All textures were chosen with illumination ‘inca’, resolution of 600 dpi and rotation angle of 00. The validation textures were the same five textures with illumination ‘inca’, resolution of 600 dpi, but the other two rotation angles were used. The 3-NN classifier was used as the classification algorithm. The results are tabulated in Table 2.

TABLE 2. ROTATIONAL INVARIANT CLASSIFICATION RESULTS FOR VARIOUS TEXTURE MODELS.

Classification accuracy in %

VI. DISCUSSION AND CONCLUSION In this paper, a new texture feature is proposed for color

texture modelling. The features GLTP/CLCVP is introduced as joint distribution features of texture patterns and contrast variance patterns. The performance of the proposed texture feature was studied with respect to rotational invariance. The experimental results reveal that the proposed texture feature GLTP/CLCVP yields promising results for rotational invariant classification problems.

In this work, the performance of the proposed texture feature extraction technique is tested using k-NN classifier with G-Statistic as the distance measure. In future it is planned to test the performance using other classifiers with different distance measures.

REFERENCES Texture TS/CLCVP LBP/CLCVP GLTP/CLCVP

CV1-I-600-45 [1] D.C.He and L.Wang, “Texture unit, Texture Spectrum and Texture

Analysis”, IEEE Transaction on Geoscience and Remote Sensing, vol. 28, no. 4, pp. 509 – 512,. 1990.

100.00 80.68 94.32

CV1-I-600-90 46.59 77.27 98.86

CV2-I-600-30 [2] T.Ojala, T.Mäenpää, M.Pietikäinen, J.Viertola, J.Kyllönen and

S.Huovinen, “Outex – A New Framework for Empirical Evaluation of Texture Analysis Algorithms”, Proc. 16

100.00 100.00 100.00

CV2-I-600-60

th Int’l Conf. Pattern Recognition (2002). Available online at http://www.outex.oulu.fi 100.00 100.00 100.00

CV3-I-600-05 [3] T.Ojala, M.Pietikäinen and D.Harwood, “A Comparative Study of Texture Measures with Classification Based on Feature Distributions”, Pattern Recognition, vol. 29, no 1, pp. 51 – 59, 1996.

100.00 98.86 100.00

CV3-I-600-10 100.00 98.86 100.00

CV5-I-600-15 [4] T.Ojala, M.Pietikäinen and T.Mäenpää, “Multiresolution Gray-scale

and Rotation Invariant Texture Classification with Local Binary Patterns”, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971 – 987, 2002

78.41 90.91 100.00

CV5-I-600-30 71.59 90.91 100.00 nd[5] R.R.Sokal and F.J.Rohlf, “Introduction to Biostatistics’, 2 ed,

W.H.Freeman, (1987). CB1-I-600-15 100.00 100.00 100.00 [6] Suruliandi, A. and K. Ramar, 2008, Local Texture Patterns – A

univariate texture model for classification of images, Proceedings of The 2008 16

CB1-I-600-75 100.00 100.00 100.00 th International Conference on Advanced Computing and

Communications (ADCOM08), Anna University, Chennai, Tamilnadu, India. pp 32 – 39. 2008, Available online at IEEE Xplore.

Average 89.66 93.75 99.32

Texture Legends Examples CV1-I-600-45: Canvas001 with ‘inca’ illumination at 100 dpi

and rotation angle of 45. CB1-I-600-15: Cardboard001 with ‘inca’ illumination at 600

dpi and rotation angle of 15.

279

Time Synchronization for an Efficient Sensor Network System

Anita Kanavalli, Vijay Krishan, Ridhi Hirani, Santosh Prasad, Saranya K.,P Deepa Shenoy, and Venugopal K R

Department of Computer Science and EngineeringUniversity Visvesvaraya College of Engineering, Bangalore, India

[email protected]

L M PatnaikVice Chancellor

Defence Institute of Advanced Technology, Pune, India

Abstract

Time synchronization schemes in Wireless Sensor Net-works have been subjected to various security threats andattacks. In this paper we throw light on some of theseattacks. Nevertheless we are more concerned with thepulse delay attack which cannot be countered using anyof the cryptographic techniques. We propose an algorithmcalled Resync algorithm which not only detects the delayattack but also aims to rectify the compromised node andintroduce it back in the network for the synchronizationprocess. In-depth analysis has been done in terms of therate of success achieved in detecting multiple outliers i.e.nodes under attack and the level of accuracy obtained inthe offset values after running the Resync algorithm.

keywords Time synchronization, Attacks, Security, Re-covery, Clock Adversary, Compromise nodes.

1 Introduction

Sensor Networks are made up of small devices with sens-ing and processing facilities equipped with a low-powerradio interface. The emergence of industrial control appli-cations as compared to previous applications which werededicated towards environmental monitoring and surveil-lance tasks. Industrial control applications demand much

more with respect to security especially when monitoringof critical equipment is needed.

Time synchronization is imperative for many applica-tions in sensor networks at many layers of its design. Thevarious examples of sensor networks include TDMA ra-dio scheduling, reducing redundant messages by dupli-cate detection of the same event by different sensors, per-forming mobile object tracking, using the different sensornodes to perform ordered logging of events during systemdebugging, to name a few. The effect of inaccurate timesynchronization would be detrimental to all the above ap-plications if somehow the underlying time synchroniza-tion protocol was modified by an adversary. This will af-fect the estimated trajectory of the object, which wouldgreatly differ from the actual one because the collabora-tive data processing and signal processing techniques willbe greatly affected. Similarly, the importance of time syn-chronization of sensor networks can be seen in other con-trol applications.

Contribution: We proposed an extension to the L-SGSi.e. Resync Algorithm, where the compromised node isbrought back into the synchronization process, by run-ning the Resync algorithm in the compromised node. Thisextension is evaluated against the main L-SGS algorithmbased on various parameters.

1

280

2 Model and Algorithm

Our system consists of a network of sensor nodes whichare assumed to be in their power ranges. Such sensorsare called neighbors of each other. The radio link presentbetween the sensor nodes is bidirectional to allow a two-way communication between the nodes.

In the group synchronization algorithm the nodes syn-chronized if their delay value has not crossed a pre-calculated threshold value. Both the receiver-sender,sender-receiver can fall prey to the pulse delay attack.

2.1 Problem definition

Implementation of the modified L-SGS algorithm. The al-gorithm uses the broadcast property of the wireless chan-nel to broadcast messages from the central to the othernodes. The times Tij are measured by the local clockof the node, where Tij is the Time i, received by Nodej. Each node has to keep track of four sets of time, Ti,i=1 to 4 in order to calculate dj and δj , if not compro-mised. Any of the nodes in the group can serve as thecentral node, provided that the collisions in the wirelesschannel do not cause any drop of packets to and from thecentral node. The implementation of Resync algorithm inorder to counter the external attack and re-enter into thesynchronization process after being compromised.

2.2 Algorithm

The modified L-SGS is executed as follows:

Table I: L - SGS Algorithm

1. Nc → Ni[SY NC]∗i=1 excluding c.

2. Ni(T1i) → Nc(T2i)[Ni||Nc||REPLY ]

3. Nc(T1i) → Nc(T2i)[Ni||Nc||REPLY ]

If d <= D then offset δ = ((T2i−T1i−)−(T4i−T3i))/2,else Node Ni labeled as a compromised node and runsRESY NC algorithm.

3 Resync AlgorithmThis proposed algorithm is a continuation of the modifiedL − SGS. As compared to the L − SGS, the modifiedLSGS, does not abort when a node gets compromised.Instead, the node which has been identified as the compro-mised node runs the Resync algorithm in order to counterthe external attack and re-enter into the synchronizationprocess after being compromised. The main idea behindwhich is used in the Resync algorithm is that once theoutliers have been detected i.e. the malicious time offsets,they need to be excluded while calculating the true offsetsbetween the nodes. This can be done by calculating themean of the offsets of the benign nodes and approximatetrue time offsets.

Let Γ be the time offset data set from all the nodes andχ be the time offsets from the outlier set, then the benigntime offset set is defined as Γ− χ . Also let is defined asthe average of the set Γ−χ. The size of set of time offsetsis n and that of the time offsets of compromised nodes isk, then µ is calculated as follows: µ = P (χ − χ)i/(n−k)

where I ranges from 1 to n − k Therefore, when a nodegets compromised, it has to ensure that it gets the true off-sets from the nodes which are not compromised in the net-work. Accordingly, it has to take the average of these timeoffsets received and set its ffset to the calculated mean.The following algorithm, implements the following con-cept. The Resync algorithm executes as follows:

Table II: Resync Algorithm

1. Nx → Ni|Nx||[COMPROMISED]∗

2. Ni → Nx[δi||OFFSET ]

3. Nx calcualtes he average of the offsets of the remain-ing benign group members and adjusts its clock.

4. Nx → Ni|Nx||[RESY NC]∗

Node Nx represents the node which has been compro-mised. This has to be informed to all the other groupmembers taking part in the synchronization process. Thisis achieved in Step 1, where the compromised node in-cludes its node id and broadcasts it to all the nodes in aCOMPROMISED message. In Step 2, as a reply each ofthe benign nodes, include the offset that they calculatedas a part of the modified L-SGS in the OFFSET messageand transmit it to the compromised node.

2

281

4 Implementation and PerformanceAnalysis

4.1 Implementation of modified LSGSOur implementation of the modified L-SGS is simulationoriented. The simulation was performed using a recentlydeveloped simulator Castalia built specifically for Wire-less Sensor Networks. Castalia is built on top of opensource mobile network Omnet++.

For the sample run, there are nine nodes taken into ac-count, one acting as the central node. The delays havebeen plotted when no node is compromised. Based onthe threshold calculation presented above average delaydavg = 0.0198446s and σ = 0.00794624. The re-quired n = 3 and thus the threshold is calculated to beD = 0.04368332s. This value is then used to detect out-liers in the subsequent runs. Based on the threshold delaycalculation, the presence of compromised nodes can bedetected as described in the algorithm. The following fig-ure illustrate how the algorithm is able to pick up on thepresence of a node under pulse delay attack based on thethreshold value.

Figure 1: End-to-end delay in a sample run

The scatter graph in Figure 1 plots the uncompromisednodes and their respective delays. All the delays are belowthe threshold which is represented by the dashed line.

4.2 Performance EvaluationThe Resync algorithm is responsible for ensuring that if anode is compromised, it broadcasts its status to the othernodes which are present in its group and using the timeoffsets of the benign nodes be able to correct its offset.

As we can see from the first plot, Nodes 1,3,4,5 and 7 i.e.

Figure 2: Run 1-Comparison of true and calculated offsets

five out of the eight nodes have a difference in the timeoffsets between 1 to 10 ms. Thus, as we can see fromthe plot above even though the Resync algorithm has anefficiency > 70% in most cases.

5 ConclusionThe performance of the modified L-SGS is measured us-ing the Successful Detection Rate (SDR) where differentdelays are introduced into the network. The performanceof the Resync algorithm is then measured by comparingthe calculated offsets with true time offsets. Even thoughin most cases, the accuracy between the time offsets isabout 70algorithm can be refined in order to achieve ahigher accuracy rate.

References[1] S. Ganeriwal, R. Kumar, M. B. Srivastava, Timesync

protocol for sensor networks. In Proceedings of theFirst ACM Conference on Embedded NetworkedSensor Systems (SenSys), pp 138-149, 2003.

[2] H. Song, S. Zhu, G. Cao, Attack-resilient time syn-chronization for wireless sensor networks, AdHocNetworks. V5 (1) 112-125, 2006.

3

282

Parallel Hybrid Germ Swarm Computing for Video

Compression

K. M. Bakwad1, S.S. Pattnaik

1, B. S. Sohi

2, S. Devi

1, B.K. Panigrahi

3, M. R. Lohokare

1

1National Institute of Technical Teachers’ Training and Research Chandigarh, India 2UIET, Panjab University, Chandigarh, India 3Indian Institute of Technology, Delhi, India

[email protected]

[email protected]

Abstract— This paper proposes Parallel Hybrid Germ Swarm

Computing (PHGSC), for real time video compression. The

convergence of Bacterial Foraging Optimization (BFO) is very

slow because of fixed step size and its performance heavily

degraded for real time processing. In this paper, initially the

authors tried to increase the speed of BFO by updating bacteria

positions parallel instead of serial, which is treated as Parallel

Germ Computing. Further Parallel germ computing is

hybridized with GLBest Particle Swarm Optimization

(GLBestPSO) to improve global performance of PGC. The

PHGSC is used to reduce computational time of motion

estimation in video compression. The adaptive step size with

prediction, zero motion vector and Von Neumann neighborhood

topology implemented in PHGSC ensures the best matching

block computationally very fast. The presented PHGSC saves

computational time up to 93.36 % when compared with other

published methods.

Keywords- Parallel Hybrid Germ Swarm Computing (PHGSC);

Global and local Best Particle Swarm Optimization

(GLBestPSO); video compression; computional time; peak signal

to noise ration;

I. INTRODUCTION

Depending upon foraging strategies of the E. coli bacterium,

K.M. Passino proposed Bacterial Foraging Optimization in

2002[1]. The Bacteria foraging optimization [1] is gaining

popularity in research community due to its attractive features.

Motion estimation has been popularly used in video signal processing, which is a fundamental component of video

compression. In motion estimation, computational complexity

varies from 70 percent to 90 percent for all video compression.

The exhaustive search (ES) or full search algorithm gives the

highest peak signal to noise ratio amongst any block-matching

algorithm but requires more computational time [2]. To reduce

the computational time of exhaustive search method, many

other methods are proposed i.e. Simple and Efficient Search

(SES)[2], Three Step Search (TSS)[3], New Three Step Search

(NTSS)[3], Four step Search (4SS)[4], Diamond Search

(DS)[5], Adaptive Road Pattern Search (ARPS)[6], Novel Cross Diamond search [7], New Cross-Diamond Search

algorithm [8], Adaptive Block Matching Algorithm [9],

Efficient Block Matching Motion Estimation [10], Content

Adaptive Video Compression [11] and Fast motion estimation algorithm [12] . GA has also been used for fast motion

estimation [13]. In this paper, authors propose fusion of

parallel germ computing as, with GLBestPSO for motion

estimation

II. PARALLEL HYBRID GERM SOFT COMPUTING

BFO can be classified into serial and parallel BFO. Standard

BFO is serial. In BFO, if all of the bacteria update their

information at the same time then it treated as Parallel Bacterial Foraging or Parallel Germ Computing. The Parallel

Bacterial Foraging or Parallel Germ Computing developed by

the authors can be found in [14]. The PGC, when hybridized

with GLBestPSO is called as PHGSC. The PHGSC has been

used for video compression.

The authors proposes an adaptive step size as given in Eq.(1),

which is used to predict best matching block in the reference

frame with respect to macro block in the current frame for

which motion vector is found. In PHGSC, the positions of

bacteria are updated as given in Eq (2). In step size equation,

W and C are same as GLBest PSO [15] as given in Eq. (3) and Eq.(4). Due to adaptive step equation of PHGSC, the next

block search will start from nearer to best matching block in

previous step.

Step size = abs [Mx + My] +r*W*C (1)

( )

( , , ) ( , , ) ( )( ) ( )T

ii k l i j k C i Stepsize

i iθ θ

∆= + +

∆ ∆ (2)

−=

i

i

pbest

gbestw 1.1 (3)

+=

i

i

pbest

gbestc 1 (4)

Where,

Mx = Horizontal position of motion vector of previous block.

My =Vertical position of motion vector of previous block.

Step by step algorithm of PHGSC

283

Step1: Initialize Parameters p, S, Nc, Ns, Nre, C (i), i= 1, 2... S

Where,

p = Dimension of search space

S = Number of bacteria in the population

Nc = Number of Chemo tactic steps

Ns= Number of Swimming steps Nre = Number of reproduction steps

C (i) = Step size taken in the random direction specified by the

tumble.

J (i, j, k) = Fitness value or cost of i-th bacteria in the j-th

chemotaxis and k-th reproduction steps.

θ (i, j, k)= Position vector of i-th bacterium in j-th chemotaxis

step and k-th reproduction steps.

Jbest (j, k) = Fitness value or cost of best position in the j-th

chemotaxis and k-th reproduction steps.

Jglobal= Fitness value or cost of the global best position in the

entire search space.

Step 2: Update the following parameters.

J (i, j, k)

Jbest (j, k)

Jglobal= Jbest (j, k)

Step 3: Reproduction Loop: k = k+1

Step 4: Chemotaxis loop: j = j+1

a) Compute fitness function J (i, j, k) for i = 1, 2,

3… S.

b) Update Jbest (j, k).

c) Tumble: Generate a random vector pi ℜ∈∆ )( with each element )(im∆ m =1,

2,.,p, a random number on [-1 1]

d) Compute θ for i = 1, 2 …S

e) Swim

i) Let m =0 (counter for swim length) ii) While m < Ns

Let m = m+1

Compute fitness function J (i, j+1, k) for i =

1, 2, 3… S

Update Jbest (j+1, k)

If Jbest (j+1,k)< Jbest (j, k) (if doing better),

Jbest (j, k) = Jbest (j+1, k). Compute θ for i

= 1, 2...S

Use this ),1,( kji +θ to compute the new j

(i, j+1, k). Else, let m = Ns. This is the end of the while

statement

Step 6: If j <Nc, go to step 4. In this case, continue

Chemotaxis, since the life of bacteria is not over.

Step 7: The Sr=S/2 bacteria with the highest cost function

values die and other Sr=S/2 bacteria with the best values split.

Step 8: Update Jglobal from Jbest (j, k).

Step 9: If k < Nre, go to step3 otherwise end.

III. PHGSC FOR MOTION ESTIMATION

The authors have already used the MPPSO [14] and PBFO

[16] for motion estimation. The Von Neumann topology is

used as search pattern. In the proposed method, macro block is

known as bacteria. Five bacterium are used in PHGSC for

motion estimation. The initial position of block to be searched

in reference frame is same as block in current frame for which

motion vector is found. The mean absolute difference (MAD)

is taken as objective or cost function for motion estimation in

and is expressed as Eq (5).

1 1

( , )1

( , )

M N

i j

CurrentBlock i jMeanAbsoluteDifference

ReferenceBlock i jMN = =

=−∑∑

(5)

Where, M = Number of Rows in the frame. N = Number of Columns in the frame

The performance of the proposed method is evaluated by peak

signal to noise ratio, which is given by the eq. (6).

(6)

IV. RESULTS AND DISCUSSIONS

The proposed method (PHGSC) has been tested for standard

video i.e. Caltrain and lecture based video sequences. Video

sequences with a distance of two frames between current

frame and reference frame are used to generate frame-by-

frame results of the proposed algorithm. To test the efficiency

of the proposed algorithm with existing algorithms, the

algorithms are executed on HP workstation CPU 3.0 GHz and

2GB RAM with MATLAB. The performance of PHGSC is

compared with of other methods [2][3][4][5][6] and the result is presented in Table 1 and Table 2. The speed of PHGSC is

faster than published methods and PSNR is close to published

methods as shown in Table 3. PHGSC saves computational

time from 93.36 to 6.06 percentages with PSNR gain of -

0.1573 to +1.7441. In the suggested method, zero motion is

stored directly. Zero motion vectors implemented PHGSC

saves computational time by maintaining accuracy.

V. CONCLUSION

This paper presents new hybrid soft computing technique

known as PHGSC. The proposed technique is used for motion estimation in video. As compared to ES, PHGSC gives less

PSNR of 0.1573 and 0.0189 for caltrain and lecturer based

video sequence respectively.

( )( , 1, ) ( , , ) ( )

( ) ( )T

ii j k i j k C i

i iθ θ

∆+ = +

∆ ∆

( )( , 1, ) ( , , ) ( )

( ) ( )T

ii j k i j k C i

i iθ θ

∆+ = +

∆ ∆

10

2

, 2

, 1

10 log

255[ ]

1( ( , ) ( , ))

M N

i j

PSNR

OrigionalFrame i j CompensatedFrame i jMN =

=

−∑

284

The PHGSC saves computational time from 93.36 to 6.06

with a PSNR gain of 1.7441 to 0.1726 over existing methods.

The results show promising improvement in terms of

accuracy, while drastically reducing the computational time.

The code developed is generalized in nature and proves its identity as useful tool in motion estimation

REFERENCES

[1] Liu and K.M. Passino, M.A. Simaan, “Biomimircy of Social Foraging Bacteria for distributed Optimization: Models, Principles, and Emergent

Behaviors”, Journal of Optimization Theory and Applications”, vol. 115,

no.3, December 2002, pp 603-628.

[2] Jianhua Lu., Ming L. Liou, “A Simple and Efficient Search Algorithm

for Block Matching Motion Estimation”, IEEE Trans. Circuits and

Systems for Video Technology, vol.7, no.2, April 1997, pp. 429- -433.

[3] Renxiang Li, Bing Zeng, and Ming L. Liou, “A New Three- Step Search

Algorithm for Block Motion Estimation”, IEEE Trans. Circuits and

Systems for Video Technology, vol. 4, no. 4, August 1994, pp. 438-442.

[4] Lai-Man Po., Wing –Chung Ma, “A Novel Four- Step Search Algorithm

for Fast Block Motion Estimation”, IEEE Trans. Circuits and Systems

for Video Technology, vol.6, no. 3, June 1996, pp. 313-317 .

[5] Shan Zhu, Kai-Khuang Ma, “A New Diamond Search Algorithm for Fast Matching Motion Estimation”, IEEE Trans. Image Processing,

vol.9, no.2, February 2000, pp.287-290.

[6] Yao Nie, Kai-Khuang Ma, “Adaptive Rood Pattern Search for Fast Block-Matching Motion Estimation”, IEEE Trans. Image Processing,

vol.11, no.12,December 2002, pp. 1442-1448.

[7] Chun-Ho Cheung, Lai-Man Po, “A Novel Cross Diamond Search Algorithm for Fast Block Estimation”, IEEE Trans. Circuits and

Systems, vol.11, no.12, December 2002, pp. 1442-1448.

[8] C.W. Lam., L.M. Po and C.H. Cheung, “A New Cross- Diamond

Search Algorithm foe Fast Block Matching Motion Estimation”, 2003

IEEE International Conference on Neural Networks and Signal

Processing, Nanjing, China, Dec. 2003, pp.1262-1265.

[9] Humaira Niasr, Tae-Sun Chol, “An Adaptive Block Motion Estimation Algorithm Based on Spatio Temporal Correlation”, Digest of Technical

papers, International conference on consumer Electronics, Jan 7-11, 2006, pp.393-394.

[10] Viet-Anh Nguyen, Yap-peng Tan, “Efficient Block Matching Motion

Estimation Based on Integral Frame Attributes”, IEEE Transactions on

Circuits and Systems for Video Technology, vol.16, no. 2, March 2006,

pp.375-385.

[11] Jiancong Luo, Ishfog Ahmed, Yong Fang Liang and Vishwanathan Swaminathan, “Motion Estimation for Content adaptive Video

Compression”, IEEE Transactions on Circuits and Systems for video

Technology, vol. 18, no.7, July 2008, pp.900-909.

[12] Chun-Man Mak, Chi keung Fong, and Wai Khen Chan, “Fast Motion

Estimation For H.264/AVC in Walsh Hadmard Domain”, IEEE

Transactions on Circuits and Systems for Video Technology, vol. 18,

no.6, June 2008, pp. 735-745.

[13] Shen Li., Weipu Xu, Nanning Zheng, Hui Wang , “A Novel Fast Motion Estimation Method Based on Genetic Algorithm”, ACTA

ELECTRONICA SINICA, vol.28 no.6, June 2000, pp.114-117.

[14] K. M. Bakwad, Dr. S.S. Pattnaik, Dr. B. S. Sohi, Swapna Devi, M.R.

Lohakare,” Parallel Bacterial Foraging Optimization for Video Compression” International Journal of Recent Trends in Electrical and

Electronics, International Journal of Recent Trends in Engineering

(Computer Science), vol. 1 no.1, June 2009, pp.118-122.

[15] M.Senthil Arumugam , M.V.C.Rao, Aarthi Chandramohan, A new and

improved version of particle swarm optimization algorithm with global-local best parameters, Journal of Knowledge and Information System

(KAIS), Springer, vol.16 no.3, 2008,pp. 324-350.

[16] K. M. Bakwad, Dr. S.S. Pattnaik, Dr. B. S. Sohi, Swapna Devi, Ch. Vidya Sagar, P. K. Patra, Sastry V. R. S. Gollapudi, “Small Population

Based Modified Parallel Particle Swarm Optimization for Motion Estimation” IEEE Sixteenth International Conference on Advanced

Computing and Communication (ADCOM-08), Anna University, Chennai, India, ,17 Dec, 2008, pp. 367-373

TABLE III. COMPURIONAL TIME SAVE AND PSNR GAIN BY PROPOSED METHOD OVER EXISTING METHODS

TABLE II. COMPARISION OF COMPUTIONAL TIME IN SECONDS OF PROPOSED METHOD AND EXISTING METHODS

TABLE I. COMPARISION OF MEAN PSNR OF PROPOSED METHOD AND EXISTING METHODS

Mean PSNR Sr.

No

Type of

Video

Sequence

No. of

Frames ES TSS SESTSS NTSS 4SS DS ARPS

Proposed

Method

1 Caltrain 30 27.8422 26.2390 25.9408 26.9647 27.4322 27.5123 27.4336 27.6849

2 Lecturer

Based

24 35.2214 34.8762 34.8757 34.8467 34.8273 34.8252 34.7248 35.2025

Computational Time in seconds

Sr. No

Type of Video

Sequence

No. of

Frames ES TSS SESTSS NTSS 4SS DS ARPS

Proposed

Method

1 Caltrain 30 3.55 0.45 0.35 0.45 0.43 0.42 0.33 0.31

2 Lecturer Based 24 5.88 0.73 0.56 0.58 0.54 0.52 0.42 0.39

Existing Block Matching Method Sr. No Proposed Method Type of Video

Sequence ES TSS SESTSS NTSS 4SS DS APRS

Caltrain 91.26 31.11 11.42 31.11 27.9 26.19 6.06 1 Computational Time

save by PHGSC (In

percentage) Lecture Based 93.36 46.57 30.35 32.35 27.71 25 7.14

Caltrain -0.1573 +1.4459 +1.7441 +0.7202 +0.2527 +0.1726 +0.2513 2 PSNR gain by PHGSC

(in db) Lecture Based -0.0189 +0.3263 +0.3268 +0.3558 +0.3752 +0.3773 +0.4771

285

Texture Classification using Local Texture Patterns:

A Fuzzy Logic Approach

E.M. Srinivasan

Department of Electronics and Communication Engineering, Government Polytechnic College,

Nagercoil, Tamilnadu, India.

[email protected]

A. Suruliandi

Department of CSE, M S University,

Tirunelveli, Tamilnadu, India.

[email protected]

K.Ramar

Department of CSE, National Engineering College,

Kovilpatti, Tamilnadu, India.

[email protected]

Abstract— Texture analysis plays a vital role in image

processing. The prospect of texture based image analysis

depends on the texture features and the texture model. This

paper presents a new texture model ‘Fuzzy Local Texture

Patterns (FLTP)’ and ‘Fuzzy Pattern Spectrum (FPS)’. The

local image texture is described by FLTP and the global image

texture is described by FPS which is an occurrence frequency

of FLTP over the entire image. The efficiency of the proposed

texture model is tested with texture classification. The results

show that the proposed method provides a very good and

robust performance.

Keywords- Texture Analysis, Texture Classification, Local

Texture Patterns, Fuzzy Local Texture Patterns, Fuzzy Pattern

Spectrum.

I. INTRODUCTION

Numerous texture modeling techniques have been developed by many researchers. Each method is superior in discriminating its texture characteristics but there are no obvious texture modeling techniques common for all texture images. Comparative study about various texture analysis methods can be found in [5, 9].

A. Motivation and Justification for the Proposed Approach

Barcelo et. al. [1] proposed a texture characterization approach ‘Fuzzy Texture Spectrum (FTS)’, which is based on the texture model ‘Texture Spectrum (TS)’ introduced by He and Wang [3, 4, and 8]. In FTS approach, the fuzzy logic and fuzzy techniques are included in TS texture model with due consideration of uncertainties introduced by noise, and different caption and digitization processes. In this representation scheme, the spectrum requires a total number of 6561 bins. For real textures, FTS method gives a better representation. Moreover, the FTS method provides superior discrimination between textured regions and homogeneous regions.

Recently, Suruliandi and Ramar [7] proposed a new texture modeling approach ‘Local Texture Patterns (LTP)’. They describe the local image texture by LTP and global image texture by ‘Pattern Spectrum (PS)’ which is the

occurrence frequency of LTP over the whole image. In the LTP model, the pattern associated with the local texture region of size 3 x 3 is uniform or non-uniform that is based on the gray-level difference between the central pixel and its neighbors as well as a uniformity measure computed by a specific rotation scheme. In this approach, the total number of patterns as well as the number of bins in the histogram is 46 only. The LTP operator is computationally simple and robust against gray-scale and rotational variations.

As referred in the FTU model, if fuzzy techniques are used, then there will be a significant improvement in texture characterization. But, the number of bins required for FTS model is 6561. This large number of bins will bring out the local texture information in more detail. But, at the same time, as the number of bins increases, computational time complexity of texture analysis also increases. In the case of LTP model, the total number bins required is 46 only and hence, the model is computationally efficient for texture analysis. Thus, it is realised that, fuzzy techniques as used in the FTU model may be combined with LTP model for a progressive approach. Hence, in this paper, it is proposed to introduce a new texture model that incorporates the advantages of both methods.

B. Outline of the Proposed Work

In this paper, a new texture analysis operator ‘Fuzzy Local Texture Patterns (FLTP)’ is proposed. The local image texture is described by FLTP and the global image texture is described by ‘Fuzzy Pattern Spectrum (FPS)’ which is an occurrence frequency of FLTP over the entire image. The performance of the proposed approach is demonstrated with texture classification.

C. Organization of the Paper

This paper is organized as follows. Section II describes the LTP texture model. Section III describes the proposed FLTP texture model. Section IV includes experiments conducted on texture classification of Brodatz [2] images and comparative analysis of various texture models based on classification performance. Section V concludes the work.

286

II. LOCAL TEXTURE PATTERNS (LTP) AND

PATTERN SPECTRUM (PS)

A. Local Image Texture Description by LTP

Let gc be the central pixel value and g1, g2,…,g8 be its neighbor pixel values in a 3x3 local region. Let the ‘Pattern Unit’ P, between gc and its neighbors gi (i=1,2, …, 8) be defined as

(1)

where g is a positive value that represents the gray value and has its importance in forming the patterns. P can be assigned with one of any three distinct values 0, 1 and 9. There are eight P values for a local region. A Pattern Units matrix is filled with these values. The method of calculating the P values in a 3 x 3 local region, formation of Pattern Units matrix and Pattern String are shown in Figure 1.

128 115 118

122 125 140

135 130 133

1 0 0

1 9

9 9 9

(a) (b)

0 0 9 9 9 9 1 1

(c)

Figure 1. (a) 3 x 3 local region (b) Pattern Units matrix for g=4

(c) ‘Pattern String’

A ‘Uniformity’ measure U which corresponds to the number of spatial transitions circularly in the ‘Pattern String’ is defined as

(2)

where

(3)

The patterns with at most U value of 3 shall be treated as uniform patterns and others as non-uniform patterns. The LTP operator for describing a local texture is defined as

(4)

There are 46 number of LTP in total. As there are few holes in the LTP numbering scheme, using a lookup table they are relabeled into continuous numbers from 1 to 46.

B. Global Image Texture Description by PS

The occurrence frequency of all the LTP is PS, with the abscissa indicating the LTP and the ordinate representing its occurrence frequency. Global image texture is described with the help of PS. The spectrum uses LTP defined earlier, as the measure to describe the global texture.

III. PROPOSED TEXTURE MODEL

In this section, FLTP and FPS texture modeling approach is proposed. The proposed method borrows some of the basic principles of LTP method and FTS method.

A. Fuzzy Local Texture Patterns – FLTP

It is noted from the Figure 1(b), the Pattern Units matrix is filled with unique P values (0, 1 or 9). The P values simply represent the relationship between the central pixel and its neighbors within a small 3 x 3 pixels image region. To represent the same in a more flexible way, each cell of the Pattern Units matrix can be assigned with three membership values. Without loss of generality of FTU and LTP, the membership values are directly associated with the degree to which the neighbor pixel is smaller than (0), equal to (1) or greater than (9) the centre pixel.

In a 3 x 3 local image region, let gc be the value of the central pixel and gi (i=1,2,…,8) be the values of its neighbor pixels. With the assumption g=0 in (1), let the difference between gc and gi be xi (i=1,2,…,8). Let µ0(xi), µ1(xi) and µ9(xi) be the membership degrees for the values 0,1 and 9 of xi respectively. The ‘Fuzzy Pattern Unit (FP)’ value between gc and its neighbors gi (i=1,2, …, 8) is defined as

) /9) x ( , 1 / ) x ( , /0) x ( ( gi) , (gc FP i 9 i 1i 0 µ µ µ= i=1,2,…8 (5)

If there is a local homogeneous region, then the difference between gc and gi will be equal to zero or almost equal to zero and µ1(xi) will be higher, and µ0(xi) and µ9(xi) will be lower. In case of textured region, the difference between gc and gi will be increasing and therefore µ1(xi) will be decreasing and µ0(xi) and µ9(xi) will be increasing.

Based on the above considerations, it is proposed here, three membership functions which are arrived from the heuristic results with parameters a,b which determine the boundary coordinates of xi. The membership functions are given below.

<<

−≤

=+

otherwise 0

a- x b- if b-a

b)(x -

a x if 1

)(x ii

i

i 0 µ (6)

<<

≥

=−

otherwise 1

a)(x abs b if b-a

a)) x ( (abs-

a) x ( abs if 0

)(x i i

i

i 1 µ (7)

∆+>

…=∆+≤≤∆

∆<

=

g) (g g if 9

8, 1,2, i g) (g gg) - (g if 1

g) - (g g if 0

),(

ci

cic

ci

ci ggP

=

−+=8

2

118 )),(),,(()),(),,((i

cicicc ggPggPsggPggPsU

>

=otherwise0

0 if1

X -Y s(X,Y)

≤

=

=

otherwise 73

3),(8

1i

ci UifggPLTP

287

<<

≥

=

otherwise 0

a x b if ) x (-

x if 1

)(x ii 0

i

i 9 µ µa

(8)

The degrees to which the pixel gi is negative (smaller), zero (similar), or positive (larger) with regard to the central pixel gc are µ0(xi), µ1(xi) and µ9(xi) respectively. Hence, the FP associated to the central pixel is given by

) 9 / ) x ( 1, / ) x ( 0, / ) x ( (

, ), 9 / ) x ( 1, / ) x ( 0, / ) x ( ( FP

8 98 18 0

1 91 11 0

µ µ µ µ µ µ …= (9)

The local region can be represented as a Fuzzy Pattern Units matrix as shown in Figure 2. The entries in the matrix are FP values which are calculated using (6, 7, 8, 9).

µ0(x1)/0,

µ1(x1)/1,

µ9(x1)/9

µ0(x2)/0,

µ1(x2)/1,

µ9(x2)/9

µ0(x3)/0,

µ1(x3)/1,

µ9(x3)/9

µ0(x8)/0,

µ1(x8)/1,

µ9(x8)/9

µ0(x4)/0,

µ1(x4)/1,

µ9(x4)/9

µ0(x7)/0,

µ1(x7)/1,

µ9(x7)/9

µ0(x6)/0,

µ1(x6)/1,

µ9(x6)/9

µ0(x5)/0,

µ1(x5)/1,

µ9(x5)/9

Figure 2. Fuzzy Pattern Units matrix

From the matrix elements, the FLTP is calculated by the following procedure. Each matrix element contains three P (0,1 or 9) values and the corresponding membership values. By using these values, a set of ‘Pattern Strings (S)’ is constructed.

(10) 1)(),(0

smembership zero-non for two

1)( membership zero-non onefor

1<<=

=

==

+ iviu

v

iik

u

iik

iu

u

iik

xxP)(psS

P)(psS

x P)(psS

µµ

µ

where, psi (i=1,2,…,8) is the element of S. v

iP means P

value of ith element having membership v. If ith matrix element contains one of the membership degree values equal to ‘1’, the ith element of the string is filled with the corresponding P value. For other non-zero membership values, there will be two strings filled with corresponding P values for ith position of the strings.

We use a new mLTP operator with minor modifications in (4). Here, mLTP is defined by

=

=8

1i

ipsmLTP (11)

When the membership degree values are ‘1’ in all the matrix elements, there will be only one S and one mLTP. If there are ‘n’ elements in the matrix having two non-zero membership values, the total number of S and mLTP is 2n.

The degree to each mLTP is obtained by multiplying the eight corresponding membership degrees

)(xµ)µ(m8

1

i∏=

=i

psiLTP (12)

So, when this 3 × 3 local region is considered, the central pixel has associated FLTP which is defined by

kk

K

k

µ(mLTP)mLTPFLTP *1

=

= (13)

where K is the total number of S or mLTP.

B. Fuzzy Pattern Spectrum – FPS

Using the procedure outlined in the previous section, the FLTP are calculated. Such FLTP are identified as uniform or non uniform using (2). If U3, then the pattern is assumed as uniform and otherwise non-uniform. In some natural textures, non-uniform patterns also help in describing the texture characteristics. In this proposed approach, it is decided to have 73 uniform patterns (0 to 72) and another 73 non-uniform patterns (73 to 146). Therefore, a total number of 146 bins are there in the occurrence histogram of FPS.

IV. EXPERIMENTS AND RESULTS

A. Textures used in the Experiments

The textures used in the experiments are taken from the publicly available Brodatz benchmark database. They are shown in Figure 3. The textures are Beach sand, Stone, Sand, Grass and Water. Normally, these are the images encountered in remotely sensed image analysis.

(a) (b) (c) (d) (e)

Figure 3. Brodatz images (a) Beach sand (b) Stone (c) Sand (d) Grass

(e) Water.

B. Texture Similarity

Similarity between different textures is evaluated by comparing their histograms. The histograms are compared as a test of goodness-of-fit using a nonparametric statistic, the log-likelihood ratio also known as the G-statistic [6]. The G-statistic compares the bins of two histograms and is defined as

+

−

−

=

===

−−−

ms

n

i

i

ms

n

i

i

ms

i

n

i ms

i

n

i

i

ms

n

i

i

ms

n

i

ii

ffff

ffff

G

. 1. 1,1 ,

1, 1, 1

loglog

loglog

2

(14)

where s is a histogram of the first image and m is a histogram of second image , n is the total number of bins in the histogram and fi is the frequency at bin i.

C. Classification of Brodatz Images using the Proposed

Texture Model

An experiment on image classification was conducted to prove the efficiency of the proposed FLTP model. For this study, Brodatz texture images shown in Figure 3 of size

288

512 x 512 were taken. Each individual texture image was considered as a sample and there were 5 samples in total. Test images were extracted from the source images, keeping each pixel of 512 x 512 as the center of the sample and thus 262144 test samples were extracted irrespective of the sample size. Each test sample was compared against the model samples using (14) and the test sample was classified as the category of the model sample which gives minimum G value. The test was carried out for test samples of size W, equal to 15, 30 and 45. Table 1 shows the results.

TABLE 1. CLASSIFICATION OF BRODATZ IMAGES USING THE PROPOSED

FLTP METHOD

Classified Samples (Total : 262144 Samples) Tex-ture

W Beach

sand Stone Sand Grass Water

% of Accura

cy

15 259905 832 489 918 0 99.15

30 262038 0 106 0 0 99.96

45 262144 0 0 0 0 100

Beach

sand

15 3111 249738 0 9295 0 95.27

30 157 260096 0 1891 0 99.22

45 0 261756 0 388 0 99.85 Sto

ne

15 31 211 260635 1196 71 99.42

30 0 0 262144 0 0 100

45 0 0 262144 0 0 100 Sand

15 121 3504 1806 256713 0 97.93

30 0 0 0 262144 0 100

45 0 0 0 262144 0 100 Gra

ss

15 0 0 10476 1709 249959 95.35

30 0 0 3958 0 258186 98.49

Wat

er

45 0 0 343 0 261801 99.87

D. Quantitative Analysis of Various Texture Models using

Classification Accuracy

The performance of FTS model and LTP model were compared with the proposed FLTP model. The result of the comparison is tabulated in Table 2.

Using the FTS model, 99.91 percent classification accuracy was obtained. This is due to the fact that FTS model has very good discriminating power.

The LTP model is yielding an accuracy of 99.57 percent. The strength of this model is, it is a robust model against gray-scale variations and rotational variations which is an important criterion for real time applications.

The FLTP method is performing better with classification accuracy of 99.69 percent. This is due to the fact that, the number of local patterns identified is 146 which is sufficiently large enough to characterize the local spatial patterns.

TABLE 2. CLASSIFICATION ACCURACY OF VARIOUS TEXTURE MODELS

Classification Accuracy (%) Tex.

Mode Beach sand

Stone Sand Grass Water Avg

FTU 100 99.88 100 100 99.67 99.91

FLTP 100 98.92 100 100 99.52 99.69

LTP 99.90 99.12 99.71 99.12 100 99.57

V. DISCUSSION AND CONCLUSION

In this paper, a new method of texture characterization technique based on FLTP and FTS is presented. Local patterns are identified by the FLTP method and these patterns are used to form FPS which characterizes the global texture feature of the given texture image. The classification results in Table 1, which shows a high classification accuracy of more than 99% for Brodatz images. It is observed from Table 2 that, classification accuracy is above 99% for FLTP model which is considerably enough to compare with other models. From the results, it is inferred that, the proposed model has very good discriminatory power and hence in future, it is planned to use the FLTP model for texture analysis such as texture segmentation and texture based edge detection.

REFERENCES

[1] A. Barcelo, E. Montseny and P. Sobrevilla, “Fuzzy Texture Unit and Fuzzy Texture Spectrum for Texture Characterization”, Fuzzy Sets and Systems, 158, 239-252 (2007).

[2] P. Brodatz, Texture – A Photographic Album for Artists and Designers, Reinhold, New York (1968).

[3] D.C. He and L. Wang, “Texture Unit, Texture Spectrum and Texture analysis”, IEEE Trans. on Geoscience and Remote sensing, 28(4), 509-512(1990).

[4] D.C. He and L. Wang, “Unsupervised Textural Classification of Images Using the Texture Spectrum”, Pattern Recognition, 25(3), 247-255(1992).

[5] T. Ojala and M.Pietikäinen and D. Harwood, “A Comparative Study of Texture Measures with Classification Based on Feature Distributions”, Pattern Recognition, 29(1), 51-59 (1996).

[6] R.R. Sokal and F.J. Rohlf, “Introduction to Biostatistics’, 2nd ed, W.H.Freeman, (1987).

[7] A. Suruliandi and K. Ramar, “Local Texture Patterns- A univariate texture model for classification of images”,Proceedings of The 2008

16th International Conference on Advanced Computing and

Communications (ADCOM08), 32-39 (2008).

[8] L. Wang and D.C. He, “Texture Classification using Texture Spectrum”, Pattern Recognition, 23, 905-910(1990).

[9] J. Zhang and T. Tan, “Brief Review of Invariant Texture Analysis Methods”, Pattern Recognition, 35, 735-747 (2002).

289

Integer Sequence based Discrete Gaussian and Reconfigurable Random Number Generator

Arulalan Rajan, H S Jamadagni

Centre for Electronics Design and Technology, Indian Institute of Science, India

(mrarul,hsjam)@cedt.iisc.ernet.in

Ashok Rao Dept of E & C, CIT,

Gubbi, Tumkur, India [email protected]

Abstract - A simple random number generation technique based on integer sequences is proposed in this paper. Random integers with Gaussian distribution were generated and compared with that of the numbers generated using the proposed technique. The mean square error between the estimated probability density function and the obtained one is very negligible and is of the order of 10-6. Using the proposed technique, one can generate anywhere between 16,000 to 80,000 random integers between 1 and 100, with Gaussian distribution. Depending on the required range, the number of random numbers generated can be varied. The technique renders itself to a very simple hardware implementation that is dynamically reconfigurable on-the-fly to generate random variables with different distribution. Keywords- Integer Sequences, Discrete Gaussian, Random Number, Reconfigurable hardware

I. INTRODUCTION Random sequence generators have become inevitable in

almost all fields ranging from communication to finance. The random sequences have some probability distribution function (PDF) associated with them. The most frequently used ones are uniform distribution, Gaussian distribution, Poisson distribution, Binomial distribution [1]. Of these, the uniformly distributed random numbers are generated using linear feedback shift registers [2]. Similarly, Gaussian distributed random numbers are more common in digital communication [3]. There are many techniques available for generating random numbers with Gaussian (bell-shaped) distribution. These generators have however focused on generating random numbers in the interval (0, 1). Not much emphasis has been laid on generating discrete analogue to a continuous Gaussian distribution. In this paper, we propose a new technique to generate discrete analogue to Gaussian distributed random numbers. We also present a simpler hardware implementation of such Gaussian random number generators. We also propose a dynamically reconfigurable hardware random number generator based on integer sequences.

The paper is organized as follows: In section 2, we give an overview of the existing techniques for generating Gaussian random numbers and their hardware implementations. We propose the new technique based on integer sequences in section 3. In section 4, we discuss some of the results of the proposed technique with regard to the generation and statistical characteristics of the random numbers. We conclude in section 5.

II. OVERVIEW OF EXISTING TECHNIQUES FOR GAUSSIAN RANDOM NUMBER GENERATION

One of the most commonly used non uniform, continuous distributions is the Gaussian distribution. A number of Gaussian random number generators have been described in literature. Most of these involve the transformation of uniform random numbers [4]. In this section, we present a quick overview of these techniques. We also present a typical discrete analogue to Gaussian distributed random numbers.

The cumulative distribution function (CDF) inversion technique [1], Box-Muller transform method, [5] and its many hardware implementations [6], the rectangle-wedge-tail method, by Marsaglia [7], and several other algorithms and implementations [8] for generating Gaussian distributed random numbers have been reported in literature.

With digital signal processing techniques requiring discrete random numbers, one needs to look at different strategies for generating discrete analogue of continuous random numbers. A simple and straightforward technique is to sample the continuous Gaussian, yielding the sampled Gaussian kernel. The disadvantage in this method is that the discrete function does not have the discrete analogues of the properties of the continuous function.

A second approach is to make use of a discrete Gaussian kernel [11] defined by

( , ) ( )tT n t e I tn−= - (1),

where ( )I tn is the modified Bessel function of integer order. The complexity of generating Bessel function on hardware is very high and hence not best suited for hardware implementation.

Having given the overview of the existing techniques for generating Gaussian random numbers, we now proceed to discuss our technique to obtain Gaussian random numbers.

III. INTEGER SEQUENCE BASED GAUSSIAN RANDOM NUMBER GENERATION

Integer sequence, as the name implies, is a sequence of integers generated using difference equations or polynomial functions. In our work, we consider a few of the integer sequences, listed in the Online Encyclopedia of Integer Sequences (OEIS) [12], generated using some kind of recursive relations. Table 1 gives the list of integer sequences used for generating random numbers. Fig. 1 shows the plot of some of these sequences.

290

These sequences, of certain length (determined by the range in which the random numbers are needed) are pairwise convolved and their convolution plots studied. The envelope of each of these convolution plots turns out to be similar to Gaussian. Fig. 2 shows the convolution of sequences in which HCS denotes the Hofstadter Conway sequence and GS denotes Golomb Sequence. The idea of generating Gaussian distribution directly follows from the plot and is discussed in detail as follows: The indices of the sequence resulting from the convolution of two sequences are taken to be the set of random numbers that one can generate. The index typically starts from 1 to L+M-1. We take these numbers, n, from 1 to L+M-1 as the values that a random variable X can take. The probability that X=n is given by the value of convolution at n. We explain this in detail with an example. We take S1 to be Golomb sequence of length 51, and S2 also to be Golomb sequence of length, 50. We convolve the two sequences and the sum of the convolution over the entire length of the result,

1( ), ( ) ( )* ( )3 3 1 21

TotalL M

S n where S n S n S nn

=+ −

=∑=

- (2)

gives the total number of random numbers that can be generated, between 1 and 100. The probability that a random variable X can take an integer value between 1 and 100 is given by S3(n), the sequence resulting from convolution, ie.,

( ( )) ( )3P X x n S n= = = /Total - (3) With the technique of generating Gaussian random numbers having been described in detail, we now proceed to discuss the hardware implementation of a Gaussian random number generator. A. Hardware Implementation of Random Number Generator

A sequence like Fibonacci sequence [11], described by the following recurrence,

( ) ( 1) ( 2), (1) 0, (1) 1a n a n a n a a= − + − = = - (4) is easier to generate in hardware as it involves only a simple recursion. However, this is not the case with most of the other sequences, which involve more than one recursion. To generate these sequences, new and simple strategies were developed. Let us look at the following sequences and discuss in detail the strategies developed for generating the same.

( ) 1 ( ( ( 1)), (1) 1; (2) 2;a n a n a a n a a= + − − = = - (5) ( ) ( ( 1)) ( ( 1)); (1) (2) 1;a n a a n a n a n a a= − + − − = = - (6)

Equations (5) and (6) are used to generate Golomb sequence [12] and Hofstadter Conway [13] sequence respectively. The elements generated from (5) are as follows: 1,2,2,3,3,4,4,4,5,5,5,6,6,6,6,7,7,7,7, 8 …. As seen from (6) and (7), the direct hardware implementation of the generating function is not that simple. An alternate approach to generate these sequences, based on the inherent pattern was proposed in [14].

Having looked at the individual sequence generator, we now proceed to describe the architecture used to generate the random numbers.

Conventional random number generators take mean and variance as the inputs and then generate random values. Not

deviating much from the conventional scheme, we propose to use variance as the input. One can make lengths of the sequences or the sequences themselves depend on the variance given.

On obtaining the lengths of the sequences, we can use a generator as simple as the one proposed in [14] to generate the sequence. The sequence elements are obtained for half the lengths and symmetry is forced on the sequence for the other half of the sequence length. Once the sequences are generated, they can be convolved using either a single multiply accumulate (MAC) unit or multiple MAC units. The result of this convolution is taken as the probability density function for the random variable X.

Higher level block diagram of the random number generator shown in figure 4 is then used to generate the random numbers with Gaussian distribution. An approach similar to the one used for generating the sequence can be used here also. We obtain a pattern here from the convolution and store that in the pattern information memory (PIM). The addresses for this memory are precisely the values that the random numbers can take. The addresses are generated using the LFSR, with the initial seed being capable of taking the LFSR through all the states. Once all the states have been obtained, the seed of the LFSR is changed. The LFSR is used to facilitate the randomness in the generator’s output. The decrement and compare unit (DCU) basically decrements the content of that location pointed to by the LFSR. The compare logic, compares if the contents of the memory location pointed to by the LFSR is zero. In such a case, the address is incremented by 1 so that the next element can be output. Since the LFSR value has to remain the same till the random value is output and the contents of the memory is decremented, the LFSR can be made to operate at half the clock rate of the memory. The control engine shown is used to decide the sequence and the length of the sequence, based on the variance input and the distribution type.

A simple modification to figure 4 results in a dynamically reconfigurable random number generator, Here the pattern information memory is a segmented one, holding the pattern information of multiple convolution results. Higher order address bits can be made to identify the distribution type and the lower order address bits can be used as the values that the random variable can take. Based on the variance and the distributions, the sequences and their lengths can be obtained. Depending on the distribution, one can use the integer sequences directly or convolve them or perform any other operations, followed by writing into the pattern information memory. An alternate approach to make the hardware a reconfigurable one is by having another memory segment where the pattern can be stored. The contents of this memory could be written into the pattern information memory shown in figure 4, on the fly, so that any distribution can be obtained using the same hardware as in figure 4.

291

IV. RESULTS Sequences listed in table 1 were considered for random

number generation. Convolution was performed with the following combinations of the sequences:

• A sequence S1 convolved with itself • A sequence S1 convolved with another sequence S2.

Without loss of generality, the lengths of the sequences were taken to be the same, in order to study the relation between length and variance. The numbers were generated in MATLAB using the proposed technique and compared with the random numbers generated using the same mean and variance as that of the proposed technique. Mean squared error is also plotted.

As mentioned in section 3, figure 1 and figure 2 are the plots of the sequences for various lengths and their convolution respectively. Figure 3 shows the histogram plot of the convolution of Golomb sequence with itself. The CDF plot is shown in figure 5. The variation of length (assuming that the sequences convolved have same length and equal to L) with standard deviation and variance are shown in figure 6 and figure 7 respectively. We find that for a Gaussian distribution, the length of the sequence and the standard deviation has a linear relation, while there is a square relation between the length and variance. We thus find that the length of the sequences can be made dependent on the standard deviation and hence the variance.

In the usual Gaussian random number generators, within a given interval, the variance, σ2, can vary depending on the requirement and hence the profile or the shape of the Gaussian PDF varies. A similar situation here is that the range be the same but with a different variance, the lengths of the two sequences can be changed. To address this, the lengths of the two sequences can be changed from 50 each to say 61 and 40 or any other combination of lengths such that the sequence resulting from convolution has a length of 100. This fact is illustrated in figure8.

Figure 9 compares the estimated PDF with the obtained PDF. Figure 10 gives the mean square error plot of the comparison. We find that the mean square error value is of the order of 10-6. This clearly shows that the technique proposed in this paper, is more efficient to generate random integers following Gaussian distribution.

V. CONCLUSION The technique of using integer sequences for random

number generation has been proposed in this paper. It has also been shown that convolution of a certain family of integer sequences can be used to generate discrete Gaussian random variable. The variance of the Gaussian random variable is made to influence the choice of the integer sequences and the length of the sequences to be convolved. A simple architecture to generate Gaussian random number is also presented in this paper. It has also been illustrated that a slight modification to this architecture yields a reconfigurable random number generator, based on different distributions. The mean square error plot shows that the error is very less. In future, we

propose to explore the use of integer sequences in generating extreme valued probability distribution functions.

VI. REFERENCES [1] D. E. Knuth, “Seminumerical Algorithms - The Art of Computer Programming”, Vol.2, 3rd ed., Addison-Wesley, USA, 1998. [2] Solomon W. Golomb. Shift Register Sequences. Aegean Park Press, 1981. [3] Xilinx 2002. Additive White Gaussian noise (AWGN) core. CoreGen documentation file. [4] D.Thomas, P. Leong, J.Villasenor, “Gaussian Random Number Generators”, ACM Computing Surveys, Vol.39, No.4, Artcl 11, Oct 2007. [5] G. E. P. Box and M. E. Muller, “A note on the generation of random normal deviates,” The Annals of Math. Statistics, vol. 29, 1958, pp. 610–611. [6] A Alimohammad, S. F. Fard, Bruce F. Cockburn, C. Schlegel, “A Compact and Accurate Gaussian Variate Generator”,IEEE Trans. on VLSI Systems, vol 16, no.5,2008. pp 517-527. [7] G. Marsaglia, T. A. Bray, “A Convenient method for generating normal variables”, SIAM Rev.6,3, 1964, pp 260-264. [8] G. Zhang, P. H. W. Leong, D. Lee, J. D. Villasenor, R. C. C. Cheung, and W. Luk, “Ziggurat-based hardware Gaussian random number generator,” in IEEE Intl. Conference on Field Programmable Logic and its Applications, 2005, pp. 275–280. [9] T. Lindberg, “Scale-space for discrete signals”, PAMI(12), No.3, March 1990, pp 234-254. [10] N.J.A.Sloane, The Online Encyclopedia of Integer Sequences”, www.research.att.com/~njas/sequences/ [11] www.research.att.com/~njas/sequences/A000045 [12] www.research.att.com/~njas/sequences/A001462 [13] www.research.att.com/~njas/sequences/A004001 [14] A. Rajan, HS Jamadagni, A. Rao, “Integer Sequence Window based Reconfigurable FIR filters”, Proc. of the First IEEE Intl. Workshop on Reconfigurable Computing, Dec 2008, India, http://ewh.ieee.org/conf/hprcw/rcw08.html

Table 1. List of Integer Sequences from [10] Sequence Generating Function Offset

A001462 a[n] = 1 + a[n - a[a[n - 1]]] a[1]= 1, a[2]= 2A004001 a[n] = a[a[n-1]]

+ a[n-a[n-1]] a[1] = a[2] = 1

A113886 a[n] = a[a[n-2]] + a[n-a[n-1]]

a[1] = a[2] = 1

A005229 a[n] = a[a[n-2]] + a[n-a[n-2]]

a[1] = a[2] = 1

A098378 a[n] = a[a[a[n - 1]]] + a[n – a[a[n - 1]]]

a[1] = a[2] = 1

A006158 a[n] = a[a[n-3]] + a[n-a[n-3]]

a[1] = a[2] = 1

A006161 a[n)= a[a[n-1]-1] + a[n+1-a[n-1]

a[1] = a[2] = 1

292

Figure 1. Integer Sequences with forced symmetry

Figure 3. Histogram plot for random variable obtained

from Golomb sequence Convolution

Figure 5. Estimated CDF for random variable obtained

from Golomb sequence convolution

Figure 7. Variance vs Sequence Length

Figure 9. PDF plots comparison

Figure 2. Convolution of Sequences

Figure 4. Random Number Generator

Figure 6. Standard Deviation Vs Length Plot for Golomb

Sequence convolution based distribution

Figure 8. Profile variation with change in lengths for the

same range between 1 and 100, but different variance

Figure 10. MSE Plot

Control engine

Distribution type

Variance

PIM

LFSR

Seed generate Counter MOD K-1

1

Log2K

DCU Pointer to RV generate

L+M

a[n]

S 3[n

]

(n)

Freq

uenc

y

Random Number played out

293

Parallelization of PageRank and HITS Algorithm on CUDA Architecture

Kumar Ishan, Mohit Gupta, Naresh Kumar, Ankush Mittal Department of electronics & Computer Engineering,

Indian Institute of Technology, Roorkee, India. kicomuec, mickyuec, naresuec, [email protected]

Abstract

Efficiency of any search engine mostly depends on how efficiently and precisely it can determine the importance and popularity of a web document. Page Rank algorithm and HITS algorithm are widely known approaches to determine the importance and popularity of web pages. Due to large number of documents available on World Wide Web, huge amount of computations are required to determine the rank of web pages making it very time consuming. Researchers have devoted much attention in parallelizing PageRank on PC Cluster, Grids, and Multi-core processors like Cell Broadband Engine to overcome this issue but with little or no success. In this paper, we discuss the issues in porting these algorithms on Compute Unified Device Architecture (CUDA) and introduce efficient parallel implementation of these algorithms on CUDA by exploiting the block structure of web, which not only cut down the computation time but also significantly reduces of the cost of hardware required.

1. INTRODUCTION

In present days, the unceasing growth of World Wide Web has lead to a lot of research in page ranking algorithms used by the search engines to provide the most relevant results to the user for any particular query. The dynamic and diverse nature of web graph further exaggerates the challenges in achieving the optimum results. Web link analysis provides a way to order the web pages by studying the link structure of web graphs. PageRank and HITS (Hyperlink - Induced Topic Search) are two such most popular algorithms used by some current search engines in same or modified form to rank the documents based on the link structure of the documents. PageRank, originally introduced by Brin and Page [1], is based on the fact that a web page is more important if many other web pages link to it. In core, it contains continuously iterating over the web graph until the Rank assigned to all of the pages converges to a stable value. In contrast to PageRank, a similar HITS algorithm, developed by Kleinberg [2], ranks the documents on the basis of two scores which it assigns to a particular set of documents dependent on specific query, although basis for computation are same for both. This paper addresses issues related to parallel implementation of these algorithms in an interesting manner and proposes an innovative way of exploiting the block structure of web existing at much lower level. Our approach in parallel implementation of these algorithms on NVIDIA’s Multi-Core CUDA Architecture not only reduces the computation time but also requires much cheaper hardware.

2. PAGERANK

2.1. Algorithm

Let Rank(p) denotes the rank of web page p from set of all web pages P. Let Sp bet a set of all web pages that points to page p and Nu be the outdegree of the page u ε Sp. Then the “importance” given by a page u to the page p due to its link is measured as Rank(u)/Nu. So total “importance” given to a page is the sum of all the “importance” due to incoming link to page p. This is computed iteratively n times for each page rank to converge. This iteration is as follows.

∀p ∈ P, Rankp = 1 − d + d ∗ Ranku

N ∈ … 1

Where d is the “damping factor” from the random surfer model [1]. We will be using 0.85 as the value of d further in this paper as given in [1]. The use of d insures the convergence of PageRank algorithm [5].

2.2. Related Works

Since PageRank involves huge amount of computation, therefore many researchers have attempted with their own approaches towards its parallel implementation. Haveliwala et al. [6] exploits the block structure of web for computing Pagerank. Rungsawang and Manaskasemsak used PC Cluster to compute PageRank[3] in which they divide the input graph into equal sets and calculate them on each pc cluster nodes. Rungsawang and Manaskasemsak also implemented Partition-Based Parallel PageRank Algorithm [4] on PC Cluster. Another efficient parallel implementation of PageRank on pc cluster by Kohlschutter et al. [5] achieves gain of 10 times by using block structure of web page and reformulating the algorithm by combining Jacobi and Gauss-Seidel method. The implementation [9] on multi core 8 SPU based Cell BE has shown that the PageRank Algorithm runs 22 times slowly.

2.3. CUDA Architecture

CUDA™, introduced by NVIDIA, is a general purpose parallel computing that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU. These GPUs

294

are used as coprocessor to assist CPU for computational intensive task. More details about this architecture can be explored at [11]. Here, we highlight the features of CUDA architecture that need special mention in relation to our work. 1. SIMT Architecture 2. Asynchronous Concurrent Execution 3. Warps 4. Memory Coalescing 2.4. Porting issues of Parallel Implementation on CUDA Architecture

Porting issues with the PageRank algorithm are mainly concerned with hardware restrictions of CUDA architecture. Some important issues of CUDA architecture are as follows: 1) Non coalesced memory access: - CUDA has some constraint related to memory accesses. The protocol followed by the CUDA architecture for memory transaction ensures that all the threads referencing memory in the same segment are serviced simultaneously. Therefore bandwidth will be used most efficiently only if, simultaneous memory accesses by threads in a half warp belongs to a segment. But due to uneven and random nature of indegrees of nodes the memory reference sometimes become non coalesced hindering the simultaneous service of memory transactions leading to the wastage of memory bandwidth.

Solution: - The nodes generally link in the locality, with few links to farther nodes. To improve the rank calculation of a node, say p, we process only those nodes on kernel which belong to locality of p, determined by the lower and upper limits. And the rest of the nodes are processed on host processor. So we create two link structured input file, one to be processed by kernel, which contains nodes lying in locality, and other contains rest of the nodes to be processed on host processor.

2) Divergence in control flow:- CUDA demands the execution path of all threads in a warp to be similar for the thread to execute in parallel, hence, divergent execution paths of threads in a warp cause the CUDA to suspend the parallel execution of threads and executes them sequentially (or become serialized), hence decreasing throughput. As the number of indegrees of nodes can be very dissimilar, the loop involved in calculation, iterating for number of indegrees, can make the thread’s control flow to become divergent or different from other threads. Solution:- To solve this problem, we tried to exploit the block structure [8] of web. A careful study reveals that there exists block structure even at smaller level. So, we divide all the nodes into blocks and calculate average for each block

separately. Then the rest of the nodes are added to the link structured input file for host. When blocks are scheduled on device’s multiprocessor, all threads in a warp follow similar execution flow to a greater extent. The number of calculation on host can be further decreased if we use some constant multiple (average factor) of the average value. This constant for peak performance is different for different input graphs depending on the distribution of the number of indegrees among the nodes.

2.5. Results and observations

We used four different parameters; block size, average factor, lower and upper limits for range of locality for experiments. By increasing the range limits more than few thousands the performance is either decreased or remains same and the block sizes giving reasonable increase in performance are 32 and 64. As for smaller block size the number of threads executing in parallel are too few. And with larger block size threads become more divergent. As the average values for smaller block size is very high therefore increase in the average factor gives decrease in performance. While with larger blocks increase in average factor gives increase in performance. Using suitable parameters based on above discussion, we achieved some promising results.

3. HITS

3.1. Algorithm

HITS algorithm also ranks web pages on the basis of link structure of web. For this purpose, it assigns two scores to a web page, namely, Authority Score and Hub Score. Higher Authority score means that the given web page is linked by many documents with high Hub Score & a higher Hub Score means that the given documents points to many documents with high Authority value. In contrast to PageRank, this algorithm is query dependent & both the scores are assigned at the run time depending upon the query. For a given query, a relatively small set of relevant documents Root set (R) is retrieved from the web, generally on the basis of the reoccurrences of the words of the query (Q) or TDIDF. Then from the Root Set, a Complete Set(C) is formed by including all the documents which either points to at least one document in the Root set or are pointed by at least one document in the root set. Finally, the scores are assigned to all the documents in the Complete Set in a number of iterations. In [7], it is shown that generally the scores converge in 5-10 iterations. Authority Value Ai is the sum of the Hub Scores Hj of the documents pointing towards it. And Hub Value Hi defined as sum of the authority Scores Ai of the documents it points to. Since, Root Set & Complete set, and therefore the Authority Score & Hub Score, are calculated at the time of query, both

295

these scores are query specific. This makes this algorithm quite slow, making it unfeasible to be used in real life situations. Since, the third step is the most time consuming step, we present our algorithm to make this part run faster on CUDA architecture.

3.2. Parallel Implementation

Since, the computations involved in this algorithm are similar to PageRank; it follows the similar model of implementation as discussed in section 2.4. As CUDA allows asynchronous concurrent execution, so the control returns to the host before the device completes its task leading to the parallel execution of code between host (CPU) and device (GPU). So, in order to minimize the computation time, the task of calculating each score is uniformly divided between host and device such that they both take approximately same time to compute their part of job.

3.3 Results of Parallel Implementation of HITS

Since, the computations in HITS algorithm are similar to PageRank therefore following the same model of implementation on a set of 300,000 nodes, generated using WebGraph [12]; we achieved a significant gain on CUDA Architecture as compared to CPU.

4. CONCLUSION

In our paper, we effectively demonstrate how to parallelize graph based algorithm like PageRank and HITS on CUDA Architecture to achieve high performance with much cheaper hardware. Further, if the nodes of pc clusters include CUDA enabled devices then we can achieve better performance, like for speed up of 10 on clusters as achieved by [3], we can increase performance gain up to 40 times with marginal increase in cost. Since, HITS algorithm calculates the scores at the time of query, its optimization will not only lead to quicker results but also more accurate results can be achieved by increasing the size of the Complete Set. Our approach can also be extended to parallelize graph based algorithms utilizing sparse graphs. LSI is another information retrieval method that uses Singular Value Decomposition to identify patterns between terms and concepts contained in an unstructured collection of text. It also requires relatively high computational performance compared to other IR method which is proving to be main bottle neck in its glory.

REFERENCES

[1] S. Brin and L. Page, “The Anatomy of a Large Scale Hypertextual Web Search Engine,” Computer Networks and ISDN Systems archive, Volume 30, Issue 1-7, April. 1998.

[2] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment”, Journal of the ACM (JACM) archive. Volume 46, Issue 5, September 1999. [3] A. Rungsawang and B. Manaskasemsak, “PageRank Computation Using PC Cluster”, Proceedings of the 10th European PVM/MPI User’s Group Meeting, Venice, Italy, 29th Sep – 2nd Oct 2003. [4] A. Rungsawang and B. Manaskasemsak, “Partition-Based Parallel PageRank Algorithm”, Proceedings of the Third international Conference on Information Technology and Applications ICITA’05), Sydney, 4th - 7th July, 2005. [5] C. Kohlschutter, P. Chirita, and W. Nejdl, “Effıcient Parallel Computation of PageRank”, Proceedings of the 28th European Conference on Information Retrieval (ECIR), London, United Kingdom, 2006. [6] S. Kamvar, T.H. Haveliwala, C. D. Manning ,G. H. Golu, “Exploiting the Block Structure of the Web for Computing PageRank”, Technical Report CSSM-03-02, Computer Science Department, Stanford University, 2003. [7] Y. G. Saffar, K. S. Esmaili, M. Ghodsi, and H. Abolhassani, “Parallel Online Ranking of Web Pages”, The 4th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA-06), UAE, March 2006, pp. 104-109. [8] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin, “PageRank Computation and the Structure of the Web: Experiments and Algorithms”, In Proceedings of the 11th World Wide Web Conference, poster track, Honolulu, Hawaii, 7-11 May 2002. [9] G. Buehrer, S. Parthasarathy, and M. Goyder, "Data mining on the cell broadband engine", Proceedings of ICS’08, Cairo, Egypt, 20-24 October, 2008. [10] S. Nomura Satoshi Oyama Tetsuo Hayamizu, and Toru Ishida, “Analysis and Improvement of HITS Algorithm for DetectingWeb Communities”. [11] NVIDIA CUDA Programming Guide 2.2 by NVIDIA Corporation. [12] WebGraph Laboratory, http://webgraph.dsi.unimi.it/ in 2006.

296

Designing Application Specific Irregular Topology for Network-on-Chip

Naveen ChoudharyDepartment of Computer Science &

EngineeringCollege of Engineering and Technology

Udaipur, [email protected]

M.S.Gaur, V. LaxmiDepartment of Computer

EngineeringMalaviya National Institute of

TechnologyJaipur, India

gaurms|[email protected]

Virendra SinghSERC

Indian Institute of ScienceBangalore, India

[email protected]

Abstract—Network-on-chip (NoC) has been proposed as a solution for the communication challenges of System-on-chip (SoC) design in nano-scale technologies. Application specific SoC design offers the opportunity for incorporating custom NoC architectures that are more suitable for a particular application, and may not conform to regular topologies. In this work we propose to generate a custom NoC that maximizes performance under the given resource constraints. This being NP-hard problem, we present a heuristic technique based on genetic algorithm for synthesis of custom NoC architectures along with requisite routing tables with the objective to improve communication load distribution.

Keywords-NP-hard; Network-on-Chip; Optimization; Performance; Cores

I. INTRODUCTION The Network-on-Chip [1, 2, 6] has been proposed as a

promising solution to act as communication architecture for the modern SoC platforms with the increasing number of processor cores. Several early works favored the use of standard topologies such as meshes, tori, k-ary n-cubes or fat trees under the assumption that the wires can be well structured in such topologies. However, most SoCs are heterogeneous, with each core having different sizes, functionality and communication requirements. Thus, standard topologies can have a structure that poorly matches the application traffic. This leads to large wiring complexity after floor planning, as well as significant power and area overhead. Since traffic characteristics for application specific SoC can be well characterized at design time [7], it is expected that networks with irregular topology tailored to the application requirements will have an edge over the networks with regular topology. A key problem in NoC is to ensure that no deadlock situations can block the whole network by its routing algorithm. However there are deadlock free topology agnostic routing algorithms such as up*/down* [3], L-turn [4], down/up [5]. These algorithms have in common that they are based on turn prohibition, a methodology which avoids deadlock by prohibiting a subset of all turns in the network. In this paper, a genetic algorithm based heuristic is proposed for the design of customized irregular Networks-on-Chip. The presented methodology uses the predefined

applications communication characteristics to generate an optimized network topology along with the corresponding routing tables. Irregular NoC communication model is defined in Section II. The proposed Genetic Algorithm based methodology is presented in Section III. Section IV summarizes some experimental results followed by a brief conclusion in section V.

II. IRREGULAR NOC COMMUNICATION MODELTask graphs are generally used to model the behavior of

complex SoC applications on an abstract level. The tasks T are mapped to a set of IP cores (Intellectual property cores) V which communicate through unidirectional point-to-point abstract channels. In this paper tasks to cores mapping is assumed to be already done. Definition 1 and definition 2 defines the core and NoC topology graph respectively.

Definition 1 The core graph is a directed graph, G (V, E) with each vertex νi Є V representing an IP core and the directed edge (νi, νj) denoted as ei j Є E, representing the communication between the cores νi and νj. The weight of the edge ei,j denoted by bwi,j, represent the desired bandwidth of the communication from νi and νj.

Definition 2 The NoC topology graph is a directed graph N (U, F) with each vertex υi Є U representing a node/tile in the topology and the directed edge (υi, υj) denoted as fi,j Є F represents a direct communication physical link/channel between the vertices υi and υj. The weight of the edge fi,j denoted Abwi,j represent the available link/channel bandwidth across the edge fi, j.

III. IRREGULAR NOC TOPOLOGY GENERATION METHODOLOGY

As shown in Fig. 1 floorplanning using methodology like B*-Trees [12] with the objective to minimize area can be done as the first step. The irregular topology construction is started by creating a Breadth First Search spanning tree based on Manhattan distance among the IP cores. The permitted node degree (nd_treemax) ie number of allowed ports per IP core at this stage is kept less than the actual permitted node degree (ndmax). The initial tree topology is strongly connected, and thus provides a path between every pair of nodes and therefore this property is retained throughout the topology generation

297



process. Based on the constructed minimum spanning tree and using Dijkstra's shortest path algorithm, the routing table entries for the routers of the NoC is generated for each edge in the core graph. At this stage the traffic load to these tree paths is assigned according to the bandwidth requirement in the core graph, the basic tree path for (υi, υj) in the NoC topology graph is assigned the traffic load of bwi j and similarly the edges of the path (υi, υj) are assigned the traffic load as the summation of their previously assigned traffic load and bwi j. In the next phase of the methodology a genetic algorithm based heuristic is used for the design of customized irregular NoC. The proposed genetic algorithm explores the search space to generate an irregular topology with optimized bandwidth load distribution and improved energy requirements.

A. Solution RepresentationEach chromosome is represented by an array of genes with

the maximum size of the gene array to be equal to the number of edges in the core graph. Similarly each gene contains the information regarding the various possible paths in the NoC topology graph between the <source(υi), destination(υj) > pair A gene is only permitted to have a maximum of n (configurable parameter) number of paths and in these n paths at least one of the path is the shortest path through the edges exclusively of the minimum spanning tree only and rest of the paths are generated by adding shortcuts.

B. Mutation OperatorsThe following three mutation operations are used to bring

variety in the population.

1) Topology-Extension-Mutation: A random number of genes are picked from the selected chromosomes and their paths are checked for the traffic loads assigned to them. If any of the edges/channels of this path are heavily loaded then a suitable shortcut channel is inserted in the topology. The added shortcut is constrained by the maximum permitted channel length emax due to physical signaling delay and so prevents the algorithm from inserting wires that span long distances across the chip. Similarly shortcut is not added between the IP cores if it exceeds a given maximum permitted node-degree ndmax of either its source or target core. This constraint prevents the algorithm from instantiating slow routers with a large number of I/O-channels which would decrease the achievable clock frequency due to internal routing and scheduling delay of the router. A new deadlock free path is formed including the added shortcut channel using Dijkstra's shortest path algorithm in combination with the routing rules of up*/down* routing. The excess load of the selected path is transferred to the channels of the new path if it does not lead to overloading of the new path channels otherwise the shortcut is rejected.

2) Topology-Reduction-Mutation: This mutation tries to remove such channels from the topology which are very lightly loaded. The load of the path to be removed is transferred to an existing path of the gene having minimum load on its channels such that the overall load distribution improves.

Figure 1. Network construction flow using genetic algorithm

3) Energy-Reduction-Mutation: This mutation is done on randomly selected chromosome with the bias towards the best class of the population in each generation. In this mutation each path of every gene of the chromosome is traversed and we try to find a replacement shorter path by adding suitable shortcut.

C. Crossover OperatorFor achieving crossover two chromosomes and a random

crossover point is selected and then genes of these chromosomes are mixed over the crossover point to produce two new chromosomes. The new chromosome is accepted only if it leads to improvement in the cost.

D. Measure of Fitness & OutputThe fitness (cost function) measure essentially has two

components: 1) average bandwidth requirement overflow, (2) dynamic energy requirement [8, 10] of the traffic for the customized topology. Let X1 is maximum chromosome energy requirement among all the chromosome in the population, X2 is maximum possible bandwidth requirement from a channel of the NoC topology graph among all chromosomes in the population, Eci is the energy requirement for chromosome ci

and Bci is the average bandwidth requirement overflow per channel of the NoC topology graph of the chromosome ci then cost of the chromosome ci can be formulated as under

)/()/( 21 XBcXEcCost iii ×+×= βαWhere α and β are two empirically determined constants.

The value of α and β were fixed as 0.25 and 0.75 respectively. It may be noted that the best 10% chromosomes at any generation are directly transferred to the next generation, so that the solution does not degrade between the generations. The topology, routing tables and traffic load mapping for the paths of the output best chromosome is accepted as the inputs for the IrNIRGAM(simulator for Irregular topology based NoC Interconnect Routing and Application Modeling) NoC simulator.

IV. EXPERIMENTAL RESULTSIn order to obtain a broad range of different irregular traffic

scenarios, we randomly generated multiple core graphs using TGFF [11] with diverse bandwidth requirement of the IP cores. For performance comparison a NoC simulator IrNIRGAM supporting irregular topology is deployed. IrNIRGAM is an

298

extended version of NIRGAM [13] for supporting Irregular NoC with table based routing. For performance comparison IrNIRGAM was run for 10000 clock cycles and network throughput in flits and average flit latency were used as parameters for comparison.

Figure 2. Average Performance comparison of IrNoC with 2-D Mesh topology with X-Y and OE routing

Previously the proposed genetic algorithm was run for 1000 generation with population size of 200 for obtaining the customized irregular topology. Mutations are done on 15% of the population and crossover on 30% of the population in each generation. During optimization the maximum channel length was set to be twice the length of a tile. Figure 2 summarize the performance results averaged over 50 generated irregular topologies (IrNoC) with permitted node/core degree of 6 with number of cores varying between 16 to 81 and 2D-mesh of equal number of cores as in IrNoC with X-Y [9] and OE [9] routing. For IrNoC table based routing was used. Figure 2 shows that optimized IrNoCs sustain a higher throughput and lower transmission latency in all cases. IrNoC with permitted node degree of 6 achieves 19.4% and 32% more throughput on average with decrease in average flit latency of 15.2 and 60.3 clock cycles in comparison to corresponding 2-D Mesh with X-Y and OE routing respectively. Similar tests on IrNoC with permitted node degree of 4 showed gains (7.5%, 18.9%) and (12.4 clocks, 57.45 clocks) for throughput and latency respectively. Figure 3 shows throughput and latency comparison of IrNoC (with permitted node degree of 6) and 2-D mesh with X-Y and OE routing with varying packet injection interval in clock cycles.

V. CONCLUSION

A genetic algorithm based methodology was implemented to tailor the NoC topology to the requirements of the application captured in the core graph. In our future work to further analyze the effectiveness of the proposed methodology, we intend to compare the proposed methodology with other application specific design methodologies proposed in the literature with realistic benchmarks in addition to fine grained

energy estimates for providing multiple objective optimization frameworks.

Figure 3. Average performance comparison of IrNoc with 2-D Mesh topology with X-Y and OE routing with varying packet injection interval

REFERENCES

[1] W. J. Dally, B.Towles,,“Route Packets, Not Wires: On-Chip Interconnection Networks,” in IEEE Proceedings of the 38th Design Automation Conference (DAC), pp. 684–689, 2001.

[2] L. Benini, G. DeMicheli., “Networks on Chips: A New SoC Paradigm,” IEEE Computer Vol. 35, No. 1 pp. 70–78, January 2002.

[3] e. a. M. D. Schroeder, “Autonet: A High-Speed Self-Configuring Local Area Network Using Point-to-Point Links,” Journal on Selected Areas in Communications, vol. 9, Oct. 1991.

[4] A. Jouraku, A. Funahashi, H. Amano, M. Koibuchi, “L-turn routing: An Adaptive Routing in Irregular Networks,” in International Conference on Parallel Processing, pp. 374-383, Sep. 2001.

[5] Y.M. Sun, C.H. Yang, Y.C Chung, T.Y. Hang, “An Efficient Deadlock-Free Tree-Based Routing Algorithm for Irregular Wormhole-Routed Networks Based on Turn Model,” in International Conference on Parallel Processing, vol. 1, pp. 343-352, Aug. 2004.

[6] U. Ogras, J. Hu, R. Marculescu, “Key research problems in NoC design: a holistic perspective,” IEEE CODES+ISSS, pp. 69-74, 2005.

[7] W.H.Ho, T.M.Pinkston, “A Methodology for Designing Efficient On-Chip Interconnects on Well-Behaved Communication Patterns,” HPCA 2003, pp. 377-388, Feb 2003.

[8] J.Hu, R.Marculescu,“Energy-Aware Mapping for Tile-based NOC Architectures Under Performance Constraints,” ASP-DAC 2003, Jan 2003.

[9] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks : An Engineering Approach, Elsevier, 2003.

[10] J. Hu, R. Marculescu, “Energy- and performance-aware mapping for regular NoC architectures”, IEEE Trans. on CAD of Integrated Circuits and Systems, 24(4), April 2005.

[11] R. P. Dick, D. L. Rhodes, W. Wolf, “TGFF: task graphs for free,” in Proc Intl. Workshop on Hardware/Software Codesign, March 1998.

[12] Y. C. Chang, Y. W. Chang, G. M. Wu and S. W. Wu, “B*-Trees : A New Representation for Non-Slicing Floorplans,” in Proc. 37th Design Automation Conference, pp. 458-463, 2000.

[13] Lavina Jain, B.M.Al-Hashimi, M.S.Gaur, V.Laxmi, A.Narayanan, “NIRGAM: A Simulator for NoC Interconnect Routing and Application Modelling, DATE 2007, 2007.

299

QoS Aware Minimally Adaptive XY routing forNoC

Navaneeth Rameshan∗, Mushtaq Ahmed†, M.S.Gaur‡, Vijay Laxmi§ and Anurag Biyani¶∗†‡§Computer Engineering, Malaviya National Institute of Technology Jaipur, India

∗ [email protected], ¶ [email protected], ‡ [email protected], § [email protected]¶Jaypee Institute of Information technology, Noida, India

[email protected]

Abstract—Network-on-Chip (NoC) has emerged as a solutionto communication handling in Systems-on-Chip design. A majordesign consideration is high performance of router of smaller size.To achieve this objective, routing algorithm need to be simpleas well as congestion-aware. QoS is also emerging as one ofthe design objectives in NoC design. Recent work has shownthat deterministic routing does not fare well when traffic in thenetwork increases [6]. An ideal routing algorithm should takecongestion awareness into account. In this paper, we proposea new Quality-of-service (QoS) aware routing algorithm whichis simple to implement and partially adapts (considers onlyminimal paths) with the traffic congestion for meeting differentQoS requirements such as Best Effort (BE) and GuaranteedThroughput (GT) [5]. Comparison of our algorithm with otherrouting algorithms namely XY, Odd-Even (OE) and DyAdsuggest improved performance in terms of average delay andjitter.

I. INTRODUCTION

NoC needs to support Quality of service (QoS) that posesto be a main concern when there are varieties of applicationssharing the network, as it becomes necessary to offer guar-antees on performance. QoS refers to the capability of thenetwork to provide communication services above a certainminimum value(s) of one or more performance metrics such asdedicated bandwidth, control jitter and latency [2]. To managethe allocation of resources to communication streams morejudiciously and efficiently, network traffic is often divided intoa number of classes when QoS is at premium. Different classesof packets have different QoS requirements and different levelsof importance. In general, the network traffic can be classifiedinto two traffic classes – (1) Guaranteed Throughput (GT) and(2) Best Effort (BE) [5].

In this paper we propose a new method, Adaptive XYRouting, for routing in NoC and compare its performance withvarious standard routing algorithms, viz., XY, Odd-Even [3]and DyAd with context to QoS in NoC. Open-source simulatorNIRGAM [4] is used for comparison with other routingalgorithms.

II. XY AND ODD-EVEN ROUTING

In NoC, one of the simplest routing algorithms is XYrouting. In XY routing, path is determined solely by addressesof source and destination nodes in a deterministic manner, i.e.,same path is chosen for a given pair of source and destinationnodes irrespective of the traffic situation in the network. In XY

routing, packet is forwarded in X-direction until the destinationand current column becomes equal, and then packet is routedY-direction until destination node is reached. XY routing isnon-adaptive, which leads to bad load balancing and lack ofadaptability to congestion.

Odd-Even routing [3] is a deadlock-free partially adaptiverouting in which turning is restricted to prevent deadlockoccurrence. The routing path is governed by the followingrule (column starting from 0): for any packet, following turnsare not allowed.1. East-to-North or East-to-South at even column.2. North-to-West or South-to-West at odd column.

III. MAXY: MINIMALLY ADAPTIVE XY ROUTING

We propose a variation of the XY routing algorithm whichintroduces a capability to adapt with traffic, while still re-taining the biggest advantage of XY routing: Simplicity. Weselect our routing direction only from the maximum of 2path-length reducing directions at any stage, i.e., algorithm isminimalistic, and hence leads to freedom from livelock issues[1]. But despite being minimalistic, our algorithm is adaptiveand decisions among directions are chosen at crucial positionsenroute.

We illustrate the functioning of the proposed algorithm byexample. Let us say a packet is to be routed from sourcenode (Sx,Sy) to destination node (Dx,Dy). Unlike XY routing,in which we always route packet first in X direction, herewe route packet with aim to equalize the absolute differencebetween X and Y coordinates of current and final nodes. Sothe first step is to route the packet in the direction whichhelps in equalizing the absolute difference between X andY coordinates of current and final nodes. Once the absolutedifference is equal for both directions then buffer availabilityof both feasible directions is taken into account and packet isrouted to one which has maximum buffer availability. If bothhave same buffer availability then a random selection is madesince it leads to equal load distribution probability amongdirection with same number of output buffer channel(s). Thesame process (equalizing the coordinate and using bufferavailability when equal) is repeated till the packet reachesdestination.

For example, assume that in a 8x8 mesh we have to routea packet from (0,0) to (3,6). In this case, first we compute

300

absolute differences between current node (initially sourcenode) and destination node. Here since ∆x = |3-0| = 3 and ∆y= |6-0| = 6, so we route it in a direction which minimizes |∆y|. Thus we chose South direction initially. Hence the currentnode now becomes (0,1). Since still | ∆y| > | ∆x|, hencepacket is routed in south direction making the current directionto be (0,2). Again |∆y| > |∆x| therefore again south directionis chosen. Now current position becomes (0,3). Now at thisstage, | ∆y| = | ∆x|. In this example, after this stage algorithmbecomes non-deterministic, i.e., routing path now becomes afunction of buffer availabilities at nodes. We now look at thebuffer availability of both the favorable directions for routing(South and West in this case). Say West direction′s outputbuffer has more channels free than South direction′s outputbuffer at (0,3). Thus packet is routed in West direction at thispoint making the current node to be (1,3). Now | ∆y| > | ∆x|,therefore packet is routed in south direction and next node is(1,4). Since at (1,4) | ∆y| = | ∆x|, therefore same methodis used for choosing between south and west directions. Thisprocess is repeated till we reach our destination node (3,6).Thus one possible path for a packet can be: (0,0) → (0,1) →(0,2) → (0,3) → (1,3) → (1,4) → (1,5) → (2,5) → (3,5) →(3,6). Complete algorithm is illustrated in Algorithm 1.

The proposed routing algorithm is free from livelock as it isminimalistic in nature and requires only two virtual channelsfor deadlock free routing. The virtual channel (VC) assignmentdepends on the relative position of the source S and destinationD. If D is towards East (West) of S, packets use the first(second) virtual channel along Y direction. For X direction,any virtual channel can be used. This approach breaks thecycle formed in channel dependency graph, thereby preventingdeadlock as illustrated in Figure 1.

Fig. 1. Virtual channel assignment to prevent cycle formation anddeadlock

IV. RESULTS

To validate the proposed algorithm, a number of simulationswere carried out with the help of NoC simulator, NIRGAM [4]for QoS.

Algorithm 1 Minimally Adaptive XY AlgorithmRequire: Sx, Sy 〈x, y〉 coordinates of source node

Cx, Cy 〈x, y〉 coordinates of current nodeDx, Dy 〈x, y〉 coordinates of destination node

Ensure: Route fom 〈Sx, Sy〉 → 〈Dx, Dy〉Initialization

1: Cx = Sx

2: Cy = Sy

3: while (true) do4: absX = |Dx − Cx|5: absY = |Dy − Cy|6: if ((Cx == Dx)and(Cy == Dy)) then7: Return IP CORE8: end if9: if (Cx > Dx) then

10: dirX = WEST11: else12: dirX = EAST13: end if14: if (Cy > Dy) then15: dirY = NORTH16: else17: dirY = SOUTH18: end if19: if (absX == absY ) then20: if (buffer[dirX] > buffer[dirY]) then21: Route in dirX22: else if (buffer[dirX] < buffer[dirY]) then23: Route in dirY24: else25: Route in random(dirX,dirY)26: end if27: end if28: if (absX > absY ) then29: Route in dirX30: else31: Route in dirY32: end if33: Update 〈Cx, Cy〉34: end while


In this experimental setup a 5x5, 2-dimensional mesh topol-ogy as shown in Figure 2 is selected in which links 2-7, 7-12,12-17, 2-1, 1-6, 6-11, 11-16, 16-21 are shared by both traffictypes for XY routing whereas links 2-7, 7-12, 12-17, 17-22,22-21, 6-11, 11-16, 16-17 are shared by both traffic types forodd-even routing. Wormhole switching technique is employedusing both types of traffic classes. Simulations were run for5000 clock cycles. Traffic load (as a fraction of capacity) isvaried from 10% to 100% in steps of 10%. The network isevaluated on the basis of average latency and jitter. The graphsfor varying bandwidth and load are given below:

301

Fig. 2. A 5x5, two-dimensional mesh showing source-destinationpairs for GT (1-17, 2-21) and BE (3-21, 6-17) traffic

B. Results and Analysis

For analyzing the average latency and average jitter withvarying loads in context to QoS, the traffic scenario shownin figure 2 was used. It can be noticed from figure 3 thatthe average latency of BE traffic in case of MAXY routingalgorithm is almost comparable but slightly lesser than OEwhen the bandwidth reserved for GT is 0.75. As expected,the average latency for XY routing is the highest as itis deterministic in nature and only one virtual-channel isavailable for a given ratio of GT to BE.The awareness tocongestion through partial adaptivity in case of OE andDYAD may be at the expense of the new path with noguarantee to the path length taken. Proposed algorithmMAXY however adapts to the congestion caused due tothe availability of only one VC and chooses an alternativeminimal path thus reducing the average latency of BE traffic.Figure 4 shows that the average jitter for GT traffic withMAXY is either comparable or lesser than OE for differentload values when bandwidth assigned for GT is 0.50, i.e.,the available bandwidth is reduced by 25% which leads tocongestion in the VC and MAXY adapts better to congestionthan deterministic routing such as XY.

V. CONCLUSION

From our observed experimental results, it can be concludedthat the proposed QoS-aware MAXY routing algorithm showsimprovement in terms of latency and jitter over other routingalgorithms such as XY, OE and DyAd in the presence ofcongestion in the network. With congestion in the network,MAXY performs better as it remains partially congestionaware and because of the minimalistic nature, it consumeslesser energy and also prohibits any chance of livelock anddeadlock can be prevented using atleast 2 virtual-channels.Therefore, the proposed methodology can thus be used to

Fig. 3. Average Latency of BE traffic for XY, OE, DyAd and MAXYRouting with bandwidth for GT=0.75.

Fig. 4. Average Jitter of GT traffic for XY, OE and MAXY Routingwith bandwidth for GT=0.50

reduce the latency of BE traffic when most of the resourcesare available for GT, not only ensuring QoS but also aidingin the improvement of other differentiated traffic classes.

REFERENCES

[1] Justin A. Boyan and Michael L. Littman. Packet routing in dynamicallychanging networks: A reinforcement learning approach. In Advances inNeural Information Processing Systems, pages 671–678, 1994.

[2] Ran Ginosar Evgeny Bolotin, Israel Cidon and Avinoam Kolodny. Qnoc:Qos architecture and design process for network on chip. Journal ofSystem Architecture, Special issue on Networks on Chip, 2003.

[3] G.-M.Chiu. The odd-even turn model for adaptive routing. In IEEEtransactions on Parallel and Distributed Systems, pages 729–738, 2000.

[4] M.S.Gaur Lavina Jain, B.M.Al-Hashimi and V.Laxmi. Nirgam: Asimulator for noc interconnect routing and application modeling. InDesign, Automation and Test in Europe, April 2007.

[5] R.Thid M.Millberg, E. Nelson and A. Janstch. Guaranteed bandwidth us-ing looped containers in temporally disjoint networks within the nostrumnetwork on chip. In Proceedings of the Design Automation and Test inEurope, 2004.

[6] Luca Benini Terry Tao Ye and Giovanni Mitcheli. Packetization androuting analysis of on-chip multiprocessor networks. Journal of SystemsArchitecture, 50(2-3), 2004.

302

Download - ADCOM 2009 Conference Proceedings

Top Related