an adaptive partitioning algorithm for distributed discrete event simulation systems

Journal of Parallel and Distributed Computing 62, 1454–1475 (2002)doi:10.1006/jpdc.2002.1856

An Adaptive Partitioning Algorithm for DistributedDiscrete Event Simulation Systems1

Azzedine Boukerche

Parallel Simulation and Distributed Systems (PARADISE) Research Laboratory, Department of Computer

Science, University of North Texas, Denton, Texas 76203-1366

E-mail: boukerche@cs:unt:edu

Received March 29, 2000; accepted January 29, 2002

Biocomputing techniques have been proposed to solve combinatorial

problems elegantly by such methods as simulated annealing, genetic

algorithms and neural networks. In this context, we identify an important

optimization problem arising in conservative distributed simulation, such as

partitioning, synchronization and communication overhead minimization. We

propose the use of a simulated annealing algorithm with an adaptive search

schedule to find good (sub-optimal) partitions. The paper discusses the

algorithms, its implementation and reports on the performance results of

simulation of several workload models on a multiprocessor machine.

The results obtained indicate clearly that a partitioning which make use of

our simulated annealing significantly reduces the running time of a

conservative simulation, and decreases the synchronization overhead of the

simulation model when compared to Nandy–Louck’s partitioning algorithm.

# 2002 Elsevier Science (USA)

1. INTRODUCTION

In recent years, there has been a growing interest in developing efficient solutions

(sequential and parallel) to hard combinatorial optimization problems, which are

based on biological, evolutionary, and/or natural processes. These approaches fall

under the realm of a general paradigm, called biocomputing, which includes such

bio-based methods as genetic algorithms, cellular automata, DNA and neural

networks; as well as methods based upon natural phenomena such as simulated

annealing [20, 41] and mean-field annealing [40]. It turns out that almost all of these

approaches are inherently parallel and distributed in nature.

Over the last two decades, simulations of complex systems have been identified

as an important area to exploit the inherent parallelism present in applica-

tion problems. To this end, a significant body of literature on parallel/distributed

1 This work is supported by UNT Faculty Research Grant and the Texas Advanced Research Program

grant TARP-003594-0092-2001.

14540743-7315/02 $35.00

# 2002 Elsevier Science (USA)All rights reserved.

ADAPTIVE PARTITIONING ALGORITHM 1455

simulation has been proposed (see [4, 11, 13]). There are two basic approaches to

parallel simulation: conservative and optimistic. While conservative synchronization

techniques rely on blocking to avoid violation of dependence constraints, optimistic

methods rely on detecting synchronization errors at run-time and then on recovery

using a rollback mechanism. In both approaches, the simulated system is modeled as

a network of logical processes (LP) which communicate only via message passing.

While solving problems in parallel, one encounters a number of optimization

problems like scheduling, partitioning and dynamic load balancing which must be

tackled efficiently if one expects to achieve significant performance gain. Due to the

NP-hardness nature of these problems in general, it is highly unlikely to obtain exact

solutions whose running times are polynomially bounded in the size of the problem.

Therefore, research has been directed to find fast, approximate (or near-optimal)

solutions.

In this paper, we consider the problem of partitioning a conservative parallel

simulation for execution on a multi-computer. The synchronization protocol makes

use of Chandy–Misra null messages [4, 10, 13]. We propose the use of a simulated

annealing algorithm with an adaptive search schedule for generating good (sub-

optimal) partitions for conservative simulation. This paper discusses the algorithm,

its implementation and reports on the performance results of several simulation

models executed on a multiprocessor machine.

The remainder of this paper is organized as follows. Section 2 introduces the major

issues of parallel discrete event simulation. In Section 3, we review previous and

related work. Section 4 is devoted to a description of the partitioning algorithm using

simulated annealing paradigm, followed by a discussion of the performance results

which we obtained. The conclusion follows.

2. FUNDAMENTAL OF PARALLEL DISCRETE EVENT SIMULATION

In this section, we introduce the basic terminology and major issues pertaining to

parallel discrete event simulation (PDES) which should provide exactly the same

solution to a problem as a sequential simulation. Thus, in specifying and developing

a parallel simulator, it is important to understand the sequential nature of the

simulation.

In a discrete-event simulation, the model evolution is defined by instantaneous

events. Each event corresponds to a transition in a portion of the model state,

composed of state variables, each describing a characteristic of the model. Each event

also has a simulation time associated with it, called timestamp, which defines its

occurrence time. Each event may in turn generate new future events.

The generation of new events and the dependency of their transitions on state

variables that previous events may have updated, define a relation of causal order

(namely, a partial order) among events. Related events are said to be causally

dependent, whereas unrelated ones are called concurrent. In order to guarantee the

correctness, concurrent events may be safely processed in any order in a simulation,

whereas causally dependent events must be processed according to the causal order.

Thus, to ensure the strict chronological order, events are processed one at a time,

BOUKERCHE1456

resulting in an (apparently) sequential program. A typical template for a sequential

simulation is given in Fig. 1.

Only by eliminating the event list in its traditional form so as to capture the

interdependence of the process being simulated can additional parallelism be

obtained [10]. This is the objective of parallel simulation. Indeed, parallel simulation

shows a great potential in terms of exploiting the inherent parallelism of the system,

and the underlying concurrency among events to achieve execution speedup. Good

surveys of the literature may be found in [4, 11, 13].

Conceptually, a parallel simulator is composed of a set of LP which interact by

means of messages, each carrying an event and its timestamp, thus called event

messages. Each LP is responsible for managing a subset of the model state, called

local state. Each event E received by an LP represents a transition in its local state.

The events scheduled by the simulation of E are sent as event messages to

neighboring LPs to be simulated accordingly. In a simulation, events must always be

executed in increasing order of timestamps. Anomalous behavior might then result if

an event is incorrectly simulated earlier in real time and affects state variables used

by subsequent events. In the physical model this would represent a situation in which

future events could influence the present. This is referred as causality error. Several

synchronization protocols have been proposed to deal with this problem. These

techniques can be classified into two groups}conservative algorithms and optimistic

algorithms. While conservative synchronization techniques rely on blocking to avoid

violation of dependence constraints, optimistic methods rely on detecting synchro-

nization errors at run-time and on recovery using a rollback mechanism. Let us

briefly outline the basic principles of these two approaches.

2.1. Conservative Simulation

Conservative approaches enforce event causality by requiring that each LP

elaborates an event only if it is certain that it will not receive an earlier event.

Consequently, events are always executed in chronological order at any LP. Each

logical process, LPi, maintains an input queue (lij) for each of its neighbor, LPj, in the

network of logical processes. In the case that one or more neighboring (input) queues

are empty, LPi is blocked because an event with a smaller timestamp than the

timestamp of the waiting events might yet arrive at an empty queue. This mechanism

implies that only unblocked LPs can execute in parallel. If all the LPs were blocked,

FIG. 1. Basic sequential discrete event simulation algorithm.


the simulation would be deadlocked. Ensuring synchronization and avoiding

deadlocks are the central problems in a conservative approach. Several schemes

have been proposed to alleviate this problem. In [10], the authors employ null

messages in order to avoid deadlocks and to increase the performance of the

simulation. When an event is sent on an output link, a null message bearing the same

timestamp as the event message is sent on all other output links. As is well known, it

is possible to generate an inordinate number of null messages under this scheme,

nullifying any performance gain [13]. As a result, a number of attempts to optimize

this basic scheme have appeared in the literature. For example, in [35], the authors

refrain from sending null messages until such time as the LP becomes blocked. They

refer to this approach as eager events, lazy null messages. They reported some

success in using variations of Chandy–Misra approaches to speed up logic

simulation.

In [6, 7, 31], the authors employed the following approach. In the event that a null

message is queued at an LP and a subsequent message (either null or event) arrives

on the same channel, they overwrite the (old) null message with the new message. A

single buffer is associated with each input channel at an LP to store null messages,

thereby saving space as well as the time required to perform the queueing and de-

queueing operations associated with null messages.

2.2. Optimistic Approach

Time Warp is based on an optimistic approach and enforces the causal order

among events as follows: events are greedily simulated in timestamp order until no

event messages remain or until a message arrives in the ‘‘past’’ (a straggler). Upon

receiving a straggler, the process execution is interrupted, and a rollback action takes

place using anti-messages. Each message is given a sign; positive messages indicate

ordinary events, whereas negative messages indicate the need to retract any

corresponding event that was previously processed. Similar messages that have

different signs are called anti-messages. If a negative message is received, the message

and the corresponding anti-message are both annihilated. A rollback consists of three

phases: (i) restoration: the latest state (with respect to simulation time) valid before

the straggler’s timestamp replaces the current state, and successive states are

discarded from the state queue; (ii) cancellation: the negative copies of messages

which were produced at simulation times successive to the straggler’s timestamp are

sent to the proper processes, to possibly activate rollbacks there; and (iii) coasting-

forward: the effective state which is valid at the straggler’s timestamp is computed by

starting from the restored state and by elaborating those messages with a timestamp

up to the stragglers; during this phase no message is produced. Rollbacks are made

possible by means of state checkpointing. The whole state of the process is

checkpointed into the state queue according to some discipline, see [38].

To minimize the storage overhead required to perform rollbacks, and to detect the

termination of LPs, optimistic synchronization mechanism uses a local virtual time

(LVT), and a global virtual time (GVT). The LVT represents the timestamp of the

latest processed event at an LP; whereas GVT is defined as the minimum of all the

local virtual times of all LPs, and of all the timestamps of messages in transit within

BOUKERCHE1458

the simulation model. The GVT indicates the minimum simulation time at which a

causal violation may occur. The use of GVT computation is to commit the safe

portion of the simulation.

3. PREVIOUS AND RELATED WORK

Scheduling, load balancing and partitioning, in parallel and distributed systems in

general, and parallel simulation in particular, have long been identified as important

optimization problems. Existing literature on partitioning and mapping problems

are based on approaches like graph theoretic, queueing theoretic, mathematical

programming, numeric and non-numeric heuristics, and/or combined methods, see

[2]. In all these formulation, finding an optimal solution is found to be NP-hard [15]

in all but very restricted cases. Thus research has focused on the development of

heuristic algorithms to find suboptimal solutions. Biocomputing/evolutionary/

natural techniques have been proposed to solve these problems elegantly by such

methods as simulated annealing, genetic algorithms, neural networks, or stochastic

processes [5].

Nabhan and Zomaya [29] proposed an optimal task scheduler. The scheduling

scheme employs a simulated annealing algorithms that minimizes a cost function

representing the expected performance of each schedule. Their proposed model is

simple and a flexible way for generating efficient formulations for computational

models. The efficiency of their scheduler is demonstrated by two case studies with

promising results.

Klin and Banerjee [25] proposed simulated evolution as an alternative to

annealing, and applied it in the context of cell placement in VLSI design. Their

technique is the mathematical analog of the natural selection of biological

environments, and it performs three basic steps}evaluation, selection, and

allocation. The first step computes the ‘‘goodness’’ of the particular cell position.

In the second step, the cells are probabilistically selected for replacement according

to their goodness. Finally, the third step removes cells from their current allocation

and searches for an improved location. These three steps are repeated until no

further improvement is needed. The experimental results showed that the simulated

evolution is slow. As a consequence the authors make use of hierarchical and

window techniques that reduce the running time significantly.

Tao et al. [37] studied the problem of allocating the interacting task modules, of a

parallel program to heterogeneous processors in a parallel architecture. They present

three heuristics for task assignment, based on simulated annealing, tabu search, and

stochastic probe approaches, respectively. The stochastic probe approach is a

combination of the aggressive search process in the tabu search and the stochastic

search process in the simulated annealing approach.

Kernighan and Lin [22] propose a two-way partitioning algorithm with

constraints on the final subset sizes. They applied pairwise swapping and iterated

on all pairs of nodes to find the best improvement on the existing partition. Fiduccia

and Mattheyses [12] further improved this algorithm by developing a clever

implementation for each iteration to achieve a linear complexity for each iteration.


Sanchis [33] then adapted this model to multiple-way partitioning. Her

algorithm attempts to minimize the communication between processors and

keep the number of processes per processor within a specified range. The

algorithm uses a concept of levels; each level successively produces a better

cut set. A VLSI component model and a network of SUN-workstations were

employed for experimentation.

The Kernighan–Lin-based algorithms unfortunately share the common weakness

that they are often trapped by local minima when the size of the problem is very

large. One way to overcome this difficulty is to form clusters, and then condense

these clusters into single nodes prior to the execution of the Kernighan–Lin-based

algorithms. The complexity of the problem is thus dramatically reduced, which in

turn improves the performance of the partitioning algorithm [8].

Each of the preceding approaches to the partitioning problem provides good

solution for restricted applications, and they are not suitable for parallel and

distributed simulations since the synchronization constraints exacerbate the

dependencies between the LPs in the system graph. In order to achieve the best

performance, the partitioning and the load balancing should decrease the running

time of the simulation compared to a random partitioning.

In the past, attempts have been made to tackle the partitioning and load balancing

problems in PDES. Nandy and Loucks [31] presented a static partitioning algorithm

for conservative parallel logic simulation. The algorithm attempts to minimize the

communication overhead and to uniformly distribute the execution load among the

processors. It starts with an initial random partition and then iteratively moves

processes between clusters until no improvements can be found (a local optimum).

All possible moves for each process are considered. The process which contributes to

the maximum gain is chosen. A process is moved only if it does not violate the block

size constraints. As a benchmark, they use the simulation of circuits modeled at the

gate level. A message passing multicomputer composed of eight T-800 INMOS

transputers was employed as the simulation platform. They report a 10–25%

reduction in simulation time from the simulation time of a random partition [31].

Kim and Jean [23] presented an efficient, linear-time partitioning algorithm for

parallel logic simulation, based on a linear ordering of vertices in a directed graph.

The performance results of the algorithm presented in this paper show a good

compromise with a high degree of concurrency, a balanced workload and a

reasonable amount of interprocessor communication. The authors have showed that

the optimistic parallel simulation algorithm with the CPP algorithm enables

powerful parallel gate-level circuit simulators.

Jorg Keller et al. [21] presented several new serialization free-parallel data

structures which seem to have large impact on the performance of logic simulation.

The efficiency of these data structures is based upon the use of parallel prefix

operations. The authors consider first the PTHOR algorithm for simulation of

logical circuits, which uses a conservative approach, then they show how attainable

speedup can be increased by several changes in the data structures, including the

memory management. The paper also examines the partitioning problem of the

simulated circuit among the parallel processors. Several dynamic partitioning

algorithms are described and compared with each others.

BOUKERCHE1460

In [14], the authors presented both static and dynamic load balancing strategies.

The static partitioning scheme attempts to minimize load imbalances while

maximizing lookahead values. The static partitioning packages, Metis and Scotch,

were used to achieve these objective functions. The scalability study based on a

performance analyzer has shown that it is indeed important to apply a partitioning

algorithm instead of a random strategy.

In [3], a simple partitioning scheme based upon the pairwise exchange was

developed to reduce the overhead of the rollbacks of a hybrid parallel simulation of

wireless networks. The synchronization protocol makes use of both conservative and

timewarp paradigms. The results obtained were quite encouraging using a cluster of

network workstations. Sarkar and Das [34], proposed two algorithms for dynamic

load balancing which reduce the number of rollbacks in an optimistic PDES system.

The first algorithm is based on the load transfer mechanism between LPs; while the

second algorithm, based on the principle of evolutionary strategy, migrates logical

processes between several pairs of physical processors. They have implemented both

of these algorithms on a cluster of heterogeneous workstations and studied their

performance. The experimental results indicated that the algorithm based on the

load transfer is effective when the grain size is greater than 10 ms, while the

algorithm based on the process migration yielded good performance only for grain

sizes of 20 ms or larger.

4. DISTRIBUTED SIMULATION MODEL

Our concern in this paper is to study the importance of partitioning on the

performance of a conservative distributed simulation. This conservative mechanism

is based on a distributed model of computation in which processes communicate via

messages. In this model a network of LPs is used to simulate the objects of a system,

e.g., a computer network. Explicit links connect the LPs, and messages are

forwarded between LPs over these links. In the conservative paradigm, an event

cannot be simulated by an LP before it is certain that an event with a smaller

timestamp cannot arrive. As a consequence of this blocking behavior, deadlocks

arise [9, 13]. Several solutions to this problem exist, each requiring a certain amount

of overhead. In this paper, we make use of the Chandy–Misra null-message

algorithm [10, 13].

In [10], the authors employ null messages in order to avoid deadlocks and to

increase the performance of the simulation. When an event is sent on an output link

a null message bearing the same timestamp as the event message is sent on all other

output links. As is well known, it is possible to generate an inordinate number of null

messages under this scheme, nullifying any performance gain [13].

In order to increase the efficiency of this basic scheme, we employ the following

approach. In the event that a null message is queued at an LP and a subsequent

message (either null or event) arrives on the same channel we overwrite the (old) null

message with the new message. We associate one buffer with each input channel at

an LP to store null messages, thereby saving space as well as the time required to

perform the queueing and de-queueing operations associated with null messages.


In the next section, we discuss the application of simulated annealing with an

adaptive schedule to the partitioning problem.

5. PARTITIONING ALGORITHM BASED UPON SIMULATED ANNEALING

Simulated annealing (SA), introduced by Kirkpatrick et al. [24], is a powerful

method for optimizing functions defined over complex systems. It is based on ideas

from statistical mechanics and motivated by an analogy to the behavior of physical

systems in the presence of a heat bath.

While greedy algorithms, and other simple iterative improvement techniques

accept a new configuration of lower cost and reject more costly states, SA escapes

from local minima by sometimes accepting higher cost arrangements with a

probability determined by the simulated ‘temperature’. Simulated annealing leads to

efficient heuristic algorithms having several advantages over other approaches to

solving combinatorial optimization problems [29, 39, 41]. First, it is problem-

independent. By substituting a few problem specific data structures and functions,

the SA algorithm can be applied to many combinatorial optimization problems.

Second, SA can easily handle multiple, potentially conflicting goals of a problem. It

has been successfully applied in solving combinatorial problems such as cell

placements [27], and task scheduling in real time systems [29, 30].

In this section, we show how to apply simulated annealing to partitioning

conservative simulation on multiprocessor machines. This method is derived by

mapping the complex simulation system into a physical system.

In this approach, each LP is considered as a particle moving in space with ndistinct positions, and each state is represented as the assignment of each process

(LP) to a partition, which in turn is assigned to a processor. These LPs (or particles)

interact with each other with a force which is represented by a energy or objective

function.

In this work, all the parameters used in the partitioning algorithm are estimated.

As in [31], our approach is to develop a partitioning algorithm and is based upon

realistic estimates of computation and communication load. Future work will be

directed at determining these estimates, given that we are successful in reducing the

running time of the simulation with our algorithm.

The structure of the algorithm is contained in Fig. 2. In the application of

partitioning, the state is represented as the assignment of each process (LP) to a

partition, which in turn is assigned to a processor.

Starting from an initial partition, a move generates another partition. An objective

function (F ) evaluates the quality of this partition. This new partition is accepted if it

is better, or probabilistically accepted if it is worse. The probability of accepting a

new partition differing by 4ðF Þ is denoted as P ð4ðF ÞÞ. Two probability functions

from statistical mechanics are common in simulated annealing: the ‘‘heat function’’

and the Boltzmann distribution. Kirkpatrick et al. [24] used the Boltzmann

probability factor,

P ð4F Þ ¼ minð1; e�4ðF Þ=kT Þ:

FIG. 2. A simulated annealing algorithm.

BOUKERCHE1462

This function accepts all changes of 4ðF Þ40 with P ð4ðF ÞÞ. It is assumed that the

Boltzmann constant k equals 1 in order to avoid inconveniently large temperature

scales. In our experiments, we used the Boltzmann distribution (see Fig. 3).

When an equilibrium state, represented as the assignment of each process (LP) to a

partition has been reached, the temperature is decreased. This iteration is repeated

until the algorithm meets a termination condition or until the temperature reaches its

lowest value Tfinal. The termination condition and the equilibrium condition together

are called the annealing schedule.

The following subsections describe the various components of the proposed

annealing algorithm in more detail.

5.1. Objective Function

We describe an objective function to evaluate the quality of the partitioning

solution generated by our algorithm. The function is chosen so that inter-processor

communication conflicts are minimized, processor load remains balanced, and the

probability of sending a null message between processors is minimized.

In the course of experiments [7], it became apparent that the following factors

increase the efficiency of the conservative simulation protocol:

1. Uniform distribution of the execution load among the processors (LOAD).

2. The closeness of the LPs to one to another in the process graph ðDiameteravgÞ.LPs that are close to each other are expected to communicate more than if they are

far away from each other. Hence, assigning them to the same processor should

reduce the running time of the simulation.

3. Minimizing the inter-processor communication cost (IPC).

-6 -4 -2 0 2 4 6

1.0

0.8

0.6

0.4

0.2

0.0

FIG. 3. Boltzmann acceptance function.


This is desirable because of the high cost of sending a message between processors

when compared to the cost of sending messages between processes within the same

processor.

4. Minimizing the number of links between the processors (IPL).

This will decrease the number of null messages sent between processors.

We define expressions for these parameters as well as the objective function Fwhich represents the quality of the partitioning

F ¼ FðLOAD;Diameteravg; IPC; IPLÞ:

Let us denote by lij the average traffic between each pair of adjacent processes, ni the

number of processes assigned to processor i; Nlinks the total number of links, Ntot the

total number of processes (or nodes) and K the number of processors.

We define the load at each processor ðPriÞ as follows:

Loadi ¼Xj

Xpk2Pri

ljk * ðservk þ Tsend þ TrcvÞ;

where servk is the amount of time pk takes to process an event, Tsend is the amount of

time it takes for an LP to send an event, and Trcv the amount of time it takes for an

LP to receive the event. If the event is generated by an LP in processor i, then Tsend

and Trecv are negligible.

We wish for Loadi to be approximately equal to 1K

Pk Loadk. We define

LOAD ¼ KKYKi¼1

ðLoadi=LoadtotalÞ;

where Loadtotal ¼P

k Loadk and K is the number of processors. Here KK is a

normalization factor that causes the maximum of LOAD to be 1. If the number of

LPs within each processor is equal, i.e., Loadi ¼ Loadtotal=K, the quantity LOAD

reaches a maximum equal to 1. Therefore, we wish to maximize the quantity LOAD.

We also consider the quantity

Diameteravg ¼PK

i¼1 DiameteriK

BOUKERCHE1464

as an average measure of distance within each cluster where

Diameteri ¼ maxj;k2Processori

ðDistancejkÞ

and Distancejk is the number of hops between LPj and LPk.LPs that are close to each other are expected to communicate more than if they are

far away from each other (this is a characteristic of traffic on computer networks and

telephone systems). Hence, assigning them to the same processor should reduce the

running time of the simulation.

We define the relative inter-processor communication factor to be

IPC ¼tcomm

tcalc

Xi;j

lijCostij; ð1Þ

where tcomm is the mean time to send one message between two processors, and tcalc is

the mean time for the processing of an event message, Costij is the communication

overhead to prepare the message for transmission, Tsend is the time required between

processes pi and pj. lij is the average traffic between pi and pj. The ratio tcomm=tcalc

reflects the relative costs of inter-node communication and computation. Through-

out our discussion, we assume that the communication cost between 2 processes

assigned to the same processor is negligible. We decided to choose the relative inter-

processor expression (1) because it is more expensive to send a message between

processors then to do so within a single processor. Note that we assume that we have

estimates for lij for each pair of adjacent processes pi;pj.

In order to reduce the number of null messages (IPL), we minimize the number of

links between each pair of processors, Pi and Pj. In [1], the authors merge the two

partitions with the largest number of interconnections into a new partition if an

upper bound on the number of LPs in a processor is not exceeded. While this

approach generates partitions with a small number of interconnections the number

of LPs per processor can be quite different. Sporrer and Bauer [36] suggests the

following ratio:

kij ¼mijPl mil

;

where mij represents the number of interconnection between the Processori and Processorj and mii is set to zero. In [36], better partitions were obtained making use of kijinstead of mij. Bearing this in mind, we chose to define IPL as follows: IPL ¼

Pij kij.

A good partitioning is one which meets the criteria of maximizing the LOAD and

minimizing the quantities: IPC, Diameteravg and IPL. Hence we express the basic

problem as finding a partition which maximizes the following quantity:2

F ¼ LogLOADa

ð1 þ IPCÞbð1 þ IPLÞgð1 þ DiameterdavgÞ

!;

2 The factors in the denominator are of the form ð1 þ xÞ because of the possibility that x might approach

zero. The Log function is used to avoid inconveniently large factor scales.


where a; b; g and d are the relative weights of the corresponding parameters. In our

experiments, a ¼ b ¼ g ¼ d ¼ 1 yielded good results (see Section 7).

5.2. Generating an Initial Partition

The choice of initial partitioning not only affects the annealing convergence

rate, but also the final partitioning of processes to processors. The more nearly

the initial partitioning approximates the optimal partitioning, the greater is

the probability that the true optimal will be approached. Several strategies

may be used to generate an initial partition. If a random initial partitioning

is used, the problem begins with a completely unordered partitioning. Hence,

the temperature must be very high to guarantee accepted moves and to ensure

that the state space is adequately searched. Consequently, a good initial

partitioning is needed to reduce the run time of the annealing process.

Our experiments showed that by using a grid partition as an initial solution,

the annealing can start at a lower starting temperature and thereby reduce the

run time considerably.

5.3. Move Set

A large number of moves must be generated to adequately traverse the problem

search space, therefore the amount of time needed to generate and evaluate a move

must be minimal. In the course of our experiments we settled on the following

strategy. The generation of a new configuration proceeds in several steps. Each

process is (sequentially) selected to be part of a trial move.

Let us assume that the selected process (ps) has at least k neighbor processes

fp1;p2; . . . ;pkg,3 where each pi; i ¼ 1; . . . ; k, is assigned to a different processor

than the one containing ps. Then an exchange of ps with one of its neighboring

processes from the processor to which ps has the most links is evaluated. If the

exchange is rejected, a randomly chosen process residing on a neighboring processor

of the current processor containing the process ps is chosen.

To prevent the move-set process from ‘‘thrashing’’ or going into an infinite loop,

each move is controlled by tabu move as in tabu search [18]. It is managed by a

mechanism that makes use of historical information about moves made as the

simulated annealing process progresses; solutions accepted for an arbitrarily defined

number (ntabu)4 of previous moves are deemed unallowable or tabu. This prevents

cycling of more than ntabu moves. An LP or a pair of LPs are available for

movement if they are not in a tabu list.

5.4. Annealing Schedule

The search for adequate cooling schedules has been addressed in many papers

during the last few years and several approaches have been proposed. In

‘‘traditional’’ simulated annealing algorithms, the temperature is controlled with a

3k is determined empirically (k ¼ 2 when a toroid network model is used).4 ntabu is determined empirically (Glover [18] [suggests using 7 moves).

BOUKERCHE1466

simple schedule of the form

Tn ¼ Next TempðT Þ ¼ aTn�1;

where 04a51, and is usually taken to be at least 0.9, T0 ¼ Tinitial

ðTinitial ¼ 100; Tfinal ¼ 0:1Þ. The annealing factor a controls the rate of annealing.

In the course of our experiments, in order to decrease the number of iterations and

(thus) the running time of the algorithm, we decided to use an adaptive schedule. The

motivation to derive an adaptive schedule is based on the observation that the

behavior of SA is very different at high and low temperatures. Indeed, at high

temperatures, it is the number of acceptances that dictates equilibrium, while at low

temperatures it is the number of rejected states that dictates equilibrium. Another

important factor is the number of iterations required to reach the final state. Our

experiments indicate that using an adaptive cooling schedule produces satisfactory

results.

To determine the value of T0, we perform a sequence of random moves and

compute the quantity 4avg, the average change in cost in uphill moves. We should

have P ¼ e�4avg=T0 ’ 1 so that there will be a high probability of acceptance at high

temperatures. This suggests that T ¼ 4avg=lnðP Þ is a good choice for T0. The same

cooling schedule used in TimberWolf3.2 [33] has been chosen in our experiments.

The cooling strategy is based on the following: (1) Allow a few iterations in which

virtually every new state is accepted and where T is reduced quite rapidly from

iteration to iteration. (2) Having left the high T regime, reduce T in such a manner

that 4ðF Þ is approximately the same from iteration to iteration. (3) When T is

reduced below a certain fixed a priori temperature ðTmin ¼ 1:5Þ then reduce T very

rapidly so as to converge to a sub-optimal solution. Table 1 shows the cooling

schedule a vs. T .

5.5. Equilibrium and Termination Conditions

At each temperature T , the algorithm is said to be at equilibrium if the number of

acceptances exceeds a constant b, or if the number of rejections at temperature Texceeds g ¼ f ðbÞ. The value of b is decreased at successive temperatures, thereby

causing the system to attain equilibrium to be approximately equal for all

temperatures. We consider the following expressions:

bn ¼ abn�1; where 04a51;

and

gn ¼ rbn; where r5100:

TABLE 1

a vs. T

a 0.95 0.9 0.85 0.80 0.10

T 150 100 50 10 1.5


The reason for choosing gn > bn is that as the temperature is lowered, the number

of rejections increases. If the partitions computed cannot be improved over cconsecutive temperatures, the algorithm terminates. The choice of r; a and c were

determined as a results of our experiments.

6. SIMULATION ENVIRONMENT

The goal of our experiments was to study the impact of our partitioning

algorithm on the performance of a conservative synchronized parallel simulation

making use of null messages [10]. The experiments were conducted on an Intel

Paragon at CalTech. The Paragon is a distributed memory multicomputer,

consisting of 72 nodes, arranged in a two-dimensional mesh. The nodes use the

Intel i860 XP microprocessor and have 32 Mbytes of memory each. The A4 has a

peak performance of 75 Mflops (64-bit) per node. In Table 2 we show the basic

communication primitives provided by the communication system of the Paragon

A4. (Refer to [19] for more details on how these parameters are derived.)

The time t required to send a message of length n is fitted by least squares to the

straight line described by the performance parameter pair (b1; n1=2) [19]. Informally,

b1 is the asymptotic bandwidth which characterizes long message performance;

while n1=2 tells us how quickly, in terms of increasing message length, this asymptotic

rate is approached reaching half of its value when n ¼ n1=2. Formally,

t ¼ðnþ n1=2Þ

b1;

hence the message startup or latency is

t0 ¼n1=2

b1;

and the transfer rate (or bandwidth), b, for a message of length n is

b ¼nt¼

b1ð1 þ n1=2

n Þ:

The long and short message performance of Intel Paragon are characterized by

b1 ¼ 23:5 Mbytes=s, n1=2 ¼ 40;044 bytes and t0 ¼ 172 ms.

TABLE 2

Basic Communication Primitives of Paragon A4

Communication type Primitives Description

csend Send a message and wait for completion

Blocking crecv Receive a message and wait for completion

isend Send a message

Non-blocking msgwait Wait until completion for communication

BOUKERCHE1468

In our experiments we selected the realistic simulation models that mimic many

real world problems: a distributed communication model and a traffic flow network.

7. SIMULATION EXPERIMENTS

Earlier simulation studies [7, 13, 28, 32] showed that the performance of a

simulation strategy is sensitive to the topology of the simulated network. We have

selected two workload models as benchmarks primarily because they provide a stress

test for any conservative simulation protocols as a consequence of their many cycles,

they do not contain any inherent bottlenecks and they are real world problems.

7.1. Distributed Communication Network Model

Distributed communication model [16] is the first benchmark used in our

experiments, as shown in Fig. 4. It models a national network consisting of four

regions which are connected through four centrally located delays. A node is

represented by a process. Final message destination is not known when a message is

created. Hence, no routing algorithm is simulated. Instead one-third of messages

arrival are forwarded to neighboring nodes. A uniform distribution is employed to

select which neighbor receives a message. Consequently, the volume of messages

between any two nodes is inversely proportional to their hop distance. Messages may

flow between any two nodes, possibly through several paths. Nodes receive messages

at varying rates that are dependent on traffic patterns. There are numerous

deadlocks in this model, and hence it provides a stress test for any conservative

synchronization mechanism. Major sources are differing process delays and

generation rates, a large number of cycles, and multitude of paths between two

nodes. In fact, deadlocks occurs so frequently and null messages will be generated so

often that an efficient mechanism to reduce the null-message generation is almost a

requirement.

Various simulation conditions were created by mimicking the daily load

fluctuation found in large communication network operating across time zones.

FIG. 4. Distributed communication network model.


Therefore, in the pipeline region, for instance, we arranged the sub-region into

stages, and all processes in the same stages perform the same normal distribution

with a standard deviation 20%. The time required to execute a message is

significantly greater than the time to raise an event message in the next stage

(message passing delay). We use a pipeline sub-model with 26 processes, a toroid

sub-model with 25 processes, a fully connected sub-model with 5 processes, and a

circular sub-model with 20 processes. We choose a shifted exponential service time

distribution for the 3 sub-regions (circular, toroid, and fully connected sub-regions),

i.e., 0:1 � logðuniðÞÞ, where uniðÞ returns a random real number uniformly distributed

between 0 and 1.

7.2. Traffic Flow Network Model

In this model, we represent the traffic network as a square mesh composed of

streets running in the horizontal and vertical directions with traffic lights at the

intersections of the street, see Fig. 5. This model is partitioned into sub-systems;

which we refer to as grids. Each grid is assigned to an LP. The cars enter the

simulation and travel within and between the grids using message passing. To reflect

traffic flow network model, we define a light interval to be a triple hr; g; yi; where r is

the number of clock ticks that the light is red, g is the number of clock ticks that the

light is green, and y is the number of clock ticks that the light is yellow. Figure 5

illustrates an example of network traffic with two grids, where each grid contains

four lights. A car enters the simulation at a construction site labeled a Source Sink.

All of the boundary street s, as shown in this figure, are sources and sinks where cars

are generated according to a probability distribution. A car might travel either

North, South, West, or East with a fixed probability distribution of changing

directions at any given intersection. We also consider the contention problem at the

intersection, i.e., if a car is turning left into a path of a car going straight, then a

contention mechanism will inhibit one of the cars until the other car clears the

intersection. For a detailed description of the model, refer to Galluscio et al. [17].

L L

LL Cars Traffic

Cars Traffic L L

LL Cars Traffic

Cars Traffic

FIG. 5. Traffic flow network model.

BOUKERCHE1470

In our experiments, we choose an average traffic flow of 100,000 cars. The number

of lights in the traffic network is held constant at 576 lights for controlling the

workload of the system. To make the simulation more interesting, we choose 30 hot

spots uniformly distributed among all the grids, where the probability of car

generated at a source is 0.25.

7.3. Performance Results

It is well known that network communication delays grow non-linearly with the

communication load [26]. Before reaching the ‘‘knee,’’ delays are almost at a

constant level. When the load exceeds the critical point, the communication network

becomes congested and delays increase exponentially. When this occurs, the

simulation virtually crumbles. Therefore, our simulation models were designed to

stay below this critical point. This limit was enforced by a restriction of the real/null

messages within the simulation. In the event that the systems get congested, and the

generation of null messages is too high, the simulation halts and reports an ‘‘error.’’

From the experimental results, we will see that the partitioning algorithm

significantly reduces the number of null messages generated by the simulation.

Thus, the Chandy–Misra conservative simulation approaches the knee of the curve

more slowly than it would in the absence of partitioning.

Recall that the goals of our experiments is to reduce the running time of the

simulation models and the overhead of the synchronization protocol used to execute

these models. The experimental results were obtained by averaging several trial runs.

First, we present in the form of graphs the results for the execution time (in seconds)

and the speedup as a function of the number of processors employed in each

simulation model. The speedup, SP ðKÞ, achieved is calculated as the ratio of the

execution time ET1 required for a sequential simulator to perform the simulation on

one processor and the time ETK required for the parallel simulation to perform the

same simulation on K processors. That is, SP ðKÞ ¼ ET1=ETK . The sequential

simulator makes use of the splay tree data structure because empirical evidence

indicates that it is among the fastest event list implementations.

Next, we study the synchronization overhead as the null-message ratio ðNMRÞwhich is defined as the number of null messages processed by the simulation using

Chandy–Misra null-message approach divided by the number of real messages

processed.

Let us now turn to our results.

Figures 6(a) and 6(b) portray the values for the execution time of the distributed

communication and traffic flow control simulation models obtained using the

random partitioning, the Nandy–Louck’s (NL) algorithm and the SA-partitioning

algorithm. As we can see from the curves, the SA-partitioning algorithm exhibits a

better performance for both models over the randomly partitioned and the NL

schemes. For both models, we observe an approximate 25% reduction in the

execution time of the simulation using the SA-partitioning algorithm over the

randomly partitioned one when 2 and 4 processors are used. Increasing the number

of processors from 4 to 8 results in a reduction of the run time of the simulation

0

500

1000

1500

2000

2500

2 4 8 16 24 32 64

EX

EC

UT

ION

TIM

E (

Sec

)

PROCESSORS

Random PartitioningNandy-Louck’s Partitioning

SA-Partitioning

0

500

1000

1500

2000

2500

2 4 8 16 24 32 64

EX

EC

UT

ION

TIM

E (

Sec

)

PROCESSORS


SA-Partitioning

(a) (b)

FIG. 6. Execution time vs. number of processors. (a) Distributed Communication model; (b) Traffic

Flow Control.


model by approximately 25–35% using SA-partitioning strategy when compared to

the randomly partitioned one.

We also observe that NL-partitioning scheme exhibit a better performance results

when compared to the random one, but not as good as the SA-partitioning scheme.

Figures 7(a) and 7(b) portray the speedup for both distributed communication and

traffic control flow control simulation models, and for random, SA-partitioning, and

Nandy–Louck’s strategies. We observe significant speedups for both models. The

results show that careful static partitioning is important in the success of the

Chandy–Misra null simulation protocol [10]. The impact of partitioning increases

with the number of processors. This is due to the fact that the communication cost

increases with the number of processors. Moreover, these results show that the

simulated annealing optimization scheme was successful in providing a good near-

optimal partition of the simulated models.

Next, we wish to study the overhead of the conservative simulation protocol,

namely the NMR. We also wish to investigate how the SA-partitioning scheme is

successful in decreasing this overhead. Figures 8(a) and 8(b) display the overhead

2

4

8

16

20

2 4 8 16 24 32 48 64

SP

EE

D U

P

PROCESSORS

2

4

8

16

20

2 4 8 16 24 32 48 64

SP

EE

D U

P

PROCESSORS


SA-Partitioning


SA-Partitioning

(a) (b)

FIG. 7. Speedup vs. number of processors. (a) Distributed Communication model; (b) Traffic Flow

Control.

0

10

20

30

40

50

60

70

SP

EE

D U

P

2 4 8 16 24 32 48 64

PROCESSORS


SA-Partitioning

0

10

20

30

40

50

60

70

SP

EE

D U

P

2 4 8 16 24 32 48 64

PROCESSORS


SA-Partitioning

(a) (b)

FIG. 8. Synchronization overhead vs. number of processors. (a) Distributed Communication model; (b)

Traffic Flow Control.

BOUKERCHE1472

NMR as a function of the number of processors employed in the simulation model.

We observe a significant reduction of null messages for both NL- and SA-

partitioning schemes in both models, when compared to a random placement. Also

the NMR increases as the number of processors increases for both simulation

models. For example, we observe in Figs. 8(a) and 8(b), if we confine ourselves to less

than 16 processors, an approximation of 10% NMR-reduction using SA-

partitioning is observed over the NL scheme and 20% over the random one for

both simulation models. Increasing the number from 16 to 32, we observe about

15% reduction when we use NL scheme, and about 25% reduction when we use a

random partitioning strategy. These results clearly indicate the success of the

simulated annealing in providing good near optimal partitions that efficiently help to

reduce the NMR overhead. They also show that careful partitioning improves the

performance of conservative parallel simulation.

8. CONCLUSION

Biological inspired techniques, such as genetic algorithms, and networks, as well as

methods based upon natural phenomena such as simulated annealing have been

successfully applied to optimization problems and a variety of applications. In this

context, this paper shows how simulated annealing technology contributes a new

tool of tackling challenging problems such as discrete event simulations.

In this paper, we have addressed the problem of partitioning a conservative

parallel simulation on a parallel computer making use of the simulated annealing

paradigm. The synchronization protocol makes use of Chandy–Misra null message

[10, 13]. We have described a simulated annealing algorithm with an adaptive search

schedule to find good (sub-optimal) partitions. We have reported a set of simulation

experiments we have conducted to study the performance of our scheme on using

two real-world models (i.e., distributed communication and traffic flow control). Our

results indicate that careful partitioning is important to improve the efficiency of

conservative parallel simulations. We obtained a significant reduction of the run time

of the simulations with the use of the simulated annealing algorithm described as


well as the reduction of the null-message overhead compared to the use of random

and Nandy–Louck’s (NL) schemes.

One remaining problem is the long computation of the simulated annealing in case

of large-scale networks. Recent work by Nabhan and Zomaya [29, 30] have shown

that fast parallel simulated annealing algorithms with low communication overhead

is possible. We plan to investigate their approaches, and parallelize our algorithm

using a simple and more efficient scheme. We also plan to investigate efficient

methods to collect the data necessary for the SA-algorithm.

ACKNOWLEDGMENTS

We thank the anonymous referees for their valuable comments on an earlier version of the paper. Their

suggestions have greatly enhanced the quality of this paper.

REFERENCES

1. E. R. Barnes, An algorithm for partitioning the nodes of a graph, SIAM J. Algebraic Discrete Methods

3 (1982), 541–550.

2. S. Bokhari, ‘‘Assignment Problems in Parallel and Distributed Computing,’’ Kluwer Academic

Publishers, Boston, 1987.

3. A. Boukerche, Partitioning PCS wireless networks for parallel simulation, Internat. J. Interconnection

Networks 1 (2000), 173–193.

4. A. Boukerche, Time management in parallel simulation, in ‘‘High Performance Cluster Computing’’

(R. Buyya, Ed.), pp. 375–394, Prentice-Hall, Englewood Cliffs, NJ, 1999.

5. A. Boukerche and S. K. Das, Nature-inspired optimization algorithms for parallel simulations, in

‘‘Solutions to Parallel and Distributed Computing Problems: Lesson from Biological Sciences’’ (A.

Zomaya, F. Ercal, and S. Olariu, Eds.), pp. 87–109, Wiley& Sons, New York, 2001.

6. A. Boukerche and S. K. Das, ‘‘Dynamic Load Balancing Strategies for Parallel Simulations,’’

pp. 10–28, IEEE/ACM PADS, New York, 1997.

7. A. Boukerche and C. Tropper, ‘‘Parallel Simulation on the Hypercube Multiprocessor,’’ Distributed

Computing, pp. 181–190, Springer-Verlag, Berlin, 1995.

8. T. Bui, C. Heigham, C. Jones, and T. Leighton, Improving the performance of the Kernighan–Lin and

simulated annealing graph bisection algorithms, in ‘‘Proceedings of the 26th ACM/IEEE Design

Automation Conference,’’ pp. 775–778, 1989.

9. K. M. Chandy and J. Misra, Asynchronous distributed simulation via sequence of parallel

computations, Comm. ACM 24 (April 1981), 198–206.

10. K. M. Chandy and J. Misra, Distributed simulation: A case study in design and verification of

distributed programs, IEEE Trans. Software Eng. SE-5 (1979), 440–452.

11. A. Ferscha, Parallel and distributed simulation of discrete event systems, in ‘‘Handbook of Parallel

and Distributed Computing’’ (A. Zomaya, Ed.), McGraw–Hill, New York, 1995.

12. C. M. Fiduccia and R. M. Mattheyses, A linear-time heuristic for improving network partitions, in

‘‘Proceedings of the 19th Design Automation Conference,’’ Las Vegas, pp. 175–181, 1982.

13. R. M. Fujimoto, Parallel discrete event simulation, Comm. ACM 33 (1990), 30–53.

14. B. P. Gan, Y. H. Low, S. Turner, W. Cai, and W. J. Hsu, Load balancing for conservative stimulation

on shared memory multiprocessor systems, in ‘‘Proceedings of the 14th IEEE/ACM PADS 2000,’’ pp.

139–146, 2000.

BOUKERCHE1474

15. M. R. Garey and D. S. Johnson, ‘‘Computers and Intractability: A Guide to the Theory of NP-

Completeness,’’ W.H. Freeman and Company, New York, 1979.

16. D. Glazer and C. Tropper, On process migration and load balancing in time warp, IEEE Trans.

Parallel Distrib. Systems 4 (March 1993), 318–327.

17. A. Galluscio, Douglass, B. Malloy, and J. Turner, A comparison of two methods for advancing

time in parallel discrete event simulation, in ‘‘Proceedings of the 1995 Winter Simulation Conference,’’

pp. 650–657, 1995.

18. F. Glover, Tabu Search}Part 1, ORSA J. Comput. 1 (1986), 190–206.

19. R. W. Hockney, The communication challenge for MPP: Intel Paragon and Meiko CS-2, Parallel

Comput. 20 (1994), 383–398.

20. L. Ingber, Simulated annealing: Practive versus theory, Math. Comput. Modeling 18 (1993), 29–57.

21. J. Keller et al., Scalability analysis for conservative simulation of logical circuits, in ‘‘VLSI Design,

Special Issue: Current Advances in Parallel Logic Simulation’’ (A. Boukerche, Ed.), Vol. 9,

pp. 219–236, 1999.

22. B. W. Kernighan and S. Lin, An efficient heuristic procedure for partitioning graphs, Bell System

Tech. J. 49 (1970), 291–307.

23. H. Kim and J. Jean, Concurrency preserving partitioning algorithm for parallel logic simulation, in

‘‘VLSI Design, Special Issue: Current Advances in Parallel Logic Simulation’’ (A. Boukerche, Ed.),

Vol. 9, pp. 219–236, 1999.

24. S. Kirkpatrick, C. D. Gelatt, and M. P., Vecchi, Optimization by simulated annealing, Science 220

(1983), 671–680.

25. R. Klin and P. Banerjee, Optimization by simulated evolution with applications to standard cell

placement, in ‘‘Proceedings of the 27th ACM/IEEE Design Automation Conference,’’ 1990.

26. L. Kleinrock, ‘‘Queueing Systems,’’ John Wiley& Sons, New York, 1976.

27. S. A. Kravits and R. Rutenbar, Placement by simulated annealing on a multiprocessor, IEEE Trans.

Comput. Des. CAD-6 (July 1987), 534–549.

28. Y. B. Lin and E. D. Lazowska, ‘‘Conservative Parallel Simulation for Systems with no Lookahead

Prediction,’’ TR 89-07-07, Department of Computer Science and Engineering, University of

Washington, 1989.

29. T. M. Nabhan and A. Y. Zomaya, A parallel computing engine for a class of time critical processes,

IEEE Trans. Systems Man Cybernet.}Part B 27 (1997).

30. T. M. Nabhan and A. Y. Zomaya, A parallel simulated annealing algorithm with low computation

overhead, IEEE Trans. Parallel Distrib. Systems 6 (1997), 1226–1253.

31. B. Nandy and W. M. Loucks, On a parallel partitioning techniques for use with conservative parallel

simulation, in ‘‘Proceeding of the 7th Workshop on Parallel and Distributed Simulation,’’ pp. 43–51,

1993.

32. D. M. Nicol, Parallel discrete-event simulation of FCFS stochastic queuing networks, in ‘‘Proceedings

of the ACM SIGPLAN Symposium on Parallel Programming, Environments, Applications, and

Languages,’’ Yale University, July 1988.

33. L. A. Sanchis, Multi-way partitioning network partitioning, IEEE Trans. Comput. 38 (January 1989),

62–81.

34. F. Sarkar and S. K. Das, ‘‘Design and implementation of dynamic load balancing algorithms for

rollback reduction in optimistic PDES, in ‘‘VLSI Design, Special Issue: Current Advances in Parallel

Logic Simulation’’ (A. Boukerche Guest-Ed.), Vol. 9, pp. 271–290, 1999.

35. W. K. Su, and C. L. Seitz, Variants of the Chandy–Misra–Bryant distributed discrete event

simulation algorithm, in ‘‘Proceedings of the SCS Multiconference on Distributed Simulation,’’

Vol. 21, 1989.

36. C. Sporrer and H. Bauer, Corolla partitioning for distributed logic simulation of VLSI-circuits, in

‘‘Proceedings of the SCS Multiconference on Parallel and Distributed Simulation,’’ SCS Simulation

Serie, Vol. 23, pp. 85–92, 1993.


37. L. Tao, B. Narahari, and Y. C. Zaho, ‘‘Partitioning Problems in Heterogeneous Computing,’’

Workshop on Heterogeneous Processing, pp. 23–28, 1993.

38. A. C. Palaniswamy and P. Wilsey, An analytical comparison of periodic checkpointing and

incremental state saving, in ‘‘Proceedings of the PADS’97,’’ pp. 127–134, 1993.

39. E. E. Witte, R. D. Chamberlain, and M. A. Franklin, Task assignment by parallel simulated

annealing, IEEE Trans. Parallel Distrib. Systems (1991).

40. S. Salleh and A. Y. Zomaya, 1998, Multiprocessor scheduling using mean-field annealing, J. Future

Generation Comput. Systems, 14 393–408.

41. A. Y. Zomaya and R. Kazman, 1999, Simulated annealing techniques, in ‘‘Handbook of Algorithms

and Theory of Computation’’ (M.J. Atallah, Ed.) (Chap. 37) pp. 37.1–37.19, CRC Press, Boca Raton,

FL, U.S.A.

AZZEDINE BOUKERCHE is an assistant professor of computer sciences at the University of North

Texas, and the Founding Director of the Parallel Simulation and Distributed Systems Research

Laboratory (PARADISE) at UNT. Prior to this, he was working as a senior scientist at the Simulation

Sciences Division, Metron Corporation located in San Diego. He was employed as a Faculty at the School

of Computer Science (McGill Univ.) and he also taught at the Polytechnic of Montreal. He spent the

1991–1992 academic year at the JPL-California Institute of Technology where he contributed to a project

centered about the specification and verification of the software used to control interplanetary spacecraft

operated by JPL/NASA Laboratory.

His current research interests include wireless networks, mobile computing, distributed systems,

distributed computing, distributed interactive simulation, parallel simulation, and VLSI design. Dr.

Boukerche has published several research papers in these areas. He was the recipient of the best research

paper award at IEEE/ACM PADS’97, the recipient of the National Award for Telecommunication

Software in 1999 for his work on a distributed security system for mobile phone operations, and has been

nominated for the best paper award at the IEEE/ACM PADS’99, and ACM MSWiM’2001. He was the

Program Co-Chair of the third IEEE International Workshop on Distributed Simulation and Real Time

Applications (DS-RT’99), and a Program Co-Chair of the 2nd ACM Conf. on Modeling, Analysis and

Simulation of Wireless and Mobile Systems (MSWiM’99), the General Co-Chair of the principle

Symposium on Modeling Analysis, and Simulation of Computer and Telecommunication Systems

(MASCOTS), in 1998, a General Chair of the 3rd ACM Conf. on Modeling, Analysis and Simulation of

Wireless and Mobile Systems (MSWiM’2000), and a General Chair of 4th IEEE International workshop

on Distributed Simulation and real Time Application (DS-RT’2000), a Chair and the main organizer of a

special Session on wireless and mobile computing at the IEEE HiPC’2000. and as a Tools-Chair for

MASCOTS 2001.

He served as a guest editor for several international journals: VLSI Design, the Journal of Parallel and

Distributed Computing (JPDC), ACM Wireless Networks (WINET), and ACM Mobile Networks and

Applications (MONET). Dr. Boukerche serves as a Program Co-Chair for the 35th SCS/IEEE/ACM

Annual Simulation Symposium (2002), a Program Co-Chair for the 10th IEEE/ACM International

Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (2002),

and Steering Committee Chair of the IEEE DS-RT, and ACM MSWiM conferences.

He has been a Program Committee member for several international conferences, such as ICC, ANSS,

ICPP, MASCOTS, BioSP3, ICCI, MSWiM, PADS, WoWMoM, WLCN, and IFIPS Networking

Conferences. He is an Associate Editor of SCS Transactions and executive member of the IEEE Task

Force on Cluster Computing. Dr Boukerche is a member of IEEE and ACM.

an adaptive partitioning algorithm for distributed discrete event simulation systems

Documents