an adaptive partitioning algorithm for distributed discrete event simulation systems
TRANSCRIPT
Journal of Parallel and Distributed Computing 62, 1454–1475 (2002)doi:10.1006/jpdc.2002.1856
An Adaptive Partitioning Algorithm for DistributedDiscrete Event Simulation Systems1
Azzedine Boukerche
Parallel Simulation and Distributed Systems (PARADISE) Research Laboratory, Department of Computer
Science, University of North Texas, Denton, Texas 76203-1366
E-mail: boukerche@cs:unt:edu
Received March 29, 2000; accepted January 29, 2002
Biocomputing techniques have been proposed to solve combinatorial
problems elegantly by such methods as simulated annealing, genetic
algorithms and neural networks. In this context, we identify an important
optimization problem arising in conservative distributed simulation, such as
partitioning, synchronization and communication overhead minimization. We
propose the use of a simulated annealing algorithm with an adaptive search
schedule to find good (sub-optimal) partitions. The paper discusses the
algorithms, its implementation and reports on the performance results of
simulation of several workload models on a multiprocessor machine.
The results obtained indicate clearly that a partitioning which make use of
our simulated annealing significantly reduces the running time of a
conservative simulation, and decreases the synchronization overhead of the
simulation model when compared to Nandy–Louck’s partitioning algorithm.
# 2002 Elsevier Science (USA)
1. INTRODUCTION
In recent years, there has been a growing interest in developing efficient solutions
(sequential and parallel) to hard combinatorial optimization problems, which are
based on biological, evolutionary, and/or natural processes. These approaches fall
under the realm of a general paradigm, called biocomputing, which includes such
bio-based methods as genetic algorithms, cellular automata, DNA and neural
networks; as well as methods based upon natural phenomena such as simulated
annealing [20, 41] and mean-field annealing [40]. It turns out that almost all of these
approaches are inherently parallel and distributed in nature.
Over the last two decades, simulations of complex systems have been identified
as an important area to exploit the inherent parallelism present in applica-
tion problems. To this end, a significant body of literature on parallel/distributed
1 This work is supported by UNT Faculty Research Grant and the Texas Advanced Research Program
grant TARP-003594-0092-2001.
14540743-7315/02 $35.00
# 2002 Elsevier Science (USA)All rights reserved.
ADAPTIVE PARTITIONING ALGORITHM 1455
simulation has been proposed (see [4, 11, 13]). There are two basic approaches to
parallel simulation: conservative and optimistic. While conservative synchronization
techniques rely on blocking to avoid violation of dependence constraints, optimistic
methods rely on detecting synchronization errors at run-time and then on recovery
using a rollback mechanism. In both approaches, the simulated system is modeled as
a network of logical processes (LP) which communicate only via message passing.
While solving problems in parallel, one encounters a number of optimization
problems like scheduling, partitioning and dynamic load balancing which must be
tackled efficiently if one expects to achieve significant performance gain. Due to the
NP-hardness nature of these problems in general, it is highly unlikely to obtain exact
solutions whose running times are polynomially bounded in the size of the problem.
Therefore, research has been directed to find fast, approximate (or near-optimal)
solutions.
In this paper, we consider the problem of partitioning a conservative parallel
simulation for execution on a multi-computer. The synchronization protocol makes
use of Chandy–Misra null messages [4, 10, 13]. We propose the use of a simulated
annealing algorithm with an adaptive search schedule for generating good (sub-
optimal) partitions for conservative simulation. This paper discusses the algorithm,
its implementation and reports on the performance results of several simulation
models executed on a multiprocessor machine.
The remainder of this paper is organized as follows. Section 2 introduces the major
issues of parallel discrete event simulation. In Section 3, we review previous and
related work. Section 4 is devoted to a description of the partitioning algorithm using
simulated annealing paradigm, followed by a discussion of the performance results
which we obtained. The conclusion follows.
2. FUNDAMENTAL OF PARALLEL DISCRETE EVENT SIMULATION
In this section, we introduce the basic terminology and major issues pertaining to
parallel discrete event simulation (PDES) which should provide exactly the same
solution to a problem as a sequential simulation. Thus, in specifying and developing
a parallel simulator, it is important to understand the sequential nature of the
simulation.
In a discrete-event simulation, the model evolution is defined by instantaneous
events. Each event corresponds to a transition in a portion of the model state,
composed of state variables, each describing a characteristic of the model. Each event
also has a simulation time associated with it, called timestamp, which defines its
occurrence time. Each event may in turn generate new future events.
The generation of new events and the dependency of their transitions on state
variables that previous events may have updated, define a relation of causal order
(namely, a partial order) among events. Related events are said to be causally
dependent, whereas unrelated ones are called concurrent. In order to guarantee the
correctness, concurrent events may be safely processed in any order in a simulation,
whereas causally dependent events must be processed according to the causal order.
Thus, to ensure the strict chronological order, events are processed one at a time,
BOUKERCHE1456
resulting in an (apparently) sequential program. A typical template for a sequential
simulation is given in Fig. 1.
Only by eliminating the event list in its traditional form so as to capture the
interdependence of the process being simulated can additional parallelism be
obtained [10]. This is the objective of parallel simulation. Indeed, parallel simulation
shows a great potential in terms of exploiting the inherent parallelism of the system,
and the underlying concurrency among events to achieve execution speedup. Good
surveys of the literature may be found in [4, 11, 13].
Conceptually, a parallel simulator is composed of a set of LP which interact by
means of messages, each carrying an event and its timestamp, thus called event
messages. Each LP is responsible for managing a subset of the model state, called
local state. Each event E received by an LP represents a transition in its local state.
The events scheduled by the simulation of E are sent as event messages to
neighboring LPs to be simulated accordingly. In a simulation, events must always be
executed in increasing order of timestamps. Anomalous behavior might then result if
an event is incorrectly simulated earlier in real time and affects state variables used
by subsequent events. In the physical model this would represent a situation in which
future events could influence the present. This is referred as causality error. Several
synchronization protocols have been proposed to deal with this problem. These
techniques can be classified into two groups}conservative algorithms and optimistic
algorithms. While conservative synchronization techniques rely on blocking to avoid
violation of dependence constraints, optimistic methods rely on detecting synchro-
nization errors at run-time and on recovery using a rollback mechanism. Let us
briefly outline the basic principles of these two approaches.
2.1. Conservative Simulation
Conservative approaches enforce event causality by requiring that each LP
elaborates an event only if it is certain that it will not receive an earlier event.
Consequently, events are always executed in chronological order at any LP. Each
logical process, LPi, maintains an input queue (lij) for each of its neighbor, LPj, in the
network of logical processes. In the case that one or more neighboring (input) queues
are empty, LPi is blocked because an event with a smaller timestamp than the
timestamp of the waiting events might yet arrive at an empty queue. This mechanism
implies that only unblocked LPs can execute in parallel. If all the LPs were blocked,
FIG. 1. Basic sequential discrete event simulation algorithm.
ADAPTIVE PARTITIONING ALGORITHM 1457
the simulation would be deadlocked. Ensuring synchronization and avoiding
deadlocks are the central problems in a conservative approach. Several schemes
have been proposed to alleviate this problem. In [10], the authors employ null
messages in order to avoid deadlocks and to increase the performance of the
simulation. When an event is sent on an output link, a null message bearing the same
timestamp as the event message is sent on all other output links. As is well known, it
is possible to generate an inordinate number of null messages under this scheme,
nullifying any performance gain [13]. As a result, a number of attempts to optimize
this basic scheme have appeared in the literature. For example, in [35], the authors
refrain from sending null messages until such time as the LP becomes blocked. They
refer to this approach as eager events, lazy null messages. They reported some
success in using variations of Chandy–Misra approaches to speed up logic
simulation.
In [6, 7, 31], the authors employed the following approach. In the event that a null
message is queued at an LP and a subsequent message (either null or event) arrives
on the same channel, they overwrite the (old) null message with the new message. A
single buffer is associated with each input channel at an LP to store null messages,
thereby saving space as well as the time required to perform the queueing and de-
queueing operations associated with null messages.
2.2. Optimistic Approach
Time Warp is based on an optimistic approach and enforces the causal order
among events as follows: events are greedily simulated in timestamp order until no
event messages remain or until a message arrives in the ‘‘past’’ (a straggler). Upon
receiving a straggler, the process execution is interrupted, and a rollback action takes
place using anti-messages. Each message is given a sign; positive messages indicate
ordinary events, whereas negative messages indicate the need to retract any
corresponding event that was previously processed. Similar messages that have
different signs are called anti-messages. If a negative message is received, the message
and the corresponding anti-message are both annihilated. A rollback consists of three
phases: (i) restoration: the latest state (with respect to simulation time) valid before
the straggler’s timestamp replaces the current state, and successive states are
discarded from the state queue; (ii) cancellation: the negative copies of messages
which were produced at simulation times successive to the straggler’s timestamp are
sent to the proper processes, to possibly activate rollbacks there; and (iii) coasting-
forward: the effective state which is valid at the straggler’s timestamp is computed by
starting from the restored state and by elaborating those messages with a timestamp
up to the stragglers; during this phase no message is produced. Rollbacks are made
possible by means of state checkpointing. The whole state of the process is
checkpointed into the state queue according to some discipline, see [38].
To minimize the storage overhead required to perform rollbacks, and to detect the
termination of LPs, optimistic synchronization mechanism uses a local virtual time
(LVT), and a global virtual time (GVT). The LVT represents the timestamp of the
latest processed event at an LP; whereas GVT is defined as the minimum of all the
local virtual times of all LPs, and of all the timestamps of messages in transit within
BOUKERCHE1458
the simulation model. The GVT indicates the minimum simulation time at which a
causal violation may occur. The use of GVT computation is to commit the safe
portion of the simulation.
3. PREVIOUS AND RELATED WORK
Scheduling, load balancing and partitioning, in parallel and distributed systems in
general, and parallel simulation in particular, have long been identified as important
optimization problems. Existing literature on partitioning and mapping problems
are based on approaches like graph theoretic, queueing theoretic, mathematical
programming, numeric and non-numeric heuristics, and/or combined methods, see
[2]. In all these formulation, finding an optimal solution is found to be NP-hard [15]
in all but very restricted cases. Thus research has focused on the development of
heuristic algorithms to find suboptimal solutions. Biocomputing/evolutionary/
natural techniques have been proposed to solve these problems elegantly by such
methods as simulated annealing, genetic algorithms, neural networks, or stochastic
processes [5].
Nabhan and Zomaya [29] proposed an optimal task scheduler. The scheduling
scheme employs a simulated annealing algorithms that minimizes a cost function
representing the expected performance of each schedule. Their proposed model is
simple and a flexible way for generating efficient formulations for computational
models. The efficiency of their scheduler is demonstrated by two case studies with
promising results.
Klin and Banerjee [25] proposed simulated evolution as an alternative to
annealing, and applied it in the context of cell placement in VLSI design. Their
technique is the mathematical analog of the natural selection of biological
environments, and it performs three basic steps}evaluation, selection, and
allocation. The first step computes the ‘‘goodness’’ of the particular cell position.
In the second step, the cells are probabilistically selected for replacement according
to their goodness. Finally, the third step removes cells from their current allocation
and searches for an improved location. These three steps are repeated until no
further improvement is needed. The experimental results showed that the simulated
evolution is slow. As a consequence the authors make use of hierarchical and
window techniques that reduce the running time significantly.
Tao et al. [37] studied the problem of allocating the interacting task modules, of a
parallel program to heterogeneous processors in a parallel architecture. They present
three heuristics for task assignment, based on simulated annealing, tabu search, and
stochastic probe approaches, respectively. The stochastic probe approach is a
combination of the aggressive search process in the tabu search and the stochastic
search process in the simulated annealing approach.
Kernighan and Lin [22] propose a two-way partitioning algorithm with
constraints on the final subset sizes. They applied pairwise swapping and iterated
on all pairs of nodes to find the best improvement on the existing partition. Fiduccia
and Mattheyses [12] further improved this algorithm by developing a clever
implementation for each iteration to achieve a linear complexity for each iteration.
ADAPTIVE PARTITIONING ALGORITHM 1459
Sanchis [33] then adapted this model to multiple-way partitioning. Her
algorithm attempts to minimize the communication between processors and
keep the number of processes per processor within a specified range. The
algorithm uses a concept of levels; each level successively produces a better
cut set. A VLSI component model and a network of SUN-workstations were
employed for experimentation.
The Kernighan–Lin-based algorithms unfortunately share the common weakness
that they are often trapped by local minima when the size of the problem is very
large. One way to overcome this difficulty is to form clusters, and then condense
these clusters into single nodes prior to the execution of the Kernighan–Lin-based
algorithms. The complexity of the problem is thus dramatically reduced, which in
turn improves the performance of the partitioning algorithm [8].
Each of the preceding approaches to the partitioning problem provides good
solution for restricted applications, and they are not suitable for parallel and
distributed simulations since the synchronization constraints exacerbate the
dependencies between the LPs in the system graph. In order to achieve the best
performance, the partitioning and the load balancing should decrease the running
time of the simulation compared to a random partitioning.
In the past, attempts have been made to tackle the partitioning and load balancing
problems in PDES. Nandy and Loucks [31] presented a static partitioning algorithm
for conservative parallel logic simulation. The algorithm attempts to minimize the
communication overhead and to uniformly distribute the execution load among the
processors. It starts with an initial random partition and then iteratively moves
processes between clusters until no improvements can be found (a local optimum).
All possible moves for each process are considered. The process which contributes to
the maximum gain is chosen. A process is moved only if it does not violate the block
size constraints. As a benchmark, they use the simulation of circuits modeled at the
gate level. A message passing multicomputer composed of eight T-800 INMOS
transputers was employed as the simulation platform. They report a 10–25%
reduction in simulation time from the simulation time of a random partition [31].
Kim and Jean [23] presented an efficient, linear-time partitioning algorithm for
parallel logic simulation, based on a linear ordering of vertices in a directed graph.
The performance results of the algorithm presented in this paper show a good
compromise with a high degree of concurrency, a balanced workload and a
reasonable amount of interprocessor communication. The authors have showed that
the optimistic parallel simulation algorithm with the CPP algorithm enables
powerful parallel gate-level circuit simulators.
Jorg Keller et al. [21] presented several new serialization free-parallel data
structures which seem to have large impact on the performance of logic simulation.
The efficiency of these data structures is based upon the use of parallel prefix
operations. The authors consider first the PTHOR algorithm for simulation of
logical circuits, which uses a conservative approach, then they show how attainable
speedup can be increased by several changes in the data structures, including the
memory management. The paper also examines the partitioning problem of the
simulated circuit among the parallel processors. Several dynamic partitioning
algorithms are described and compared with each others.
BOUKERCHE1460
In [14], the authors presented both static and dynamic load balancing strategies.
The static partitioning scheme attempts to minimize load imbalances while
maximizing lookahead values. The static partitioning packages, Metis and Scotch,
were used to achieve these objective functions. The scalability study based on a
performance analyzer has shown that it is indeed important to apply a partitioning
algorithm instead of a random strategy.
In [3], a simple partitioning scheme based upon the pairwise exchange was
developed to reduce the overhead of the rollbacks of a hybrid parallel simulation of
wireless networks. The synchronization protocol makes use of both conservative and
timewarp paradigms. The results obtained were quite encouraging using a cluster of
network workstations. Sarkar and Das [34], proposed two algorithms for dynamic
load balancing which reduce the number of rollbacks in an optimistic PDES system.
The first algorithm is based on the load transfer mechanism between LPs; while the
second algorithm, based on the principle of evolutionary strategy, migrates logical
processes between several pairs of physical processors. They have implemented both
of these algorithms on a cluster of heterogeneous workstations and studied their
performance. The experimental results indicated that the algorithm based on the
load transfer is effective when the grain size is greater than 10 ms, while the
algorithm based on the process migration yielded good performance only for grain
sizes of 20 ms or larger.
4. DISTRIBUTED SIMULATION MODEL
Our concern in this paper is to study the importance of partitioning on the
performance of a conservative distributed simulation. This conservative mechanism
is based on a distributed model of computation in which processes communicate via
messages. In this model a network of LPs is used to simulate the objects of a system,
e.g., a computer network. Explicit links connect the LPs, and messages are
forwarded between LPs over these links. In the conservative paradigm, an event
cannot be simulated by an LP before it is certain that an event with a smaller
timestamp cannot arrive. As a consequence of this blocking behavior, deadlocks
arise [9, 13]. Several solutions to this problem exist, each requiring a certain amount
of overhead. In this paper, we make use of the Chandy–Misra null-message
algorithm [10, 13].
In [10], the authors employ null messages in order to avoid deadlocks and to
increase the performance of the simulation. When an event is sent on an output link
a null message bearing the same timestamp as the event message is sent on all other
output links. As is well known, it is possible to generate an inordinate number of null
messages under this scheme, nullifying any performance gain [13].
In order to increase the efficiency of this basic scheme, we employ the following
approach. In the event that a null message is queued at an LP and a subsequent
message (either null or event) arrives on the same channel we overwrite the (old) null
message with the new message. We associate one buffer with each input channel at
an LP to store null messages, thereby saving space as well as the time required to
perform the queueing and de-queueing operations associated with null messages.
ADAPTIVE PARTITIONING ALGORITHM 1461
In the next section, we discuss the application of simulated annealing with an
adaptive schedule to the partitioning problem.
5. PARTITIONING ALGORITHM BASED UPON SIMULATED ANNEALING
Simulated annealing (SA), introduced by Kirkpatrick et al. [24], is a powerful
method for optimizing functions defined over complex systems. It is based on ideas
from statistical mechanics and motivated by an analogy to the behavior of physical
systems in the presence of a heat bath.
While greedy algorithms, and other simple iterative improvement techniques
accept a new configuration of lower cost and reject more costly states, SA escapes
from local minima by sometimes accepting higher cost arrangements with a
probability determined by the simulated ‘temperature’. Simulated annealing leads to
efficient heuristic algorithms having several advantages over other approaches to
solving combinatorial optimization problems [29, 39, 41]. First, it is problem-
independent. By substituting a few problem specific data structures and functions,
the SA algorithm can be applied to many combinatorial optimization problems.
Second, SA can easily handle multiple, potentially conflicting goals of a problem. It
has been successfully applied in solving combinatorial problems such as cell
placements [27], and task scheduling in real time systems [29, 30].
In this section, we show how to apply simulated annealing to partitioning
conservative simulation on multiprocessor machines. This method is derived by
mapping the complex simulation system into a physical system.
In this approach, each LP is considered as a particle moving in space with ndistinct positions, and each state is represented as the assignment of each process
(LP) to a partition, which in turn is assigned to a processor. These LPs (or particles)
interact with each other with a force which is represented by a energy or objective
function.
In this work, all the parameters used in the partitioning algorithm are estimated.
As in [31], our approach is to develop a partitioning algorithm and is based upon
realistic estimates of computation and communication load. Future work will be
directed at determining these estimates, given that we are successful in reducing the
running time of the simulation with our algorithm.
The structure of the algorithm is contained in Fig. 2. In the application of
partitioning, the state is represented as the assignment of each process (LP) to a
partition, which in turn is assigned to a processor.
Starting from an initial partition, a move generates another partition. An objective
function (F ) evaluates the quality of this partition. This new partition is accepted if it
is better, or probabilistically accepted if it is worse. The probability of accepting a
new partition differing by 4ðF Þ is denoted as P ð4ðF ÞÞ. Two probability functions
from statistical mechanics are common in simulated annealing: the ‘‘heat function’’
and the Boltzmann distribution. Kirkpatrick et al. [24] used the Boltzmann
probability factor,
P ð4F Þ ¼ minð1; e�4ðF Þ=kT Þ:
FIG. 2. A simulated annealing algorithm.
BOUKERCHE1462
This function accepts all changes of 4ðF Þ40 with P ð4ðF ÞÞ. It is assumed that the
Boltzmann constant k equals 1 in order to avoid inconveniently large temperature
scales. In our experiments, we used the Boltzmann distribution (see Fig. 3).
When an equilibrium state, represented as the assignment of each process (LP) to a
partition has been reached, the temperature is decreased. This iteration is repeated
until the algorithm meets a termination condition or until the temperature reaches its
lowest value Tfinal. The termination condition and the equilibrium condition together
are called the annealing schedule.
The following subsections describe the various components of the proposed
annealing algorithm in more detail.
5.1. Objective Function
We describe an objective function to evaluate the quality of the partitioning
solution generated by our algorithm. The function is chosen so that inter-processor
communication conflicts are minimized, processor load remains balanced, and the
probability of sending a null message between processors is minimized.
In the course of experiments [7], it became apparent that the following factors
increase the efficiency of the conservative simulation protocol:
1. Uniform distribution of the execution load among the processors (LOAD).
2. The closeness of the LPs to one to another in the process graph ðDiameteravgÞ.LPs that are close to each other are expected to communicate more than if they are
far away from each other. Hence, assigning them to the same processor should
reduce the running time of the simulation.
3. Minimizing the inter-processor communication cost (IPC).
-6 -4 -2 0 2 4 6
1.0
0.8
0.6
0.4
0.2
0.0
FIG. 3. Boltzmann acceptance function.
ADAPTIVE PARTITIONING ALGORITHM 1463
This is desirable because of the high cost of sending a message between processors
when compared to the cost of sending messages between processes within the same
processor.
4. Minimizing the number of links between the processors (IPL).
This will decrease the number of null messages sent between processors.
We define expressions for these parameters as well as the objective function Fwhich represents the quality of the partitioning
F ¼ FðLOAD;Diameteravg; IPC; IPLÞ:
Let us denote by lij the average traffic between each pair of adjacent processes, ni the
number of processes assigned to processor i; Nlinks the total number of links, Ntot the
total number of processes (or nodes) and K the number of processors.
We define the load at each processor ðPriÞ as follows:
Loadi ¼Xj
Xpk2Pri
ljk * ðservk þ Tsend þ TrcvÞ;
where servk is the amount of time pk takes to process an event, Tsend is the amount of
time it takes for an LP to send an event, and Trcv the amount of time it takes for an
LP to receive the event. If the event is generated by an LP in processor i, then Tsend
and Trecv are negligible.
We wish for Loadi to be approximately equal to 1K
Pk Loadk. We define
LOAD ¼ KKYKi¼1
ðLoadi=LoadtotalÞ;
where Loadtotal ¼P
k Loadk and K is the number of processors. Here KK is a
normalization factor that causes the maximum of LOAD to be 1. If the number of
LPs within each processor is equal, i.e., Loadi ¼ Loadtotal=K, the quantity LOAD
reaches a maximum equal to 1. Therefore, we wish to maximize the quantity LOAD.
We also consider the quantity
Diameteravg ¼PK
i¼1 DiameteriK
BOUKERCHE1464
as an average measure of distance within each cluster where
Diameteri ¼ maxj;k2Processori
ðDistancejkÞ
and Distancejk is the number of hops between LPj and LPk.LPs that are close to each other are expected to communicate more than if they are
far away from each other (this is a characteristic of traffic on computer networks and
telephone systems). Hence, assigning them to the same processor should reduce the
running time of the simulation.
We define the relative inter-processor communication factor to be
IPC ¼tcomm
tcalc
Xi;j
lijCostij; ð1Þ
where tcomm is the mean time to send one message between two processors, and tcalc is
the mean time for the processing of an event message, Costij is the communication
overhead to prepare the message for transmission, Tsend is the time required between
processes pi and pj. lij is the average traffic between pi and pj. The ratio tcomm=tcalc
reflects the relative costs of inter-node communication and computation. Through-
out our discussion, we assume that the communication cost between 2 processes
assigned to the same processor is negligible. We decided to choose the relative inter-
processor expression (1) because it is more expensive to send a message between
processors then to do so within a single processor. Note that we assume that we have
estimates for lij for each pair of adjacent processes pi;pj.
In order to reduce the number of null messages (IPL), we minimize the number of
links between each pair of processors, Pi and Pj. In [1], the authors merge the two
partitions with the largest number of interconnections into a new partition if an
upper bound on the number of LPs in a processor is not exceeded. While this
approach generates partitions with a small number of interconnections the number
of LPs per processor can be quite different. Sporrer and Bauer [36] suggests the
following ratio:
kij ¼mijPl mil
;
where mij represents the number of interconnection between the Processori and Processorj and mii is set to zero. In [36], better partitions were obtained making use of kijinstead of mij. Bearing this in mind, we chose to define IPL as follows: IPL ¼
Pij kij.
A good partitioning is one which meets the criteria of maximizing the LOAD and
minimizing the quantities: IPC, Diameteravg and IPL. Hence we express the basic
problem as finding a partition which maximizes the following quantity:2
F ¼ LogLOADa
ð1 þ IPCÞbð1 þ IPLÞgð1 þ DiameterdavgÞ
!;
2 The factors in the denominator are of the form ð1 þ xÞ because of the possibility that x might approach
zero. The Log function is used to avoid inconveniently large factor scales.
ADAPTIVE PARTITIONING ALGORITHM 1465
where a; b; g and d are the relative weights of the corresponding parameters. In our
experiments, a ¼ b ¼ g ¼ d ¼ 1 yielded good results (see Section 7).
5.2. Generating an Initial Partition
The choice of initial partitioning not only affects the annealing convergence
rate, but also the final partitioning of processes to processors. The more nearly
the initial partitioning approximates the optimal partitioning, the greater is
the probability that the true optimal will be approached. Several strategies
may be used to generate an initial partition. If a random initial partitioning
is used, the problem begins with a completely unordered partitioning. Hence,
the temperature must be very high to guarantee accepted moves and to ensure
that the state space is adequately searched. Consequently, a good initial
partitioning is needed to reduce the run time of the annealing process.
Our experiments showed that by using a grid partition as an initial solution,
the annealing can start at a lower starting temperature and thereby reduce the
run time considerably.
5.3. Move Set
A large number of moves must be generated to adequately traverse the problem
search space, therefore the amount of time needed to generate and evaluate a move
must be minimal. In the course of our experiments we settled on the following
strategy. The generation of a new configuration proceeds in several steps. Each
process is (sequentially) selected to be part of a trial move.
Let us assume that the selected process (ps) has at least k neighbor processes
fp1;p2; . . . ;pkg,3 where each pi; i ¼ 1; . . . ; k, is assigned to a different processor
than the one containing ps. Then an exchange of ps with one of its neighboring
processes from the processor to which ps has the most links is evaluated. If the
exchange is rejected, a randomly chosen process residing on a neighboring processor
of the current processor containing the process ps is chosen.
To prevent the move-set process from ‘‘thrashing’’ or going into an infinite loop,
each move is controlled by tabu move as in tabu search [18]. It is managed by a
mechanism that makes use of historical information about moves made as the
simulated annealing process progresses; solutions accepted for an arbitrarily defined
number (ntabu)4 of previous moves are deemed unallowable or tabu. This prevents
cycling of more than ntabu moves. An LP or a pair of LPs are available for
movement if they are not in a tabu list.
5.4. Annealing Schedule
The search for adequate cooling schedules has been addressed in many papers
during the last few years and several approaches have been proposed. In
‘‘traditional’’ simulated annealing algorithms, the temperature is controlled with a
3k is determined empirically (k ¼ 2 when a toroid network model is used).4 ntabu is determined empirically (Glover [18] [suggests using 7 moves).
BOUKERCHE1466
simple schedule of the form
Tn ¼ Next TempðT Þ ¼ aTn�1;
where 04a51, and is usually taken to be at least 0.9, T0 ¼ Tinitial
ðTinitial ¼ 100; Tfinal ¼ 0:1Þ. The annealing factor a controls the rate of annealing.
In the course of our experiments, in order to decrease the number of iterations and
(thus) the running time of the algorithm, we decided to use an adaptive schedule. The
motivation to derive an adaptive schedule is based on the observation that the
behavior of SA is very different at high and low temperatures. Indeed, at high
temperatures, it is the number of acceptances that dictates equilibrium, while at low
temperatures it is the number of rejected states that dictates equilibrium. Another
important factor is the number of iterations required to reach the final state. Our
experiments indicate that using an adaptive cooling schedule produces satisfactory
results.
To determine the value of T0, we perform a sequence of random moves and
compute the quantity 4avg, the average change in cost in uphill moves. We should
have P ¼ e�4avg=T0 ’ 1 so that there will be a high probability of acceptance at high
temperatures. This suggests that T ¼ 4avg=lnðP Þ is a good choice for T0. The same
cooling schedule used in TimberWolf3.2 [33] has been chosen in our experiments.
The cooling strategy is based on the following: (1) Allow a few iterations in which
virtually every new state is accepted and where T is reduced quite rapidly from
iteration to iteration. (2) Having left the high T regime, reduce T in such a manner
that 4ðF Þ is approximately the same from iteration to iteration. (3) When T is
reduced below a certain fixed a priori temperature ðTmin ¼ 1:5Þ then reduce T very
rapidly so as to converge to a sub-optimal solution. Table 1 shows the cooling
schedule a vs. T .
5.5. Equilibrium and Termination Conditions
At each temperature T , the algorithm is said to be at equilibrium if the number of
acceptances exceeds a constant b, or if the number of rejections at temperature Texceeds g ¼ f ðbÞ. The value of b is decreased at successive temperatures, thereby
causing the system to attain equilibrium to be approximately equal for all
temperatures. We consider the following expressions:
bn ¼ abn�1; where 04a51;
and
gn ¼ rbn; where r5100:
TABLE 1
a vs. T
a 0.95 0.9 0.85 0.80 0.10
T 150 100 50 10 1.5
ADAPTIVE PARTITIONING ALGORITHM 1467
The reason for choosing gn > bn is that as the temperature is lowered, the number
of rejections increases. If the partitions computed cannot be improved over cconsecutive temperatures, the algorithm terminates. The choice of r; a and c were
determined as a results of our experiments.
6. SIMULATION ENVIRONMENT
The goal of our experiments was to study the impact of our partitioning
algorithm on the performance of a conservative synchronized parallel simulation
making use of null messages [10]. The experiments were conducted on an Intel
Paragon at CalTech. The Paragon is a distributed memory multicomputer,
consisting of 72 nodes, arranged in a two-dimensional mesh. The nodes use the
Intel i860 XP microprocessor and have 32 Mbytes of memory each. The A4 has a
peak performance of 75 Mflops (64-bit) per node. In Table 2 we show the basic
communication primitives provided by the communication system of the Paragon
A4. (Refer to [19] for more details on how these parameters are derived.)
The time t required to send a message of length n is fitted by least squares to the
straight line described by the performance parameter pair (b1; n1=2) [19]. Informally,
b1 is the asymptotic bandwidth which characterizes long message performance;
while n1=2 tells us how quickly, in terms of increasing message length, this asymptotic
rate is approached reaching half of its value when n ¼ n1=2. Formally,
t ¼ðnþ n1=2Þ
b1;
hence the message startup or latency is
t0 ¼n1=2
b1;
and the transfer rate (or bandwidth), b, for a message of length n is
b ¼nt¼
b1ð1 þ n1=2
n Þ:
The long and short message performance of Intel Paragon are characterized by
b1 ¼ 23:5 Mbytes=s, n1=2 ¼ 40;044 bytes and t0 ¼ 172 ms.
TABLE 2
Basic Communication Primitives of Paragon A4
Communication type Primitives Description
csend Send a message and wait for completion
Blocking crecv Receive a message and wait for completion
isend Send a message
Non-blocking msgwait Wait until completion for communication
BOUKERCHE1468
In our experiments we selected the realistic simulation models that mimic many
real world problems: a distributed communication model and a traffic flow network.
7. SIMULATION EXPERIMENTS
Earlier simulation studies [7, 13, 28, 32] showed that the performance of a
simulation strategy is sensitive to the topology of the simulated network. We have
selected two workload models as benchmarks primarily because they provide a stress
test for any conservative simulation protocols as a consequence of their many cycles,
they do not contain any inherent bottlenecks and they are real world problems.
7.1. Distributed Communication Network Model
Distributed communication model [16] is the first benchmark used in our
experiments, as shown in Fig. 4. It models a national network consisting of four
regions which are connected through four centrally located delays. A node is
represented by a process. Final message destination is not known when a message is
created. Hence, no routing algorithm is simulated. Instead one-third of messages
arrival are forwarded to neighboring nodes. A uniform distribution is employed to
select which neighbor receives a message. Consequently, the volume of messages
between any two nodes is inversely proportional to their hop distance. Messages may
flow between any two nodes, possibly through several paths. Nodes receive messages
at varying rates that are dependent on traffic patterns. There are numerous
deadlocks in this model, and hence it provides a stress test for any conservative
synchronization mechanism. Major sources are differing process delays and
generation rates, a large number of cycles, and multitude of paths between two
nodes. In fact, deadlocks occurs so frequently and null messages will be generated so
often that an efficient mechanism to reduce the null-message generation is almost a
requirement.
Various simulation conditions were created by mimicking the daily load
fluctuation found in large communication network operating across time zones.
FIG. 4. Distributed communication network model.
ADAPTIVE PARTITIONING ALGORITHM 1469
Therefore, in the pipeline region, for instance, we arranged the sub-region into
stages, and all processes in the same stages perform the same normal distribution
with a standard deviation 20%. The time required to execute a message is
significantly greater than the time to raise an event message in the next stage
(message passing delay). We use a pipeline sub-model with 26 processes, a toroid
sub-model with 25 processes, a fully connected sub-model with 5 processes, and a
circular sub-model with 20 processes. We choose a shifted exponential service time
distribution for the 3 sub-regions (circular, toroid, and fully connected sub-regions),
i.e., 0:1 � logðuniðÞÞ, where uniðÞ returns a random real number uniformly distributed
between 0 and 1.
7.2. Traffic Flow Network Model
In this model, we represent the traffic network as a square mesh composed of
streets running in the horizontal and vertical directions with traffic lights at the
intersections of the street, see Fig. 5. This model is partitioned into sub-systems;
which we refer to as grids. Each grid is assigned to an LP. The cars enter the
simulation and travel within and between the grids using message passing. To reflect
traffic flow network model, we define a light interval to be a triple hr; g; yi; where r is
the number of clock ticks that the light is red, g is the number of clock ticks that the
light is green, and y is the number of clock ticks that the light is yellow. Figure 5
illustrates an example of network traffic with two grids, where each grid contains
four lights. A car enters the simulation at a construction site labeled a Source Sink.
All of the boundary street s, as shown in this figure, are sources and sinks where cars
are generated according to a probability distribution. A car might travel either
North, South, West, or East with a fixed probability distribution of changing
directions at any given intersection. We also consider the contention problem at the
intersection, i.e., if a car is turning left into a path of a car going straight, then a
contention mechanism will inhibit one of the cars until the other car clears the
intersection. For a detailed description of the model, refer to Galluscio et al. [17].
L L
LL Cars Traffic
Cars Traffic L L
LL Cars Traffic
Cars Traffic
FIG. 5. Traffic flow network model.
BOUKERCHE1470
In our experiments, we choose an average traffic flow of 100,000 cars. The number
of lights in the traffic network is held constant at 576 lights for controlling the
workload of the system. To make the simulation more interesting, we choose 30 hot
spots uniformly distributed among all the grids, where the probability of car
generated at a source is 0.25.
7.3. Performance Results
It is well known that network communication delays grow non-linearly with the
communication load [26]. Before reaching the ‘‘knee,’’ delays are almost at a
constant level. When the load exceeds the critical point, the communication network
becomes congested and delays increase exponentially. When this occurs, the
simulation virtually crumbles. Therefore, our simulation models were designed to
stay below this critical point. This limit was enforced by a restriction of the real/null
messages within the simulation. In the event that the systems get congested, and the
generation of null messages is too high, the simulation halts and reports an ‘‘error.’’
From the experimental results, we will see that the partitioning algorithm
significantly reduces the number of null messages generated by the simulation.
Thus, the Chandy–Misra conservative simulation approaches the knee of the curve
more slowly than it would in the absence of partitioning.
Recall that the goals of our experiments is to reduce the running time of the
simulation models and the overhead of the synchronization protocol used to execute
these models. The experimental results were obtained by averaging several trial runs.
First, we present in the form of graphs the results for the execution time (in seconds)
and the speedup as a function of the number of processors employed in each
simulation model. The speedup, SP ðKÞ, achieved is calculated as the ratio of the
execution time ET1 required for a sequential simulator to perform the simulation on
one processor and the time ETK required for the parallel simulation to perform the
same simulation on K processors. That is, SP ðKÞ ¼ ET1=ETK . The sequential
simulator makes use of the splay tree data structure because empirical evidence
indicates that it is among the fastest event list implementations.
Next, we study the synchronization overhead as the null-message ratio ðNMRÞwhich is defined as the number of null messages processed by the simulation using
Chandy–Misra null-message approach divided by the number of real messages
processed.
Let us now turn to our results.
Figures 6(a) and 6(b) portray the values for the execution time of the distributed
communication and traffic flow control simulation models obtained using the
random partitioning, the Nandy–Louck’s (NL) algorithm and the SA-partitioning
algorithm. As we can see from the curves, the SA-partitioning algorithm exhibits a
better performance for both models over the randomly partitioned and the NL
schemes. For both models, we observe an approximate 25% reduction in the
execution time of the simulation using the SA-partitioning algorithm over the
randomly partitioned one when 2 and 4 processors are used. Increasing the number
of processors from 4 to 8 results in a reduction of the run time of the simulation
0
500
1000
1500
2000
2500
2 4 8 16 24 32 64
EX
EC
UT
ION
TIM
E (
Sec
)
PROCESSORS
Random PartitioningNandy-Louck’s Partitioning
SA-Partitioning
0
500
1000
1500
2000
2500
2 4 8 16 24 32 64
EX
EC
UT
ION
TIM
E (
Sec
)
PROCESSORS
Random PartitioningNandy-Louck’s Partitioning
SA-Partitioning
(a) (b)
FIG. 6. Execution time vs. number of processors. (a) Distributed Communication model; (b) Traffic
Flow Control.
ADAPTIVE PARTITIONING ALGORITHM 1471
model by approximately 25–35% using SA-partitioning strategy when compared to
the randomly partitioned one.
We also observe that NL-partitioning scheme exhibit a better performance results
when compared to the random one, but not as good as the SA-partitioning scheme.
Figures 7(a) and 7(b) portray the speedup for both distributed communication and
traffic control flow control simulation models, and for random, SA-partitioning, and
Nandy–Louck’s strategies. We observe significant speedups for both models. The
results show that careful static partitioning is important in the success of the
Chandy–Misra null simulation protocol [10]. The impact of partitioning increases
with the number of processors. This is due to the fact that the communication cost
increases with the number of processors. Moreover, these results show that the
simulated annealing optimization scheme was successful in providing a good near-
optimal partition of the simulated models.
Next, we wish to study the overhead of the conservative simulation protocol,
namely the NMR. We also wish to investigate how the SA-partitioning scheme is
successful in decreasing this overhead. Figures 8(a) and 8(b) display the overhead
2
4
8
16
20
2 4 8 16 24 32 48 64
SP
EE
D U
P
PROCESSORS
2
4
8
16
20
2 4 8 16 24 32 48 64
SP
EE
D U
P
PROCESSORS
Random PartitioningNandy-Louck’s Partitioning
SA-Partitioning
Random PartitioningNandy-Louck’s Partitioning
SA-Partitioning
(a) (b)
FIG. 7. Speedup vs. number of processors. (a) Distributed Communication model; (b) Traffic Flow
Control.
0
10
20
30
40
50
60
70
SP
EE
D U
P
2 4 8 16 24 32 48 64
PROCESSORS
Random PartitioningNandy-Louck’s Partitioning
SA-Partitioning
0
10
20
30
40
50
60
70
SP
EE
D U
P
2 4 8 16 24 32 48 64
PROCESSORS
Random PartitioningNandy-Louck’s Partitioning
SA-Partitioning
(a) (b)
FIG. 8. Synchronization overhead vs. number of processors. (a) Distributed Communication model; (b)
Traffic Flow Control.
BOUKERCHE1472
NMR as a function of the number of processors employed in the simulation model.
We observe a significant reduction of null messages for both NL- and SA-
partitioning schemes in both models, when compared to a random placement. Also
the NMR increases as the number of processors increases for both simulation
models. For example, we observe in Figs. 8(a) and 8(b), if we confine ourselves to less
than 16 processors, an approximation of 10% NMR-reduction using SA-
partitioning is observed over the NL scheme and 20% over the random one for
both simulation models. Increasing the number from 16 to 32, we observe about
15% reduction when we use NL scheme, and about 25% reduction when we use a
random partitioning strategy. These results clearly indicate the success of the
simulated annealing in providing good near optimal partitions that efficiently help to
reduce the NMR overhead. They also show that careful partitioning improves the
performance of conservative parallel simulation.
8. CONCLUSION
Biological inspired techniques, such as genetic algorithms, and networks, as well as
methods based upon natural phenomena such as simulated annealing have been
successfully applied to optimization problems and a variety of applications. In this
context, this paper shows how simulated annealing technology contributes a new
tool of tackling challenging problems such as discrete event simulations.
In this paper, we have addressed the problem of partitioning a conservative
parallel simulation on a parallel computer making use of the simulated annealing
paradigm. The synchronization protocol makes use of Chandy–Misra null message
[10, 13]. We have described a simulated annealing algorithm with an adaptive search
schedule to find good (sub-optimal) partitions. We have reported a set of simulation
experiments we have conducted to study the performance of our scheme on using
two real-world models (i.e., distributed communication and traffic flow control). Our
results indicate that careful partitioning is important to improve the efficiency of
conservative parallel simulations. We obtained a significant reduction of the run time
of the simulations with the use of the simulated annealing algorithm described as
ADAPTIVE PARTITIONING ALGORITHM 1473
well as the reduction of the null-message overhead compared to the use of random
and Nandy–Louck’s (NL) schemes.
One remaining problem is the long computation of the simulated annealing in case
of large-scale networks. Recent work by Nabhan and Zomaya [29, 30] have shown
that fast parallel simulated annealing algorithms with low communication overhead
is possible. We plan to investigate their approaches, and parallelize our algorithm
using a simple and more efficient scheme. We also plan to investigate efficient
methods to collect the data necessary for the SA-algorithm.
ACKNOWLEDGMENTS
We thank the anonymous referees for their valuable comments on an earlier version of the paper. Their
suggestions have greatly enhanced the quality of this paper.
REFERENCES
1. E. R. Barnes, An algorithm for partitioning the nodes of a graph, SIAM J. Algebraic Discrete Methods
3 (1982), 541–550.
2. S. Bokhari, ‘‘Assignment Problems in Parallel and Distributed Computing,’’ Kluwer Academic
Publishers, Boston, 1987.
3. A. Boukerche, Partitioning PCS wireless networks for parallel simulation, Internat. J. Interconnection
Networks 1 (2000), 173–193.
4. A. Boukerche, Time management in parallel simulation, in ‘‘High Performance Cluster Computing’’
(R. Buyya, Ed.), pp. 375–394, Prentice-Hall, Englewood Cliffs, NJ, 1999.
5. A. Boukerche and S. K. Das, Nature-inspired optimization algorithms for parallel simulations, in
‘‘Solutions to Parallel and Distributed Computing Problems: Lesson from Biological Sciences’’ (A.
Zomaya, F. Ercal, and S. Olariu, Eds.), pp. 87–109, Wiley& Sons, New York, 2001.
6. A. Boukerche and S. K. Das, ‘‘Dynamic Load Balancing Strategies for Parallel Simulations,’’
pp. 10–28, IEEE/ACM PADS, New York, 1997.
7. A. Boukerche and C. Tropper, ‘‘Parallel Simulation on the Hypercube Multiprocessor,’’ Distributed
Computing, pp. 181–190, Springer-Verlag, Berlin, 1995.
8. T. Bui, C. Heigham, C. Jones, and T. Leighton, Improving the performance of the Kernighan–Lin and
simulated annealing graph bisection algorithms, in ‘‘Proceedings of the 26th ACM/IEEE Design
Automation Conference,’’ pp. 775–778, 1989.
9. K. M. Chandy and J. Misra, Asynchronous distributed simulation via sequence of parallel
computations, Comm. ACM 24 (April 1981), 198–206.
10. K. M. Chandy and J. Misra, Distributed simulation: A case study in design and verification of
distributed programs, IEEE Trans. Software Eng. SE-5 (1979), 440–452.
11. A. Ferscha, Parallel and distributed simulation of discrete event systems, in ‘‘Handbook of Parallel
and Distributed Computing’’ (A. Zomaya, Ed.), McGraw–Hill, New York, 1995.
12. C. M. Fiduccia and R. M. Mattheyses, A linear-time heuristic for improving network partitions, in
‘‘Proceedings of the 19th Design Automation Conference,’’ Las Vegas, pp. 175–181, 1982.
13. R. M. Fujimoto, Parallel discrete event simulation, Comm. ACM 33 (1990), 30–53.
14. B. P. Gan, Y. H. Low, S. Turner, W. Cai, and W. J. Hsu, Load balancing for conservative stimulation
on shared memory multiprocessor systems, in ‘‘Proceedings of the 14th IEEE/ACM PADS 2000,’’ pp.
139–146, 2000.
BOUKERCHE1474
15. M. R. Garey and D. S. Johnson, ‘‘Computers and Intractability: A Guide to the Theory of NP-
Completeness,’’ W.H. Freeman and Company, New York, 1979.
16. D. Glazer and C. Tropper, On process migration and load balancing in time warp, IEEE Trans.
Parallel Distrib. Systems 4 (March 1993), 318–327.
17. A. Galluscio, Douglass, B. Malloy, and J. Turner, A comparison of two methods for advancing
time in parallel discrete event simulation, in ‘‘Proceedings of the 1995 Winter Simulation Conference,’’
pp. 650–657, 1995.
18. F. Glover, Tabu Search}Part 1, ORSA J. Comput. 1 (1986), 190–206.
19. R. W. Hockney, The communication challenge for MPP: Intel Paragon and Meiko CS-2, Parallel
Comput. 20 (1994), 383–398.
20. L. Ingber, Simulated annealing: Practive versus theory, Math. Comput. Modeling 18 (1993), 29–57.
21. J. Keller et al., Scalability analysis for conservative simulation of logical circuits, in ‘‘VLSI Design,
Special Issue: Current Advances in Parallel Logic Simulation’’ (A. Boukerche, Ed.), Vol. 9,
pp. 219–236, 1999.
22. B. W. Kernighan and S. Lin, An efficient heuristic procedure for partitioning graphs, Bell System
Tech. J. 49 (1970), 291–307.
23. H. Kim and J. Jean, Concurrency preserving partitioning algorithm for parallel logic simulation, in
‘‘VLSI Design, Special Issue: Current Advances in Parallel Logic Simulation’’ (A. Boukerche, Ed.),
Vol. 9, pp. 219–236, 1999.
24. S. Kirkpatrick, C. D. Gelatt, and M. P., Vecchi, Optimization by simulated annealing, Science 220
(1983), 671–680.
25. R. Klin and P. Banerjee, Optimization by simulated evolution with applications to standard cell
placement, in ‘‘Proceedings of the 27th ACM/IEEE Design Automation Conference,’’ 1990.
26. L. Kleinrock, ‘‘Queueing Systems,’’ John Wiley& Sons, New York, 1976.
27. S. A. Kravits and R. Rutenbar, Placement by simulated annealing on a multiprocessor, IEEE Trans.
Comput. Des. CAD-6 (July 1987), 534–549.
28. Y. B. Lin and E. D. Lazowska, ‘‘Conservative Parallel Simulation for Systems with no Lookahead
Prediction,’’ TR 89-07-07, Department of Computer Science and Engineering, University of
Washington, 1989.
29. T. M. Nabhan and A. Y. Zomaya, A parallel computing engine for a class of time critical processes,
IEEE Trans. Systems Man Cybernet.}Part B 27 (1997).
30. T. M. Nabhan and A. Y. Zomaya, A parallel simulated annealing algorithm with low computation
overhead, IEEE Trans. Parallel Distrib. Systems 6 (1997), 1226–1253.
31. B. Nandy and W. M. Loucks, On a parallel partitioning techniques for use with conservative parallel
simulation, in ‘‘Proceeding of the 7th Workshop on Parallel and Distributed Simulation,’’ pp. 43–51,
1993.
32. D. M. Nicol, Parallel discrete-event simulation of FCFS stochastic queuing networks, in ‘‘Proceedings
of the ACM SIGPLAN Symposium on Parallel Programming, Environments, Applications, and
Languages,’’ Yale University, July 1988.
33. L. A. Sanchis, Multi-way partitioning network partitioning, IEEE Trans. Comput. 38 (January 1989),
62–81.
34. F. Sarkar and S. K. Das, ‘‘Design and implementation of dynamic load balancing algorithms for
rollback reduction in optimistic PDES, in ‘‘VLSI Design, Special Issue: Current Advances in Parallel
Logic Simulation’’ (A. Boukerche Guest-Ed.), Vol. 9, pp. 271–290, 1999.
35. W. K. Su, and C. L. Seitz, Variants of the Chandy–Misra–Bryant distributed discrete event
simulation algorithm, in ‘‘Proceedings of the SCS Multiconference on Distributed Simulation,’’
Vol. 21, 1989.
36. C. Sporrer and H. Bauer, Corolla partitioning for distributed logic simulation of VLSI-circuits, in
‘‘Proceedings of the SCS Multiconference on Parallel and Distributed Simulation,’’ SCS Simulation
Serie, Vol. 23, pp. 85–92, 1993.
ADAPTIVE PARTITIONING ALGORITHM 1475
37. L. Tao, B. Narahari, and Y. C. Zaho, ‘‘Partitioning Problems in Heterogeneous Computing,’’
Workshop on Heterogeneous Processing, pp. 23–28, 1993.
38. A. C. Palaniswamy and P. Wilsey, An analytical comparison of periodic checkpointing and
incremental state saving, in ‘‘Proceedings of the PADS’97,’’ pp. 127–134, 1993.
39. E. E. Witte, R. D. Chamberlain, and M. A. Franklin, Task assignment by parallel simulated
annealing, IEEE Trans. Parallel Distrib. Systems (1991).
40. S. Salleh and A. Y. Zomaya, 1998, Multiprocessor scheduling using mean-field annealing, J. Future
Generation Comput. Systems, 14 393–408.
41. A. Y. Zomaya and R. Kazman, 1999, Simulated annealing techniques, in ‘‘Handbook of Algorithms
and Theory of Computation’’ (M.J. Atallah, Ed.) (Chap. 37) pp. 37.1–37.19, CRC Press, Boca Raton,
FL, U.S.A.
AZZEDINE BOUKERCHE is an assistant professor of computer sciences at the University of North
Texas, and the Founding Director of the Parallel Simulation and Distributed Systems Research
Laboratory (PARADISE) at UNT. Prior to this, he was working as a senior scientist at the Simulation
Sciences Division, Metron Corporation located in San Diego. He was employed as a Faculty at the School
of Computer Science (McGill Univ.) and he also taught at the Polytechnic of Montreal. He spent the
1991–1992 academic year at the JPL-California Institute of Technology where he contributed to a project
centered about the specification and verification of the software used to control interplanetary spacecraft
operated by JPL/NASA Laboratory.
His current research interests include wireless networks, mobile computing, distributed systems,
distributed computing, distributed interactive simulation, parallel simulation, and VLSI design. Dr.
Boukerche has published several research papers in these areas. He was the recipient of the best research
paper award at IEEE/ACM PADS’97, the recipient of the National Award for Telecommunication
Software in 1999 for his work on a distributed security system for mobile phone operations, and has been
nominated for the best paper award at the IEEE/ACM PADS’99, and ACM MSWiM’2001. He was the
Program Co-Chair of the third IEEE International Workshop on Distributed Simulation and Real Time
Applications (DS-RT’99), and a Program Co-Chair of the 2nd ACM Conf. on Modeling, Analysis and
Simulation of Wireless and Mobile Systems (MSWiM’99), the General Co-Chair of the principle
Symposium on Modeling Analysis, and Simulation of Computer and Telecommunication Systems
(MASCOTS), in 1998, a General Chair of the 3rd ACM Conf. on Modeling, Analysis and Simulation of
Wireless and Mobile Systems (MSWiM’2000), and a General Chair of 4th IEEE International workshop
on Distributed Simulation and real Time Application (DS-RT’2000), a Chair and the main organizer of a
special Session on wireless and mobile computing at the IEEE HiPC’2000. and as a Tools-Chair for
MASCOTS 2001.
He served as a guest editor for several international journals: VLSI Design, the Journal of Parallel and
Distributed Computing (JPDC), ACM Wireless Networks (WINET), and ACM Mobile Networks and
Applications (MONET). Dr. Boukerche serves as a Program Co-Chair for the 35th SCS/IEEE/ACM
Annual Simulation Symposium (2002), a Program Co-Chair for the 10th IEEE/ACM International
Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (2002),
and Steering Committee Chair of the IEEE DS-RT, and ACM MSWiM conferences.
He has been a Program Committee member for several international conferences, such as ICC, ANSS,
ICPP, MASCOTS, BioSP3, ICCI, MSWiM, PADS, WoWMoM, WLCN, and IFIPS Networking
Conferences. He is an Associate Editor of SCS Transactions and executive member of the IEEE Task
Force on Cluster Computing. Dr Boukerche is a member of IEEE and ACM.