dynamic load balancing for clustered time · abstract in this thesis. we consider the problem of...
TRANSCRIPT
DYNAMIC LOAD BALANCING FOR CLUSTERED TIME WARP
by
Khalil M. El-Khatib
School of Cornputer Science
McGill University, Montreai
November 1996
A dissertation
submitted to the Faculty of Graduate Studies and Rtseach
in partial fulfillment of the rcquirements for the degrec of
MASTER OF SCIENCE
Copyright 01996 by Khalil M. El-Khatib
National Library Bibiiitheque nationale du Canada
Acquisitions and Acquisitions et BiMiographic Services seMces bibliographiques 395 Wellington Street 395, rue Wdlingtori OriawaON K 1 A W OnawaON K 1 A W canada Canada
The author has granted a non- exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or seli copies of this thesis in microfom7 paper or electronic formats.
The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or othewise reproduced without the author's ~ermission.
L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/fih, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
Abstract
In this thesis. we consider the problem of dynamic load balancing for parallel
cliscrete evcnt simulation. We focus on the optimistic synchronization protocol. Time
Warp.
A distributed load balancing algorithm was dcveloped. which makes use of thc
active process migration in Clustered Time Warp. Clustercd Time Warp is a liybrid
synchronisation protocol; it uses an optimistic approach between the clusters and a
sequential approach within the clusters. As opposed to the ccntraliaed algorithm de-
veloped bj H. Avril for Clustered Time Warp, the presented load balancing algorithm
is a distributed token-passing one.
Wc present two metrics for rneasuring the load: processor utilization and proccssor
advance siniulation rate. Different models were simulated and test ed: VLSI niodele
arid queuing nctwork models (pipeline and distributcd networks). Rcsults show tliat
improving the performance of the system depends a great dcd on the nature of the
simulated model.
For the VLSI model, we also examined the effect of the dyuamic load bslancing
algorithm on the total number of processed messages per unit time. Performance
rcsults show that dynamically balancing the load, the t hroughput of the simulation
was iniproved by more than 100%.
Résumé
Dans cette thèse. nous considerons le problème du balancement dynamique des
charges pour la simulation à évenement discrèts basée sur un principe de synclironi-
scztion optimistique.
An algorithme distribué et basé sur le principe de migration des processus a été
développé pour le system TWR (Time Warp avec Regroupement). TWR est un
protocole de synchronisation hybride: il utilise une approache optimistique entre
les groupes de processors et une approache sequentielle à l'interieure des groupes.
Contrairement à l'algorithme centralisé developpé par H. Avril. notre algorithme est
distribu6 et basé sur la distribution d'un jeton.
Différents mocl&les out été simulés et testés: cles circuits VLSI, pipelines d'assemblage
et réseaux distribués. Les resultats obtenus ont montré que le balancement dynamique
des charges a permis d'accoitre considerablement la performance de la simulation.
ACKNOWLEDGMENTS 1 wish to place on record my profound sense of gratitude to my thesis advisor, Prof.
Car1 Tropper, for his advice, guidance and encouragement throughout all phases of
this work.
1 owe <an especial debt of gratitude to Dr. Axzedine Boukerche for the vivid
instructions and exchange of ideas. 1 have no doubt that this thesis is better product
due to his guidance.
1 am very grateful to Herve Avril for Iiis lielpful discussioris and hclp providirig
the code for CTW.
A general thanks to al1 my fellow consultants a t the School of Computer Science.
It is always a grcat pleasurc working with them.
1 would likc to tliank all the system staff and cspecially Luc Boulianne. Matthew
Sams, and Kent Tse for keeping biernat up and running. Further thanks go to al1
people working in the administrative office at our school.
I grcatly appreciste the friendship I share with Rami Dalloul and Nada Tammim
Dalloul who have been with me every step of the way.
1 am greatful to Jacqueline Fai Yeung and Peter Alleyne for their patience and
unfailing assistance. 1 will never ever forget their hclp.
Last but not least, 1 rcmain grateful to my family whose moral support und
encouragement were responsible for my sustained progress throughout the counc of
this work.
Contents
1 Introduction
. . . . . . . . . . . . . . . . . . . . 1.1 Parallel Discrete Event Simulation
. . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Parallel Logic Simulation
. . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Logic Simulation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Pürtitioning
. . . . . . . . . . . . . . . . . . 1.2.3 Fine Computation Granularity
. . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Circuit Activity ;
2 Pardel Discrete Event Simulation 8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction 8
. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Optimistic Approach 9
. . . . . . . . . . . . . . . . . . . . 2.2.1 Local Control Mechanism 9
. . . . . . . . . . . . . . . . . . . . 2.2.2 Global Control Mechanism 10
. . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Conservative Approach 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Load Balancing 17
. . . . . . . . . . . . . . . . . . . . . . 2.4.1 Static Load Balancing 17
. . . . . . . . . . . . . . . . . . . . . 2.4.2 D ynamic Load Balancing 20
. . . . . . . . . . . . . . . . . . . . . . . 2.5 Ciustered Time Warp (CTW) 26
. . . . . . . . . . . . . . . . . 2.5.1 Cluster Structure and Behavior 27
. . . . . . . . . . . . . . . 2.5.2 Variations of Clustered Time Warp 28
2.5.3 Load Baiancing for Clustercd Time Warp . . . . . . . . . . . . 29
3 Metrics and Algorithm 30
3.1 Mctrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Processor Advance Simulation Rate (PASR) . . . . . . . . . . 30
3.1.2 Processor Utilization(PU) . . . . . . . . . . . . . . . . . . . . 32
. . . . . . . . . . . . . . . . . 3.1.3 Combination of PU and PASR 33
3.2 Algonthrn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Pseudo Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Token period, P . . . . . . . . . . . . . . . . . . . . . . . . . 36
. . . . . . . . . . . . . . . . 3.3.2 SelectingtheClusters(Processes) 37
. . . . . . . 3.3.3 Number of Processors that can initiate migration 37
4 Results and Discussion 38
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Experimental Rcsults . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Increase in rollbacks . . . . . . . . . . . . . . . . . . . . . . . 44
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Pipeline Mode1 44
4.4 Distributed Network mode1 . . . . . . . . . . . . . . . . . . . . . . . 48
5 Conclusion
List of Figures
2.1 A Tirne Warp process (A) . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Process A after message &th timestamp 142 was processed . . . . . . 11
2.3 Rollbnck to time 135 . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Deadlock Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Cluster Structure 27
. . . . . . . . . . . . . . . . . . . . . 3.1 Processors with different PASRs 32
. . . . . . . . . . . . . . . . . . . . . . . . . Simulation time for C2 40
. . . . . . . . . . . . . . . . . . . . . . . . . Simulation time for C3 40
. . . . . . . . . . . . . . . . . . . . Peak rnemory consumptzon for C2 42
Peak mernory consumption for C3 . . . . . . . . . . . . . . . . . . . 43
Throughput Graph for C2 . . . . . . . . . . . . . . . . . . . . . . . . . 45
. . . . . . . . . . . . . . . . . . . . . . . . . Throughput Graph for CR 45
. . . . . . . . . . . . . . . . . . . . . . . Manufacture Pipeline Mode1 46
Percentage Increase in the Number of Rollbacks . . . . . . . . . . . . 47
Percentage Reduction of the Simulation Time . . . . . . . . . . . . . 47
Distributed Netutork Model . . . . . . . . . . . . . . . . . . . . . . . 48
Percentage of Reduction in the Simulation Time in the Presence o j
Time- Varying Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.12 Percentage of Reduction in the Simulation Time in the Absence of
Time-Varying Load.. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
vii
Chapter 1
Introduction
1.1 Parallel Discrete Event Simulation
Simulation is regarded as an indispensable tool for the design and vcrification of
complex and large systems. There is hardly a discipline or a science that cioes not
use simulation. Examples are queuing network simulation [NicoBS]. VLSI simulation
[Bai194]. and battlefield simulation [ReitSO].
Discrcte Event Simulation (DES) is a. technique for the siniulation of systems
wliich can be rnodeled as discrete systems: the components of the system change
stntes at discrete points in timc. For cxample? gates in a logic simulation change
state from 1 to O, or vice versa.
Continuous Simulations (CS) are used to simulate models in which components
change states in a continuous way with timc; examples are weather and liquid flow
simulation. Unlike discrete simulation, continuous simulation usually involves the use
of differential equations to evaluate the change in the state valuc.
A simulated systern is usually divided into entities which interact ovcr time by
scheduling events for one other. An event describes a change in thc state of a. certain
ent ity or component in the model. Each event in the system is assigned a virtual time
stamp which represents the time at which the event would happcri in thc rcal worlcl.
For instance. cntities iu a computer network simulation rcprcsent computcr servcrs.
and events represent real messages sent over the network.
Originally. simulations started on sequential machines. The simulated system is
mocieled as follows: a state, which includcs local variables. and an event queue, which
contains the events scheduled for the system. The simulator mode1 continuously
processes the event with the smallest time stamp, advances thc simula.tion time. and
scheclules events for thc future.
Because of the rapicl change in the s i x and complcxity of the models, Parallcl
Discrete Event Simulation (PDES) seems to be a promising tool to handle the d o
mands in the world of simulation. In PDES, clifferent cornponents of the modcl rcside
in different machines. Each component has iits own separate state and cvent queue.
The main problem for PDES is how to satis& the causality constraint: evcnts that
rlepend or1 eacli othcr have to be processecl in (incrcasing) ordcr of tlicir timcstamps.
Two main approaches to avoid causality errors have k e n widely stutlied ancl
usecl: the conservative approach, introduced by Chandy & Misra [Chari79], and the
optzmistzc approach, pioneered by Jefferson [Jeff85]. The two approaches cliffcr in
the way they deal with the causality issue. In a conservative simulation. the logical
proccss (LP) will process an event when it knows it is safe to do so. An LP in an
optimistic simulation processes events as soon as they arrive. If it later rcceives an
cvent with a srnaller tirne stamp, it rolls back mcl restores an old corrcct savctl stütc.
Bcfore resuming the simulation, the LP would cancel prcviously sent messages by
sending anti-messages that annihilate the originals.
The focal point of this thesis is the dynamic load balancing for Clustered Time
Warp (CTW) [Avri95]. CTW is a hybrid approach that uses Time Warp between
clusters of logical proccsses and a sequential mechanism inside the clusten. Although
tlic problem of loacl bülancing has been extensively stuclied over the past fcw ycars
[Hac87, Hac87, Wang85, Livn82, Zarg851, few results arc directly relatai to the area
of PDES. Io this thesis, two metrics for dynarnic load balancing are presented. A
distributcd dynamic load balancing algorit hm is also constructcd. Scverd associated
modcls are simulated to evaluate the performance of the algorithm and metrics.
1.2 Parallel
Logic simulation is a
Logic Simulation
standard technique for the design and venfication of VLSI
systcms bcfore they are actudly built. While the verification part can reducc thc
cxpcnsive prototypes and debugging time. the cxecutiou time requircd by simulatiou
cxceeds days and even weeks of CPU time for large circuits. Pxallcl logic simulation
providcs z u i alternative solution that accounts for thc rapid increasc in tlic sizc of
circuits.
Two traditional algorithms for parallel logic simulation have been proposcd iri
the liternture: (1) compile-mode, and (2 ) centralized-time event-driuen. In compile-
mode simulation [Wang87, Wnng90, Maur901, the systcm simulstes al1 the elemcrits
after each time unit. Scheduling and synchronization arc siniplifiecl in this approach
at thc cxpcnse of run time. The simplicity of this approach makcs it fcasiblc for
direct implemcntation in hardware. Whilc the compilc-moclc algoritlim cari rcsult iu
high spced-ups for active circuits: the synchronization time is a hindrancc for circuits
with low activity. Event-driven simulation [Fran86, Srnit87, SmitBû, Sou188. Wils86.
Wong861 provides an alternative solution in which only elcments whose input have
changed are simulated at each time step.
Cornpilernode and centralized-time event-driven are both synchronous algorithms.
Asynchronous algorithms used in parallel discrete cvent simulations h m exhibitcd
promises for parallel logic simulation.
Soule and Gupta [Sou1921 discussed the effectiveness of Chandy-Misra-Bryant
[Chan79, Brya771 algorithm (CMB) for parallel logic simulation. For six benchmark
circuits, the authors reported an effective concurrency of 42-196. Effective concur-
rcricy is definctl as the amount of parallelism that is obtaincd wheu usiug arbitrary
niariy procesuors. in the absence of synchronization and sctieduling overheads. The
authors noticed that using the CMB algorithm. 50% to 80% of the total execution
timc is spent on deadlock resolution. To resolve this issue. the authors experimented
with many variations such as look-ahead, null messages. clustering. and demand driven
scherne. Speed-ups of 16-32 were achieved on a 64-procmsor machine.
Su ancl Seitz [Su891 studied logic simulation for null-message conse~at ive simula-
tion. To rcduce the overhead of null messages, two variations were examinecl: (1) lory
null mcssagcs, and (2) elernent groupingl. Thc authors used Intel iPSC machincs as
a testbed to rneasure the speed-ups of a 1376-gate multiplier network. Speecl-ups of
approximately 2 and 10 were reported using 16-node iPSC/'L and 128-nodc iPSC/l
rcspectively.
Time-Warp techniques were &O used for VLSI circuit simulation. Chung and
Cliung [Cliuu89] impleniented Tirne-Warp on the Conncctiori Machirie. Thc authors
described the Lower Bound Rollback (LBR) technique to overcome storagc problcrns.
The LBR of an LP is equal to the local virtual time of the L P if it does not have
incoming links. For an L P with incoming links, the LBR is set to the minimum of
the local virtuel time of the LP and of al1 of the LBRs of the incoming links. If the
LP is part of a. cycle. than the LBR is set to the GVT. Oncc the LBR is known.
cach L P can rcclaini al1 the space used by events with timestamp up to thc LBR..
Two circuits from the ISCAS-85 benchmarks were simulatcd: a 16-bit combiriütiorial
multiplier (2406 gates) and 32-bit array multiplier(80UO gates). The authors rcported
that, using the LBR technique, c m reduce the storage up to 30% corriparcd to the
origirial Time Warp depending on the simulated circuit.
Brincr et al [Brin911 present another version of the Time-Warp for logic simula-
tion in that the algorithm uses incremental state saving to reduce thc overhead of
'Saiiic idca as CTW, introduced in section 2.5.
periodically checkpointing al1 events on a processor. The simulation runs on a BBN
GPlOOO machine. and it uses a lazy cancellation scheme. Even though the authors
rcported an improvement in the performance of the simulation. they noticetl that
circuit partitionhg is important in order to reduce rollbacks in the systcrn.
Implernentations for TimcWarp can also be founcl in [BaucSl]. Thc simulator
was implemented on a Sequent. and also on a set of Sun Sparcstation 2's. Speed-ups
from 2 to 3 were reported using a 5-processor Sequent, and 5.2 to 5.7 using a uetwork
of 8 Sun Sparcstations. Simulated circuits were selected from ISCAS-89 [BrglSS].
1.2.1 Logic Simulation
VLSI circuits can be sirnulated at various levels. tlcpending on thc abstraction of the
siniulated elerneuts. Each levcl emphasis certain aspects about the sirnulated circuit.
Four typcs of logic simulators are presented in [Horb86]:
a Switch level simulators
The switch level simulator is a special type gate sirnulator whick c m simulatc
MOS circuits more precisely than thc conventional gatc level sirnulators.
a Gatc level simulators
Thc simulated elernents are such logic gates as AND, O R's and INVERTER'S.
Each element has only one output termirial.
a Furictional leveI sirnulators
Thc simulated elemcnts are such hnctional primitive as latchcs, flip-Rops. coun-
tcrs and registers. The elernents rnay have more than one output tenninals.
0 Register transfer level simulaton
Register transfer level simulators provide the ability to define the behavior of
the simulated circuits with register transfer language.
Each circuit element (gate) is represented by a logical proccss (LP). LPs are
connectecl with each other as determined by thc circuit conncction. Each LP has its
own code that simulates the function of the corresponding circuit element.
Because of the rapid increase in the size of VLSI circuits [Mukh86. Term83j. large
simulation times have become a serious problem. VLSI simulation poses many chal-
lcnging problems for parallel discrete event simulation. The following text describes
thrce major problems: circuit partitioning. computatioo granularity and locality of
activi ty.
The distribution of LPs on the processon of a distributcd computing system has a
grcat impact on the performance of the simulation. A parallel simulation with a
bac1 distribution might run slower than its cquivalent serial simulation. VLSI circuits
are hard to partition [Cartgl. Spor93. Bai1941 because of the Iack of information on
signal activity: partitioning can only be done based on the structural informatio~i of
the circuit
A number of the partitioning algorithms [Spor93, Bai1941 attempt to balancc the
lond by assigning an equal number of components to al1 of the processors. This
technique ignores the fact that al1 parts of the circuit might not be active at the sarnc
timc [Sou187].
Anot her approach to partitioning is to minimize interprocessor cornniunication.
One approach is to place frequently communicating components on the same processor
[Nand92]. Carothers et. al. [Caro951 showed that an increase in the communication
delays can significantly reduce the performance of Time Warp for simulations with
large number of LPs and fine granular events such as logic simulation.
1.2.3 Fine Computation Granularity
Thc fine computation grauularity of events is a serious problem for optimistic logic
simulation. The message service time at an LP is very srna11 (e.g. evaluating thc out-
put of an AND gate). which rnakes message canccllation difficult in case of rollback;
ariti-messages cannot always annihilate previously sent messages quickly cnough. pos-
sibly leading to a cascaded rollback. Moreover, the number of events is in the hundrcds
of thousands. A change in the state of a gate, from high to low or vice versa. will
crcate an event which might trigger thousands of other events. Thc large number of
cvents means that the sixe of the event heap, the state queue, the iuput queuc and
the output qucuc are very large ancl any operation on thcse data structiirrs bcconies
corrcspontlingly expensive.
1.2.4 Circuit Activity
VLSI circuits are known to have locality of actzvzty property [AvrigG. Bai194. SoulSP].
i.c. only a part of the circuit may be active a t a given time, while thc rcmaining parts
are idle. This property can cause a big diffcrence in processor utilid â t' ions: sonic
processors might be very active while many others are idle.
Chapter 2
Parallel Discret e Event Simulation
2.1 Introduction
This chaptcr describes in section orle aiid two the two major techniques for psrallcl
simulation: the op timzstic approach and the consemative approach. Scctiori tlirce
iriclutlcs a gcneral description for load balancing. III section fivc, we introduce Clus-
terecl Time Warp. Clustered Time Warp uses Time Warp synchronization in between
clusters and a sequentid rnechanism within the clustcrs.
Optimistic Approach
In 1983. Jefferson [.Jeff83] proposed a new paradigrn for synchronizing distributed
systcms. This new paradigm. known as vzrtual tirne. providcs a flexible abstraction
of rcal time for distributcd systcms.
Timc Warp is introduced as an optimistic synchronization protocol whicli implc-
ments thc virtual time paradigm. It uses run time detection of errors caused by
out of order execution of dependent simulator events. and recovery using a rollback
mcchmism.
Thc following subsections describc the two main components of Timc wwp: local
ürid global control mechanism. The local contrd mechanism is rcsporisiblc for proccss
syncliroriization. Thc global control mechanism clcals with spacc management.
2.2.1 Local Control Mechanism
In Timc Warp. cach message consists of six components: (1) the sender of the message
(2) thc receiver (3) the virtual sent timc (4) the virtual rcceive time (also known as
tirncstamp) (5) and an sign field used to indicate whcther thc nicssagc is an ordiriary
message or and antimessage and finally (6) the event data.
Eech process in Timc Warp h a its own local virtual timc (LVT). whicli is thc
timestamp of the next message waiting in the input queue. In addition, cach proccss
kceps a virtual clock that reads the local virtual time.
Each process has a single input queue in which al1 arriving messagcs are storcd in
increasing order of th& timcstamps. A process serves al1 of the messages rcsicling in
its input queue in spite the fact that future messages might havc lower tinicstrzrnps.
Sincc al1 the virtual clocks in the system are not synchroniircd, it is possible for
a. process to receive a message with a virtual time in the "pmt". Such a. message is
callecl a stmggler. In order to maintain the causality constraint, the receiving process
has to roll back to an earlier virtual time, cancelling al1 work clone and sending " anti-
messages" to cancel al1 of the messages sent after thc timcstamp of the straggler.
In order to rollback? a process must. from time to timc. save its state in a state
queue. Whenever a process receives a straggler. a process uses the statc qucue to
rcstore itself to a state p io r to the timestamp of the straggler.
An anti-message is a duplicate of the message. with a negative sign field. Each
time a proccss sends a message? its anti-mcssage is retainecl in the sender's output
queue. Iri the case a of rollback, al1 of the anti-messages with timcstanips highcr thari
that of the straggler are sent out.
Figures 2.1? 2.2 and 2.3 show the structure and state of a process during three
consecutive snapshots. Figure 2.1 shows process A about to process the niessage
with timestamp 142, coming fiom process D. After processing that message. (figure
2.2). process A saves its state. and changes its local virtual time to 154. which is the
timestamp of tlie next message waiting in the input queue. At this moment, proccss
A rcceivcs a mcssage (straggler) with timestamp 135. Figurc 2.3 shows process A
after rolling back to a s a k state: (1) the last state saved beforc timestümp 135 was
rcstored; (2) antimessages in A's output queue that have a virtiid timestamp larger
than 135 have been transmitted to their destinations.
2.2.2 Global Control Mechanism
During the siniulation, cach process uses its mcrnory for swing statcs. storing input
niessages and keeping copies of output messages. Since each procesv has a limitecl
memory, space has to be reclaimcd regularly for further use.
Global virtual time (GVT) was introduced by Jefferson[.Jeff83]. to help solve the
problem of memory management:
The GVT at rcal timc T is the minimum of al1 virtual tirncs in al1 docks
a t time r, and (2) of the virtual send times of al1 mcssagcs that have becn
sent but have not yet been processecl at time r.
Local Time
Process
Virtual
Name Queue
lZ1 Begin Time +inf End Time
Local Time
'rocess
Output Queue
Figure 2.1: A Time Warp process ( A )
Virtual
Name Queue
Begin Time End Time
Output Queue
Sig n Receive Time Receiver Send Time Sender Data
Sign Receive Time Receiver Send Time Sender Data
Sign Receive Time Receiver Send Time Sender Data
Sign Receive Time Receiver Send Time Sender Data
Figure 2.2: Process A ajter message with tzmestamp 142 was processed
Local Virtual Time
Process Name
Current State
State Queue
Begin Time End l ime
Output Queue ,
C
Sign Receive l ime Receiver Send Time Sender Data
Sign Raceive Time Receiver Send Time Sender Data
Antimessages in transit toward receivers
Figure 2.3: Rollback to time 135
Oncc the GVT is computed, thc system cari gct rid off al1 but thc last s w d
state before GVT. siuce there might be uo strite saved exactly a t GVT. Similarly. ttic
events prior to the GVT in the L P input queue çan al1 bc rcmoved. Only evcrits with
a timestamps smaller than the timestamp of the oldest kept state can bc discardecl.
This process is called fossil collection.
The task of finding the GVT in a distributed environnient is not trivial. In a sharcd
nicniory rnultiprocessor environment, the GVT can be easily computed [Fuji891 : how-
cver, transient messages in a distributed system make GVT computation a difficult
problem. A transient message is a inessage that has been sent from a source process
but hm not yet arrived a t the destination proccss.
A very simple way to find the GVT is to stop al1 processes, cornpute the GVT,
and then resume the computation. The problem with this approach is the high
cost in time, especially when the number of processes is large, and when the GVT
cornputation is done frequently.
Some GVT algorithms [Jeff83, Sama851 solve the problem of transient messages
by sending acknowledgment messages. However? sending acknowledgment messages
will result in a large increase in the network traffic. and might also clegracle the
performance of the distributed simulation.
Lin and Lazowska [Lin891 proposed another algorithm for GVT computation.
in which an acknowledgrnent is sent for a group of messages, rather than for each
message. The authors reported a 50% decrease in network trafEc. compared to an
optimized version of Samadi's [Sama851 algorithm proposed by Bellenot [Bc1190].
As the number of proccsses in the simulation grows larger. scnding acknowlcdg-
nicnts bcconies a scrious impediment. In rcsponse to this problcm. asyricl~ronoos
token passing algorithms [BellSO, CouB1, Prei891 have been introduccd. Prciss's
[Prci89] algorithm. presented next, computes the GVT for nodes arrrrimgcd on a vir-
tua1 ring. The algorithms runs in two phases: the START and the STOP phase.
Lct [STARTi, SSOPI:] denotes a real time interval for node i. Let MVT, denote
the minimum of al1 timestamps of al1 messages either in transient from notlc i at
time START, or sent d u h g the timc intcrval [START,, STOP,]. Let PVT, dcnotc
the srnallest timcstamp at time STOP, of al1 objccts simulatcd on noclc 2 . LVT,
rcprcsents the minimum of MVTi and PVTi. Then an estimatc of the GVT is the
minimum of al1 the LVT,s. The algorithm requirn that al1 of the time intcrvals h m
at least a time point in common, during which al1 processors will be computing their
LVTis at the same time.
During the START phase. The START token starts at node 0, and circulatcs to
cach riode on the ring. Whcn a proccss receivcs the STARS tokcn, it forwarcis it to
its neighbor, and then sets its STARTi to the receipt time of the tokcn.
When the START tokcn gets back to node O, it launches the STOP phase. Node
O will compute its LVTo, store it inside the token, and send the token. When a node
i rcceives the STOP token, it compares its LVTi to the GVT valuc in the tokcn, and
the minimum value is kept in the token. When the token gets back to nocle 0: the
value stored inside the token is the smallest value of al1 of the LVTi. which tlenotes
the GVT. Another round is needed to pass the new GVT to al1 nodes.
2.3 Conservative Approach
While optimistic simulation detects cuid recovers from causality error by rolling back
to a statc just pnor to the error. a process in a consemtive sirriulatiori blocks so as
to avoid the possibility of any synchronization crror.
The channel clock value is defined as the timcstamp of the last message receivcd
dong that channel: in case there is no message received dong that channel. the
cliannel dock value is set to O. Each process repeatedly selects the input chmnel
with the sniallest clock value. and serves the message waiting in that channel. If the
selectcd channel is empty, thc process blocks since it cloes not know thc timcstanip
of thc next message to arrive on that enipty channel.
A deadlock can happen as a. natural consequence of the blocking behavior. Figure
2.4 shows three processes in a deadlock situation. Peacock et. al. [Peac79] prcsenteci
a necessary and sufficient condition for deadlock to occur:
A dcadlock exists if and only if there is a cycle of cmpty channcls which
d l havc the same channel clock value, and processes are blockecl becausc
of t hese channels.
A number of approaches have been proposed to solvc the deadlock problem. The
first approach, presented independently by Chandy k Misra [Chan791 and Bryant
[Brya77], uses null messages to avoid deadlock. A null message provides a lower
bound on the timestamp of the next incoming message. Each time a process consumcs
a message, it must send a message on each of its output channels. If the simulation
does not require a regular message on a channel, a null message is seut in its place.
When a process receives a null message on one of its input channels, it guarantees
Figurc 2.4: Deadlock Situation
that no rricssagc will arrive with a srnallcr timestanip than that of tl-ic riull riicssagc.
Thc nul1 niessage approach suffcrs from two drawbacks: the fine granulnn'ty of thc
riull niessages and the minimum time stamp zncrement. For tlic simulation of systenis
of simple elements, such as logic gatcs, the tirnc required to evaluate the output of the
gste is so srnall that it is comparable to the time requiretl to process a null mcssage.
In addition. the number of regular rnessages in logic simulatiou is in terms of millions.
and if for each message there is only one null niessiige sent. this woultl tloublc thc
rictwork t raffic.
To rcduce the problern of network traffic introduccd by the use of riull messages.
Misra [Misr861 suggested that a process refrains for a period of time fiorn seriding a
nul1 message, expecting that a real message will be sent over the same channel. After
ü predetermined period of time a null message can be sent in case no rcal rncssagc is
sent. Misra [Misr861 also suggested the demand driven technique, in whicli a proccss .
scnds a rcquest to its predecessor for the Iowa bound on the tirnc of its riext output
niessage.
Scverd venions of the previous algorithm have been suggcsted. Su and Seita
[Su891 suggested a lary message technique, in which several consecutive null messages
are replaccd with one null message. Nicol [Nico88] proposed the use of a future list.
Eiich process. making use of its input list, can estimate the lower bound on the
timestamp of the next message. This estimate. defined as lookahead depends on the
service time of the number of messages in the input queues.
Many empirical results [NicoSS, Fuji89. FujiSS] have proved that a good perfor-
mance of the simulation depends on a good lookahead estimate. Intuitivdy. a good
lookaheacl reduces the proccssors blocking timc. For example' if an L P has a dock
value t , and it has a lookahead 1. then this LP can predict that it will not rcceivc any
further message with a timestamp less than t + 1. Thc LP would be safc to proccss al1
pending events with timestamp less than t + 1. The problem with using lookahead is
that it requires knowledge about the rnodel. which is not possible sornetimes without
a pre-computation
While prcvious algorithms avoid deadlocks, others detect and break thcni. Un-
likc the dcacllock avoidance approach, this approncli does uot prohibit cyclcs of zero
timestamp increment. The first such algorithm was presented in [Misr82]. Thc algo-
rithm uses diffusion computation to detect whether a givcn node is a part of a cycle
or a knot. The dgorithm uses special messages, called probes, that are sent alorig the
outgoing edges of the nodes in the graph. The deadlock c m be broken by observing
that thc smallest timestamped message in the entire systcm is always safc to proccss.
The problcm with this approach is that it is clifficult to dccidc wlicn such an algorithm
shoulcl bc crnployecl: if the algorithm is usecl too frequently, tlicn it rriight degrade
thc performance of the simulation. On the other hando if a deadlock is not brokeri
quickly. the performance of the simulation will also deteriorate.
Lubachevsky [Luba88] proposed another alternative consermtivc scheme basecl on
the idea of bounded lag (BL). The BL algorithm uses a moving timc window in which
only evcnts whose timestamp lies in the time window are eligible for processing.
The BL algorithm suffers from two problems: (1) the smallest timestamp of al1
cverits in the system has to be computed periodically and sent to al1 processes which
adds synchronisation overhead on the simulation; (2) the size of the time window
is critical to achieve good performance. Finding the adequate sizc of the window
requires know1edge about the application.
2.4 Load Balancing
In a clistributed system, it is likely that some processors are heavily loaded while
others are lightly loaded or idle. Efficient utilization of parallel computer systems
rcquires that the job being executed be partitioned over the proccssors in an efficient
way.
Iu the gerieral partitioning problem? one is given a parallel cornputcr systeni (with
a specific intcrconnectioa network) and a parallcl job or task composecl of niodulcs
or units that communicate with each other in a specified pattern. One is requircd to
map the modules into the processors in a way to minimizc the total exccution timc.
The mapping or assignment can be categorized as being cither static or dynamic.
Thc static approach ' assumes a priori knowledge of the execution time and cornniuni-
cation pattern of the simulation. and the tnsk-to-proccssor mapping is dcciricd alicad
of run-time. The clynamic approach, in contrat, moves jobs or modules bctwcen
processors from time to time, whenever this leads to improved efficiency.
In this section, a general review of the field of load balancing is presented with
cmphasis on the work done in the context of PDES. The fint subsection describes
somc approaches and solutions to the problem of static partitioning. The second
subsection outlines some algorithms for dynarnic load balancing.
2.4.1 Static Load Balancing
This subsection briefly reviews some of the previous work donc on static load balanc-
ing. Algorithms for this paradigm rely on a priori knowledge of the computation and
l Aiso known as stlztic partitioning, compile time pastitioning, and mapping problem [Taut85,
Bouk941.
corrimunication cost for each task in the system.
In t e m s of graph theory. the problem can be descnbed ris follows:
Given a weighted graph G with costs on its edges, partition the nodes of
G into subsets no larger than a gzuen maximum size. so as to niinimize
the total cost of the edges? cuts and the difference in the subset wcights.
Wcights on nodes and edges represent proccssing ancl communication cost.
Since the problem is known to be NP-complete [Gare79]. lieuristics [BokhSl.
Bokh87. Ni851 have been developed to obtain suitable partitions. Kernighan <anci
Lin [Kern70] proposed a 2-way partitioning algorithm with constraints on the subset
skc: givcn n graph of 271 vertices (of equd size), the algorithm splits the graph into
two subsets of n vertices each with minimum cost on dl edges cut. The algorithm
starts with any arbitrary partition A, B of the graph mci tncs to clccrease the initial
cxtcrnsl cost by a series of interchanges of subsets of A. and B. T h cxternal cost
clefines the sum of connections between the two partitions. Using the 2-way partition
as a tool, the authon extended the technique to perform a k-way partitions on a set
of kn objects.
Bokhari [Bokhal] studied the mapping problem on a Finitc Elcment Maclliinc
(FEM) .' Starting witli a random initial mapping, the heuristic algorithni procceds by
sequences of pairwise interchange between processors till tlic best mapping is fouricl.
Wei aucl Cheng [Wei891 proposed a partitioning approach called the ratio cut. Their
ülgorithm runs by identifying clusters in a circuit, and avoids any cut through these
clusters.
Most of the previous approaches to the partitioning problem are suitable for re-
stncted applications, but they don% quite satisfy the properties of parallel simulation
bccause of the synchronization constraint among proceusors. Static partitioning for
parallel simulation has recently became a focus of research.
*The Fiiriite Element Machine is m array of processors used to solve structurai problcms.
Nandy and Loucks [Nand92] presented a partitioning and mapping algorithni for
consenmtive PDES. The objective of the dgorithm is t o reduce the communication
ovcrhead and to evenly distribute the cxecution load among al1 of the proccssors.
Tlie algorit hm starts with an initial random allocation of proccsses to cliistcrs. and
then iteratively moves processes between clusters until no further improvement can
be found (a local optimum). A process is moved if its reallocation does not violate the
size constrrrint of the processor. Using the proposed algorithm, thrce srnall circuits
wcrc partitioned. mapped, and simulated on a network of transputers. Results showed
a 50% improvernent in the simulation performance ovcr random partitioning.
Rccently. Sporrer and Bauer [Spor93] have presented a hierazchical partitioning
for the Time Warp simulation of VLSI circuits calleci the Torollii partitioning" . Thc
ob jcctive of the algorithm is to minimize the number of interconnccting signals whilc
kccping the size of partitions nearly equal. The basic idea of the algorithm is to clc-
tcct strongly connected regions and use them as indivisible cotnponents. callecl petals.
for partitioning. The algorithm uses an hierarchical method which, in the first step.
creatcs a fine graincd clustering of the circuit with a niinimttrn nuniber of intcrcorincc-
tions, rcgardless of the size of the partitions. In the seconcl step. cqual sized partitions
arc formcd at a coarse grained level based on the connectivity matrix. To cvaluate
the partitioning algorithm, cornparison was donc with three other partitioning algo-
rithms [Fidu82, Hilb82, Smit871. Six benchmark circuits werc partitioned and rated
with four diffcrent algorithms. The simulation was run on a nctwork of SunSparc
2 workstations connected via Ethernet. The achieved speed ups wcre almost lincar
even for large riurnbers of partitions.
Another approach for static partitioning known as simulated annealing (SA) was
proposcd by Kirkpatrick et al [Kirp83]. The SA dgorithm draws upon an analogy
with the behavior of a physical system in the presence of a heat bath. The algorithm
uses m adaptive search schedule to find a good partition (near optimal). Starting
wit h an initial partition, the algorit hm moves processes between processors until a
termination or equilibrium condition is met.
Recent work on partitioning using sirnulated annealing appeared in [BoukW] . The
authors proposed an objective hnction to evaluate the quality of the partitioning so-
lution generated by the algorithm. The objective hnction depends on inter-proccssor
communication? the distribution of loads, and the number of null messages betnreen
proccssors. A sct of experiments werc conducted to find the impact of tlie partitiou-
ing algorithm on the performance of a conservative simulation using null niessages.
Thc authors used Intel iPSCli860 hypcrcube as a platform for the simulation. In
the experiments? a queuing network mode1 in thc form of a toms with FCFS service
clistribution was used as a benchmark. Results showed a 25-35% rcduction in the exe-
cution time of the simulation using the proposed dgonthm over raridom partitioning
whcn 2 and 4 processors are usecl.
2.4.2 Dynamic Load Balancing
Thc literature is rich in gcncral purpose dynamic load balancing algorithms [LuBo,
Iqba86, Ande871, but only a handful of these algorithms apply to PDES.
In [Shan89], the author presents a simple dynamic load balancing for conscrvative
simulation. Three phases take place in his dynamic allocation scheme. In the first
pliwc. the process(es) which need allocation are identified: thcse are the proccsscs
whose servicc times fa11 outside a prcdefined intcrval of service time. In thc sccond
phase? process(es) which should be migrated arc identified: processes identified in
phase one which have predecesson or suc ces soi.^ ori different processors are consid-
cred for reallocation. The third phase is for identifying the processors to which the
process(es) are reallocated: processors containing successors or predecessors for se-
lected processes will be chosen first. The algorithm was tested on two logical systcms
(pipelines) on an iPSC Hypercube with 4 nodes. Experiments showed tliat when us-
iug the load balancing algorithm, the running time of the simulation is reduced only
when the system exhibits a non-uniform distribution of messages.
Boukerche and Das [Bouk97] presented a dynamic load balancing algonthm for
conservative simulation using the Chmdy-Misra[Chan79] nul1 nicssage approach. T h i r
algorithm uses the CPU queue length as a metric. that indicates thc workloacl at each
processor. In their approacho a Load Balancing Facility (LBF) is implementetl in two
separate phases: load balancing and process migration. In the first step of the al-
gorithm, every processing element in the simulation sends the size of its CPU queuc
length to a central processor in which processors are classifiecl according to the devi-
ation of their respective queue lengths from thc mean. The second stcp for the loacl
balancer is to select the heavily overloatled processes fiom the heavily overloadecl pro-
cessors. Experiments were conducted on an Intel Paragon A4. Tlic authors cliosc ari
N x N torus queuing nctwork mode1 with an FCFS servicc discipline as a bcncliniark
in their cxperirnents. Their results showed approximately a 25-30% reduction in the
simulation time using dynamic load balancing over a random static partitiming whcn
2 processors were employed. A reduction of 40% was observecl when the numbcr of
processors incrcases from 4 to 8. Similarly. the authors oLscrvccl a 50% rcductiori in
the run timc when clianging the tiurnber of proccssor from 8 to 16. Thc authors also
rcportcd a recluction in the synchronization overhead " witli dynamic load balancing:
whcn less than 4 processors are used, the rcduction was approxirnately 2530% in the
~ynchron~a t ion overhead. When 8 and 16 processors werc used, the reduction was
10-40%.
Burdorf and Marti [Burd931 dcal with the problem of dynamic load balancing for
the RAND Time Warp system. The system runs on a set of workstations sliürecl with
othcr users, which creates a largc variation in the loads depending on the number
of users and processes. Initially? the static balancer assigns objects to processors
xcording to the load, which is gathered by running a presimulation. During the
'Dcfined as the uumber of nul1 messages processed by the sirridatiou usiug the Cliaridy-Misra
ud-message approach dividcd by thc number of red messages proccsscd.
simulation? the balancer records the smallest simulation time of al1 objccts on each
machine. To minimize rollbacks, the dynamic load balancer seeks to minimize the
variance between the ob jects' simulation times. The authon prcsented four schcmes
for load balancing: three of them use Local Virtual Time (LVT) as a mctric, and the
fourth uses the ratio of total processed messages to the total number of rollbacks.
Thc algorithm was irnplemented on a simulation of colliding shapcs niovirig in a
constricted spoce. Thgy report that using the average of processeci messages was uot
fcaüiblc for their simulation, while the best results were achievctl when objects which
are furthest ahead are moved to machines which are furthest behind. ruid objects
whiclz arc furthest behind are moved to machines which are furthest ahead. Rcsults
using the dynamic load balancing strategy showed a five to 10 times performance
irnprovcment ovcr simulation with only static balancing.
Sclilagcnhaft ct ai [Sctil95] describc a dynamic load balaricirig a l g o r i t h for Tirric
FVarp VLSI circuit simulation. The algorithm consists of three cornponcnts: (1) loacl
scrisor; (2) load cvûluator: and (3) load adaptor. The load sensor cornputes the
Virtud Tirne Progess (VTP) for cach simulation process. The VTP rcfiects how
fast a siniulation process progresses in virtual time. The loacl sensor first calculatcs
the Integrated Virtual Time (IVT) a t the end of each simulation stcp. The IVT of a
proccssor is defined as the average of al1 of the virtual times of the clusters rcsiding
on that processor. Thc VTP during a time interval [Ti, Ti] is thcn cornputeci as tlic
change in the IVT per unit real time. The load evaluator decides whether to launch
process migration or not, depending on ratio betwcen the actual and the predicted
VTP; if the ratio is big enough (migration is worthwhile), then load balancing is
initiated. Process migration is thcn controlled by the load aclaptor. The authors
simulated one logical circuit (~13207 [Brg189]) on two processors shared with other
users. Corolla partitions was used to partition thc circuit iuto sniall clustcis with
many of thcm mapped to each processor. The results showed that the algorithm
reduccs the lag between the VTPs of the processors which resulted in an increase in
the aclvance of the GVT, and hence a 24% reduction in the simulation time.
Reither and Jefferson [ReitSO] presented a dynamic load balancing algorithm for
thc Tirrie Warp Operating System (TWOS). Tkcir algorithm was tcsted 0x1 battlcfield
simulation in which objects are created and destroyed concurrcntly. For this imple-
mentation, dynamic load balancing was necessary in order to ensure goocl <assignment
of the new object to processor, and to balance the load after some objects have been
dcstroycd. In their simulation, the life span of an object is divided into phases. which
can run on different processors. Each phase is responsiblc for handling an objcct's
data and variables for one interval of virtual time. The algorithm trics to balmec
the 10ad using an estimate of eflective work, defincd as the portion of the work tliat
will not be rolled back. The authors tested their algorithm on two simulation niodels:
a military simulation and two-dimensional colliding pucks simulation. Their results
showed that speedups for the pucks simulation are equal to or less than the normal
speedups because of the relativcly even balance of pucks. Because of the inhercnt im-
balCuicc in thc battlcfield simulation, thc dynamic load balancing algorithm improvccl
thc spccclup by 25%.
Glazer and Tropper [Glaz93] introduccd the notion of a tirnc slicc and simulation
advance rate (SAR) in thcir work on dynamic load balancing. TLic purposc of the
tlynamic load balancer was to reduce the number of rollbacks, and heuce reducc the
simulation time. To this end, the SAR is computed at each processor and sent to
one dedicated processor. The SAR is dcfined as the advance of the simulation dock
clivided by the CPU time needed for the advance. The load of a proccss, derived frorn
its SAR, is a measure of the amount of CPU time it requires to advance its local
simulation dock by one unit. The length of the time slice for a process is dctermincd
by its load: each process is given tirne slice proportional to the ratio of its load to
the mean of al1 of the loads. To evaluate the performance of the algorithm, three
different simulation models were constructed, representing different classes of nioclelu:
a pipeline model, an hierarchical network model and a distributed network model.
Expcriments were conducted on PARALLEX, an emulation of a panllcl siniiilation
on a uniproccssor machine. The authors reporteci a 16-33% decrease in rollbacks ancl
19-37% increase in the simulation rate when only 8 processors were used. Rcsults
showed better improvement when the number of processon was increased: 42-71%
arid 4749% decrease in the rollbacks when using 16 and 32 proccssors respcctively?
iri addition to 30-39% and 49-65% increase in thc simulation ratc.
Most rccently. Carothers and Fujinioto [CaroSG] proposed a. load clistributiori sys-
tem for background cxecution of Time Warp. The systcm is designcd to use the
free cycles of a collection of heterogeneous machines to run a Time Warp simulation.
Their load management policy consists of two components: the processor allocation
policy and the load balanczng policy. The processor allocation policy is used to deter-
rnirie the set of proccssors that c m be used for a Timc Warp simulation. This set is
computed dynaniically throughout the siniulation. A prowssor is addcd or rcniovctl
ficm the set when an cstimate for thc Time Warp exccution timc on that proccssor
falls below or moves above certain thresholds. As in [GIas93]. the metric uscd wcw
thc Processor Advance Time (PAT), which refiects the amount of rcal timc uscd to
mske a one unit advance in virtual timc. Using the centrally collected PATs, the
load bûlancing policy tries to distribute the load cvenly ovcr al1 processors. An initial
implementatiou of the nlgorithm was implemented on a set of niric Silicon Graphies
rnachincs? one of which was dedicated to dynamic load managcmerit. The Tinie Warp
application usecl in the experiments is a simulation of a personal communication ser-
vices network. The authon cxperimented on two different parameters: timc-varying
cxternal workloads, and changing the usable set of processon. In thc first set of
cxperiments, half of the processors experienced a rnonotonic increase in the cxternal
workload which results in cluster migration into other processors. Rcsults showecl an
improvcment in the simulation tinie of up to 45%. To sce how t h system rcacts to
the change in the usable set of processon, a "spike" workload was aclded to four of
the eight workstations. The authors observed a small improvement in the simulator
cfficiency and in the reduction of the simulation tirne. This minor improvemcnt was
believed to be thc result of the extra overhead of consiclering iriactivc processors in
the cornputation of GVT. A delay in the GVT computation would result in a merriory
shortage on active processors since memory is riot rcclaimeci fast eiiough.
Avril and Tropper [Ami961 studied dynarnic load balancing for Clustered Time
Warp. To measure the load, the authon defined the ''load of a cluster'' to be the
number of events which were processecl by the cluster since the Iast load balance in
the simulation. The load of a processor is then cornputecl as the sum of al1 of thc
loads of al1 of the clusters resicling on the proccssor. The authors clcscribe a triggcring
technique based on the throughput4 of the system: the loacl balanciug algorithm is
triggercd only when overall incrcase in the throughput of the system becomes larger
than the cost (in terms of throughput) of moving the clustcrs. The algorithm was
implemented and its performance was measured using two of the largest benchmark
digital circuits of the ISCAS789 series [Brg189]. Results showed an irnprovcmcnt of
40-100% in thc throughput of the system when dynarnic load balsncing was uscd.
The authors observecl that minimizing interproccssor's comrnuuicatious whcri niovirig
clustcrs rcduce the uuniber of rollbacks, but does not improve the throughput.
'Definecl ils the number of non-rolleil-back message evcrits pcr rinit timc.
25
2.5 Clustered Time Warp (CTW)
An implementation of our algorithm for load balancing was developed for Clustered
Time Warp (CTW) [Avri95]. CTW is a hybrid algorithm which makcs use of Tirne
Warp between clusten of LPs. and a sequential algorithm within thc cluster.
Logic simulation has always been s challenge for Timc Warp because of two pri-
mary problems: memory management [Fuji891 ancl unstable behavior [Luba89]. The
niemory management problcm arises as the result of the large number of LPs in the
simulation: a circuit can consist of hundreds of thousands of gates (LPs). Sirice each
LP must regularly Save its state to overpass a rollback, it is possible that thc simula-
tion might run out of memory. Fossil collection algorithm cannot rcclaini spacc fast
cnough for the simulation to continue. Several schemes werc dcvclopctl to solvc this
problem includirig thc use of the cancelback protocol and the usc of artificial rollhack
[Jeff9O].
Logic simulation dso cxhibits unstable behwior becsusc of low coniputatiorial
granularity: messages are small. and the service time is in terms of nano-seconds
(bit-wise operation); messages flow very quickly from one proccss to thc othcr: an-
timcssages cannot annihilate previously sent messages quickly enough. possibly Icad-
ing to cascacling rollbacks [Avri96].
To rcclucc memory consumption and to avoid cascading rollbacks. CTW üsscrnbles
niany LPs into one cluster. Administration is exccuted on the clustcr lcvel. rather
thau on the LP lcvcl, which leads to a less scheduling overheacl. Message cenccllation.
cxecuted on the level of clusters, helps to avoid cascading rollbacks. Next section
describes the structure and the rnechanism for CTW. The 1 s t section introduces
variations of CT W.
/'- Cluster Environment
Ciwter Input
Queue
output Queue
\ Tirnezone Table
Figure 2.5: Cluster Structure
2.5.1 Cluster Structure and Behavior
Each cluster in Clustered Time Warp is autonomous: it receives messages. proccsses
cvents, swes states and messages, and rolls back in case of a stragglcr or ariti-riicssage.
A closter corisists of one or more LPs (gates), a Cluster Environnient (CE). a Time-
zonc table, and a Cluster Output Queue (COQ). The COQ holds the output messages
of the cluster and is used in case of rollback. The CE is the managcr of the Tiniezone
tablc and the COQ. Figure 2.5 shows the complete structure of a cluster with four
LPs.
Initially, the cluster tirnezone consists only of the interval [O, +m[. arid is updatecl
ovcr time. When a cluster receives an event with a timestarnp t, it determines which
tirnezone t fits into, and divides it into two timezones, [ti, t [ and [ t , t i+i[ .
Each LP in the cluster keeps its Local Simulation Times (LST), as well as the time
of the last event it processed (TLE). When the LP proccsses an cvent, it determines
if its tirnezone is different £rom the one of the TLE; if this is the case, then the LP
'Local Simulation T i e is the same as locd virtual timc
27
saves its state.
When an LP tries to send an event. it checks the receiving LP; if this LP rcsides
on the sarne cluster, the message is inserted into the input queue of the receiving LP:
if not, thc message is handed to the CE which takes care of forwarding the message
to the cluster in which the rcceiving LP is located.
Whcn a cluster receives a straggler with timestamp t,. the CE creates a ncw
tirnezone. and rolls back d l of the LPs with SLE greater than t,. After the rollback,
a. "coast forward:' is executed at the LP level. The cluster will behave similarly when
it receivcs an anti-message. except that it will not create a new tirnezone.
2.5.2 Variations of Clustered Time Warp
As rnentioned prcviously. when a clustcr rcccives a straggler or ari üritirncssagc. it
rolls back dl of the LPs which have processed an eveut with a. tirncstamp largcr than
that of the straggler or the sntimessage. This includes also LPs which arc uot dircctly
affected by the rollback. The first variation of CTW [Avriga] was intcndecl to avoid
unnecessary rollbacks: only affected LPs will be rolled back.
In the new stratcgy, when a cluster receives a straggler or an antirnesssge, it
iipclates its timezonc table, and forwards the evcnt to the input queuc of the rcceiving
LP. In this way, only the receiving LP will be rolled back. This ncw scheme is callecl
Local Rollback, as opposed to Clustered Rollback.
A second variation of CTW deals with chcckpoint timing. Instcad of saving its
' ves a state when it cnters a new tirnezone, an LP saves its statc cach time it rec~i
message from an LP located in a different cluster. Even though the new scheme
clecre~es the number of states saved, therc is an increase iri thc number of cvcnts
an LP has to keep. Tlicse events are needed for coast forwmd, whcn ari LP is rolled
back to a state prior to the GVT. This new schenie was cellccl Local Checkpointing,
as opposed to Cluster Checkpointing for pure CTW.
Using the previous variations, three checkpointing algorithms were developed:
Clustered Rollback Clustcr Checkpointing (CRCC), and Local Rollback Local Check-
pointing (LRLC) , Local Rollback Cluster Checkpointing (LRCC) . These algorithms
exhibit a trade-off between memory and execution t ime. Experimental rcsults about
the performance of each algonthm is found in [Avri95].
2.5.3 Load Balancing for Clustered Time Warp
Cliistered Time Warp supports process migration for dynamic load balancing [AvrigG].
In the original implementation of Clustered Time Warp, a proccss calletl thc pilot is
detlicated to collect load information from d l of the processors; thc load of each
processor is piggybacked on the GVT token, and forwardecl to the pilot. Depending
ou thc distribution of the loads, the pilot decidcs on changing the process to processor
niapping.
Our current implementation of load balancing uses a. distributcd algorit hm. iri
wliich thcre is rio dedicated processor. A token is passecl arouncl on a virtual ririg to
al1 of thc proccssorso and load balancing is done when al1 loüds arc inscrtecl irito tlic
token.
Chapter 3
Metrics and Algorit hm
This chepter provides details of the metrics and the algorithm for our dynarriic load
balancing algori t hm. The following section describes thc two mctrics crnploycd. Scc-
tion two describes the dynamic load balancing algorithm in dctail. Section thrce
describes the design issues associated with the devclopment of the algorithm.
3.1 Metrics
Dynamic load balauuicing rcquires both a. metric to clcterminc thc systcm load arid
a mcchanism for controlling process migration. Ideally, the mctric should riot orily
bc simple and fast to compute, but also effective. In this section, two such rnctrics
arc suggested: Processor Utilization (PU), and Processor Advancc Simulation Rate
(PAS R) .
3.1.1 Processor Advance Simulation Rate (PASR)
If several (simulation) proccsses are interconnectecl, a cliscrcpaucy in tlieir rcspcctivc
virtual clocks can result in an increase in the numbcr of messages arriving iri the
past, and cause rollbacks. When a process is rolled back fiom time ti to time t j , ali
work performed during this time penod is discarded. System resources used dunng
the corresponding real time interval could have bcen productively employcd by othcr
processes.
Controlling the rate at which processes advance their corresponding virtual timcs
will minimize the difference between the virtual clocks, and as a consequence. reducc
the number of rollbacks which occur in the simulation.
The virtual time of a processor is defined as the minimum virtual timc of d l thc
processes residing on th& proccssor. A proccssor which has no cvents to proccss scts
its virtual time to infinity.
For a system sirnulatecl in thc rcal time intemal ( tS tar t , tend). t hc Processor Advancc
Simulation Rate (PASR) defines the rate of advancc in virtud timc relative to real
timc. Let t l and t2 be two real time values, with t2 > t L . Dcfine STt as the simulation
tinie a t rcal time t. Let AST denote the change in the simulation timc during thc
time interval ( t l , t 2 ) :
A(ST) :: = ST,, - ST,, .
Thc PASR is defined as:
A proccssor with a PASR
bccause it is ahead of other
higher than average is susceptible to being rollcd b x k
processors in virtual timc. If it is slowed clown? the
frcqucncy with which it is rolled back by other proccssors might well dccrease. Figure
3.1 shows an cxample of two processors with different PASRs. A message rn sent £rom
Pl to Pi will force fi to roll back to a virtual tirric prcvious to u t l , since the timestarnp
of m is ut l . PZ then has to cancel al1 the previous work done in the (t2, tl ) real time
iriterval. Hence, moving some Ioad fiom processors with liigh P A S h to others with
low PAS& should speed up the slow processors and slow down the f a t ones.
Vi rtu al Time
"t2
3.1.2 Processor Utilization(PU)
A uumber of researchers [Hac87, Tant85, Wang851 feel that it is best to niaximizc
the available parallelism in the systern by keeping processor utilization as high as
possible. For systems where no a priori estimates of load distribution arc possible. only
actiiül progrmi cxecution can reveal how much work has bccn assiguet1 to iriclividual
processors.
Lct us define effective utilization [ReitSO] as thc proportion of work donc by a
processor which is not rolled back. Unfortunately, it is impossible for a proccssor to
dctermine the effective utilization at a given point in the simulation since it might
rollback later and cancel al1 of the work that has been doue. In [Reitg~] an estimate
of the effective utilization is used for Ioad computatibn. Consequently we makc usc
of the processor utilkation (PU), defined as the ratio of thc proccssor's cornput cl, t' lori
tirne (in seconds) bctwcen tl and t2 to t2 - t l ;
PU = camptation time in (tl , t z ) t., -t1
Processor utilization allows for the fact that messages in thc systcm might be of
clifferent s ix . and might require different service times. It also accounts for thc fact
that two processors might advance their virtual clocks by the same value' evcri if the
coniputation time is différent.
3.1.3 Combination of PU and PASR
was also tested in our experimcuts. Thc A combination of the two metrics,
combination was intended to increase the utilimtion of the processors. while main-
tüiriing maximum advance simulation ratc ruid niinimizing the riiinibcr rollback.
3.2 Algorit hm
A clifficulty in the load bdancing of a distributed application is the absence of global
information ou the load of the systern. We employ a (distributed) algorithm to collect
thc rclevant inforniatiori about processors' Ioads in our loacl balancing ulgorithrn.
Our algorithm uses a token which circulates to each processor ori a logical ring
[TüneSl]. At each proccssor. the PASR, the PU and the & ( al1 refcrred to as
LOAD later) are inserted into the token.
Assume that there are n processors in the system. Initially. the token is launched
by processor one (1). Whcn it gets to processor n, data from d l of the processors will
he stored in the token. Processor n, called the host, will be able to identify overloaded
and underloaded processors. Thc token will remain in proccssor n for a pcriodl of
timc after which it is launched again. After the next round, when the tokcn reuches
processor n - 1, data from al1 of the processors will be contained in it. and hcncc n-1
becomes then the uext host.
At the end of each round, the host processor cornputes both the mean and the
standard deviation of al1 of the LOADs. The riew mean is then compared with the
'WC dcterruine this time period cxperimentaliy. Sec scction 3.3.1 for a disciission.
33
niean Erom the previous round (stored in the tokcn). If the new mean is larger. this
indicates that the performance of the system is improving (increase in the P U or the
PASR). and the system is left intact. The host sets a count-dom timcr which triggers
the token again after ,8 GVT computations. If the new mean is smallcr tkan the
mean from the prcvious round. thc host checks the standard deviatiori of tlic LOADs.
If tlic standard deviation is found to bc larger than a certain toleraucc value. a:'. thcn
a uew process to processor mapping is iissumed to be necessary. The host matches the
processor with largest load together with thc lemt loaded onc ancl sends a message
to the over-loaded processor with the name of the destination processor to which
it should transfcr some of its load. The load to be movcd is equal to half of thc
rliffcrcncc in the loads betwcen the two matched processors. The staidarcl clcviation
of thc loacls is computed again: and ariother pair of proccssors is rriatckccl. This
process is repeated until thc standard clcviation is less tkan thc value a. Whcri al1
niigrütious are completcd. the host starts the timcr for tlic ncxt round of the tokcu.
3.2.1 Pseudo Code
This section prcsents thc basic functions for the tlynamic load balancing algorithm.
Tlic skcleton of the algorithm is prcscntcd ncxt in ri. C-format langiiagc.
function Serve-Token(token)
I Insert processor's LOAD into the token.
if (token contains loads from al1 processors){ - Cy=, L0.4D;
M ~ ~ ~ L O A D - n /* ri is the numbcr of processors */ if(MeanLoAo 2 token.(prevzous MeanLoAo))
'Tlic value of t h coristmt P is dctcrmiried e~pcrirncrit~dy (10 GVT coniputations).
3A value of 25% for cr was found feasiblc for our cxperirncrits.
/* The performance of the system is improving */ Set-Tirner() :
else {
S t D= (standard deviatiou of LOADs) ;
if (StD 5 tolerance a)
/* cr was taken to be 2 5 . a value determinecl experimcntally*/
Set-Timer( ) :
else {
while (processors are still transferring processes)
wait();/* Processor can procced with other computations. */ Set-Timer();
} } } else
pass the token to the succcssor on the virtual ring;
1 fiirictiori Sct-Simer( ) ; {
wait(p computations); '
Init ialize ( NewToken) ;
Serve-Tokcn(NewToken) ;
1
function matchProcessors()
{ StD = (standard deviation of LOADs);
while (StD 5 tolerance a){
sourceProc=processor with the highest LOAD;
destinationProc=processor with the lowest LOAD;
ZoadToMove = LOAD(sourceProc)- LOAD(destination Proc) . 2 1
updatc LOADs of sourceProc and destinationProc and reconipute StD:
}
3.3 Design Issues
Designing a proccss migration system involves resolving a iiumbcr of issues such as
which cluster to niove: when the dynamic loacl bdancing algorithm sliotild bc iuvokcd.
and what tlic maximum number of processors which arc dlowcd to transfer load (at
the end of each round) should be. These issues are al1 interrclatecl. and th& rcsohttion
also depends on the sirnulateci model. WC provide a discussion of thcsc issues iri the
following sections.
3.3.1 Token period, ,O
Tlie algorithni periodicdly sends out a tokeri to collect the necessas. inforrnatiori
from a11 of the processors, and at the end of cach round thc host decidcs whether
a process migration is needed. The time intervd P betwecn two consecutivc token
rounds (token peniod) should be long enough to allow the system to stabiliae after the
previous load balance? but should be short enough to prevent the system from being
unbalanced for a lengthy period of time.
Dunng the course of the experiments? a token period of 10 GVT' computations
was employed. This value was fouml fcasible for our models. This value is partially
determined by the systems being simulated, in addition to thc time it takcs for a.
"lrc GVT computation w u initiated every 3 seconds
GVT cornputatiori: the longer it takes to compute the GVT. the shorter the token
perioci should be.
A rclated issue is the determination of a. Transfer of clusters occurs orily if the
clifference of the standard deviations is less than a. We used a value of a. = -25,
clet ermined experimentally.
3.3.2 Selecting the Clusters (Processes)
There is a trade-off between achieving the goal of coniplctely bülaricirig tlic load
ancl thc communication costs associatecl with rnigmting processes since transfcrring
clusters takes time. When determining which cluster to move, our implcnientation
diooses the cluster with the highest LOAD so as to mininiize the number of clusters
rnovecl. assurning that the load of the cluster does not exceed the intcnded 10x1 to
movc.
3.3.3 Number of Processors that can initiate migration
TIicoreticdly, the load balancing algorithm should distribute the loads on al1 of the
processois as uuiforrnly as possible. However, rnigratioris from too many processors
can cause the system to become unstable. In our experiments, WC observecl that
approximately 30% of the processon can launch migration at onc time without jeop-
ardixing the stability of the system.
Chapter 4
Results and Discussion
4.1 Introduction
This chapter presents the simulated models. cxperiments ancl results using the algo-
&,hm and metrics presented in the prcvious chaptcr. The load balancing a l g o ~ t h m
was imphmented on top of Clustered Time Warp. In our experiments. wc made cx-
çlusive use of the LRCC checkpointing technique which offcrs and intermecliate clloicc
in the niemory vs execution time tradc-off. The simulations wcre cxccutecl on a BBN
Butterflyl GP1000, a sharcd memory niultiprocessor machine. A set of niodels. basetl
on VLSI circuits, asscrnbly pipeline and distributed comniunicütion rictworks. wcrc
simulatcd.
'~lre Buttedy is an MIMD machine with 32 processor riode. Each nodc lias nu MC68020 ancl
MCG8881 processors with 4 rnegabytes of rnemory 'wd a high-speed rnultistagc crossbar switcli whicti
interconnects the processors. The clock speecl of e d processor is 25 MHL.
1 Circuit 1 Inputs 1 Outputs 1 Flip-Flop 1 Nurnber of gatcs 1
Table 4.1: Circuits jkom ISCAS'89
Two digital circuits fiom the ISCAS'89 [Bq1891 benchmark suite (Table 1.1) werc
selcctccl and simulated. The size of each of these circuits is approximately 24.000
gatcs. The circuits were partitioned into 200 clustcrs cach. using string partitioning
[Lcvc82]. Clusters were riuniberecl arbitrarily ancl wcre mappccl to proccssors as fol-
lows: if N is the number of processors, and K is the number of clustcrs (K > N ) .
then the first 5 clusters were assigned to proccssor numbcr 1. the second f clustcrs
were assigned to processor number 2, and so on so forth.
The data which was collected includes the running time, pesk rncrnory usage. and
cffcctivc throughput . The performance of the algorit hm was evaluatccl using bctwcen
12 and 24 proccssors or1 the Butterfly.
4.2 Experimental Results.
Figures 4.1 <and 4.2 show the simulation time for the two circuits 6 anct C:{. plottecl
against the number of processors. Results arc compared to the simulation without
a ions. load balancing. The samc number of input vectors were used for d l of the siniul t'
Figure 4.1 depicts tlie performance of the metries and the algorithm for C2: 3572%
rcduction in the simulation time when PU is uscd as a mctric, 0-30% when PASR
is used and 0-40% when P U * & is used. As for C3, the results arc as follows:
2-21% when P U is used as a metric, (-5)-16% when PASR is used and 0-15% when
P U * & is used.
Wc ;mibu te the differcnce in the percentage of recluction in thc simulation time
PU
PASR
PU/PASR
M NO L-B
O ' 12 14 15 16 17 19 22 24
processors
Figure 4.1: Simulation time for C2
5 PU
5 PASR
E PUtPASR
m No L-B
200 ' 12 14 15 16 17 19 22 24
processors
Figure 4.2: Simulation time for C3
between the two nietrics to the locdity of activity. VVheri the system cxhibits a. high
locality of activity. PU will increase on somc processors. Uriderloaded processors can
quickly proccss the cvents generated by a more heavily loadecl proccssor rcsulting iri
the same PASR, but a cliffercnt PU. This also explains why the improvements for C2
decline with an increase in the number of processors since the activity of the circuit
is spread out when using more processors.
To understand the difference operation of the load bal<mcing algorithm between
CZ and C:l. and why clynamic load balancing rnight sometimcs iucrcrrsc thc iurinirig
time of thc simulation (employing the PASR with C:& WC lookecl a t two othcr clc-
ments: pcak memory consurnption ancl the effective throughput. Thc peak memory
consumption is the average of the peak memory consumption in al1 of thc proccs-
sors. On each processor. it represents the maximum amount of memory uscd ovcr the
course of a run of the simulation. Thc effective throughput is dcfined as the number
of non-rolled-back mcssages in the systern per unit timc.
Rcscarchers on niemory çonsumpt ion for Timc Würp [.Jcff85, LiriSl. .JcffgO Akyi9 Z]
have pointcd out that there is a time penalty msociatcd with large spacc corisumption.
and that it is possiblc for a simulation to run out of memory. Memory management is
time consuming; a heavily loaded processor must spend considerable time on memory
mariagcnicnt . By looking a t figures 4.3 and 4.4, one can see that dynamic load balancing rcduced
pcak rncniory consumption for C2, but increased it for C:{. Circuit C2 has a higher
locality" of events thau does C3, hence moving clusters and aüsociated cvents from
the loaded processors in C2 decreases the peak memory consumption. As for C:i, the
activity of the circuit is much more distributed, and hencc moving clusters might
create a memory problem.
?Ail cxperimcntd stiidy about the activity of the s m e two circuits w u bc fouiid in [AvBQG].
41
600 '
12 14 15 16 17 19 22 24 Processors
Figure 4.3: Peak memory consumption for C2
12 14 15 16 17 19 22 24 Processors
Figure 4.4: Peak memory consumption for C:]
38 234 457 650 781 976 1163 1353 1533 Simulation Time
Figure 4.5: Throughput Gruph for C2
Wc also cxamined tlie effective throughput of the systcrn. Figures 4.5 and 4.6
show the impact of thc dynamic load balxicing algorithm on thc cffcctivc througliy ut
of the systcni during the simulation of C2 ancl C:{. Thc figures show a ruri of tlie sim-
ulation using the PU metric. Both figures show a noticeable incrcasc in thc effective
throughput of the system. This increase is counter-balaicecl in C:$ by mi incrcase in
the memor-y consumption.
4.2.1 Increase in rollbacks
An incrcasc in rollbacks was observecl (up to 20%) when load balancing was cmploycd.
This increase was expected since moving clusters slowcd clown the source and the des-
tination processors; when these processors resume computation they might well serid
messages into the "past" of processors which were not involved in moving clusters.
In addition, it is sometimes the case that the virtual time of the migratecl cluster is
srrialler than the virtual time of the receiving processor. This rcsults frorn the fact
that the receiving processor is lightly loded, and is a h e d in virtuül time.
O 13 45 76 106 1 37 159 171 196
Simulation Time
Figure 4.6: Throughput Graph for C:i
4.3 Pipeline Mode1
A second model which was simulated is a manufacturing pipeline (figure 4.7) [Glugd].
The mode1 consists of thirty processes, including two sinks and two sources. arrarigccl
in ninc stagcs. Each proccss waü represented by a cluster of 625 logical proccsscs
conncctcd in a. mcsh topology. and clusters a t the same stage wcre mappetl to tlic
same proccüsor. Messages in the system flow from sources to thc sinks. following
diffcrcnt patlis. At each stage, the message is servcd and forwarded to the ncxt stage,
until it gets to the sink, where it leaves the system. The service time distribution
is cleterministic and the routing decision at cach stage is governed by a uniform
distribution. The pipeline model exhibits a large number of rollbacks which arc caused
by messages starting at thc snme source, following diffcrent patlis, and arriving at
the süme processor in a. (possibly) different order from the orle iri which they wcre
generatcd.
Figures 4.8 and 4.9 show the results from the pipeline model. When dynamic load
balancing was used, the simulation showed an incrcase in the percentage of rollbacks,
Figure 4.7: Manufacture Pipeline Model
Pipeline Model
25 7
Figure 4.8: Perrentage Increase in the Number of Rollbacks
Pipeline Model
PU
Figurc 4.9: Percentage Reductzon of the Simulation Time
and an increase in the simulation time. Figure 4.8 shows that the number of rollbacks
increased by 20% when the PU metric was used, 18% when using the PASR and 17%
whcn a combination of the two metrics was used. The increase in the perceritage
of rollbacks results £rom the fact that moving clusters from one stagc (proccssor) to
aiiother will cause delay on some of the input links of tlic next stagc (proccssor).
Figurc 4.9 shows tliat the simulation time increased by 8% when using clynamic
loacl balancing with the P U metric, 0.5% when using thc PASR mctric. Whcu iising
the combination of the two metrics, the simulation time did not change. The increave
in the simulation time is explained by the increase in the number of rollbacks iri the
system.
4.4 Distributed Network model
The final model is a. distributed communication model (figure 4.10) . Two kinds of
cxperiments were conducted on the model. In the first experiment, messages are
Figure 4.10: Distribu ted Network Model
uniformly distributeci on the network. The second experiment moclclccl a national
conraiunication network clivided into four regions. In ttris model. we expcrinicntcd
with the reaction of dynamic load balancing to a continuous changc of loads on
the proccssors. During the course of the simulation, messages were conccntratcd on
diffcrcnt rcgions, one region a t a time. For instance, at one poiut messages were
concentrated in region 1, arid regions 2, 3 and 4 were lightly loacled. After a. period
of time. region 2 became saturated with messages, and rcgions 1, 3 arid 4 were lightly
loaclecl . The simulation runs on 10 processors, with 7-8 riodes mapped to each processor.
Interprocessor communication was minimized by mapping the connected nodes to the
same processor. On each node a message is servcd, and with a probability of 30%,
is forwarded to a randomly selected neighbor. Nodes have service times governed
by exponentiai distributions (with different means ), and the choicc of a neiglibor to
which to forward the message is governed by a uniforrn distribution.
The results in figures 4.11 and 4.12 shows a difference in the spceclup perccntagc
PU PAÇR PUPASR
Figurc 4.11: Percentage of Reduction in the Simula-
tion Time in the Presence of Time- Varying Load.
Distributed Communication Model
25 7
Figurc 4.12: Percentage of Reduction in the Simula-
tion Time in the Absence of Tzme-Varying Load.
betwcen the two experiments. Figure 4.11 shows a 30-35% reduction in the simulation
t h e when dynamic load balancing was employcd with al1 metrics. This comes b x k
to the facct that, in the presence of time-varying load. the system is locally ovcrloaded
with messages, and cluster migration improves the performance of the simulation. A
rcduction of only 10-20% was observed in the absence of time-varying load (figure
4.12).
Chapter 5
Conclusion
Iri tfiis tlicsis' we examined two mctncs for clynamic l o d balancing: processor uti-
lization (PU) and processor advance simulation rate (PASR). A combination of the
two mctncs was also testcd. A clistributed load balancing algorithm was devcloped.
in which a tokcn circulating on a logical ring is uscd to collect the information about
thc loads of proccssors, and inforni the processors about the new mapping. Thc algo-
rithrri runs in conjiinction with Clustered Time Warp (CTW)[Avri95]. which allows
clustcr migration.
To observe the performance of the algorithm with each of the metrics. scveral
rnoclels wcrc constructed: VLSI moclels, an assembly pipeline mode1 ancl a clistributecl
cornniunication nctwork model. Experiments were carried out on the BBN Buttcrfly
GP 1000. a 32 nodc distributed memory multiprocessor. The simulation time. memory
consumption and effective throughput were measurcd. The effective throughput is
tlic nunibcr of non-rolled-back messages in the systcni per unit timc.
Rcsults for thc VLSI circuits showed that clyncunic load belancing chariges tlic
simulation time between (-5) and 71%. As for the pipeline model, the simulation
time inçrcased by 0.5-8.0% when load balancing was employed. Load balaricing cx-
liibited 10-33% improvement with the distributed network model. The PU metric
performed well for al1 models except for the pipeline model. Results also showed that
the throughput of the system was improved by more than 100% for the VLSI model.
The PASR and & yield the same improvement as the PU for the distributed net-
work model? and better results for the pipeline model. Perhaps the most important
conclusion of this thesis is to point out that the dynamic load balancing depencls
strongly on the nature of the model which is being simulated.
A riuniber of extensions of this research suggest theniselves. 01ic is to iniplcrricnt
thc algorithm on a distributed mernory machine. It is possible that sonie intcrcsting
aspects do not rcveal themselves on a shared mcmory machine, such as the impact of
the communication network connecting the processing elements.
Another extension relates to the use of string pûrtitioning for the VLSI circuits ex-
amined. One might investigate the effect of different (initial) partitioning algorithms
011 clynamic load balancing.
Bibliography
[Akyi92] Akyildiz, I.F. and Chen. L. and Dmo S.R. and others. "Pcrforniancc Anal-
ysis of Time Warp with Limited Memory. Proc. 1992 ACM Sigmetrics
Con.. on Measurement and Modeling of Cornputer Systerns. pp. 2 13-SN.
May 1992.
[Aricle87] Ander. E., "A Simulation of Dynarnic Task Allocation in a Dishibutcd
Computer System" Proceedings o j the 1987 Wznter Simulation Confer-
ence, pp. 768-776, 1987.
[Avri95] Avril. Hcrvc and Tropper. Carl. "Clustered Timc Warp and Logic S inda-
tion'' , PTOC. of the 9th Workshop On Parnllel and Distributed Simulation,
pp. 112-119, June 1995.
[Ami961 Avril, Herve and Tropper, Carl, "The Dynamic Load Balancing of Clus-
tcred Time Warp for Logic Simulation", Proc. of the 9th Workshop On
Pamllel and Distributeal Simulation, pp. 20-27, May 1996.
[Avri96] Avril, Hcnre, "Clustered Time Warp and Logic Simulatiori" , PliD Tlicsis.
School of Computer Science. McGill University. Montreal. Canada, 1996.
[Bai1921 Bailey, M. L., "How Circuit Size Affects Panllelism", IEEE Tkans. Com-
put. Aided. Des. Integr. Czm. Syst. 12, Vol. 12, pp. 1903-1912. 1992.
Bailey, M. L. and Briner, J.V. and Chamberlain. R.D.. "Parallel Logic
Simulation of VLSI Systems" , A CM Computing Sumeys, Vol. 26. No. 3'
pp. 255-295, Septernber 1994.
Bauer, H. and Sporrer, C. and Krodel, T.H., "On Distributed Logic Simu-
lation Using Time-Warp" , Proc. of the international Conference on Very
Large Scale Integration VLSI. pp. 127-136, 1991.
Bcllnot . S., "Global Virtual Time Algorithm" , Distributed Simulation.
pp. 122-127, 1990.
Bokhari, Sahid H., "On the Mapping Problem", IEEE ~ n s a c t i o n s on
Cornputers, Vol. 30, No. 3, pp. 207-214, March 1981.
Bokhari, Sahid H. , Assignrnent Problerns in Pnrallel and Distributed
Computing, Klcwer Academic Publishcrs, Boston, 1987.
Boukerche, Azedine and Tropper, Carl, "A Static Partitioning m c l Map-
ping Algorithm for Consenrative", Proc. of the 8th Workshop on Pnrallel
and Distributed Simulation, pp. 164- 172 , 1994.
Boukerche, Aazedine and Das, Sajal, "Load Balancing for Conservativc
Pnrallel Simulation", MASCO T, 1997.
Brglez, Franc and Bryan, David and Kozminski, Krzystof, 'Combina-
tional Profiles of Sequential Benchmark Circuits", Proceedings IEEE In-
ternational Symposium on Circuits and Systems (ISCAS), pp. 1929-1934,
1989.
Brimer, J.V., ''Fast Paralle1 Simulation of Digital Systems" . Proc. of the
5th Workshop On Parallel and Distributed Simulation, pp. 71-77. 1 991.
[Brin911 Briner, J.V. and Ellis, J.L. and Kedem, G., "Breaking the Bamcr of
Paralld Simulation of Digital Systems, Proc. of the 26th ACM/IEEE
Design Automation Conference. pp. 223-226. 1991.
[Brya77] Bryant, R.E.. "Simulation of Packet Communication Archit ecturc Corn-
puter Systems" , Tech. Rep. MIT-LCS-TR-188. Massachusetts Institute
of Technology. 1977.
[Brya81] Bryant, R.E., "MOSSIM: A Switch-Level Simulator for MOS-LSI". Pro-
ceedings of the 18th Design Automation Conference, 198 1.
[Burd931 Burdorf, C. ancl Marti, J . . "Load Balancing for Tinic Warp ori Multi-User
workstations" , The Cornputer .Journal. Vol. 36. No. 2. pp. 168-176. 1993.
[Caro951 Carothers, Christopher D. and Fujimoto. Richard M. and England. Paul.
"Effcct of Communication Overhead on Timc Warp Performarmz An
Experimental Study", Proc. of the 8th Workshop On Pnrallel and Dis-
tributed Simulation, pp. 118- 125, 1995.
[Caro961 Carotlicrs, Christopher D . and Fujimoto, Richard M.. "Backgrouricl Ex-
ccution of Time Warp Programs, Proc. of the 9th Workshop On Parallel
and Distributed Simulation, pp. 12-19, May 1996.
[Cartgl] Carter, H. and Vemuri, R. and Wilsey, P. A. and others? "High Speecl
Acceleration of VHDL Simulation, Synthesis, and atpg: Overview of the
quest Project" , Spring 1991 VHDL Users ' Group, pp. 85-90. April 1991.
[Cham941 Chamberlain, R.D. and Henderson, C.D. "Evaluating the Use of Prc-
Simulation of VLSI Circuits", Proc. of the 8th Workshop On Paralle2
and Distributed Simulation, pp. 139-146, July 1994.
[Chan791
[Chun891
[ Conc9 11
[Elma861
[Fidu821
(Fran861
[FujiSB]
[Fu j i89)
Chandy. K. and Misra. J . . "Distributed Simulation: A Case Study in
Design and Verification of Distributed Prog~arns". IEEE Transactions
on Software Engineering? September 1979.
Chung, M.J. and Chung, Y., "Data Parallel Siniulation Using Tirnc-Warp
on the Connection Machine", PTOC. of the 26th RCM/IEEE Design Au-
tomation Conference. pp. 98-103, 1989.
Concepcion, A.1 and Kelly S.G.. "Computing Globd Virtual Tinie Using
the Multi-level Token Passing Algorithm"' Proc. of the 5th Workshop On
Pamllel and Distnbuted Simulation, pp. 63-68, 1991.
Elmagarmid, A.K., "A Survey of Distributed Deitdlock Detection Algo-
rithrns", ACM SIGMOD Record, Vol. 15. No. 3, pp. 37-45. Scptcmber
1986.
Fiduccia. C. M. and Mattheyses, R.M., "A Linear-Timc Heuristic for Im-
proving Nctwork Partit ions" ? Proceedings of the 19th A CM/IEEE Design
Automation Cmference DAC 82, pp. 175-181, 1982.
Frank, E.. "Exploi ting Parallelism in a Switch-Levcl Simulation Ma-
cliine" , Proc. of the 23rd A C W E E E Design Automation Conference.
pp. 20-26, 1986.
Fujimoto, Richard M., "Lookc&ead in Parallel Discrcte Event Simula-
tion", Proceedings of the International Conference on Pnml le1 Pmcessing.
pp. 34-41, 1988.
Fujimoto, Richard M., "Parallel Discrete Event Simulation'' , Proceedings
of the 1989 Winter Simulation Conference. pp. 19-28. 1989.
[Fuj i8 91 Fujimoto, R.M.. "Time Warp on a Shared Memory Multiprocessor".
Tech. Rep. TR-UUCS88-021a, Computer Science Department. Univcr-
sity of Utah.
January 1989.
Garcy. M.R. and Johnson, D.S., Cornputers and Intractabilzty: A Guide
to the Theory of NP-completeness, W.H. Fieedman and Company, New
York. 1979.
Ghzer, David W. and Tropper. Carl. "On Proccss Migration and Load
Balancing in Time Warp" . IEEE Bans. Parallel and Distributed S?jstems.
Vol. 4, No. 3, pp. 318-327, Murch 1993-
Groselj, B. and Tropper, Carl, "The Time-of-NextEveut Algorithrn" . Pro-
ceedings of the 1988 Distributed Simulation Conference, SCS simulation
series, Vol. 19, No. 3. pp. 25-29, February 1988.
Groselj. B. and Tropper, Carl, "Thc Distributcd Simulation of Cliistcrcd
Processes" Distributed Computing, Vol. 4. yp. 11 1-121, 1991.
Hac, A. and ,Johnson! T. J., "A Study of Dynamic Load Balancing in
a Distributed System" . A CM Symposium on Communica tions, Architec-
tures and Protocols, pp. 348-356, 1986.
Hac, A. and Jin, X., "Dynarnic Load Balancing in a Distributed System
Using a Sender-init iated Algorit hm", Proceedings of the IEEE- CS and
ACM SIGARCH Workshop on Instrumentation for Distributed Comput-
ing Systems, pp. 62-65, 1987.
Hilberg, W., Grundprobleme der Mikmelektmnik, Oldenbourg Vcrlag,
Munchcn, 1982.
Horbst . E., Logic Design and Simulation, Elsevier Science Publishers B .V.
(North Holland). 1986.
Iqbal. A.M. and Saltz. J.H. and Bokhari. S.H.? "A Comparative Anal-
ysis of S tatic and Dynamic Load Balancing S tratcgics" . Proceedings of
the 1986 International Conference on Parallel Processing. pp. 1040-1-47.
1986.
.Jcffersou? D. and Sowixral. H.. "Fast Concurrent Simulation Usiug t hc
Time Warp Mechanisrn, Part II: Global Control" . Tcch. Rcp. TR-83-204.
Rand Corporation, August 1983.
.Jefferson, D.R.. "Virtud Time", A CM ~ n s a c t i o n s on Progmmming
Languages and Systems, Vol. 7: No. 3. pp. 404-425. July 1985.
.Jefferson. D.R.. "Virtud Time II: The Cüncelback Protocol for Storagc
Management in Distributed Simulation", Proc. of the 9th Ann. ACM
Symp. on Principles of Distributed Computation. pp. 75-90. August 1990.
Kernighan,B.W. and Lin, S., "An Efficient Heuristic Proccdure for Par-
titioning Graphs", Bell System Technical Journal? Vol. 49. No. 2. pp.
291-307, Fkbruary 1970.
Kirpatrick! S. and Gclatt. C. D. and Vccchi. M.P.? "Optimization by
Simulated Anncaling" , Science, Vol. 220. No. 4598, May 1983.
Levendel, Y.H,. and Menon. P.R. and Soffa. S.H., "Special Purpose Corn-
puter for Logic Simulation Using Distributed Processing" . Bell System
Technicd Journal, Vol. 61, No. 10, pp. 2873-2090, December 1982.
Lin, Yi-Bing and Lazowska, Edward D., "Determining the Global Virtual
Time in a Distributed Simulation", Tech. Rep. TR-90-01-02, Deyartment
of Computer Science and Engineering, University of Washington,
Seattle, Wa, December 1989.
[Ling 1]
[Liu901
[Lin821
[Lu861
[Luba88]
[LubaSB]
[Luba89]
[Maur901
Lin. Y. B. and Preiss. B.R., "Optimal Memory Management for Timc
Warp Parallel Simulation", ACM Transactions on Modeling and Corn-
puter Simulation, Vol. 1' No. 4. pp. 283-307, October 1991.
Liu, L.Z. and Tropper, Carl, "Local Deadlock Detection in Distributed
Simulation", Proceedings of the 1990 Distributed Simulation Conference,
SCS simulation series, Vol. 22, No. 1, pp. 137-143, 1990.
Livney, M. cmd Melman, M. , "Load Bdancing in Honiogeneous Broaclcast
Distributed S ystem" . Proc. of the A CM Computer Network Performance
Syposium. pp. 47-55, 1982.
Lu. H. <and Garey, M.J., "Load-Balancing Task Allocation in Locally
Distributed Computer System'? , Proceedings of the 1986 International
Conference on Parallel Processing. pp. 1037-1039, 1986.
Lubachevsky, Boris D.. "Simulating Colliding Rigid Disks in Parallel Us-
ing Boundcd Lag Without Time Warp" . Distributed Simulation, SCS
Simulation Series, Vol. 22, No. 1, pp. 194-202, 1988.
Lubachevky, B. "Bounded Lag Distributetl Discrete Event Simulation".
Proceedings of the 1988 Distributed Simulation Conference, SCS szmula-
tion series, Vol. 19, No. 3, pp. 183-191, February 1988.
Lubachevsky, Boris D. and Schwartz, A. and Weiss, A., "Rollback Sonie-
timcs Works.. .if Filtered" , Proceeding of the 1989 Winter Simulation .
Conference, pp. 630-639, December 1989.
Maurer, P.M. and Wang, Z., "Techniques for Unit-Delay Compiled Sim-
ulation", Proc. of the 27th ACM/IEEE Design Autontution Conference,
pp. 480-484, 1990.
Misra. J. and Chandy. K.M.. "A Distributed Graph Algorithm: Knot
Dctection" . ACM Transactions on Progmmméng Languages and Systems.
Vol. 4. pp. 678-686, October 1982.
Misra. .J .V., "Distributed Discrete Event Simulation'' . Computing Sur-
vep. Vol. 18. No. 1, pp. 39-65, March 1986.
Mukherjee, A.. '?ntroduction to nMos and CMOS VLSI Systcms Design".
Prentice Hall International Editions. 1986.
Nandy, Biswajit and Loucks. Wayne M., "An Algorithm for Partitionhg
and Mapping Consenrative Parallel Simulation onto Multicomputers" ,
PADSW pp. 139-146. 1992.
Natarajari. N.. "A distributed Scheme for Dctecting Commuriicatiori
Dcacllocks". IEEE Iransnctzons on Soflware Engineen'ng. Vol. SE-12.
No. 4. pp. 531-537, April 1986.
Ni, L. and et al., "A Distributed Drafting Algorithm for Load Balancing" . IEEE i3-ans. Sof. Eng., Vol. 11, No. 10. 1985.
Nicol, D.M, "Parallel Discrete Event Simulation of FCFS Stochastic
Qucuing Nctworks" Proceeding of the ACM SIGPLAN Sgmposium on
Pamllel Programming, Envimnments, Applications, and Languages. pp.
124-137, JUIY 1988.
Peacock, J.K. and Wong, d. W. and Manning E.G., "Distributed Simula-
tion Using a Network of Processors", Cornputer Networks? Vol. 13. No.
1, pp. 44-56, February 1979.
Prciss, B.R., "The Yaddes Distributcd Discrete Event Simulatiori Speci-
fication Language and Execution Environment", Proceedings of the Mul-
tzconference on Distributed Simulation, pp. 139-144, 1989.
[ReitgO] Reither, Petcr L. and Jefferson, David, "Virtual Time Based Dynamic
Load Management in the Time Warp Operating Systcm". 4th Workshop
on Parnllel and Distributed Sàmulntion, pp. 103-1 11. 1990.
[Sani;185] Samadi. B.: "Distributed Simulation, Algotithms and Performance Anal-
ysis", PhD Thesis, Computer Science Department. University of Califor-
nia. Los Angeles. 1985.
[Schl95] Schlagenhaft? Rolf c u i d others, "Dynaniic Load Balancing of a Multi-
Cluster Simulator on a. Network of Workstations" . 9th Workshop on Par-
allel and Distrib~ted Simulation. pp. 175-180. 1995.
[Shan891 Shanker. M. and et al.. "Adaptive Distribution of Modcl Corriponcnts
Via Congestion", Proceedings of the 1989 Winter Simulation Conference.
1989.
[Srnit861 Smith. R.J. . "Fundamentd of Parallel Logic Simulation". Proc. of the
23rd ACM/IEEE Design Automation Conference. pp. 2-12. 1986.
[Srriit87] Smith. R. and Smith, J. arid Smith. K., Tastcr Architecture Sirriulatioii
Througli Parallelism". Proc. of the 24th A C W E E E Design Automation
Conference, pp. 189-194. 1987.
[Srnit871 Smith, Steven P. and Underwood, Bill and Ray Mercer, M.. "An Aualysis
of Scveral Approaches to Circuit Partitioning for Pardlel Logic Simula-
tion", Proceedings IEEE International Conierence on Computer Design
ICCD 87, pp. 664-667, 1987.
[Sou1871 Soule, L. and Blank,T., "Statistic for Parallelism and and Abstraction
Lcvels in Digital Simulation", Proc. of the 24rd ACM/IEEE Design Au-
tomation Conference, pp. 588-591, 1987.
[Tan& 11
[Tant 8 51
Soule. L. and B1ank.T.. "ParaIlel Logic Simulation on General Purpose
Machine". Proc. of the 23d ACM/IEEE Design Automation Conference,
pp. 166-171, 1988.
Soule, L. <and Gupta, A., "An Evaluation of the Chandy-Misra-Bryant
Algorithm for Digital Logic Simulation of VLSI-circuits" . Proc. of the 6th
Workshop On Pamllel and Distributed Simulation, pp. 129-138. 1992.
Sporrer. Christian and Bauer. Herbert, "Corolla Partitioning for Dis-
tributed Logic simulation of VLSI-Circuits" , PADS'93, Vol. 23. No. 1.
pp. 85-92. 1993.
Su. W.K. and Seitz, C.L., "Variants of the Chmdy-Misra-Bryant Dis-
t ributed Discrete-Event simulation Algorithm" . Proc. of the 9th Work-
shop On Parallel and Distributed Simulation, pp. 38-43. 1989.
Tanenbaum. A., Computer Networks. Prentice Hall? Englewood Cliffs.
NJ, 1981 (second edition 1988).
Tantawi, A.N. and Towsley, D.. "Optimal S tatic Load Balancing in Dis-
tributcd Computer Systems", Journal ACM, Vol. 32, No. 3. pp. 445-465.
1985.
Terman, C.J., "Simulation Tools for Digital Design". PhD clisscrtation.
Massachusetts Institute of Technology, Cambridge, Massachusetts. 1983.
Vladimirescu, A. and Liu, S., The Simulation os MûS Integrated Cir-
cuits Using SPICE, Memo VCB/ERLM80/7, University of California.
Berkeley, 1980.
Wang, L.T. and Hoover, N. and Porter, E. and ZasioJ., "SSIM: A Soft-
ware Levelized Compiled-Code Simulatoc", Proc. of the 27th A CM/IEEE
Design Automation Conference, pp. 2-8, 1987.
[Wang851 Wang, Y.T and Morris, R. J. T.. "Load Sharing in Distributcd Systems".
IEEE 13-Qnsactions on Cornputers, pp. 204-217. 1985.
[WangSO] Wang. 2. and Maurer, P.M.. "LECSIM: A Lcvclized Eveut Driven Com-
piled Logic Simulator" ? Proc. of the 27th A CM/IEEE Design Automation
Conference, pp. 491-496, 1990.
[Wei891 Wei. Yen-Chuen and Cheng, Chung-Kuan, "Towards Efficient Hierarchi-
c d Designs by Ratio Cut Partitioning", IEEE. pp. 298-301. 1989.
[WilsBG] Wilson, A., "Parallelization of an Event Drivcn Simulator on tlic Ericorc
Multim~w", Tcch. Rcp. ETR 86-005. Encore Computcr. 1986.
[Wong86] Wong, F., "Statistics on Logic Simulation", Proc. of the 23rd ACM/IEEE
Design Automation Conference, pp. 13-19, 1988.
[Zarg85] Zargham, M. and Pircell, R., "A Protocol for Load Balancing CSMA
Networks". IEEE Trans. Pamllel and Distributed Systems, 1985.