model partitioning and the performance of distributed timewarp simulation of logic circuits

17
SIMULATION PRACTICE !F THEORY Simulation Practice and Theory 5 (1997) 83-99 Model partitioning and the performance of distributed timewarp simulation of logic circuits J. Cloutier *, E. Cerny, F. Guertin D&p.de’lnformatique et de Recherche Op&rationnelle, Giversite de Montreal, C.P. 6128, Succ. Centre-Ville, Montreal, H3C 357 Canada Received 24 September 1994; revised 22 November 1995 Abstract Simulation of complex digital electronic systems requires powerful machines and algorithms. Distributed simulation could improve both the execution time and the availability of a large distributed memory for complex models. Model partitioning onto the available processors has a major impact on simulation efficiency. We report on how various partitioning algorithms affect timewarp-based distributed simulation of combinational and synchronous sequential logic circuits, and try to determine the relationship between circuit parameters (the number of gates, topological levels and the degree of activity in the circuit) and the structure of the partition having the fastest simulation on a heterogeneous network of Sun workstations. Keywords: Timewarp; Distributed simulation; Logic circuits; Partitioning; Performance 1. Introduction The increased attention to quality and the growing complexity of electronic circuits translate into more extensive simulation runs. It becomes difficult to verify complete systems at the gate level for the following reasons: (a) The simulation times become too long. (b) The required physical memory is too large. Page faulting deteriorates perfor- mance, since simulation has poor locality properties. (c) Hardware accelerators are costly and not as effective with mixed levels of model abstractions (e.g., such as found in VHDL) that are also needed to reduce the simulation time. Yet, high-speed networks of workstations and low-cost multi-processors are available * Corresponding author. Email: [email protected]. 0928-4869/96/$15.00 0 1996 - Elsevier Science B.V. All rights reserved SSDI 0928-4869(95)00053-4

Upload: j-cloutier

Post on 26-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

SIMULATION PRACTICE !F THEORY

Simulation Practice and Theory 5 (1997) 83-99

Model partitioning and the performance of distributed timewarp simulation of logic circuits

J. Cloutier *, E. Cerny, F. Guertin

D&p. de’lnformatique et de Recherche Op&rationnelle, Giversite de Montreal, C.P. 6128, Succ. Centre-Ville, Montreal, H3C 357 Canada

Received 24 September 1994; revised 22 November 1995

Abstract

Simulation of complex digital electronic systems requires powerful machines and algorithms. Distributed simulation could improve both the execution time and the availability of a large distributed memory for complex models. Model partitioning onto the available processors has a major impact on simulation efficiency. We report on how various partitioning algorithms affect timewarp-based distributed simulation of combinational and synchronous sequential logic circuits, and try to determine the relationship between circuit parameters (the number of gates, topological levels and the degree of activity in the circuit) and the structure of the partition having the fastest simulation on a heterogeneous network of Sun workstations.

Keywords: Timewarp; Distributed simulation; Logic circuits; Partitioning; Performance

1. Introduction

The increased attention to quality and the growing complexity of electronic circuits translate into more extensive simulation runs. It becomes difficult to verify complete systems at the gate level for the following reasons:

(a) The simulation times become too long. (b) The required physical memory is too large. Page faulting deteriorates perfor-

mance, since simulation has poor locality properties. (c) Hardware accelerators are costly and not as effective with mixed levels of

model abstractions (e.g., such as found in VHDL) that are also needed to reduce the simulation time.

Yet, high-speed networks of workstations and low-cost multi-processors are available

* Corresponding author. Email: [email protected].

0928-4869/96/$15.00 0 1996 - Elsevier Science B.V. All rights reserved SSDI 0928-4869(95)00053-4

84 J. Cloutier et al. / Simulation Practice and Theory 5 (1997) 83-99

and could help solving the problems (a) and (b) using distributed simulation. There are three distributed simulation techniques: Synchronous (global synchronization on every event), Asynchronous conservative [4], and Asynchronous optimistic (time- warp) [ 111. The asynchronous techniques are useful for circuit models with real gate delays. Good model partitioning is one of the most important factors affecting performance [ 3,5,7,13,14].

For our study of the effects of partitioning on simulation performance, we selected the timewarp simulation technique, because it easily supports inertial delays, and there is less need for “inside” knowledge of model semantics than in the conservative approach that requires deadlock detection or prevention. Rather than adapting our old timewarp simulator [6], we used the commercial general object-oriented time- warp simulator Sim+ + from JADE Simulations International Corp. [9], running on a network Sun workstation. The comparison of model partitioning is valid for any shared bus (e.g., ethernet) distributed system.

For the experiments we used gate-level models of the lSCAS’85 and ISCAS’89 testability benchmark circuits [1,2]; we argue that the results on partitioning can be extended to larger circuits. In this paper we report on the experimental results from simulating three distinct combinational and four synchronous sequential circuits from these benchmarks. We try to determine the relationship between the circuit topology and the structure of the partition that yields the best simulation perfor- mance. In initial experiments, we determined that the best partitions are obtained by the Level [ 133 and by the Mincut [S] algorithms. Our improved Level algorithm considers the circuit topology, the number of gates, their evaluation load (expected evaluation frequency), and the power of the processors.

The paper is organized as follows: In Section 2, we describe the simulation environment, and in Section 3 discuss the benchmarks models. In Section 4, we outline the partitioning algorithms, the initial simulation experiments and our strat- egy for improving partitions. We present experimental results in Section 5, and conclude the paper in Section 6.

2. The simulation environment

2.1. The simulator

The JADE simulator is based on Virtual Time [ 111. The simulated system is described using a collection of independent entities which are statically created during the initialization phase of the simulation. Entities synchronize and exchange information using time-stamped events (messages) that include the scheduled and received times, as determined by the originating entity. An event can be canceled by the originating entity at any time less than the receive time of the event. Each entity has a local view of the simulation time. Entities advance in time by awaiting (future) incoming events and/or by suspending their operation for some time interval. In the latter case, execution can also be resumed when an event satisfying a user specified condition is received. The state of entities is saved by the timewarp executive in case

J. Cloutier et al. / Simulation Practice and Theory 5 (1997) 83-99 85

an entity has to roll back in time due to the reception of a straggler event. Global virtual time is periodically updated to release memory. For logic circuit simulation, a layer was placed on top of this general purpose system to facilitate modeling of signals and their updates, and to group any number of gates, flip-flops, input generators and output observers into one entity. There is one entity executing on each processor, and the composition of the entity is determined from the circuit parameters and the relative processor power using a partitioning algorithm. Each processor thus runs a sequential simulator that sits on top of the timewarp executive which is responsible for providing a coherent view of the state of the circuit to the simulator.

2.2. The network of workstations

The simulations were run on an Ethernet network of six Sun workstations: 2 SPARCs IPC (the slow group), 2 SPARCs II (the medium group) and 2 SPARCs model 10 (the fast group). We estimated their relative computing power as 100% for the fast, 64% for the medium and 43% for the slow group, using benchmarks that measured their processing and communication power. Those benchmarks showed that for gate level simulations, a large number of small messages are sent on the shared bus. This is why the shared bus bandwidth (communication power) is determined by the processing power of the workstations.

3. The benchmark circuits

The circuits are from the ISCAS’85 and ISCAS’89 testability benchmark suites [ 1,2]. We assumed equal delays on all gates. The input vectors were formed by synchronous independent pseudo-random bit streams provided by single-bit genera- tors on each primary input. The vectors on the inputs were separated by (the same) large enough delays that the outputs could stabilize before the arrival of the next input vector. For sequential circuits, the clock signal was generated internally by each flip-flop, with the same period as the inputs. Beside collecting event counts from each gate at the end of each simulation consisting of 100 or 200 vectors, depending on the circuit size, no signal observation was done to reduce I/O effects that cannot be controlled.

Although model partitioning is very important, not much is known about the effects of various partitioning techniques on the performance of timewarp simulation. These effects can be as follows:

- Communication overhead vs. computation load. If many messages must be exchanged relative to the computation load, the simulation will be slower, especially on a network with high communication costs.

_ Load balancing. The computation load must be evenly distributed over the processors, taking into account their relative power. An imbalance can cause some processes to progress far in the future and result in rollbacks. This may

86 J. Cloutier et al. / Simulation Practice and Theory 5 (I 997) 83-99

not be a problem if the processor is idle, but message cancellation and state restoration can still be costly.

_ Look-ahead. In an ideal situation, an entity with a local simulation time t, should be receiving messages only with a scheduled time t, such that t, > tL. In this way the entity can proceed smoothly forward in time without rollback and achieving a high degree of parallelism.

The circuit parameters we used in partitioning are the circuit topology (the netlist), the gate delays, the relative number of evaluations of the model of each circuit element, and the relative complexity of the element model’s evaluation. To model the circuit parameters for partitioning, we represent a logic circuit by a weighted directed acyclic hypergraph H = (V, E, W, C), where V is the set of nodes (gates, flip- flops, and input and output ports), E E V x 2’ the set of directed hyperedges (circuit nets), W: V-tR the node evaluation load function (e.g., the average frequency of evaluation of a gate), and C : E + 53 the communication load function of hyperedges (e.g., the average frequency of events on a net). An hyperedge is a tuple (0, D) where v E V is the source node and D E I/ are the destination nodes. Circuit partitioning is thus transformed to a weighted directed hypergraph partitioning problem.

4. Strategy for balancing partitions

Initially, we implemented the following known partitioning techniques [ 131: Natural, Level, Chains, Fanout Cones, Fanin Cones and Random. They all have the same basic algorithm:

1. A total ordering of the gates is found using one of the above heuristics, and 2. The computation load is distributed to the processors (i.e., the circuit elements

are placed into the corresponding entities) using a greedy algorithm that assigns circuit elements to an entity until an upper limit is reached. This limit is determined from the performance of the processor on which the entity is executing.

We also adapted the Mincut algorithm [ 81 for weighted hypergraphs [ 121. While the previous algorithms aim at load balancing, Mincut minimizes the communication load of the simulation. Initial experiments [lo] determined that the best partitions are produced by the Level and Mincut algorithms. Hence, we subsequently concen- trated only on these two methods which are described in more detail in the following paragraphs.

4.1. The Level algorithm 1133

The method first assigns a topological Zevel (from the primary inputs) to each node of the acyclic hypergraph. The hypergraph is acyclic for combinational circuits (from their definition), however, for synchronous sequential circuits, there may be a cycle, hence this method cannot be applied directly. We experimentally determined [lo] that the best way to eliminate cycles (for partitioning only) is to remove hyperedges having a flip-flop as their source. Since by definition there is at least one

J. Cloutier et al. 1 Simulation Prrrctice and Theory j i 1997) 83-99 81

flip-flop in each cycle in such circuits, the hypergraph becomes acyclic. Then, nodes with the same topological level are assigned to a block, starting with the lowest level.

The assignment of nodes is done in a greedy manner, placing gates into the first

block until its maximum size is reached, then passing to the second one, etc. Note,

that the original algorithm [13] only used W(u) = 1 for all L’ E V, and that the

communication load C was not considered at all. There is a one-to-one mapping between the blocks and the processing elements.

4.2. The Mincut algorithm

The usual Mincut heuristic algorithm partitions a hypergraph using the number

of nets crossing between the partition blocks (called cut size) as the primary minimiza-

tion goal, and the balance of the size of the blocks as a secondary goal. We first generate a random initial partition. The algorithm of [S] first solves the cut minimiza-

tion, while limiting the imbalance of the partitions to less than 33%. It then iteratively modifies the partition by moving vertices from one block to another until no

reduction in the cut can be achieved. We improved the speed of the algorithm for

sparse graphs [ 121 to O(max(\ V/I, IEI) log 1 VI), and we take into account the node

weights W(v) (u E I’) and the communication load C(e) (e E E). The node weight and

the communication load represent respectively the average number of times a gate

is evaluated and the average number of events on a wire, when simulating one input vector (each input of the circuit is assigned to 0 or a 1 with equal probability). The

primary minimization goal thus is the weighted cut size which is defined as the sum

of the communication loads of the hyperedges connecting vertices in different blocks.

The results in [ 141 suggest that Mincut partitions are good for sequential circuits

and that the number of gates in a block is a good approximation for the evaluation

load. Others [S] used presimulation runs to determine the evaluation load. As

discussed in the following section, our experiments also indicate that estimates on

gate evaluation rates from presimulations improve the partitions (achieve better

simulation times) in the case of pseudo-random input sequences.

5. Experimental results

We present here the results of our initial experiments, and follow it by describing

the improved partitioning strategy and the final experimental results. A sequential

simulation may be faster or slower than a multi-processor simulation depending on

the relationship between the amount of computation and the communication require-

ments for a simulation. We made experiments where the evaluation time of every

gate was artificialy increased (by adding an empty loop). By doing so, we increased

the processing time, while the communication time remained almost intact. Under these circumstances, the observed impact of our model partitioning on the simulation performance remained the same as presented in this paper.

88 J. Cloutier et al. 1 Simulation Practice and Theory 5 (1997) 83-99

5.1. Combinational circuits

Table 1 describes the benchmarks used. Table 2 summarizes the results of the initial experiments using the original Level and Mincut algorithms, in which we set W(v) = 1 for all u E V, C(e) = 1 for all e E E, and where the number of gates assigned to a processor (a partition block) is directly proportional to the power of the processor. These results suggest that Level partitioning is better than Mincut. This may always be the case for combinational circuits, since, unlike Mincut, Level partitioning introduces no directed loops between the blocks. This produces a pipe- line-like structure over the partition, such that with proper load balancing it can achieve near optimal look-ahead conditions and maximal parallelism. Proper load distribution is as essential as in pipelines in computer architecture. We can also see in Table 2 that the cut size is much larger for the Level than for the Mincut partitions, yet the simulation times are shorter. This seems to indicate that for combinational circuits a well balanced pipeline-like partition leads to good performance, even if the communication load is higher. The cut value is the cut size as defined in Section 4.2.

Since load balancing is so important, we did additional experiments on the same benchmarks to determine the optimal load, and to relate it to the circuit characteris- tics. The load was varied as follows:

- The same number of gates was assigned to each computer in a performance group, and

- The limit on the number of gates in a group was derived from the relative power of the group’s processors.

The experimental results are summarized in Fig. 1. The load is expressed as a factor relative to the fast processor group: The ratio of the power of the medium group to

Table 1

Characteristics of combinational circuits

Circuit Inputs outputs Nets Gates

cl908 33 25 913 880

c3540 50 22 1719 1669

~7552 207 101 3719 3512

Table 2

Initial experiments

Circuit Mincut

Simulation time

(sec.)

cut

Level

Simulation time

(sec.)

cut

cl908 115 134 79 319

c3540 173 402 95 626

~7552 488 582 277 1642

J. Cloutier et al. / Simulation Practice and Theory 5 (1997) 83-99 89

3b 50 7i) 90 110 130 150

Load(%) C 1908

160 ” E 120 140

.g 100 - 2 60 80

40

0

Load (‘S) c3540

650

i 550

; 450 E = 350

g 250

150

30 50 70 90 110 130 150

Load (7~) C7552

Fig. 1. Simulation results for combinatorial circuits

the fast group is about the same as the ratio of the medium group to the slow group. We use this to compress the load distribution information into one number, as indicated on the Load axis in Fig. 1. The load factor of 100% means that each group is assigned the same total node weight (since we used W(C) = 1 here, it corresponds to the number of gates). Let now let ~100% be a load factor, then X% of the fast group’s load is assigned to the medium group, and again x% of the medium group’s load is assigned to the slow group, i.e., x2% of the fast group.

There are two curves in each graph of Fig. 1: One corresponds to an increasing order of the groups and the other to a decreasing order. In the increasing (decreasing) order, the slow (fast) group is assigned to the low-level partitions, i.e., near the circuit

inputs, followed by the medium group, and finally the fast (slow) group is assigned to the high-level partitions, i.e., near the circuit outputs. We can see that the minimal decreasing-order simulation time is always greater than the minimal increasing-order simulation time. An explanation is as follows: Due to the convergence of paths of unequal length to the gates that are in the higher topological levels (near the end of the “pipeline”), there are more frequent rollbacks than in the lower levels (the first level has no rollbacks). In the decreasing order, the slower processors are assigned

90 J Cloutier et al. / Simulation Practice and Theory 5 (1997) 83-99

to the higher-level partitions where there is more overhead than in the other parti- tions. Being the slowest group incurs larger time penalty and the simulation pro- gresses at a slower rate. Decreasing the load in the slow group at the end of the pipeline is not enough to compensate for the resulting increased load (and rollbacks) in the other groups.

Furthermore, the increasing order achieves minimal simulation time when higher load (number of gates) is assigned to slower processors (placed at the beginning of the pipeline). For example, the smallest execution time for ~7552 is achieved with the load of 130%, meaning that 1.69 times more gates were assigned to the slow processors than to the fast ones. The possible explanations are: First, the lower-level partitions have few if any rollbacks, thus the simulation overhead is diminished. Second, the gates in the lower-levels have smaller evaluation loads, i.e., they are evaluated fewer times than the gates in the higher levels where paths of different lengths reconverge. We confirmed this with another experiment where we counted the number of gate activations. Consequently, more gates should be placed on the slower processors assigned to the lower-level partitions than to the fast processors assigned to the highest-level partitions, and the evaluation load of each gate should be accounted for.

Since computing the exact evaluation load of a gate (or net) is an NP-complete problem, an approximation of the real evaluation load of gates and of the communica- tion load of nets must be used. We used presimulation runs to obtain such estimates for each circuit. The communication load is approximated the average number of gate evaluation per input vector, while the communication load is approximated by the average number of transition on each net for one input vector.

The Mincut [ 121 and Level algorithms were reapplied using these weights, while taking into account the relative processor power (by definition, Level partitioning cannot consider the communication load C). We show in Table 3 the average simula- tion times from 10 simulations and the cut size of the partitions for each benchmark. We also indicate the effective number of messages sent on the network (i.e., not including those that were canceled by the Timewarp mechanism). This number may be approximated by multiplying the number of input vectors by the weighted cut size of the partitions. For comparison, we also indicate the minimal simulation time from the increasing-order level partition of Fig.1. We can observe that:

Table 3

Simulation for different partitioning techniques

Circuit Mincut Level Minimal “increasing”

Simulation Cut size Simulation Cut size Simulation Simulation Cut size

time (sec.) (#Messages) time (sec.) (#Messages) time ratio time (sec.) (#Messages)

cl908 117 199 (23 K) 43 379 (48 K) 2.7 39 311 (37 K)

c3540 169 474 (47 K) 47 684 (79 K) 3.6 45 498 (61 K)

~7552 303 686 (90 K) 197 1766 (247 K) 1.5 148 1630 (236 K)

J. Cloutier et al. J Simulation Practice and Theory 5 (I 997) 83-99 91

(a) The simulation time seems unrelated to the cut number. (b) When activations from presimulations are used, the simulation times of both

Level and Mincut partitions are improved by 50% as compared to the case (Table 1) where only the number of gates is used as the measure of the evaluation load of a block.

(c) The best performance is obtained using Level partitions and it is only at most 33% worse than the best performance of the increasing order shown in Fig.1.

(d) As anticipated, the cut sizes and the number of messages sent are higher in the level partitions than in the Mincut partitions.

5.2. Sequential circuits

Table 4 shows the characteristics of the four sequential circuits [2] used in the experiments. The experimental results are summarized in Fig. 2. Each graph contains three curves corresponding respectively to Mincut partitions, Level decreasing order and Level increasing order partitions. The indicated values are averaged over 3 simulation runs for each partition. We can see that:

(1) The decreasing-order Level partitions have smaller minimal execution times except for circuit ~13207. We may not explain this comprehensively nor quantitatively based on the simulation of only four circuits. We show in Fig. 3 the distribution of the number of gate activations on each level of the circuits. For the ~13207, the activations are mainly concentrated on the first levels, while for the other circuits they are more evenly distributed. The impact of these distributions on the partitions are shown in Tables 5 and 6 which indicate the percentage of gates per partitions for 50% and 70% load factors. We can see that in both cases a lower percentage of gates is assigned to the first partition for the ~13207 than for the other three circuits. It would seem that for this circuit, where high concentration of messages is in the low topological levels, the communication cost is dominant and the Mincut partition wins over the pipeline structure of the Level partition. However, further studies are needed to fully explain this behavior. For completeness and comparison with the activity distribution of Fig. 3, we show in Fig. 4 the distribution of gates across the topological levels in the benchmarks.

(2) Decreasing-order Level partitions always produce smaller minimal execution times than those of the increasing order. This seems to contradict the results from the combinational circuits. Yet, the same arguments can be used to favor the decreas-

Table 4

Characteristics of sequential circuits

Circuit Inputs outputs Flip-flops Gates

s5378 35 49 179 2719

~9234 19 22 228 5597 ~13207 31 121 669 7951

s15850 14 87 591 9172

92 J. Cloutier et al. / Simulation Practice and Theory 5 (1997) 83-99

300

30 40 50 60 70 80 90 loo 110 120 130 140 150

Load factw

~5378

2800

‘) 2200

P 2 1900

‘$ 2

1600

u 1300

1000

700

30 40 50 60 70 80 90 lC0110 120 130140 150

8OO

- Level (decreasing)

30 40 50 60 70 80 90 100 110 120 130 140 150

Load factor

~9234

1000

90

800

700

600

500

400

300

30 40 50 60 70 80 90 100 110 120 130140 150

Load factor Load factor

~13207 s15850

Fig. 2. Simulation results for sequential circuits

ing order here. While rollbacks occur more frequently on the higher levels in combina- tional circuits, in sequential circuits they are more frequent in the lower levels. This is because flip-flops are placed in the higher-level partitions by our Level algorithm. The flip-flops send messages to the gates on the lower levels, and these can generate many rollbacks there. Consequently, it becomes preferable to assign lower-level partitions to the faster processors, i.e., the decreasing order.

Table 7 summarizes the best results for the Mincut and the Level decreasing-order partitions. In the Level results, the minimal times occur at load factors between 50 and 70. This is useful for a quick determination of a good load factor. Furthermore, the “real” power of our processors is best approximated by a load of 46% which is close to the load factor of the best experimental results of the Level partitions. Finally, we can observe that:

(1) The simulation time seems again unrelated to the cut size. (2) While it is possible to predict the best load factor for the Level partitions, this

is not the case for Mincut, because the latter does not take into account the direction of the hyperedges. It is thus difficult to predict which processor will have a larger number of rollbacks.

J. Cloutier et al. 1 Simulation Practice and Theor), 5 ( 1997) 83-99 93

2000

2 0000 .o

s .;

8000 ‘2 0 a

6000 ?a

‘- G $

4000 3 z

2000

0 0 5 10 15 20 25 30

10000

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

Level Number Level Number

~5378 s9234

16000

14000

0 0 10 20 30 40 50 60

12000

10000

8000

6000

4000

2000

0 0 10 20 30 40 50 60 70 80

Level Number Level Number

~13207 ~15850

Fig. 3. Number of gate activations on each level in sequential benchmark.

Table 5

Percentage of gates per partitions for 50% load factor

Partition All without ~5378 with $9234 with ~13207 with ~15850 with

number Act. (%) Act. (%) Act. (%, Act. (%) Act. (%)

0 28.6 40.3 44.1 33.5 31.7

1 2X.6 27.9 20.2 24.1 24. I 2 14.3 11.5 9.4 14.2 9.1

3 14.3 9,s 8.6 13.6 16.2

4 1.2 5.1 11.7 8.6 8.1

5 1.2 5.4 6.0 6.0 4.9

94 J. Cloutier et al. / Simulation Practice and Theory 5 (1997) 83-99

Table 6

Percentage of gates per partitions for 70% load factor

Partition All without ~5378 with ~9234 with ~13207 with ~15850 with

number Act. (%) Act. (%) Act. (%) Act. (%) Act. (%)

0 22.9 31.7 38.8 27.1 32.9

1 22.9 24.4 18.5 21.4 20.5

2 15.9 16.1 10.5 14.7 11.2

3 16.0 12.2 9.4 12.4 13.2

4 11.2 7.2 9.4 14.4 13.6

5 11.2 8.5 13.4 10.0 8.6

We illustrate these observations in Fig. 5 which shows the relationship between the simulation time and the cut size for different load factors. Even though the cut sizes of the Level partitions are larger than those of the Mincut partitions, the former are faster (except for one circuit as explained earlier). Also, there is little correlation between the cut size and the simulation time for Level partitions, and a strong correlation, as anticipated, for Mincut partitions.

Finally, to complete the experiments and to confirm the results from the combina- tional circuits, we obtained estimates on gate activations from presimulations and used them as vertex weights W(v) when applying the Level partitioning algorithms. The results of the simulations are summarized in Fig. 6 which shows the simulation times as function of the load factor for the Level increasing and decreasing orders with both W(V) = 1 and W(V) = activation estimate. The best of those times are compared in Table 8. Clearly, the best ones are again obtained when activation estimates are used. Note, that this is true for pseudo-random input sequences, and it may not be a valid conclusion when some specific design verification input sequences are used in the simulation which could generally be much longer and different in nature from the sequences used in the presimulations.

6. Conclusions

We studied the effects of partitioning algorithms on the timewarp-based distributed simulation of combinational and sequential circuits under pseudo-random input sequences applied with a fixed period, and with equal gate delays. The obtained results are valid for any shared bus distributed system. We determined that:

For combinational and sequential circuits: - The number of gates gives a poor indication of the computation load of a

partition. - The topology of the partition (i.e., pipeline) is usually more important than the

cut size. - The general simulation behavior of the different circuits seems to be independent

J. Cloutier et al. / Simulation Practice and Theory 5 (1997) 83-99 95

6 lb l? io is 3’0 Level Number Level Number

~5378 ~9234

0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 80

600

500

400

300

200

100

0

Level Number Level Number

~13207 s1.5850

Fig. 4. Number of gates per level in sequential benchmarks.

of the circuit sizes. We thus believe that our conclusions apply to much larger circuits.

For combinational circuits only:

Faster simulations are achieved using level partitioning in which gates in the low topological levels (near the inputs) are assigned to the slow processors and the higher-level gates (near the outputs) to the fast processors.

For sequential circuits only: _ The structure and the dynamic behavior of a circuit seems to determine which

96 J. Cloutier et al. / Simulation Practice and Theory 5 (1997) 83-99

Table 7

Summary of best simulation times for sequential circuits

Circuit Best Mincut Best Level (decreasing) Ratio of

times

Simulation

time (sec.)

cut Load Simulation

time (sec.)

cut Load

s5378 446 1420 60 320 1532 50 1.39

~9234 429 923 110 376 2266 60 1.14

~13207 758 1414 100 1090 3693 70 0.69

~15850 431 1785 90 330 3541 70 1.30

- Level (Elapsed time)

Mincut (Elapsed time)

- Level cut size

- Mincut cut size

1200

3 1000 5

T$ 800

1 d 600

400

1800 1150

1600

1400 1050

1200 2 9so

lOOO.# : 85O VJ e

800 P ‘a 750 u 2

600 2 650

w 400 550

0 350

30 50 70 90 110 130 150 30 50 70 90 110 130 150 Load (%) S5378 Load (%) ~9234

3000

3 5 2500

3 2000

P

81500

4000

3500

3000

2500

2000

1500

1000

500

0

-500

-1ocil 0

2500

2000 3500

1800 3000

1600

L 1400 2500

,120o 2000

1 nnn t .(O I VU”

800 1500~

600 1000

400

Load (%) ~13207 Load (%) S15850

Fig. 5. Relationship between simulation times and cut size.

J. Cloutier et al. J Simulation Practice and Theor?, 5 (1997) 83-99 97

30 50 70 90 110 130 150 Load (%) ~5378

800

700

2 600 s =

k 5

500

400 w g 300 r)

200

100

0 30 50 70 90 110 130 150

Load (%) ~9334

70 90 110 Load (%) ~13207

850

750

” $j 650 J

B 5.50

350

Load (%a) ~158.50

Fig. 6. Simulation results for sequential circuits with “activation-partitioning”

partitioning strategy (mincut or level) should give the best results. Further analysis is still required, however. Faster simulations can be achieved using level partitioning in which gates in the low topological levels of the circuit (near the inputs) are assigned to the fast processors and the higher-level gates (near the outputs and flip-flops) to the slow processors.

98 J. Cloutier et al. / Simulation Practice and Theory 5 (1997) 83-99

Table 8

Comparison of Level-decreasing Best simulation times

Circuit Best Level-decreasing Best Level-decreasing Ratio of

(number of gates partition) (activation-partition) times

Simulation cut Load Simulation cut Load

time (sec.) time (sec.)

~5378 320 1532 50 244 1262 30 1.31

~9234 376 2266 60 260 2195 50 1.45

~13207 1090 3693 70 600 3737 50 1.81

s15850 330 3541 70 273 3414 70 1.20

To further improve the partitioning algorithms, we have developed a limited simulated annealing optimization algorithm that takes the current best partition as the starting point and uses the actual simulation time as the cost function. Some good partitions explored by the simulated annealing algorithm had small feedback loops between neighboring blocks, and these partitions were less susceptible to disturbances in the workstation network. Such loops may have a stabilizing effect on the simulation, without unduly increasing the simulation time.

Acknowledgements

We express our gratitude to Pascal Abessolo N’guema for collecting the simulation data. This work was partially funded by NSERC Canada grants STR0117759 and OGPIN-007.

References

[l] F. Brglez and H. Fujiwara, A neutral netlist of 10 combinational benchmark circuits and a target

translator in Fortran, in: Proceedings of the IEEE International. Symposium on Circuits and Systems, (1985) 104111.

[Z] F. Brglez, D. Bryan and K. Kozminski, Combinational profiles of sequential benchmark circuits,

in: Proceedings of the IEEE International Symposium on Circuits and Systems (1989) 75-81. [3] J.V. Briner, J.L. Ellis and G. Kedem, Taking advantage of optimal on-chip parallelism for parallel

discrete event simulation, in: Proceedings of the IEEE International Conference on Computer-Aided Design (1988) 312-315.

[4] K.M. Chandy and J. Misra, Asynchronous distributed simulation via a sequence of parallel

computations, Comm. ACM 4 (1981) 198-206.

[ 51 Y. Chi and W.M. Louks, Partitioning strategies to improve the performance of parallel time warp

simulation, in: Proceedings of 1992 Workshop on Partitioning Logic Circuits for Parallel Simulation, Ottawa (1992) 8.1-8.11.

[6] J. Cloutier, M. Bourgault, C. Roy and E. Cerny, Performance vs flexibility in a hierarchical multi-

level VLSI simulator, in: Proceedings of Conference on AI and Simulation, Society for Computer

Simulation (1987) 92-97.

J. Cloutier et al. / Simulation Practice and Theory 5 (1997) 83-99 99

[7] M. Davoren, A structural approach to the mapping problem in parallel discrete event simulations,

Ph.D. Thesis, University of Edinburgh, Scotland, 1989.

[S] C.M. Fiducia and R.M. Mattheyses, A linear time heuristic for improving network partitions, in:

Proceedings of the 19th ACM/IEEE Design Automation Conference (1982) 175-181.

[9] JADE Simulations International Corp., Sim+ + - A discrete-event simulation language, Techmcal

Report, 1989.

[lo] P. Girodias, Partitionnement pour la simulation paralltle en temps virtue1 de circuits combinatoires,

Master Thesis, Universitt de MontrCal, Canada, 1992.

[ 1 l] D.R. Jefferson, Virtual time, ACM Trans. Programming Languages Systems 7(3) (1985) 404-425. [ 121 M. Meknassi, M.E. Aboulhamid and E. Cerny, An efficient partitioning algorithm for large graphs.

in: Proceedings CCVLSI-92, Halifax, Canada ( 1992) 93-100.

[ 131 S.P. Smith, B. Underwood and M.R. vMercer, An analysis of several approaches to circuit

partitioning for parallel logic simulation, in: Proceedings of IEEE International Conference on Computer Design (1987) 664-667.

[ 141 C. Sporrer and H. Bauer, Partitioning VLSI-circuits for distributed logic simulation, in: Proceedings of European Simulation M&conference, York. UK (1992) 409-413.