parallel shared-memory state-space exploration in stochastic modeling

Parallel Shared�Memory State�Space

Exploration in Stochastic Modeling

Susann C� Allmaier� Graham Horton

Computer Science Department III� University of Erlangen�N�urnberg� Martensstr� �� Erlangen� Germany� Email � fsnallmai j grahamg�informatik�uni�erlangen�de�

Abstract� Stochastic modeling forms the basis for analysis in manyareas� including biological and economic systems� as well as the per�formance and reliability modeling of computers and communication net�works� One common approach is the state�space�based technique� which�starting from a high�level model� uses depth� rst search to generate botha description of every possible state of the model and the dynamics of thetransitions between them� However� these state spaces� besides being veryirregular in structure� are subject to a combinatorial explosion� and canthus become extremely large� In the interest therefore of utilizing boththe large memory capacity and the greater computational performanceof modern multiprocessors� we are interested in implementing parallelalgorithms for the generation and solution of these problems� In thispaper we describe the techniques we use to generate the state space ofa stochastic Petri�net model using shared�memory multiprocessors� Wedescribe some of the problems encountered and our solutions� in particu�lar the use of modi ed B�trees as a data structure for the parallel searchprocess� We present results obtained from experiments on two di�erentshared�memory machines�

� Introduction

Stochastic modeling is an important technique for the performance and reliabil�ity analysis of computer and communication systems� By performing an analysisof an appropriate abstract model� useful information can be gained about thebehavior of the system under consideration� Particularly for the validation of asystem concept at an early design stage� values for expected performance andreliability can be obtained� Typical quantities of interest in computer perfor�mance might be the average job throughput of a server or the probability ofbu�er over�ow of a network node� In reliability analysis� probabilities for criticalsystem states such as failures may be computed� As a result of such analyses�design parameters such as protocol algorithms� degrees of redundancy and com�ponent bandwidths may be optimized� Thus the ability to perform these analysesquickly and e�ciently is of great importance ��

One important class of techniques for stochastic modeling beside analyticaland discreteevent simulation approaches is statespace analysis� Here� a highlevel model such as a queuing network or stochastic Petri net is created� fromwhich the entire statespace graph is generated� in which there is one node for

each possible state which the model can assume� The states are linked by arcswhich describe the timing characteristics for each state change� The state spaceis thus described by an annotated directed graph�

Using the simplest and most common assumptions on the transitions �that the time spent by the system in each state is exponentially distributed �the stochastic process described by the model is a Markov chain� In this case�the directed graph of the state space represents a matrix and the transient andsteadystate analysis is performed by solving a corresponding system of ordinarydi�erential equations and linear system of equations respectively�

Owing to the combinatorial nature of the problem� the state spaces arising inpractical problems can be extremely large �� nodes and arcs�� The memoryand computing requirements for the resulting systems of equations grow corre�spondingly� It is the size of the state space that is the major limiting factor in theapplication of these modeling techniques for practical problems� This motivatesthe investigation of parallel computing for this type of stochastic analysis�

One wellknown technique for describing complex stochastic systems in acompact way are Generalized Stochastic Petri Nets �GSPNs� �� Petri netsallow a graphical and easily understandable representation of the behavior of acomplex system� including timing and conditional behavior as well as the forkingand synchronization of processes�

We are interested in using sharedmemory multiprocessors for the analy�sis of GSPN models� Such machines are becoming more widespread both asmainframe supercomputers and also as highend workstations and servers� Thesharedmemory programming model is more general than the messagepassingparadigm� allowing� for example� the concurrent access to shared data structures�On the other hand� the model contains other di�culties such as contention forthis access� which requires advanced synchronization and locking methods� Thesewill be the subject of this paper� We consider implementations on two di�erentshared memory multiprocessors� the Convex Exemplar SPP �� mainframe su�percomputer using the proprietary CPS thread programming environment anda Sun Enterprise �� multiprocessor server using POSIX threads�

The signi�cance of this work lies in the extension of the ability to model withGSPNs to sharedmemory multiprocessors� To our knowledge� there has� untilnow� been no work published concerning parallel sharedmemory statespacegeneration for stochastic models� An approach for distributed memory machineswas published in �� The results of this work should provide faster statespacegeneration and� in conjunction with parallel numerical algorithms� overall accel�eration of the analysis process� In particular� we will be able to better utilizethe larger main memories of modern sharedmemory multiprocessors� This willalso enable the analysis of models whose size has prevented their computationon standard workstations�

In the following section we describe statespace generation for Petri nets� InSection � we describe the parallelization issues and the solution techniques weused� Section � contains results from the parallel programs and in Section � wegive a short conclusion�

DOWN UP

RUN

WRITE_CPFAIL

REBOOT

CHECKPOINT

Fig� �� Example GSPN Model of a Computer System with Failures and Checkpoints�

� State�Space Generation for GSPNs

In this section we brie�y describe stochastic Petri nets and the automatic gen�eration of their underlying state spaces�

One of the most widely used highlevel paradigms for stochastic modeling areGeneralized Stochastic Petri Nets �GSPNs� �� These are an extension to stan�dard Petri nets to allow stochastic timed transitions between individual states�They have the advantages of being easy to understand and having a naturalgraphical representation� while at the same time possessing many useful mod�eling features� including sequence� fork� join and synchronization of processes�The state space� or reachability graph� of a GSPN is a semiMarkov process�from which states with zero time are eliminated to create a Markov chain� TheMarkov chain can be solved numerically� yielding probability values which canbe combined to provide useful information about the net�

A GSPN is a directed bipartite graph with nodes called places� representedby circles� and transitions� represented by rectangles� Places may contain tokens�which are drawn as small black circles� In a GSPN� two types of transitions arede�ned� immediate transitions� and timed transitions� For simplicity� we will notconsider immediate transitions any further in this paper� If there is a token ineach place that is connected to a transition by an input arc� then the transitionis said to be enabled and may �re after a certain delay� causing all these tokensto be destroyed� and creating one token in each place to which the transitionis connected by an output arc� The state of the Petri net is described by itsmarking� an integer vector containing the number of tokens in each place� The�rst marking is commonly known as the initial marking� A timed transitionthat is enabled will �re after a random amount of time that is exponentiallydistributed with a certain rate�

Figure shows a small example GSPN that models a group of comput�ers� each of which may either be running �UP�� writing a checkpoint �CHECK�POINT�� or failed and rebooting from the last checkpoint �DOWN�� The changesbetween these states are represented by timed transitions with appropriate ex�ponentially distributed rates� In this case� we have modeled two computers �by

(CHECKPOINT,UP)

(UP,UP)

(UP,DOWN)

(DOWN,DOWN) (CHECKPOINT,DOWN) (CHECKPOINT,CHECKPOINT)

Fig� �� State Space for Example GSPN�

inserting two tokens into the net� initially in place UP�� Note that we couldmodel any number of computers by simply adding tokens accordingly� assumingthat the transition rates are independent of the states of the other machines�

Figure � shows the state space� or reachability graph� of this GSPN withtwo tokens� For readability� we have omitted the rates that are attached to thearcs and have used a textual description for the marking vector� Each statecorresponds to one possible marking of the net� and each arc to the �ring ofa transition in that marking� Owing to the simplicity of this particular Petrinet� the state space has a regular triangular structure� In general� however� thereachability graph is highly irregular�

In this example� the statespace graph has six nodes� It is easy to see thatadding tokens to the net will lead to a rapid increase in the size of the graph� Inthe general case� the size grows as tp� where t is the initial number of tokens andp the number of places in the Petri net� It is for this reason that Petri nets of evenmoderate complexity may have state spaces whose storage requirements exceedthe capacities of all but the largest of computers� In addition� the computationtimes for the solution of the underlying equations grows accordingly� It is� ofcourse� for these reasons that we are interested in parallel computation�

Figure � shows the sequential statespace generation algorithm in pseudocode form� It utilizes a stack S and a data structure D� which is used to quicklydetermine whether or not a newly detected marking has previously been dis�covered� D is typically chosen to be either a hash table or a tree� One of thecontributions of this work is the use of a modi�ed Btree to allow rapid searchwhilst at the same time minimizing access con�icts� The algorithm performs adepth�rst search of the entire state space by popping a state from the stack�generating all possible successor states by �ring each enabled transition in thePetri net and pushing each thuscreated new marking back onto the stack� ifit is one that has not already been generated� Replacing the stack by a FIFOmemory would result in a breadth�rst search strategy�

� procedure generate state space

� initial marking m�

� reachability graph R � �� search data structure D � �� stack S � �� begin

� add state m� to R� insert m� into D push m� onto S� while �S �� mi pop�S�� for each successor marking mj to mi

�� if �mj �� D�

�� add state mj to R�� insert mj into D�� push mj onto S�� endif

�� add arc mi � mj to R� endfor

� endwhile

�� end generate state space

Fig� �� Sequential State�Space Generation Algorithm

� Parallelization Issues

The statespace generation algorithm is similar to other statespace enumera�tion algorithms� such as the branchandbound methods used for the solutionof combinatorial problems such as computer chess and the traveling salesmanproblem� Consequently it presents similar di�culties in parallelization� namely

� The size of the state space is unknown a priori� For many Petri nets it cannotbe estimated even to within an order of magnitude in advance�

� All processors must be able to quickly determine whether a newly generatedstate has already been found � possibly by another processor � to guaran�tee uniqueness of the states� This implies either a mapping function of statesto processors or an e�ciently implemented� globally accessible search datastructure�

� The state space grows dynamically in an unpredictable fashion� making theproblem of load balancing especially di�cult�

However� there are also two signi�cant di�erences to a branchandbound algo�rithm�

� The result of the algorithm is not a single value� such as the minimum pathlength in the traveling salesman problem or a position evaluation in a gameof strategy� but the entire state space itself�

� No cuto� is performed� i�e� the entire state space must be generated�

Our parallelization approach lets di�erent parts of the reachability graphbe generated simultaneously� This can be done by processing the main loop ofalgorithm generate state space �Lines �� concurrently and independentlyon all threads� implying simultaneous read and write access to the global datastructures R� D and S� Thus the main problem is maintaining data consistencyon the three dynamically changing shared data structures in an e�cient way� Ourapproach applies two di�erent methods to solve this problem� S is partitionedonto the di�erent threads employing a shared stack for load balancing reasonsonly� whereas the design of D and R limits accesses to a controlled part of thedata structure which can be locked separately�

With respect to control �ow there is not much to say� threads are spawnedbefore entering the main loop of algorithm generate state space and termina�tion is tested only rarely� namely when an empty shared stack is encountered��Thus we can concentrate on the crucial and interesting part of the problem� theorganization of the global data structures and the locking mechanisms that wehave designed�

�� Synchronization

Synchronization is done by protecting portions of the global shared data withmutex variables providing mutual exclusive access�

Because arcs are linked to the data structures of their destination states �mj

in Figure �� rather than their source states �mi in Figure �� synchronizationfor manipulating the reachability graph may be restricted to data structureD� marking mj is locked implicitly when looking for it in D by locking thecorresponding data in D and holding the lock until Line � has been processed�No barriers are needed in the course of the generation algorithm�

Synchronization within the Search Data Structure� We �rst considereddesigning the search data structure D as a hash table like some �sequential�GSPN tools do� because concurrent access to hash tables can be synchronizedvery easily� by locking each entry of the hash table before accessing it� But thereare many unsolved problems in using hash tables in this context� As mentionedearlier� neither the size of the state space is known in advance � making itimpossible to estimate the size of the hash table a priori � nor is its shape�which means that the hash function which maps search keys onto hash tableentries would be totally heuristic�

For these reasons� we decided to use a balanced search tree for retrievingalready generated states� This guarantees retrieval times that grow only loga�rithmically with the number of states generated�

The main synchronization problem in search trees is rebalancing� a nonbalanced tree could be traversed by the threads concurrently by just locking thetree node they encounter� unlocking it when progressing to the next one� and�

� This can easily be done by setting and testing termination �ags associated withthreads under mutual exclusion�

if the search is unsuccessful� inserting a new state as a leaf without touchingthe upper part of the tree� Rebalancing � obviously obligatory for e�ciencyreasons with large state spaces � means that inserting a new state causes aglobal change to the tree structure� To allow concurrency� the portion of thetree that can be a�ected by the rebalance procedure must be anticipated andrestricted to as small an area as possible� since this part has to be locked untilrebalancing is complete� A state is looked up in D for each arc of R �Line � inFigure �� which will generate a lot of contention if no special precautions aretaken�

We found an e�cient way to maintain the balance of the tree by allowingconcurrent access through the use of Btrees �� these are by de�nition auto�matically balanced� whereby the part of the tree that is a�ected by rebalancingis restricted in a suitable way�

Synchronization Schemes on B�trees� A Btree node may contain more than onesearch key � which is a GSPN marking in this context� The Btree is said tobe of order � if the maximum number of keys that are contained in one Btreenode is �� A node containing �� keys is called full� The search keys of one nodeare ordered smallest key �rst� Each key can have a left child which is a Btreenode containing only smaller keys� The last �largest� key in a node may have arightchild node with larger keys�

Searching is performed in the usual manner� comparing the current key in thetree with the one that is being looked for and moving down the tree accordingto the results of the comparisons� New keys are always inserted into a leaf node�Insertion into a full node causes the node to split into two parts� promoting onekey up to the parent node which may lead to the splitting of the parent nodeagain and so on recursively� Splitting might therefore propagate up to the root�Note that the tree is automatically balanced� because the tree height can onlyincrease when the root is split�

The entity that can be locked is the Btree node containing several keys�Several methods which require one or more mutex variables per node are known�� We have observed that using more than one mutex variable causes an un�justi�able overhead� since the operations on each state consume little time� Theeasiest way to avoid data inconsistencies is to lock each node encountered onthe way down the tree during the search� Since nonfull nodes serve as barriersfor the back propagation of splittings� all locks in the upper portion of the treecan be released when a nonfull node is encountered �� However� using this ap�proach� each thread may hold several locks simultaneously� Moreover� the lockednodes are often located in the upper part of the tree where they are most likelyto cause a bottleneck� Therefore� and since we even do not know a priori if aninsertion will actually take place� we have developed another method adaptedfrom �� that we call splitting�in�advance�

Our Btree nodes are allowed to contain at most �� keys� On the waydown the Btree each full node is split immediately� regardless of whether aninsertion will take place or not� This way back propagation does not occur� sinceparent nodes can never be full� Therefore a thread holds at most one lock at

2521

582 5 4137

2 5 13

Step 1

Step 2

4137

13 2521

58

46

469 21

252 5 4137 58

Step 4

33 17

469Step 3 21

13 252 5 58

33

41

33

17

33 17

469

37

13

9 17

Fig� �� Splitting�in�advance when Inserting Key �� in a B�tree

a time� The lock moves down the tree as the search proceeds� For this reason�access con�icts between threads are kept to a small number� thus allowing highconcurrency of the search tree processing� Figure � shows the insertion of key �in a Btree of order � � using splittinginadvance� Locked nodes are shownas shaded boxes� As the root node is full� it is split in advance in Step � The lockcan be released immediately after the left child of key �� has been locked �Step�� The encountered leaf node is again split in advance� releasing the lock of itsparent �Step �� Key � can then be inserted appropriately in Step � withoutthe danger of back propagation�

Using e�cient storage methods� the organization of the data is similar tothat of binary trees� whereby Btree states consume at most one more byte perstate than binary trees ��

Synchronization on the Stack� The shared stack� which stores the as yetunprocessed markings� is the pool from which the threads get their work� In thissense the shared stack implicitly does the load balancing and therefore cannotbe omitted� Since it has to be accessed in mutual exclusion� one shared stackwould be a considerable bottleneck� forcing every new marking to be depositedthere regardless if all the threads are provided with work anyway�

Therefore we additionally assign a private stack to each thread� Each threaduses mainly its private stack� only pushing a new marking onto the shared stackif the latter�s depth drops below the number of threads N � A thread only popsmarkings from the shared stack if its private stack is empty� In this manner loadimbalance is avoided� a thread whose private stack has run empty because it hasgenerated no new successor markings locally� is likely to �nd a marking on theshared stack� provided the termination criterion has not been reached�

The shared stack has to be protected by a mutex variable� The variablecontaining the stack depth may be read and written with atomic operations�

thus avoiding any locking when reading it� This is due to some considerations�

� The variable representing the depth of the shared stack can be stored inone byte because its value is always smaller than two times the number ofthreads� if the stack depth is n� when all the threads are reading it everythread will push a marking there leading to a stack depth of �N � � N

which causes the threads to use their private stacks again�� The number of threads can be restricted to N � �� so that �N � can be represented in one byte without any loss of generality since the statespace generation of GSPNs is no massively parallel application�

�� Implementation Issues

Synchronization� Since we have to assign a mutex variable to each Btreenode in the growing search structure� our algorithm relies on the number ofmutex variables being only limited by memory size�

Our algorithm can be adapted to the overhead that lock and unlock oper�ations cause on a given machine by increasing or decreasing the order of theBtree �� an increase in � saves mutex variables and locking operations� On theother hand a bigger � increases both overall search time � since each Btreenode is organized as linear list � and search time within one node� which alsoincreases the time one node stays locked� Measurements in Section � will showthat the savings in the number of locking operations are limited and that smallvalues for � lead to a better performance for this reason�

Waiting at Mutex Variables� Measurements showed that for our algorithm� which locks small portions of data very frequently � it is very important thatthe threads perform a spin wait for locked mutex variables rather than gettingidle and descheduled� The latter method may lead to time consuming system callsand voluntary context switches in some implementations of the thread librarieswhere threads can be regarded as light weight processes which are visual to theoperating system �e�g� Sun Solaris POSIX threads�� Unfortunately busy waitsmake a toolbased analysis of the waiting time for mutex variables di�cult sinceidle time is converted to CPU time and gets transparent to the analysis process�

Memory Management� Our �rst implementation of the parallel statespacegeneration algorithm did not gain any speedups at all� This was due to thefact that dynamic memory allocation via the malloc�� library function is im�plemented as a mutually exclusive procedure� Therefore two threads that eachallocate one item of data supposedly in parallel always need more time that onethread would need to allocate both items�

Since our sparse storage techniques exploit dynamic allocation intensively�we had to implement our own memory management on top of the library func�tions� each thread reserves large chunks of private memory and uses specialallocate�� and free�� functions for individual objects� In this way� only fewmemory allocation must be done under mutual exclusion�

� Experimental Results

We implemented our algorithms using two di�erent sharedmemory multipro�cessors�

� A Convex Exemplar SPP �� multiprocessor with � Gbytes of main memoryand �� Hewlett�Packard HP PA�RISC �� processors� It is a UMA �uni�form memory access� machine within each hypernode subcomplex consistingof � processors whereby memory is accessed via a crossbar switch� Largercon�gurations are of NUMA �nonuniform memory access� type� as mem�ory accesses to other hypernodes go via the socalled CTI ring� We did ourmeasurements on a con�guration with one hypernode to be able to bettercompare with the second machine�

� A Sun Enterprise server �� with � Gbytes of main memory and �UltraSparc processors which can be regarded as UMA since memory isalways accessed via a crossbar switch and a bus system�

Our experiments use a representative GSPN adapted from �� It models a mul�tiprocessor system with failures and repairs� The size of its state space can bescaled by initializing the GSPN with more or less tokens that represent the pro�cessors of the system� The state spaces we generated consist of �� statesand �� arcs �Size S� and of �� states and �� arcs respectively�Size M��

Figure � shows the overall computation times needed for the reachabilitygraph generation depending on the number of processors measured on the Con�vex SPP and on the Sun Enterprise for the GSPN of Size M� Figure � showsthe corresponding speedup values and additionally the speedups for the smallermodel �Size S�� In the monoprocessor versions used for these measurements�all parallelization overhead was deleted and the Btree order was optimized to� � � The �gures show the e�ciency of our algorithms � especially of theapplied synchronization strategies� For both architectures the speedup is linear�� shows that these speedups are nearly model independent�

Figure � shows the dependency between the computation times of the statespace generation and the Btree order � for Size M measured on the Sun Enter�prise�

Table gives the total number of locking operations and the number of usedmutex variables for di�erent Btree orders � for a parallel run with � processors�It can be seen that the number of locking operations decreases only by a factorof �� whereby the total number of mutex variables � which is also the numberof Btree nodes � decreases by a factor of �� when � is increased from to�� This is due to the fact that for each arc in the state space at least one lockingoperation has to be performed and that the number of locking operations per arcdepends only on how deep the search moves down the Btree �see Section �� Thus it becomes intelligible that � � � leads to the best performance �compareSection ��

The number of shared stack accesses turned out to be negligible when localstacks are used� �� was the maximum number of markings ever popped from

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 2 3 4 5 6 7 8Number of Processors

Computation Time (sec)

Size M, ConvexSize M, Sun

Fig� �� Computation Times� Convex andSun� Size M�

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8Number of Processors

Speedup

Size M, ConvexSize S, Convex

Size M, SunSize S, Sun

Fig� �� Speedups� Convex and Sun� Size Mand S�

200

400

600

800

1000

1200

1400

1600

1800

2000

2 3 4 5 6 7 8Number of Processors


B-tree order 1B-tree order 2B-tree order 4B-tree order 8

B-tree order 16B-tree order 32

Fig� �� Computation Times for VariousValues of �� Sun� Size M�

0

500

1000

1500

2000

2500

2 3 4 5 6 7 8Number of Processors


No Local Stacks, ConvexWith Local Stacks, Convex

No Local Stacks, SunWith Local Stacks, Sun

Fig� � Computation Times with and with�out Local Stacks� Convex and Sun� Size M�

B�tree order Number of Number of� Locking Operations Mutex Variables

� ��

� ��

� ��

��

��

��

Table �� Synchronization Statistics for Size M

the shared stack during all our experimental runs� Figure � compares the com�putation times with and without the use of local stacks on both multiprocessorarchitectures for Size M�

� Conclusion and Further Work

We presented a parallel algorithm for generating the state space of GSPNs whichis mainly based on the use of modi�ed Btrees as a parallel search data structure�Measurements showed good linear speedups on di�erent architectures�

One Btree node could be organized as a balanced tree rather than a linearlist� But measurements of the reduction in the number of locking operations when� is increased �Table � let expect only moderate performance improvements thisway�

On the other hand the maintenance of several Btrees rather than one seemsto be a a promising improvement in the organization of the search data structure�con�icts at the root node could be avoided thus allowing a higher degree ofparallelization�

Acknowledgments� We wish to thank Stefan Dalibor and Stefan Turowski atthe University of ErlangenN�urnberg for their helpful suggestions and for theirassistance in the experimental work�

References

�� S� Allmaier� M� Kowarschik� and G� Horton� State space construction and steady�state solution of GSPNs on a shared�memory multiprocessor� In Proc� IEEE Int�Workshop Petri Nets and Performance Models �PNPM �� St� Malo� France� ��IEEE Comp� Soc� Press� To appear�

�� G� Balbo� On the success of stochastic Petri nets� In Proc� IEEE Int� Workshop onPetri Nets and Performance Models �PNPM �� pages �� Durham� NC� ��IEEE Comp� Soc� Press�

�� R� Bayer and M� Schkolnick� Concurrency of operations on B�trees� Acta Informat�ica� ��

�� G� Ciardo� J� Gluckman� and D� Nicol� Distributed state�space generation ofdiscrete�state stochastic models� Technical Report �� ICASE� NASA Lang�ley Research Center� Hampton� VA� ��

� D� Comer� The ubiquitous B�tree� Computing Surveys� �� L�J� Guibas and R� Sedgewick� A dichromatic framework for balanced trees� In

Proc� �th Symp� Foundations of Computer Science� pages �� M� Ajmone Marsan� G� Balbo� and G� Conte� Performance models of multiprocessor

systems� MIT Press� �� M� Ajmone Marsan� G� Balbo� G� Conte� S� Donatelli� and G� Franceschinis� Mod�

elling with generalized stochastic Petri nets� Wiley� Series in Parallel Computing��

parallel shared-memory state-space exploration in stochastic modeling

Documents