wcet analysis of superscalar processors using simulation with coloured petri nets

The International Journal of Time-Critical Computing Systems, 18, 275–288 (2000)c© 2000 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

WCET Analysis of Superscalar Processors UsingSimulation With Coloured Petri Nets

FRANK BURNS [email protected] of Computing Science, University of Newcastle upon Tyne, UK

ALBERT KOELMANS [email protected] of Computing Science, University of Newcastle upon Tyne, UK

ALEXANDRE YAKOVLEV [email protected] of Computing Science, University of Newcastle upon Tyne, UK

Abstract. Determining a tight WCET of a block of code to be executed on a modern superscalar processorarchitecture is becoming ever more difficult due to the dynamic behaviour exhibited by current processors, whichinclude dynamic scheduling features such as speculative and out-of-order execution in the context of multipleexecution units with deep pipelines. We describe the use of Coloured Petri Nets (CP-nets) in a simulation basedapproach to this problem. A complex model of a generic processor architecture is described, with emphasis onthe modelling strategy for obtaining the WCET and an analysis of the results.

Keywords: WCET, superscalar processors, Coloured Petri Nets, modelling

1. Introduction

New processor architectures are becoming ever more complex, in order to increase theirspeed. Architectural features include deep pipelines, out-of-order and parallel executionof instructions, and the use of multi level caches. This makes it difficult to obtain accuratetiming estimations of programs that run on these processors, since the execution time mayvary according to the sequence of instructions and the data being processed. This, in turn,makes it difficult to obtain a tight WCET for blocks of code. It is clear that processors willcontinue to become more complex still in the future, further exacerbating this problem.

Work on WCET analysis has traditionally been carried out at different levels of abstraction.At the highest level, the use of closed formulae (Zhang, Burns, and Nicholson, 1993;Puschner and Koza, 1989; Shaw, 1989; Park, 1993) or algebraic methods (Choi, Lee, andKong, 1994), allows relatively large programs to be modelled, but until recently has onlybeen applied to relatively simple pipelined processor architectures (Narasimham and Nilsen,1994). Some recent work has been done on superscalar modelling of WCET (Lim et al.,1998; Schneider and Ferdinand, 1999) but this has been restricted to a fairly simple modelof in-order multiple issue. More accurate is the use of numerical methods such as ILP (Li,Malik, and Wolfe, 1995), but such methods carry a big penalty as the size of the programincreases. Finally, at the lowest level we have simulation methods (Healy, Whalley, andHarmon, 1995) which have been applied to both pipelines and caching effects. Thesemethods are potentially the most accurate, but can only be used on small sections of code,

276 BURNS

because of the very time consuming nature of simulation. Simulation methods are thereforemost useful in combination with a higher level analysis tool. Our work applies a simulationbased WCET analysis to a model of a complicated superscalar processor which includesspeculation as well as out of order issue, execution and completion. The model does nottake into account caching effects. We describe a complex processor model and analyse itsworst case timing behaviour, based on the use of CP-nets (Jensen, 1997). Previous workin the Petri Net area, e.g. (Razouk, 1987), has mainly focused on the use of ordinary PetriNets. CP-nets can be simulated using the Design/CPN tool (Christensen, Jorgensen, andKristensen, 1997). The model is not an accurate model of an actual processor, but combinesthe most important features of existing processors, such as the PowerPC.

A CP-net model, like an ordinary Petri Net model, consists of a collection of places,transitions, and arcs between these places and transitions. The model containstokensthatflow around the model and are stored in the places. CP-nets allow complex data typesto be attached to the tokens. In our processor model, each instruction (implemented as atoken) is such a complex data object. It consists of attributes reflecting not only its typeand operands, but also important information about dependencies, targets, branching etc.All of these aspects affect the temporal profile of the instruction in the overall instructionflow. The flow of the instructions is determined by so calledguards, which are conditions,attached to the arcs of the model, that determine whether a transition is allowed to fire. Inthe Design/CPN system, guards are specified in the ML programming language (Appel andMacQueen, 1991). These guards therefore determine the dynamic behaviour of the model.The collective set of guards is responsible for implementing the dynamic behaviour of theprocessor being modelled.

Simulation techniques are often used to obtain an average behaviour profile of the modelbeing investigated, due to the fact that there is an element of non-determinism as a resultof data dependencies. In this work we combine simulation with analysis in order to avoidthis problem. By carefully implementing the guards in such a way that always worst caseassumptions are made about the execution of the program, the model can be made to executein worst case fashion, and a highly accurate WCET is thus obtained.

The paper is organised as follows. We briefly introduce superscalar processors in section 2.Section 3 discusses the main features of the model. In section 4 the use of the modelfor WCET analysis is described, showing the inadequacies of straightforward simulation,followed by a solution. Section 5 shows the conclusions.

2. Superscalar Processors

Superscalar processors (Johnson, 1991) attempt to execute several instructions in parallel,known asInstruction Level Parallelism(Arvind and Rebello, 1994). The processor performsthe following steps: itfetchesa number of instructions from memory, determines howmany of these instructions can be executed in parallel, assigns an appropriate functionalunit for each instruction,issuesall scheduled instructions, andwrites backthe resultsof the execution to registers and memory. This requires multiple functional hardwareunits, calledresources, say for the execution of integer and floating point operations; many

WCET ANALYSIS OF SUPERSCALAR PROCESSORS 277

processors contain multiple copies of the same unit, so that several instructions of thesame type can be executed in parallel. This is calledmultiple issue. Many superscalarprocessors attempt to execute instructionsout of orderin order to improve performance.Some processors, such as the Pentium, have restrictions that prevent some instruction typesto be executed in a particular pipeline. The model described here does not have these kindsof restrictions.

The number of instructions that can be issued in parallel or out of order varies dependingon the number of resources available and on the dependencies between instructions. Ifthere are resource conflicts and instruction dependencies, then the possibility ofhazardsthat prevent the next instructions from executing may occur. This has an adverse affect onthe WCET. Resource conflicts lead to astructural hazard. These need to be detected andthe affected instructions need to be stalled, or, alternatively, a suitable schedule needs to befound that will prevent stalling. Instruction dependencies lead to adata hazard. These arisewhen an instruction depends on the results of a previous instruction, so that its operands arenot yet available. For example, when an instruction depends on the value of a register thatis written to in a previous instruction, this is called aread-after-writehazard. Wheneveran instruction overwrites a value used in the previous one, this is called awrite-after-readhazard. If consecutive instructions both write to the same target register, this is called awrite-after-writehazard. In all these cases, the instructions should be executedin order,not out of order or in parallel. Johnson (1991) discusses these topics in depth.

3. Model of the Superscalar Processor

This section describes the model of the superscalar processor. The Design/CPN tool andits functionality are described in detail in Jensen (1997) where more detailed references toColoured Petri Net terms can be found.

Processor instruction types are specified with predefined identifiers using an enumeratedcolor set as follows:

color Instr = withINT | FPADD | MUL | DIV | BRA | NOOP;

Here each identifier corresponds to an instruction type. For example,INT andFPADDare used to define integer and floating point operations, respectively;BRA defines branchinstructions, andNOOP defines no-operation instructions. It is then possible to define arecord with appropriate fields to model an instruction:

color Value = record no:Line * instr:Instr * d: Dep * d’:Dep * t:Target timed;

This definition assumes that the instruction has dependencies d and d’, one for eachpossible register dependency, and a destination t, which links those instructions sharingthe same target register. Instruction dependencies are linked to the numbers of previouslyexecuted instructions; for example the following instruction, number 3, has dependencieson instructions 1 and 2 and the same target as 1:

{no=3,instr=INT,d=1,d’=2,t=1}

278 BURNS

Figure 1. CP-net model of a superscalar processor.

The simulation uses these dependencies to control the dynamic behaviour of the model.The top level view of the CP-net model is shown in Figure 1. The main flow of tokens

takes place from the top to the bottom. The model shows the main stages of a processorpipeline: instruction fetch, instruction decode, instruction execution, and the writeback ofthe execution results to memory.

Instructions are loaded into the model via input placeIn. The initial marking for thisplace is shown in the box to the right, where an initial set of instructions is specified. Theformat of these instructions is such that all dependencies are already specified. It wouldbe too difficult to decode a binary object file within the model. In practice, this is nota serious restriction, since the code can readily be preprocessed outside the Design/CPNenvironment.

Below the input place is aFetchtransition which is used for fetching instructions into theprocessor. A placePC is used for holding the program counter token. A queue is used tohold and process fetched instructions.


When theFetchtransition occurs, instructions are loaded into theInstruction Queue. Theyare thenDecodedand moved to theCentral Window, where instructions are stored until theprocessor decides in which order they will beExecuted.

The placeStall1 is used to control the fetching of new instructions, for example, to makesure that no buffer overflow can take place. Fetching of instructions is also affected by theexecution of incorrectly predicted branches during theExecutestage, so theStall1 place isalso controlled from there.

The place calledInstr Controlis used to control the number of instructions that the modelis allowed to process at the execution stage. It can hold a maximum of four tokens. Theflow of tokens will stall if the model attempts to store more instructions than four. This inturn stops theCentral Windowfrom overflowing.

TheExecutestage is modelled by using a set of transitions to represent individual executionunits DIV, MUL, FPADD, INT. Instruction execution occurs at one of the many executiontransitions. The symbol H means that the transition is hierarchical. For example, FPADD isa substitution transition with a pipeline containing four stages. TheExecutestage is followedby transitionsMemoryandWriteback, where the results of the instructions are written backto the memory and the registers. Places are used between transitions for storage, queues,etc.

The place calledStall3 is used to control data hazards. These kinds of hazards occurduring the writeback operation of the processor, and are therefore controlled from this partof the model.

On the top left of the diagram is the branch prediction unitBPU. It is here that branchspeculation is performed. This unit gets its information from the execute stage, when allappropriate data is finally available, so that the decision can be made whether to branch ornot. The unit stores the result of the branch, and makes the result available when the samebranch is encountered later.

To model processor timing we use the global clock feature of the simulator, as describedin Jensen (1997). Specific timing values for a particular processor have to be derived fromprocessor data sheets and similar sources of data.

A very detailed description of the overall model, including its hierarchical structure, fallsoutside the scope of this paper. The interested reader is referred to Burns, Koelmans, andYakovlev (1999).

3.1. Multiple and Out of Order Issue

In the multiple issue model four fixed length instructions are fetched at a time as a block.This is achieved by defining a block of instructions as a record:

1‘{blockno=1,b1={no=1,instr=INT,d=0,d’=0,t=0},b2={no=2,instr=INT,d=1,d’=0,t=0},b3={no=3,instr=FPADD,d=0,d’=2,t=0},b4={no=4,instr=FPADD,d=0,d’=0,t=0}

}

280 BURNS

Figure 2. Multiple, out of order issue.

Fetched instructions are passed into a queue (see Figure 2). Instructions from the queueare considered for decode and are replenished by a new block of instructions when theyhave been decoded. Decoded instructions are passed into a buffer. The buffer, called aninstruction windowor central window, CW, is placed between the instruction decoder andthe functional units.

To determine the number of instructions to be issued in parallel, certain dependency testshave to be made. This is done by using guards and code segments at transitions, denoted bythe symbol C at the transitionsDecodeandINT. These are used to test the dependencies be-tween instructions before allowing instructions to be issued in parallel to the functional units.

For example transition guards at execute transitionINT are used to test for writeback ofresults before an instruction is allowed to issue from the central window.

To issue out of order implies using buffers or windows in which to store instructionswaiting to execute. Tokens collect in the central window and represent not yet issuedinstructions. Dependent instructions are held in the window until dependencies are resolved.Independent instructions are issued randomly from the window. To control its size, aninstruction control size node,Instr Control, is used.

Another way of implementing the instruction window is to distribute individual bufferscalledreservation stationsto each of the functional units, buffering instructions destinedfor a particular functional unit at the input of that functional unit. Multiple windows orreservation stations can be modelled in Petri nets by using a number of buffer places, onefor each reservation station.

Out of order writeback and completion are modelled. However, the model presented heredoes not model a reorder buffer. Details of the model of speculation and recovery can befound in Burns, Koelmans, and Yakovlev (1999).


Figure 3. Block diagram of WCET analysis.

4. WCET Analysis

Once the Design/CPN model of the superscalar processor is defined it is used in the WCETanalysis process. The overall process of WCET analysis is shown in Figure 3. The pro-cessor architecture is first modelled as a set of hierarchical CP-nets utilizing a library ofDesign/CPN functions which are compiled together with the model. In order to applyWCET analysis to a series of instructions, the code to be analysed must be first prepro-cessed outside the model. Initially the C source code is compiled and assembly output isgenerated from it. This is preprocessed to generate a token list (CPN input) which containsinformation about dependencies, targets etc:

movl $4, R1 7→ {no=1,instr=INT,d=0,d’=0,t=1}movl $8, R2 7→ {no=2,instr=INT,d=0,d’=0,t=2}addl $10,R1 7→ {no=3,instr=INT,d=1,d’=0,t=1}subl $10,R2 7→ {no=4,instr=INT,d=2,d’=0,t=2}

At this stage, data independent information is extracted from the code about branching,loop limits etc. Some manual input is required to determine loop bounds to be used later inthe WCET analysis process.

Once the preprocessing is done, the instructions are entered into the CP-net model atthe input place and compiled using the Design/CPN compiler. Simulation then proceedstogether with WCET analysis until a WCET value can be established. During this processfull simulation of the code is not necessary, as limited processing of loops is used as part

282 BURNS

Figure 4. Example instruction timing diagram.

of the overall technique to cut down the analysis time. Simulation time is cut down forthose loops that are highly repetitive and where each loop execution time is observed not tochange during execution of the main loop body. The initial and last loop iterations whichare subject to speculation errors are accounted for seperately. The loop bounds derivedfrom the preprocessing stage are applied later when calculating the full loop WCET.

4.1. Straightforward Simulation

After the processor model is simulated the results are analysed. A huge text file is generatedby the Design/CPN tool which, amongst other things, contains timing data. This shows thetime of execution of transitions and their outputs. These are extracted and then displayedusing a graphical interface that we have developed. Figure 4 shows a typical graphicaloutput for a simulation from the tool. From left to right instructions are shown on the timeaxis; at the left side, the processor units are displayed. Instructions are first fetched asblocks into the queue. Subsequently they can be seen to execute in the functional unitsbefore being allowed to write back out of order.

Code generated for the JPEG compression algorithm has been used as input to the model.Analysis of the results has shown that with straightforward simulation there is a variationin the measurements, which ranges approximately from 22000 to 26000 clock units. Thisrepresents a variation of about 20%. This is due to the non-determinism and contentioninherent in the model. In a non-deterministic model, independent parallel instructionswhich are in conflict are issued randomly, thus causing potential deficiencies in the WCETcalculation which is inadequte for the purpose of finding a close WCET. Our aim is, ofcourse, to force the simulation to work in such a way that the worst case estimate is obtainedin all cases, by making sure that the simulator works at the top range of this variation. Theremaining sections explain how this is achieved by modifying the behaviour of the CP-Net.


Figure 5. Contention between instructions.

4.2. Non-Determinism and Contention

In order to attain a close approximation for the WCET, contention must be accounted for.It occurs when hazards are present. For example, consider the set of instructions stored inthe Central Window in Figure 5 (see also Figure 6 at time 30). The box in Figure 5 showsthe contents of the central window: one FPADD instruction and three INT instructions,in which dependencies for the FPADD and two of the INT instructions are resolved. INTinstruction 7 will be resolved after a delay of 10, because it has a dependency on instruction2. The FPADD instruction can be issued straight away, but only one INT instruction maybe executed at a time, as there is only one integer functional unit, and therefore only onecan be issued. Four sequences of the INT instructions are therefore possible, one in-order,[5,6,7], and three out-of-order, [5,7,6], [6,5,7], and [6,7,5].

4.3. Effect of Contention on WCET

Because contention affects the order of issue, the delays of the paths which cover theconflicting instructions will also be affected by the order of issue. The critical path mayalso be varied depending on the order in which the instructions are issued, ultimatelyaffecting the WCET.

We now show that the effects of contention on the WCET can be resolved by takinginto account the time over which the contention occurs. Referring to Figure 5, one of thesequences [5,6,7], [5,7,6], [6,5,7], or [6,7,5] will lead to the longest path delay fromthe set of possible delays. We refer to this as the worst case path delay (WCPD). It is definedas follows:

Definition 1. The Worst Case Path DelayWCPD for a set of pathsA is the path de-lay in which the ordering of the contending instructions results in the longest executiontime.

284 BURNS

Figure 6. The effect of contention on execution times.

The maximum contention time MCT is the range of time over which the contendinginstructions conflict. Each instruction has a MCT over which other instructions competewith it for issue. It is defined as follows:

Definition 2. TheMaximum Contention Time or Delay MCTti for an instructioni of type

t is

MCTti =

T N∑j=0

NCi j × FUt

Nt

where

NCi j = 1 if instruction j contends withi , otherwise 0

FUt = execution time of the functional unit of typet

T N = all instructions entering central window with instructioni

Nt = number of resources of typet

Figure 6 shows two graphs of the overall sequence of instructions; the WCPD is shown inpart (a). As can be seen, the path delay may vary depending on the different cases of issue.For sequences [6,5,7]; [6,7,5] when instruction 6 is issued first, the path delay to instruction9 will be a minimum, shown in part (b), and for sequence [5,7,6] when instruction 6 isissued last, the path delay to instruction 9 will be a maximum, shown in part (a), resultingin the WCPD. The maximum contention time or delay caused by contention between theinstructions from time 30 is 20.

Because of the effects of contention, to calculate an accurate WCET, the maximumcontention delay must be taken into account for all contending instructions in the executed


code. In the model, the WCET is the WCPD for the simulated code. The WCPD iscalculable for a set of pathsA in which contention occurs by adding the MCT to the set ofpath delays forA, called PD(A), when there are no resource restrictions, i.e., the maximumparallel case. This leads to the following lemma:

Lemma 1 The WCPD for a set of pathsAwhich contain the set of contending instructionsC is found by adding the maximum contention delays MCT for the contending instructionsC to PD(A) before the resource restrictions are applied.

Proof: The WCPD ofA beforeC occurs = PD(A1) = PD(A2) = . . . PD(An), because allcontending instructions are ready to fire at the same time. The WCPD ofA afterC occurs= PD(A1) + MCT1 = PD(A2) + MCT2 = . . . = PD(An) + MCTn.

Here all pathsAi ∈ PD(A) are obtained from the reachability graph for the timed CP-net.This principle can be applied recursively to cover all paths to find the WCET using

lemma 2.

Lemma 2 Given the addition of all maximum contention delays MCT for all occuringcontending instructions in all paths in the executed code, the WCPD will be the longestresulting path delay and will be the one which gives the WCET.

Although the MCT can be applied theoretically, there is a problem with applying it inpractice to the simulation model. The Design/CPN simulator works sequentially, by usinga reference time which is added to each token after each transition is fired before beingupdated itself. The MCT cannot be added to each contending instruction as it is issued,because the simulation time is updated for each contending instructionafter it is issued,leading to a variation in time for each instruction.

In practice, an approach is taken which takes into account the current simulation time. Avariable instruction contention delay ICT must be added to each instruction at simulationtime after it is issued, which takes into account the sequential simulation time updates.

The ICT is defined as follows:

Definition 3. The Instruction Contention Time or Delay ICTti for an instructioni of typet is

ICTti =

C N∑j=0

NCi j × FUt

Nt

where

NCi j = 1 if instruction is of typet and contends withi , otherwise 0

FUt = execution time of functional unit of typet

C N = number of instructions in central window at ST wheni is issued

ST= sequential simulation time

Nt = number of resources of typet

286 BURNS

The WCPD is calculable for a set of pathsA in which contention occurs by adding theICT value for each contending instruction. This is shown in the following theorem:

Theorem 1 C is a set of contending instructions to be issued. The WCPD for the set ofpathsA which containC can be found after issuing and executing instruction i∈ C byadding both the simulation time before i is issued to ICTt

i and the execution time of thefunctional unit on which i is executed.

Proof: The WCPD afteri is issued =ST (beforei is issued) +FUt + ICTti = ST (before

all contending instructions∈ C are issued) +FUt + MCT = WCPD after all contendinginstructions∈ C are issued.

As the simulation time ST varies, a variable instruction contention time ICT is added toeach contending instruction after it is issued, depending on the number that it was contendingwith when it was issued. Design/CPN allows functional control of time for transitions andarcs. We can therefore add the additional delay in the model on the outgoing arc of therespective functional unit using the arc expressionexecute@+ ICTi as shown in Figure1. The following ML functions, which are invoked when the functional unit transitions arefired, are used to check the contention for instructions, and are also used to work out thevariable instruction contention times.

fun ct ( NOFU: Count, k: Value) =(count ( chkd, #instr k, #d k, !clist) div NOFU) * (!FUdelay))

fun chkd (k: Value, i: Instr, d: Dep) =(if #instr k = i andalso

exists (isn (#d k)) (!wbl) andalsoexists (isn (#d’ k)) (!wbl)

then 1 else 0);

The functionct counts the total number instructions that are contending. The functionchkdchecks contention between instructions by ensuring that the instruction dependencies#d and #d’ have been resolved. A similar method can be applied which adds the minimumcontention time to each instruction in the model to find the range over which the contentionexists. This is possible by subtracting contention times at the output of the functional units.ML functions are used to process each instruction and alter their times accordingly.

4.4. Simulation Results

The simulation with the new additions was applied to the JPEG algorithm using two modes.The first one finds the effect of maximum contention in the model, and the second one theminimum. The number of functional units was varied for each mode and the simulationwas carried out in two phases: one with the model set to allow instructions to issue in-orderand the other set to allow them to issue out of order. A set of data obtained from simulationwas analysed for the effects of minimum and maximum contention and average values


were calculated for all WCET (min-max) values. The following results were obtained forin-order issue:

FU WCET (min-max) %range error WCET (real)

1FU 267545 - 267545 0% ±1% 270730

The following results were obtained for out of order issue:

FU WCET (min-max) %range error WCET (real)

1FU 221096 - 255420 19% ±1.5% 2592602FU 199956 - 217540 11% ±1% 2194603FU 194776 - 207360 7% ±0.5% 208000

The percentage range in the table shows the maximum range of contention for differingnumbers of functional units taken from the least WCET(min) and maximum WCET(max)values during simulation. The errors columns show the range of errors within the WCET(min-max) average values and were calculated from an analysis of all data sets. These pointto a slight underestimation of the WCET during simulation. The real calculated measuresare shown in the final column which were derived from an analysis of all the data. Theerrors are due to the non-determinism that is left in the model and are due to arbitrationoccuring mainly during branching and speculation. The in-order results show that for ap-plication of both modes the figures for WCET(min) and WCET(max) are the same. For theout-of-order results, the variation in WCET(min) and WCET(max) decreases as the numberof functional units increases.

5. Conclusions

CP-nets and Design/CPN provide the designer of a real-time system with both qualitativeanalysis of reachable states and analysis of timing properties, such as worst case executiontime for a block of instructions. We have described the main aspects of our approach in calcu-lating the WCET of a superscalar processor with CP-nets, and some results. The techniquehas been applied to basic models of superscalar processors which cover important featuressuch as pipelining, multiple issue and out of order issue which are applicable to processorssuch as the PowerPC. Benefits have been derived from the high level modelling capabilities.Typical speed of simulation is about 6000 instructions per minute on a 450 Mhz PC.

Results have shown that straightforward simulation cannot be applied to find an accurateWCET measurement due to non-determinism. This is due mainly to contention over re-sources. By taking the contention into account in the timing model we have managed toeliminate most of the error in the simulation leading to a close approximation for the WCET.The technique has been tested with results which rate favourably against other techniques.We have managed to narrow down the margin of error in the WCET measurement to a smallpercentage. A major advantage is that the technique is not limited to small programs butcan be applied to larger programs. Further work will be centered on studying the effects ofspeculation and analysing margins of error in the WCET measurements.

288 BURNS

Acknowledgments

The authors would like to thank the editors, Dr. Puschner and Professor Burns, and theanonymous referees for valuable comments on earlier drafts of the paper. This work issupported by EPSRC grant GR/L28098 (TIMBRE).

References

Appel, A. W., and MacQueen, D. B. 1991. Standard ML of New Jersey. InThird Int. Symp. on ProgrammingLanguages Implementation and Logic Programming, Maluszynski, J., and Wirsing, M., eds. Lecture Notes onComputer Science, Volume 528, Springer.

Arvind, D. K., and Rebello, V. E. F. 1994. Instruction-level parallelism in asynchronous processor architectures.In Proceedings of the Third Int. Workshop on Algorithms and Parallel VLSI Architectures, Moonen, M., andCatthoor, F., eds. Elsevier Science Publishers, Leuven, Belgium, pp. 203–215.

Burns, F. P., Koelmans, A. M., and Yakovlev, A. V. 1999. Analysing superscalar processor architectures withcoloured Petri nets.Int. J. Software Tools for Technology Transfer2: 182–191.

Choi, J., Lee, J., and Kong, I. 1994. Timing Analysis of Superscalar Programs using ACSR. Technical report,Dept. of Computer and Information Science, University of Pennsylvania.

Christensen, S., Jorgensen, J. B., and Kristensen, L.M. 1997. Design/CPN—A computer tool for coloured Petrinets. InProceedings of Tacas ’97, Brinksma, E., ed. Lecture Notes on Computer Science, Volume 1217,Springer, pp. 209–223.

Diep, T. A. 1995. A visualization based microarchitecture workbench.Proc. Caregie Mellon University, PhDThesis.

Healy, C., Whalley, D. B., and Harmon, M. G. 1995. Integrating the timing analysis of pipelining and instructioncaching.Proc. 16th Conf. Real-Time Systems Symposium.

Jensen, K. 1997. Coloured Petri nets. Basic concepts, analysis methods and practical use. Volume 1, Basicconcepts.EATCS Monographs in Theoretical Computer Science, Springer-Verlag, ISBN 3-540-58276-2.

Johnson, M. 1991.Superscalar Microprocessor Design. Prentice Hall.Li, Y. S., Malik, S., and Wolfe, A. 1995. Performance estimation of embedded software with instruction cache

modelling.Proc. Int. Conf. Computer Aided Design.Lim, S. S., Bae, Y. H., Jang, G. T., Rhee, B. D., Min, S. L., Park, C. Y., Shin, H., Park, K., and Kim, C. S. 1994. An

accurate worst case timing analysis technique for RISC processors.Proc 15th IEEE Real-Time Systems Symp.:97–108.

Lim, S. S., Han, J. H., Kim, J., and Min, S. L. 1998. A worst case timing analysis technique for multiple-issuemachines.Proc. 19th Conf. Real-Time Systems Symposium.

Narasimham, K., and Nilsen, K. D. 1994. Portable execution time analysis for RISC processors.Languages,Compilers, and Tools for Real-Time Systems.

Park, C. Y. 1993. Predicting program execution times by analyzing static and dynamic program paths.J. RealTime Systems5(1): 31–61.

Puschner, P., and Koza, C. 1989. Calculating the maximum execution time of real-time programs.J. Real TimeSystems1(2): 159–176.

Razouk, R. R. 1987. The use of Petri nets for modeling pipelined processors. Technical Report 87–29, Universityof California, Department of Information and Computer Science.

Shaw, A. C. 1989. Reasoning about time in higher-level language software.IEEE Trans. Software Engineering15(7): 875–889.

Schneider, J., and Ferdinand, C. 1999. Pipeline behaviour prediction for superscalar processors by abstractinterpretation.Languages, Compilers, and Tools for Embedded Systems.

Zhang, N. N., Burns, A., and Nicholson, M. 1993. Pipelined processors and worst case execution times.J. RealTime Systems5(4).

wcet analysis of superscalar processors using simulation with coloured petri nets

Documents