a dissertation submitted in partial satisfaction of the...

An Information Taxonomy for Discrete Event Simulations

by

Theresa Marie Kiyoko Roeder

B.S. (Case Western Reserve University) 1997 M.S. (Case Western Reserve University) 1999 M.S. (University of California, Berkeley) 2001

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Doctor of Philosophy in

Engineering-Industrial Engineering and Operations Research

in the

GRADUATE DIVISION

of the

UNIVERSITY OF CALIFORNIA, BERKELEY

Committee in charge:

Professor Lee W. Schruben, Chair Professor J. George Shanthikumar

Professor Hyun-soo Ahn Professor David R. Brillinger

Spring 2004


Copyright 2004

by


1

Abstract


by


Doctor of Philosophy in Engineering-Industrial Engineering and Operations Research

University of California, Berkeley

Professor Lee W. Schruben, Chair

Discrete event simulation models are a popular means of decision support because of

their ability to model complex systems with relative ease. This ease, however, can lead

to the inclusion of more detail in the model than is necessary, which, in turn, can lead to

increased development time and a greater number of errors in the model. In this

dissertation, we introduce an information taxonomy to aid in identifying the information

contained in a system.

The taxonomy has a twofold advantage. First, it assists the modeler in organizing

the information on the system, which facilitates model development. Second, it helps the

modeler to identify the information needed to model system characteristics, and to obtain

the desired output. If the user is constrained in the information that can be included in the

model, the taxonomy can help determine whether the model can meet requirements. If

not, more information must be included, or approximations must be developed to

overcome the lack of information.

We develop two approximations when simulating queueing systems in this

dissertation. The first is used to approximate dedication constraints, where lack of job

2

information precludes modeling system behavior. This approximation provides upper

and lower bounds on system performance measures. The second approximation is used

to estimate job waiting time distributions, where lack of job information precludes

obtaining the desired output statistics. We show that the approximation converges to the

job-driven estimate of the waiting time distribution for certain systems.

i

To my family

ii

TABLE OF CONTENTS

1. Introduction........................................................................................1

2. Current Simulation Taxonomy and World Views ............................4

2.1. Resource-Driven versus Job-Driven Simulations............................................ 4 2.1.1. Description.................................................................................................... 4 2.1.2. Comparison ................................................................................................... 6

2.2. Simulation World Views.................................................................................... 7 2.2.1. Process Interaction ........................................................................................ 8 2.2.2. Activity Scanning........................................................................................ 10 2.2.3. Event Scheduling ........................................................................................ 12 2.2.4. Comparison of Process Interaction and Event Scheduling ......................... 15

2.3. Some Recent Experiences: A Semiconductor Fabrication Plant Model ..... 16 2.3.1. System Description ..................................................................................... 17 2.3.2. Results......................................................................................................... 18

2.4. Problems with a Taxonomy Based on the Simulation World Views ........... 21

2.5. Problems with a Taxonomy Based on the Resource-Driven and Job-Driven Paradigms ......................................................................................................... 23

3. Information Taxonomy....................................................................25

3.1. Types of Information ....................................................................................... 26 3.1.1. General or Entity-Specific .......................................................................... 26 3.1.2. Subscripted: Resource or Job...................................................................... 28 3.1.3. Subscripted: Local or Global ...................................................................... 29 3.1.4. Modeling, Statistics, or Both ...................................................................... 30 3.1.5. Static or Dynamic ....................................................................................... 31

3.2. Classification..................................................................................................... 31 3.2.1. Diagram....................................................................................................... 31 3.2.2. Example: Wafer Fab ................................................................................... 33 3.2.3. Example: G/G/s Priority Queue with Two Job Types ................................ 39

3.3. Complexity Analysis......................................................................................... 41 3.3.1. Memory (Storage) Requirements................................................................ 43 3.3.2. Processor Requirements: Simulation Engine.............................................. 45 3.3.3. Processor Requirements: Data Manipulation.............................................. 46

3.4. Relationship to Current Taxonomies and Formalisms................................. 46 3.4.1. Classical World Views................................................................................ 48 3.4.2. Resource-Driven and Job-Driven Paradigms.............................................. 50 3.4.3. Entity-Attribute-Set..................................................................................... 50 3.4.4. Discrete Event System Specification (DEVS)............................................ 51 3.4.5. Conical Methodology.................................................................................. 53 3.4.6. Object-Oriented Modeling .......................................................................... 54

iii

4. Implications......................................................................................55

4.1. General Implications on Modeling ................................................................. 55

4.2. Implications on Modeling: Service Disciplines.............................................. 56 4.2.1. Clarification of Terminology ...................................................................... 58 4.2.2. Goals and Assumptions............................................................................... 59 4.2.3. Service Discipline Taxonomy..................................................................... 60

4.2.3.1. General Information ............................................................................ 61 4.2.3.2. Resource Information .......................................................................... 62 4.2.3.3. Local Job Information ......................................................................... 62 4.2.3.4. Global Job Information........................................................................ 63

4.2.4. Evaluation ................................................................................................... 64

4.3. Desired Approximation Characteristics......................................................... 65 4.3.1. Approximation Accuracy............................................................................ 65 4.3.2. Computational Requirements...................................................................... 67 4.3.3. Error Measures............................................................................................ 68

4.3.3.1. Existing Measures................................................................................ 68 4.3.3.2. Example: Graphical Approach to Measuring Approximation Error ... 70

4.4. Example: Approximating Dedication Constraints........................................ 72 4.4.1. Problem Statement ...................................................................................... 73 4.4.2. System Behavior from the Job Perspective ................................................ 76 4.4.3. System Behavior from the Resource Perspective ....................................... 77 4.4.4. Approximation Implementation.................................................................. 80 4.4.5. Results......................................................................................................... 80 4.4.6. Possible Improvements and Future Work................................................... 86

4.5. Example: Approximating Waiting Time Distributions ................................ 87 4.5.1. Basic Approach........................................................................................... 88 4.5.2. Eliminating Individual Job Information...................................................... 93

4.5.2.1. Lower Time Slice Counter Method..................................................... 93 4.5.2.2. Upper Time Slice Counter Method ..................................................... 99

4.5.3. Observations ............................................................................................. 100 4.5.4. Time Slice Counter Method...................................................................... 102

4.5.4.1. Further Observations ......................................................................... 102 4.5.4.2. The Time Slice Counter Method Algorithm...................................... 104

4.5.5. Experimentation: Single-Server System................................................... 106 4.5.6. Discussion................................................................................................. 116

4.5.6.1. Single-Stage Systems ........................................................................ 116 4.5.6.2. Other Service Disciplines .................................................................. 118 4.5.6.3. Summary............................................................................................ 119

4.5.7. Future Work .............................................................................................. 120

5. Conclusions ................................................................................... 122

6. Bibliography.................................................................................. 125

iv

Appendix A. Definitions and Notation...................................................... 131

A.1 Definitions and Abbreviations....................................................................... 131

A.2 Notation........................................................................................................... 133 A.2.1 General Notation and Notational Conventions ......................................... 133 A.2.2 Information Taxonomy ............................................................................. 135 A.2.3 Approximation Algorithm Characteristics................................................ 136 A.2.4 Estimating Dedication Constraints ........................................................... 136 A.2.5 Approximating Waiting Time Distributions............................................. 137

Appendix B. Example: Generating Correct Departure Process without Job Information ................................................................................................ 140

B.1 Problem Statement......................................................................................... 140

B.2 Solution Characteristics and Error Measures............................................. 141

B.3 Solution Approach: Discretized-Time Queues ............................................ 144

Appendix C. Dedication Constraints......................................................... 146

C.1 Examples of Systems with Dedication .......................................................... 146 C.1.1 Wafer Production ...................................................................................... 146 C.1.2 Health Care ............................................................................................... 147 C.1.3 Web Servers .............................................................................................. 148 C.1.4 Other ......................................................................................................... 149

C.2 Enhanced Approximation.............................................................................. 149

Appendix D. Approximating Waiting Time Distributions ....................... 151

D.1 Possible Transitions Between Curve Orderings.......................................... 151

D.2 Classification of Uncertain Jobs.................................................................... 152 D.2.1 Ignore Uncertain Jobs ............................................................................... 153 D.2.2 Always Classify as (Not) Delayed............................................................ 153 D.2.3 Randomly Classify Jobs Using a Fixed Probability ................................. 155 D.2.4 Randomly Classify Jobs Based on the Current Time Slice Location ....... 155 D.2.5 Using Expectations ................................................................................... 156 D.2.6 Deterministically Classify Jobs Depending on Current Location............. 157 D.2.7 Hidden Markov Models ............................................................................ 159 D.2.8 Comparison of Uncertain Job Classification ............................................ 161

D.3 Implementation Details.................................................................................. 164 D.3.1 Classifying Jobs Based on Counters ......................................................... 164 D.3.2 Jobs Finding an Idle Server....................................................................... 164 D.3.3 Updating Counters .................................................................................... 165 D.3.4 Initialization Bias ...................................................................................... 166 D.3.5 Selecting Parameter Values ...................................................................... 167

v

D.4 Multiple-Server Tandem Queueing Systems ............................................... 168 D.4.1 Experimentation........................................................................................ 168 D.4.2 Discussion................................................................................................. 175

D.5 Extensions ....................................................................................................... 177 D.5.1 Run-Time Error Estimation ...................................................................... 177 D.5.2 Dynamic Delay Values ............................................................................. 182 D.5.3 Dynamic Time Slices................................................................................ 184 D.5.4 Using More Information: Expected Order Statistics ................................ 185 D.5.5 Using More Information: LCFS Service Discipline ................................. 186

vi

LIST OF FIGURES Figure 1. Traditional GPSS block diagram for a G/G/1 queue........................................ 9

Figure 2. Petri Net (with initial markings) for a G/G/1 queue....................................... 11

Figure 3. Basic event graph component ......................................................................... 13

Figure 4. Event graph for an n-stage G/G/· queueing network...................................... 13

Figure 5. Two-event event graph for an n-stage G/G/· queueing network..................... 14

Figure 6: Information taxonomy..................................................................................... 32

Figure 7. Event graph for an n-stage queueing network, with dedication constraints modeled using the approximation................................................................... 80

Figure 8. ( )totgQ t for the system described above .......................................................... 82

Figure 9. ( )totgQ t for the system in Figure 8, with short revisit times............................ 85

Figure 10. Aapprox(t*) and Aexact(t*) as proportions of the total number of jobs processed 85

Figure 11. Vapprox(t*) and Vexact(t*) as proportions of the total number of jobs processed 86

Figure 12. Event graph model for estimating { }P delay γ≤ ........................................... 90

Figure 13. Sample realization of START and DELAY event occurrences........................ 92

Figure 14. System realization from Figure 13 with approximated DELAY curve ........... 97

Figure 15. P{wait ≤ γ} for various values of γ and several time slice sizes ..................... 99

Figure 16. { }P wait γ≤ for various values of γ and several time slice sizes using both the Lower and Upper Time Slice Counter Methods ........................................... 100

Figure 17. Accuracy of job classification using the Lower and Upper Time Slice Counter Methods......................................................................................................... 103

Figure 18. Histogram of service times generated as Beta(0.5,2) with mean 1 .............. 109

Figure 19. Approximated probabilities using the Time Slice Counter Method for a lightly-loaded M/M/1 system......................................................................... 111

Figure 20. Approximated probabilities using the Time Slice Counter Method for a heavily-loaded M/M/1 system ....................................................................... 112

Figure 21. Errors for M/M/1 Time Slice Counter Method trials.................................... 113

Figure 22. Errors for U/U/1 Time Slice Counter Method trials..................................... 114

Figure 23. Errors for B/B/1 Time Slice Counter Method trials...................................... 115

Figure 24. Regression residuals versus fitted values ..................................................... 118

Figure 25. Sample queue for 5 parts that visit the same server 3 times......................... 144

Figure 26. Sample discretized queue for the queue given in Figure 25 ......................... 144

vii

Figure 27. Possible transitions between orderings within a time slice .......................... 152

Figure 28. Location of current time in time slice ........................................................... 154

Figure 29. Probabilities integrated over ........................................................................ 158

Figure 30. Area lost by using a point other than 2

m i ξξ= + as a cutoff ....................... 159

Figure 31. Mapping of hidden states to observable outputs of the states ...................... 160

Figure 32. Difference in estimation errors for the random and deterministic Time Slice Counter Method M/U/1 implementations ..................................................... 162

Figure 33. Difference in estimation errors for the random and deterministic Time Slice Counter Method U/U/1 implementations...................................................... 163

Figure 34. Difference in estimation errors for the random and deterministic Time Slice Counter Method M/B/1 implementations...................................................... 163

Figure 35. Illustration of relationships of times for updating time slices ...................... 166

Figure 36. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and service times................................................... 171

Figure 37. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and Beta service times .......................................... 172

Figure 38. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and Uniform service times .................................... 172

Figure 39. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and service times................................................... 173

Figure 40. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and Beta service times .......................................... 173

Figure 41. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and Uniform service times .................................... 174

Figure 42. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and service times, and a LCFS service protocol... 175

Figure 43. Example of job tagging ................................................................................. 180

Figure 44. Time slice counts after doubling the number of time slices .......................... 184

viii

LIST OF TABLES Table 1. Summary of terminology used by different applications................................... 5

Table 2. Disadvantages of the job-driven and resource-driven paradigms ................... 7

Table 3. Disadvantages of the implementation approaches ......................................... 16

Table 4. Simulation run lengths for the fab model........................................................ 19

Table 5. Possible orderings of the curves ................................................................... 102

Table 6. Lower and Upper Time Slice Counter Method Classifications .................... 103

Table 7. Formulae for 2DC in different systems........................................................... 107

Table 8. Distributions and their characteristics ......................................................... 108

Table 9. Systems in increasing order of 2DC ............................................................... 109

Table 10. Factors included in multiple regression ....................................................... 116

Table 11. Multiple linear regression model of Time Slice Counter Method errors...... 117

Table 12. Example values to illustrate differences in error measures.......................... 143

Table 13. Experiments for multiple-server tandem queueing systems.......................... 170

Table 14. Possible queue sizes and their interpretations.............................................. 186

ix

ACKNOWLEDGEMENTS I would like to thank my advisor, Lee W. Schruben, for his energy, creativity, and sound

advice. You always kept me on my toes. I would like to thank my committee members,

George Shanthikumar, Hyun-soo Ahn, and David Brillinger, for their guidance and

unfailing willingness to help me. Thank you also for the somewhat unexpected lessons

about Canadian hockey gear manufacturers.

I could always count on my families across two continents for support. I would

especially like to thank my parents, Karen and Lothar, and my brother, Paul. All of my

friends have been very supportive, but I must explicitly thank Laura and Beshara

Elmufdi, Max Robert, Deepak Rajan, Nirmal Govind, and Krista Damer for always being

there for me. My fellow “BSG” members Debbie Pederson and Wai Kin Chan were an

invaluable sounding board for ideas and had no small part in the development of this

dissertation. Thank you for reminding me how to do integrals.

Finally, I would not have been able to do this without my husband, Mike Cating.

Thank you for everything.

1

1. INTRODUCTION Computer simulation is one of the most popular Operations Research/Management

Science tools used for decision support. Because its focus is sampling the behavior of a

system rather than finding the exact solution to a problem, it lends itself to the analysis of

complex systems that would be difficult to represent using analytical mathematical

modeling techniques. It is possible to represent non-deterministic behavior over time

using technology no more advanced than personal computers (PCs). However, because

of the relative ease of modeling a wide variety of situations (especially compared to more

traditional mathematical modeling techniques such as linear programming), there is the

temptation to include everything in the model, and the simulation models developed can

be too complex for their intended purpose. On p. 12 of (Seila et al. 2003), the authors

note:

Model building is more art than science and requires much practice to master. It is important to develop a model that is as simple as possible while also being highly credible and having a good chance of being found valid for the anticipated decisions. Most experienced modelers agree that it is better to start with a simple model and add details later if necessary than to start with an excessively complex and realistic model.

Excessive detail is detrimental to model efficacy in several ways. First, it can take longer

to develop the model, as the details must be modeled accurately. Second, modeling

system components that are not necessary can induce errors, reducing the accuracy of the

simulation. Finally, the additional information that must be stored and processed is a

factor that can cause the runtimes for simulation models to be very large. (The modeling

approach taken also has a large impact on the runtimes. We will discuss this further in

2

later sections.) For example, a transportation network model with approximately 100

nodes took 18 hours to simulate 10 days (Sturm 2002).

To combat this problem of excessive runtimes, parallel (several replications run at

the same time on several processors) and distributed (single replication run on several

processors) computing algorithms are being developed to devote more processing

resources to the simulations. See, for example, (Jefferson 1983; Witten et al. 1983;

Wonnacott 1996). We take a different, complementary, approach here.

Rather than developing means of devoting more computing power to a model, our

approach focuses on the requirements of model; by paring out unneeded information and

including only what is necessary to model the system and obtain the desired output

statistics, we can reduce runtimes. Faster simulations can provide more information for

the same amount of resources.

While all the above techniques can be used in any kind of simulation model, we

focus our efforts here on discrete-event systems. These are simulation models where the

system state changes at discrete points in time. In contrast, continuous-time simulations

have state variables changing value throughout time (Law and Kelton 2000). An

alternate definition for discrete-event systems is systems whose rules or laws of behavior

change at discrete points of time. For example, we may have a system whose state is

changing constantly up to a threshold, when a different set of rules applies.

In Chapter 2, we discuss different approaches to simulation modeling, and the

effects these approaches can have on execution speed and simulation accuracy. These are

known as the classical simulation world views. We also review the resource-driven and

job-driven paradigms proposed in the literature. Chapter 3 introduces a new taxonomy

3

for the information contained in a system, with special focus on simulation models. We

provide a discussion of the information needed to model different types of service

disciplines. Lack of information can cause aspects of the system to be modeled

incorrectly, or desired output statistics to be unavailable. (Approximations may be used

to overcome these difficulties if the required information is unavailable.)

In Chapter 4, we discuss the implications of the taxonomy. We propose a

taxonomy of service disciplines in Section 4.2. Desired properties of estimation

algorithms, and how we may evaluate the effectiveness of algorithms are outlined in

Section 4.3. In Sections 4.4 and 4.5, we describe two approximation methods that allow

us to overcome lack of information and obtain accurate results. Definitions, notation, and

acronyms used throughout this paper are summarized in Appendix A. Further examples

and implementation details are given in Appendices B through D.

4

2. CURRENT SIMULATION TAXONOMY AND WORLD VIEWS The research in this dissertation is motivated by work done using a simulation approach

that has come to be known as a resource-driven (RD) simulation. Here, the focus is on

the resident entities in the simulation. Most discrete event simulations in use are known

as job-driven (JD) simulations, where the focus is on the transient entities that pass

through the system (Schruben and Schruben 2001). These two approaches are discussed

in Section 2.1. They are different from the classical simulation world views, the topic of

Section 2.2.

The discussion in this section motivates the taxonomy in Chapter 3. It

summarizes work done in (Roeder et al. 2002; Schruben and Roeder 2003), and outlines

the current state of research in this area. As the literature does not provide much

guidance, the concepts of “information” and “amount of information” are vague in this

section. They will be quantified in Chapter 3. Resource-driven and job-driven

simulations have respectively been referred to as “active server” and “active transaction”

approaches to modeling (Henriksen 1981).

2.1. Resource-Driven versus Job-Driven Simulations

2.1.1. Description

In building and analyzing simulation models, it is useful to classify system entities as

resident or transient. Resident entities (resources) remain part of the system for long

intervals of time, whereas transient entities (jobs) enter into and depart from the system

5

with relative frequency. In a factory, a resident entity might be a machine; a transient

entity might be a part. Sometimes, it is not clear whether an entity is resident or transient.

The decision depends on the level of detail desired and objectives of the simulation study;

a factory worker might be regarded as a transient entity in one model and a resident entity

in another. Table 1 summarizes the terminology used by different applications in

referring to resources and jobs.

Table 1. Summary of terminology used by different applications Technology Resources Jobs Queueing Theory Servers Customers Petri Nets Tokens Tokens Event Graphs Resident Entities Transient Entities Arena Resources Entities Simscript Permanent Entities Temporary Entities GPSS Facilities and Storages Transactions

In describing the dynamic behavior of a system, it is often useful to focus on the cycles of

the resident entities. For example, one might describe the busy-idle cycles of machines

or workers. On the other hand, the focus may be on the paths along which transient

entities flow as they pass through the system (e.g., parts moving through a factory).

Transient entity system descriptions tend to be more detailed than resident entity

descriptions. Typically, a mixture of both viewpoints is used, but one or the other

predominates. (For example, jobs may be traced only at certain points in the system, not

throughout the system.)

In (Schruben and Roeder 2003), the authors propose using these approaches as a

way of classifying simulation models at a high level. It is different from the conventional

taxonomy based on simulation world views described in Section 2.2; the world views are

ways of implementing resource- or job-driven simulations.

6

2.1.2. Comparison In highly-congested systems, where there are relatively few resident entities and a great

many transient entities, it is usually more efficient in terms of memory and computational

requirements to study the cycles of the resident entities than to track each transient entity.

These systems include semiconductor factories (“fabs”) with thousands of wafers,

communication systems with millions of messages, and transportation systems with tens

of thousands of vehicles. In simulating such systems, the cycles of resident entities might

be described by the values of only a few variables (e.g., the number of available servers

and the queue size), while the flow of transient entities might require many variables

(representing each job). This is a great disadvantage of job-driven simulations, as,

whenever the simulation becomes congested, the simulation’s memory footprint becomes

large, and its execution slows; in some cases, it may consume all system resources and

cause the operating system to fail.

Very large and highly-congested queueing networks (with any number of jobs of

any number of types at any countable discrete stages of processing) can be modeled as a

resource-driven simulation with a relatively small set of integers. The memory footprint

of the simulation is proportional to the number of resources in the system, not the number

of jobs. As a consequence, the simulation is almost insensitive to system congestion; the

number of resources, and therefore the memory requirements, does not change with

congestion. (There may be a computational impact if the number of jobs arriving during

a time period increases because more arrival events take place during that time.)

However, using only integer counts reduces modeling ability and the number of available

output statistics.

7

On the other hand, systems where there are only a few transient entities and many

resident entities (a construction project or an airline maintenance facility) may be

efficiently studied by examining the flow of transient entities. The advantage of using a

job-driven model over a resource-driven model is that the detailed experiences of

individual jobs can be tracked easily. This, in turn, allows detailed modeling and output.

The advantages and disadvantages of the two approaches are summarized in Table 2.

Table 2. Disadvantages of the job-driven and resource-driven paradigms

Job-Driven Resource-Driven Disadvantages • Large memory footprint for

congested systems • Simulation execution slows

as system size increases

• Insufficient information to a priori model certain system behaviors

• Available output limited Advantages • Detailed modeling possible

• Detailed output statistics available

• Easy animation

• Memory requirements insensitive to system congestion

• Simulation execution speed insensitive to system congestion

2.2. Simulation World Views A conventional taxonomy for simulation models classifies them according to three

primary simulation world views (“Weltanschauungen”). See, for example,

(Derrick et al. 1989; Derrick 1992; Carson 1993; Page 1994). (Overstreet 1987)

proposes an informal graphical method for translating models between world views. The

terminology of world views was brought into the mainstream by (Kiviat 1969).

However, in that work, they are not used as a means of classifying models. Rather, they

are used to describe the approaches different simulation programming languages (SPLs)

take in the implementation of simulation models. Kiviat’s original interpretation of

world views is taken in (Schruben and Roeder 2003). In the interim, the distinction

8

between the implementation and the model has become blurred, and the world views

have been used as the means of classification for simulation models. For this reason, we

present the three world views and highlight their advantages and disadvantages in the

following sections.

2.2.1. Process Interaction The most prevalent traditional world view is the process interaction world view. A

process can be defined as “a set of events that are associated with a system behavior

description” (Kiviat 1969). For example, a job’s progression through the system is a

process. Each job represents a different process. Following (Nance 1981), we define an

event as “a change in object state, occurring at an instant.”

The name “process interaction” derives from the focus on modeling how different

processes in the system interact with each other. For example, loading and unloading

parts into/from a machine and repairing a machine when it fails are separate processes.

In this case, the production process needs the machine and an operator, either of which

may be busy with a repair process. The processes are interacting through their shared

resources. The simulation progresses by activating and deactivating processes, shifting

control between them.

Virtually all commercial simulation packages use a process interaction-based

approach to implement job-driven simulations. For a formal definition of this world

view, see (Cota and Sargent 1992).

The jobs are the active system entities that “seize” available system resources as

needed. This approach typically requires that every step in the processing flow path of

9

every job in the system be explicitly represented. Records of all jobs in the system are

created and maintained. The jobs move through their processing steps (often represented

by a block flow diagram), seizing and releasing system resources as needed.

Figure 1 shows the GPSS blocks (Schriber 1991) for a single-server queueing

system. Jobs are GENerated with a certain interarrival time TA. They then join the

QUEUE for resource 1. If resource 1 is available, they SEIZE the resource, and

DEPART the queue (computing waiting time statistics). After a service time TS, the job

RELEASEs the resource and is TERMinated.

Figure 1. Traditional GPSS block diagram for a G/G/1 queue

Process interaction models are inherently job-driven. They are convenient when

predicting system performance and fast simulation execution speed are not as important

TAGEN

QUEUE

SEIZE

1

DEPART

1

TS

1

RELEASE 1

TERM

10

as detailed system animation. A major advantage to using this approach is that there

often is a direct mapping from simulation logic to animation – the code focuses on

describing the things in the system that move. Another advantage is that modeling a

system using processes is often simpler than focusing on state-changing events (see

Section 2.2.3 for a discussion of the event scheduling world view). Less abstraction is

required to formulate the model when the focus is on processes, as they are usually the

most natural way to think about a system; this allows users with little training and

experience to build fairly sophisticated models.

2.2.2. Activity Scanning The second classical world view is the activity scanning world view. Much as the

process interaction world view is closely related to job-driven simulations, activity

scanning is often used to develop a purely resource-driven model (in the sense of models

where only counts of resources and jobs are available). The name “activity scanning”

derives from the way the simulation model progresses: After each transaction, the whole

model is scanned to find whether any new activity can start or finish.

Stochastic Timed Petri Nets (STPNs) are a popular graphical implementation of

activity scanning that can be useful in simulating certain types of resident entity cycles

(Törn 1981; Haas 2002). They are an extension of Petri Nets (Petri 1962). The bars in a

STPN represent transitions, which may be instantaneous or have time delays. The circles

are places containing tokens; the token counts represent the system state. STPNs are a

special case of the event graphs discussed in Section 2.2.3. Their usefulness is limited to

a subset of the models easily simulated by event graphs. See (Schruben 2003) for an

11

algorithm to convert any Petri Net to a more compact event graph. (There is an

interesting confusion of terminology here due to the term “event graph” having been also

used to denote a very special class of a Petri Nets – a construct completely different from

the one referred to above (Commoner et al. 1971).)

Figure 2 shows a Petri Net for the G/G/1 queueing system shown in Figure 1.

Figure 2. Petri Net (with initial markings) for a G/G/1 queue

An advantage of activity scanning models is that they can be used to quickly show the

relationships between resources in a system, and how the resource cycles relate to one

another (Kiviat 1969; Fishman 1973). A disadvantage is that they grow very quickly as

the model size increases because of lack of parameterization. This can slow the

execution when the entire model must repeatedly be scanned to find the next activity.

The Three-Phase Method is a modification to the activity scanning paradigm. It

reduces the number of activities that must be scanned by classifying activities as time-

bound (“B” events) and conditional (“C” events). An example of a “B” event is a finish

service, which is known to occur at a fixed time after a start service. Start service events,

on the other hand, require that conditions be met before they can begin service (available

job and resource); they are “C” events. In the Three-Phase Method, the simulation clock

tStA

12

time is advanced to the time of the next “B” event(s). After it is executed, all “C” events

are scanned to see if their conditions are now met. The Three-Phase Method was

introduced in (Tocher 1963).

2.2.3. Event Scheduling The final classical world view is the event scheduling world view (Kiviat 1969). The

three elements of a discrete event system model are the state variables, the events that

change the values of these state variables, and the relationships between the events. The

event scheduling world view focuses on these events. The simulation proceeds by

executing the next event, which, in turn, may schedule further events or cancel events

that have already been scheduled.

Event graphs are a graphical way of implementing an event scheduling-based

simulation (Schruben 1983). They are related to state-transition diagrams in queueing

theory (Ross 1997). In an event graph, events are represented as vertices and the

relationships between events are represented as directed edges connecting pairs of event

vertices. In contrast, the vertices in a state-transition diagram represent the state, and the

edges represent possible transitions between states. This can make the diagram infinitely

large (if the state-space is infinite). Since the vertices represent changes in state, not the

states themselves, event graphs are unlikely to suffer this problem, especially when the

model can be parameterized.

Time elapsed between the occurrences of events is represented on the edges of an

event graph. Figure 3 shows its basic structure. It states that

13

If condition (i) is true at the instant event A occurs, then event B will immediately be

scheduled to occur t time units in the future with variables k assigned the values j.

Figure 3. Basic event graph component

An event graph for a (resource-driven version of a) multiple-server n-stage queueing

network is shown in Figure 4. Qi and Ri are the current queue size and the number of

available servers at stage i, respectively. DONE is a Boolean variable indicating whether

the job has completed its final processing stage.

Figure 4. Event graph for an n-stage G/G/· queueing network

In contrast to process interaction-based models, jobs do not actively seize the resources.

There is no explicit differentiation between resource and job entities.

Figure 5 shows the event graph for the model in Figure 4 without the START

event. The two event graphs have the same behavior. In Figure 5, Qi is the total number

of jobs in queue and in service at stage i. Ri is the total number of servers at i.

j A B (k)

t (i)

{Qi=Qi-1, Ri=Ri-1}

START (i)

FINISH (i)

tsta

ENTER (i)

{Ri = Ri + 1, DONE = (i = = n)}

(Ri > 0) i

i

i

1

(i = = 1)

{Qi = Qi + 1}

i+1

(Qi > 0)

(DONE = 0)

14

Figure 5. Two-event event graph for an n-stage G/G/· queueing network

Unlike in process interaction and activity scanning models, event scheduling can be used

to implement both job-driven and resource-driven simulations in a natural way. For

resource-driven simulations, rather than maintaining a record of every job in the system,

only integer counts of the numbers of jobs of particular types and different stages of

processing or in different states are necessary. The system’s state is described by the

availability of resources (also integers) and these job counts. Thus, all the state variables

in a resource-driven simulation are non-negative integers. The state changes for each

event are difference equations that increase or decrease one or more state variables by

integer amounts. This is illustrated in Figures 4 and 5. For a discussion of resource

graphs, a derivation of event graphs designed specifically for resource-driven

simulations, see (Hyden et al. 2001).

When using event scheduling to implement a job-driven simulation, the integer

counts of the jobs at a given step are replaced by lists of these jobs. As in the process

interaction approach, the memory footprint of the simulation will increase with system

congestion.

ts

ta

ENTER (i)

FINISH(i)

{Qi = Qi – 1, DONE = (i = = n)}

(Ri ≥ Qi)i

1

(i = = 1)

{Qi = Qi + 1}

(Qi ≥ Ri)

(DONE = 0)

i

i+1

15

Event graphs have been shown to be able to model Turing Machines

(Savage et al. 2004). That is, event graphs can model anything.

2.2.4. Comparison of Process Interaction and Event Scheduling Traditionally, the terms “job-driven simulation” and “process interaction world view”

have been used almost interchangeably because process interaction-based software is

inherently job-driven. Nonetheless, in (Schruben and Roeder 2003), we show this

paradigm can also be used to develop logic for resource-driven simulations. An early

example of this can be found in (Schriber 1991). Here, the workers processing a large

number of parts are introduced as “transactions” that cycle through states of being busy

and idle.

Some of the disadvantages traditionally associated with job-driven simulations,

see (Roeder et al. 2002), are actually phenomena of the process interaction world view,

not of the job-driven approach itself. Implementing a job-driven model using an event

scheduling paradigm does not experience these difficulties. Table 2 lists the

disadvantages of the job-driven and resource-driven paradigms themselves, while Table 3

below shows the difficulties associated with the actual implementations. The two tables

show that the event scheduling implementation of a resource-driven simulation is

straight-forward, though the resource-driven paradigm itself has problems.

The problems of modularity and encapsulation (or lack thereof) for process

interaction-based simulations are addressed in (Cota and Sargent 1992). They are not

listed in Table 3 since the authors propose a modification of the world view to address

these issues. The modification redefines the control state of a process to allow the

16

process to end before its “time left in state” has run out if the conditions for reactivation

are met. This facilitates encapsulation by allowing processes to be ended prematurely

without the need for other processes to do the cancellation.

Table 3. Disadvantages of the implementation approaches

Process Interaction Activity Scanning (STPN)

Event Scheduling

Job-Driven • Failure modeling inaccurate

• Deadlock possible

N/A without many customizations

Schedule many events in congested systems (slows simulation)

Resource-Driven

Must “trick” software into performing desired behavior

• Model size increases dramatically with system size

• Many extensions required to enable modeling

An expansive evaluation of different modeling and implementation approaches is given

in Table 4.8 of (Page 1994).

2.3. Some Recent Experiences: A Semiconductor Fabrication Plant Model

Current simulation models in the semiconductor industry present prohibitively large run

times, preventing the simulations from being used to their full potential. The goal of the

research project described here was to create a resource-driven simulation of a

semiconductor wafer fabrication plant (“fab”) to see the whether there would be an

increase in execution speed, and whether there were any disadvantages to this approach.

Increased speed in semiconductor simulations is a great concern in the industry

(Brown et al. 1997).

17

We briefly describe the system modeled, and then discuss the results of the study;

these will motivate the work done in the next sections. For a more detailed discussion of

the modeling differences, see (Roeder et al. 2002). For confidentiality reasons, exact

system parameters cannot be given.

2.3.1. System Description The fab produces more than 5 part types using over 80 different tool types. Each tool

type has a varying number of (functionally identical) tools. The tools include serial

processing, batching, and stepper tools. Batch tools are tools such as furnaces where

several lots are processed at one time. Stepper tools are used for the photolithography

steps of the manufacturing process, and are frequently the bottleneck in the system. Each

stepper costs several millions of Dollars.

Tool types are subject to a set of preventive maintenances (PMs) and failures that

occur according to certain probability distributions. Test- and rework wafers visit the

tools periodically. All tools require load and unload times, and stepper tools also have

setup-dependent setup times. Wafer lots are queued based on critical ratio ranking, and

the first lot in queue is selected to be processed next. Critical ratio (CR) ranking orders

jobs based on their expected due dates and remaining processing times. (We define CR

more formally in Section 4.2.3.4.) Proprietary processing rules are used in processing

parts on the stepper tools. Operators and generic tools (e.g., masks and reticles) were not

modeled to facilitate the modeling process.

18

Lots visit tools repeatedly, and visit subroutes with certain frequencies. (E.g.,

every 10th job to get to Step 4 executes Step 4, all others go directly to Step 5.) The basic

route followed by the parts consists of over 500 processing steps.

The job-driven simulation used by Intel is coded in AutoSched AP (ASAP), a

commercial software package by Brooks/AutoSimulations, Inc., (AutoSimulations 1999).

It is heavily used in the semiconductor industry. Data and options are specified in

spreadsheet files and are processed by the software. Intel engineers code customized

functionality in C++. ASAP is a process interaction-based software package.

The resource-driven simulation developed at UC Berkeley was implemented

using the software package SIGMA for the underlying simulation engine and to generate

C source code (Schruben and Schruben 2001). Additional coding was done in C to

mimic the functionality of the job-driven simulation. The resulting simulation is an

executable file. The ASAP input files were imported into a Microsoft Access database,

which, along with Microsoft Excel, was used to create plain text input files. SIGMA uses

an event scheduling algorithm to model system dynamics.

In order to be able to compare the output of the two simulations accurately, the

resource-driven model simulates the job-driven model, not the fab itself. This is done to

prevent differences in output caused by different approaches to modeling the same

aspects of the fab.

2.3.2. Results In undertaking the study, the expectation was that the resource-driven simulation would

be somewhat faster than the job-driven simulation currently in use. The results

19

confirmed this intuition. The simulation runtimes for 2 years of simulated time are

shown in Table 4.1 The resource-driven model was more than two orders of magnitude

faster than the job-driven model when failures were included in the model. Failures

increase the runtime because they are modeled as jobs, artificially increasing system

congestion.

Table 4. Simulation run lengths for the fab model

Job-Driven Resource-Driven With PMs/Failures

≈ 3 hours < 10 minutes (≈ 352 seconds)

Without PMs/Failures

≈ 1.5 hours < 10 minutes (≈ 300 seconds)

While the resource-driven event-scheduling model is faster, the job-driven process-

interaction model has an advantage when it comes to modeling certain aspects of the

system, and to the available statistics. The majority of system features were modeled

accurately in the resource-driven model, and the available output for these parts of the

system was statistically insignificantly different from that of the job-driven simulation.

The resource-driven simulation was able to achieve these accurate results much more

quickly than the job-driven simulation. There are, however, important components of the

system that were difficult to model in a resource-driven simulation.

These system are those that require the knowledge of specific job attributes. One

such example is dedication constraints. Dedication constraints are rules typically used

with the photolithography (stepper) tools to ensure that successive layers of etchings on

wafers are the same depth: At each visit to the stepper tool, wafers are exposed to

ultraviolet light that is used to create the circuitry of the chips. Because the dimensions

1 The model was run on a Dell Dimension Pentium 4 1.7 GHz PC with 512 MB of RAM.

20

of the chips are so small, differences in the wavelengths of the light emitted by the

individual stepper tools have an impact on the quality of the resulting chip (Woods 1998).

Dedication constraints require wafers to be processed by the same tool i on

different, though perhaps not all, visits to a tool group. A less restrictive version of this

rule, partial dedication, is that the wafer must be processed by one of a subset of the tools

in the group, not necessarily by tool i. Modeling dedication constraints in a

straightforward manner requires more information than can be obtained from merely

looking at job and resource counts.

Stepper tools are among the most expensive resources in the fab and are typically

the bottleneck, so it is important to model this constraint. Because not all wafers that

arrive at the tool group are processed immediately, even if there are idle tools, the work-

in-progress at that tool group is greater than if the constraints were not in place. We will

return to dedication constraints in Section 4.4.

In resource-driven models, queueing at resources has the twofold problem of

limited available output statistics, and of being difficult to model accurately in some

cases. While we may be interested in knowing the distribution of the waiting times of

jobs at the resource, we are typically limited to averages in resource-driven simulations.

The queue sizes are known at any given time, and the average queue size can be used to

calculate the average waiting time using Little’s Law (Little 1961).

Queueing disciplines that are difficult to model require specific job information,

often quantities such as how long a job has been in the system (global delay information),

or how long the job has been at the resource (local delay information). We will discuss

21

queueing disciplines more in Chapter 4 after proposing an information-based taxonomy

for simulation models.

2.4. Problems with a Taxonomy Based on the Simulation World Views In (Schruben and Roeder 2003), the authors claim the simulation world views are better

treated as ways of implementing, not classifying, simulation models. This coincides with

their historical development, see (Kiviat 1969; Fishman 1973). Using them as a

taxonomy is problematic as they provide neither an exclusive nor an exhaustive means of

classification.

For example, there is periodic confusion on the categorization of certain models;

Petri Nets have been described both as activity scanning (Miller et al. 2004) and process

interaction (Seila et al. 2003). A proper taxonomy should classify models as one or the

other.

The world views are also unable to unambiguously account for resource-driven

and job-driven simulations: Both RD and JD simulations can be implemented as either

process interaction or event scheduling models. The RD/JD framework should not be

ignored, because it takes a higher-level view of models by focusing on the entities of

interest rather than on the actual implementation. It helps decide whether the emphasis of

a model should lie with the resources, with jobs, or with both; this is useful in developing

the model itself because it provides clarity.

The process interaction world view was developed as a mixture of the event

scheduling and activity scanning world views: “Conceptually, the process interaction

approach combines the sophisticated event scheduling feature of the event scheduling

22

approach with the concise modeling power of the activity scanning approach”

(Fishman 1973). This mixture of activity scanning and event scheduling is different from

the mixture found in the Three Phase Method, which explicitly adds event scheduling

elements to the activity scanning to reduce the number of activities that must be scanned

at each transaction. The activity scanning world view itself has been described either in

terms of activities that are started and ended by events (Kiviat 1969), or defined as

activities independently of events (Fishman 1973). The ability to overlap and mix world

views is necessary to adequately define the behavior of SPLs, but is not desirable in a

proper taxonomy, where there should be clear distinctions.

The simulation world views were originally put forth to describe the ways

simulation programming languages approach model implementation. Computer

programming and data structures have made great progress since the 1960s, and there are

many variations and refinements in implementations of simulation models. The Three

Phase Method is an early example of a refinement to the activity scanning approach to

reduce computational requirements. As methods of model implementation, the world

views still provide a useful high-level differentiation between SPLs. However, a modeler

should be aware that they are only rough means of classifying SPLs, and that there may

be many deviations in the details. For a detailed discussion of formalisms and

frameworks for simulation models, see (Page 1994).

23

2.5. Problems with a Taxonomy Based on the Resource-Driven and Job-Driven Paradigms

While intended to be an improvement on the simulation world views as a taxonomy, the

resource-driven and job-driven paradigms do not satisfactorily capture the behaviors of

systems and models of systems. For example, resource-driven simulations have been

defined as simulations where only integer counts of the jobs and available resources are

maintained (Schruben 2000). This is not the case in the simulation described in

Section 2.3; there, detailed information such as utilization or time since last failure is

maintained for all resources in the system. On the other hand, a job-driven simulation

has been defined as one where all jobs in the system are traced, with the implication that

no information on the resources is kept. This is also not true, as JD models often contain

all possible information in the system. In a sense, the name “job-driven” is a misnomer.

A large part of the problem with using the RD and JD paradigms as a taxonomy is

that they are ill-defined. While the model from Section 2.3 does not fit the definition

cited earlier from the literature, it is generally accepted as a resource-driven model. A

defining characteristic appears to be the fact that job information is not traced.

Classifying models based on job tracing is an option, but can lead to difficulties with so-

called mixed models, where jobs are traced in certain parts of the system but not in

others. It does also not allow differentiation between the RD model in Figure 4 and the

fab model from Section 2.3.

While it is not a requirement of a taxonomy, the RD/JD framework is not helpful

in deciding what aspects of a system can and cannot be modeled using a resource-driven

or mixed model (e.g., an RD simulation can model a FCFS service discipline with one

24

job type, but not two). The information taxonomy presented next stresses the information

needed to model desired system characteristics, or to obtain desired output statistics.

25

3. INFORMATION TAXONOMY We have outlined problems associated with current taxonomies of simulation models,

both for the classical world views and the resource-driven/job-driven taxonomy proposed

in (Schruben and Roeder 2003). While much work continues to be done trying to use the

world views to capture models, we feel the fundamental problem is that the world views

are neither an exclusive nor an exhaustive classification because they try to characterize

implementations, not models.

The taxonomy proposed in this section draws heavily from the literature, but

focuses on the information contained in the system model. (Merriam-Webster 1993)

defines information as “facts, data.” The taxonomy allows the modeler to determine

what information is (data are) needed to model specific aspects of the system and,

conversely, to state what can and cannot be modeled given the informational constraints

on the model. The taxonomy will clarify why certain types of models are subject to

certain problems (e.g., inaccurate failure modeling in process interaction-based models);

and why certain statistics are not available for some types of problems (e.g., FCFS

waiting time distributions for a “resource-driven” model). Section 3.4 relates the

taxonomy to existing formalisms.

It is important to note that the taxonomy focuses on the model itself, not the

implementation of the model (though implementation can be aided through the use of the

taxonomy). Implementation can use a SPL based on any of the world views, or newer

methods such as object-oriented or web-based technologies (Healy and Kilgore 1997).

26

In addition to aiding in modeling decisions such as data structure design, the

taxonomy assists in the complexity analysis of the memory requirements of different

models. Computational requirements are not addressed directly, as they are very

dependent on the implementation; however, information (and memory) requirements may

give a rough estimate of the computational requirements, regardless of implementation.

For example, a queueing discipline based on a job’s total time in system requires ordered

insertions into a job queue, regardless of the chosen implementation.

3.1. Types of Information The following subsections provide different high-level ways of classifying information.

While the classifications in any subsection are exclusive and exhaustive, the subsections

themselves are not mutually exclusive. We give examples of the different types of

information in each subsection. Section 3.2 combines the classifications from this section

into a single taxonomy. We then give examples of the information from the perspective

of the taxonomy as a whole.

In some cases, the classification of information depends on the system to be

studied, and on the purposes of the study. We give examples of different classifications

of the same information, depending on the situation.

3.1.1. General or Entity-Specific At the highest level, we can classify information as being general information about the

system, or information about specific (resident or transient) entities in the system.

27

Examples of general information are the number of resources or the number of jobs

waiting in queue. Examples of entity-specific information are the last failure time for

each resource, or the waiting times for each job. We refer to the information about the

system itself as “non-subscripted,” while the information about entities is “subscripted”

(because that is how such information is generally represented mathematically). The

subscripted information is relevant if we are differentiating between the different entities.

If we are not interested in the waiting times of individual jobs, we may not need to

explicitly differentiate between jobs j and j’.

If we do not differentiate between resources, then the processing rate at a resource

group is general information about the system, i.e., non-subscripted. Different

classifications are possible if there are different processing rates for the k resources in the

same group. (k is system information.) We can either treat them as separate resource

groups, in which case the processing rates would be general information; or differentiate

between the k resources in one group, in which case the processing rates would be

subscripted information. The second approach is preferable if other resource information

is needed (e.g., utilizations), or to simplify job routing.

The subscripts referred to here should not be confused with subscripts introduced

to facilitate modeling. For example, a queueing network simulation that uses only counts

of the number of jobs and available resources at each stage may refer to jobs going from

station i to station i+1. The only information used in the model, however, is general

information.

28

3.1.2. Subscripted: Resource or Job Subscripted (entity-related) information may pertain either to resources or to jobs.

Resources are the resident entities that remain in a system for extended periods of time,

even for the duration of the simulation. Jobs are transient, and more likely to enter into

and leave from the system.

In some cases, there may be ambiguity. For example, a worker is a resource, but

leaves the system to go on break or at the end of a shift. If the purpose of the simulation

is worker scheduling, it may make sense to treat the worker as a job. If the purpose is to

model the system as a whole, including the flow of jobs, the worker can be considered a

resource. (The same worker will be going on and off shift throughout the simulation,

while jobs do not usually re-enter the system once they have left.)

Another example of ambiguity is the case of a machine that requires servicing by

another resource, say, a worker. Although the machine is in the system throughout the

simulation, it becomes a job from the worker’s perspective. If the focus of the study is

maintenance scheduling, the machine could be treated as a job. If the study is also

explicitly modeling job flows, the machine more naturally fits in the role of a resource, as

it does not make sense to have a job processing another job. Since the machine is also

serviced by another resource, it can be seen in the context of a hierarchy of resources

(Schruben and Schruben 2001).

It is possible to classify entities as resident or transient in any simulation model.

As a general rule, entities that remain or reappear in the system throughout the study can

29

be considered resources. Those entities that will ultimately leave the system (even if

their sojourn is long) are jobs.

3.1.3. Subscripted: Local or Global Entity information can be either local or global. When classifying information as local or

global, we are usually referring to temporal location. For example, the job’s waiting time

at the current resource (local temporal) or the job’s time-in-system (global temporal).

Some literature also refers to spatially local or global information (Baker 1998). In our

context, globally spatial information is information that requires knowledge of the system

as a whole (general and entity information), while locally spatial information is

information (not necessarily delay-based) that is restricted to the current location or

resource (entity information).

Most resource information is global because resources are more likely to be

stationary entities (e.g., machines). There are instances where differentiating between

local and global resource information is necessary. For example, a worker servicing

machines may not be allowed to spend more than x time units at a given machine. The

time already spent at the machine can be considered local information for the worker.

The differentiation between local and global is more intuitive for jobs because

jobs are more likely to move through the system and have different experiences at

different stages. The experiences at different stages do not have to be dependent on one

another. In a queueing network, a job spends different amounts of time at the different

stations. These times are local information, and may be discarded when the job has

finished at a station.

30

3.1.4. Modeling, Statistics, or Both Information may be used to model the system behavior, to estimate system statistics, or to

do both. For example, a job’s routing is used to model the system, but is not used

directly for output statistics. The cumulative area under the queue versus time curve for a

resource group is used for output statistics, to calculate the (time) average queue size.

The utilization of a resource may be used both to model system behavior (by assigning a

new job to the resource that has been utilized least) and to calculate statistics (to report

average utilization).

It is important to differentiate between these uses for the information. Only one

integer variable is needed to merely simulate a G/G/s queueing system, the number of

jobs in the system, see Figure 5. (System parameters are also required, i.e., the

interarrival and service rates, and the number of servers s.) If we wish to collect statistics

on the queue behavior, we need to introduce more variables, for example the largest

queue size, or the number of waiting job-hours. More desired output leads to larger

memory and computational requirements.

Similarly, the more detailed the system behavior, the more memory and

computational effort necessary. For example, simulating dedication constraints in

semiconductor fabs requires knowing which machine processed each wafer. If dedication

constraints are not an important element of the system, the job information need not be

stored (for this purpose).

This categorization is related to the classification of attributes used for control,

measurement, or both in (Gahagan and Herrmann 2001).

31

3.1.5. Static or Dynamic Any information in a system is either static or dynamic. Static information does not

change over time, for example the number of resources in a particular resource group

(unless explicitly modeling resource acquisition/relinquishment). Static information is

usually an input parameter to the simulation.

Dynamic information can change value over time; an example is the number of

available resources at a resource group. Dynamic information is used during the

simulation run and reported as output statistics at the end.

An example where there may be ambiguity is the intensity function for a Non-

homogeneous Poisson Process. In this case, we have two types of information: The

functional specification of the intensity is an input parameter (static), while the

instantaneous intensity is dynamic.

3.2. Classification

3.2.1. Diagram

Figure 6 shows a diagram of the information taxonomy we propose, incorporating the

classifications described above. The regions are not drawn to scale.

32

Figure 6: Information taxonomy

We define the following regions, and give examples of each in the next section.

I. Non-subscripted 1. Static

i. Modeling ii. Modeling and statistics

2. Dynamic i. Modeling ii. Modeling and statistics iii. Statistics

II. Subscripted: Resources 1. Static


2. Dynamic i. Modeling

(a) Local (b) Global

ii. Modeling and statistics (a) Local (b) Global

iii. Statistics (a) Local (b) Global

33

III. Subscripted: Jobs 1. Static


2. Dynamic i. Modeling

(a) Local (b) Global

ii. Modeling and statistics (a) Local (b) Global

iii. Statistics (a) Local (b) Global

3.2.2. Example: Wafer Fab In this section, we illustrate the information taxonomy using the semiconductor

simulation described in Section 2.3. Where appropriate, we indicate the impact the lack

of information had in the resource-driven event scheduling implementation of the model.

We also show where the process interaction model misclassifies information, leading to

unnecessary difficulties in model implementation. All resource information is global

because the resources in this model are stationary.

I.1.i.: Non-subscripted static information for modeling

• Number of job types

• Number of resource groups

• Wafer release times for each job type

• Queueing discipline at each resource group; queueing disciplines and other

operational rules are typically modeler- or system-defined, not characteristics of the

resources themselves.

34

• Service distributions; the distributions are modeler- or system-defined, not

characteristics of the resources themselves. In some cases, the true service

distributions are not even known.

• Processing rates; if the processing rates are the same for all resources in a group, this

information is general system information. If there are differences, it becomes

resource information (II.1.i).

• Job routings and processing times; this quantity can also be defined as a job

characteristic if desired. Since the routing is deterministic in this case, it makes more

sense as a general system parameter, rather than as a job attribute. In the case of

stochastic routing, the routing probabilities should be treated as system parameters,

while the actual routings themselves are job-specific.

• Preventive maintenance and failure schedule for each resource group.

• Batch sizes for each resource group. If several job types can be batched together, the

admissible combinations are also in this category.

I.1.ii.: Non-subscripted static information for modeling and statistics

• Simulation run length

• Number of resources in each group; used to assign jobs among the available idle

servers (with probability idle servers/total servers), and also to calculate resource

group statistics such as average utilization.

• Load/unload, setup times for resource groups

35

I.2.i.: Non-subscripted dynamic information for modeling

Bottleneck information is an example of the type of information in this category, though

it was not used in the fab simulation.

I.2.ii.: Non-subscripted dynamic information for modeling and statistics

• Current simulation clock time

• Number of available resources in each group; used to assign jobs among available

idle servers, and to calculate resource group statistics.

• Number of jobs in queue at each resource group; because this simulation does not

require much information about the queue sizes, it is sufficient to treat the queue

counts as global information. If more elaborate queue information were desired, the

queues themselves may be treated as resources.

I.2.iii.: Non-subscripted dynamic information for statistics

• Cumulative area under queue versus time curve; the sole purpose of these variables

are book-keeping, to determine the average queue sizes during the simulation run.

II.1.i.: Resource static information for modeling

While there was no static resource modeling information here, an example is resource

qualifications.

II.1.ii.: Resource static information for modeling and statistics

This model contained no static resource information used for both modeling and

statistics.

36

II.2.i.a: Resource local dynamic information for modeling

While there was no local dynamic resource information in the model, an example is a

worker’s time at current location.

II.2.i.b: Resource global dynamic information for modeling

• Resource status

II.2.ii.a: Resource local dynamic information for modeling and statistics

While there was no local dynamic resource information for modeling and statistics in this

model, an example is the time a worker arrived at a machine group to perform

maintenance.

II.2.ii.b: Resource global dynamic information for modeling and statistics

• Dedicated queue size; in the case of resource dedication, each resource (in each

group) will have its own queue of jobs that have previously been processed by it. If

jobs are assigned to specific resources on arrival at the group, they would also be

included in this queue. The assignment may occur based on the number of jobs

already in the resource’s dedicated queue.

• Time to next preventive maintenance/failure for each type of PM/failure. In the

process interaction model, PMs and failures are not treated as attributes of a resource;

rather, they are seen either as general or as job information. As a consequence, PMs

and failures are modeled as high-priority jobs that make the resources “unavailable”

at the appropriate times.

37

II.2.iii.a: Resource local dynamic information for statistics

While there is no local dynamic resource information for statistics in this model, an

example is the cumulative time a worker spends at different stations.

II.2.iii.b: Resource global dynamic information for statistics

• Cumulative time in PM/failure

• Cumulative time spent processing, idle, loading, unloading etc.

III.1.i.: Job static information for modeling

While there was no specific job-related static information used only for modeling here,

examples include due dates; travel rates for transitions between resources that do not

require a resource such as a conveyor belt (e.g. walking rate for amusement park

visitors); or job-specific routing for probabilistic routing. The due date could also be

used for book-keeping purposes, for example, if jobs are serviced based on earliest due

date.

III.1.ii.: Job static information for modeling and statistics

• Job type; if jobs are traced explicitly, the job type is used for routing purposes

(modeling), and also to gather statistics on jobs based on type. In a resource-driven

model, this information is known only implicitly in the events scheduled for a job

(e.g., start service on a job of type 3).

38

III.2.i.a: Job local dynamic information for modeling

• Position in queue; a job’s position in queue is local information as it is no longer

relevant once the job has left the current queue. This is a job attribute, not a resource

or queue, attribute. It is not known in a resource-driven simulation.

III.2.i.b: Job global dynamic information for modeling

None of the following are known in a resource-driven simulation.

• Remaining processing time; this is often used in determining job ordering in queues.

• Current location in system

• Resource ID used for processing; this information is used for modeling dedication

constraints. Since it is not available in the resource-driven implementation of the fab

simulation, dedication cannot be modeled.

III.2.ii.a: Job local dynamic information for modeling and statistics

• Time joined current queue; this information can be used for modeling, e.g., the FCFS

discipline will service the job with the earliest time. It is also used to gather waiting

time statistics for the job. This information is not known in a resource-driven

simulation.

III.2.ii.b: Job global dynamic information for modeling and statistics

• Time entered the system; this information can be used both for job ordering in queues

and for calculating cycle time statistics. This information is not known in a resource-

driven simulation.

39

III.2.iii.a: Job local dynamic information for statistics

• Waiting time. This information is not known in a resource-driven simulation.

III.2.iii.b: Job global dynamic information for statistics

• Cycle time. This information is not known in a resource-driven simulation.

• Though not used in this simulation, the job’s classification or grade (showing defect

levels) at the end of processing is also information that can be used for statistical

purposes. This information is not known in a resource-driven simulation.

3.2.3. Example: G/G/s Priority Queue with Two Job Types In this section, we illustrate the information taxonomy with another example, a priority

queue with two job types and FCFS queueing within classes. Since it is a single-stage

system, there is no difference between local and global information. We assume detailed

output is desired.

I.1.i.: Non-subscripted static information for modeling

• Number of job types (2)

• Priority of each job type

• Arrival and service distributions for both job types

• Processing rates if s servers are functionally identical

I.1.ii.: Non-subscripted static information for modeling and statistics

• Simulation run length

• Number of resources (s)

40

I.2.ii.: Non-subscripted dynamic information for modeling and statistics

• Current simulation clock time

• Number of available resources

I.2.iii.: Non-subscripted dynamic information for statistics

• Cumulative area under queue versus time curve

• Cumulative area under queue versus time curves for the two job types

II.1.i.: Resource static information for modeling

• Processing rates if there are differences between the s servers

II.1.ii.: Resource static information for modeling and statistics

There is no static resource information for modeling and statistics.

II.2.i.: Resource dynamic information for modeling

• Resource status

II.2.ii.: Resource dynamic information for modeling and statistics

• Resource utilization

II.2.iii.: Resource dynamic information for statistics

• Cumulative time spent processing, idle, etc.

III.1.i.: Job static information for modeling

Not applicable in this example.

41

III.1.ii.: Job static information for modeling and statistics

• Job type

III.2.i.a: Job local dynamic information for modeling

• Position in queue

III.2.ii.a: Job local dynamic information for modeling and statistics

• Time joined current queue; within priority classes, there should be some discipline for

deciding which job to process next, e.g., FCFS.

III.2.iii.a: Job local dynamic information for statistics

• Waiting time

• Cycle time

3.3. Complexity Analysis One application of the information taxonomy presented here is that it facilitates the

complexity analysis of simulation models. The memory complexity can be derived from

the information needed to model the system and to find the desired output statistics.

Some computational complexity can be obtained from these requirements, though the

complexity depends heavily on the model implementation, which is not dictated by the

taxonomy. We use big-O notation (Landau 1909) to express the complexities.

We define the following quantities:

( ){ }*: 0t t t≤ ≤G set of general system information; when the set does not change

over time, we omit the time index for simplicity, ( )t t= ∀G G

42

( ){ }*: 0t t t≤ ≤R set of resources; when the set does not change over time, we omit

the time index for simplicity, ( )t t= ∀R R

( ){ }*: 0t t t≤ ≤J set of jobs; when the set does not change over time, we omit the

time index for simplicity, ( )t t= ∀J J

( ){ }*2 : 0t t t≤ ≤R power set of ( ){ }tR ; that is, the set of all subsets of ( ){ }tR

( ){ }*2 : 0t t t≤ ≤J power set of ( ){ }tJ ; that is, the set of all subsets of ( ){ }tJ

( ){ }*: 0t t t′ ≤ ≤R element of power set of ( ){ }tR , ( ) ( )2 tt′ ∈ RR ;

that is, a subset of all resources

( ){ }*: 0t t t′ ≤ ≤J element of power set of ( ){ }tJ , ( ) ( )2 tt′ ∈ JJ ;

that is, a subset of all jobs

Jmax maximum value of ( )tJ , ( )maxJ t t≥ ∀J

M set of modeling behaviors or rules,

, , , ,local global local global= ∪ ∪ ∪ ∪G R R J JM M M M M M ; the set of system aspects that

are to be modeled, for example resource dedication or FCFS queueing. Subscripts

“local” and “global” refer to local and global behaviors.

O set of output statistics, , , , ,local global local global= ∪ ∪ ∪ ∪G R R J JO O O O O O ; the set of

desired output statistics, for example, job waiting time distributions.

43

3.3.1. Memory (Storage) Requirements We do not consider the memory required to store events on the future events list. This

quantity depends on the implementation chosen. Simulation engine processor

requirements are discussed in the next section.

The amount of general information typically used during a simulation run is

[ ]( )O ∪ ×G GM O G . Since the amount of system information does not change during

the simulation, this quantity is fixed. For example, if we simulate an M/M/1 FCFS

system and desire only average queue statistics, ×GM G contains the interarrival and

service rates, as well as the server status (2 real-valued and one integer-valued variables).

The simulation run length and current time are also in this category (2 real variables).

[ ]∩ ×G GM O G contains the current queue size (1 integer variable), and ×GO G is the

cumulative area under the queue versus time curve (1 real variable).

If we trace resources, the amount of resource-related information is

( ), , , ,local global local globalO ⎡ ⎤∪ ∪ ∪ ×⎣ ⎦R R R RM M O O R . (To simplify notation, we will

aggregate local and global resource information in this section since there are a limited

number of cases where there is explicit local and global information for resources.) Since

the number of resources does not usually change after the simulation has begun, the

amount of information stored here is also a fixed quantity. The amount of information

depends on the number of resources in the system, R .

In some cases, we may be interested only in tracing a subset of the resources, e.g.,

photolithography tools in a wafer fab. In this case, the amount of information is

44

[ ]( )O ′∪ ×R RM O R . For example, let there be two machine groups, with k1 and k2

machines, respectively. Each machine has a different error rate (qualification), and we

wish to find the utilization and the average percentages of time loading/unloading for

each machine. The error rate information (which may be used to assign jobs to

machines) is in ×RM R (k1+k2 real variables). The cumulative times busy (utilization)

and loading/unloading are not used for modeling, and are information in ×RO R

(3(k1+k2) real variables). If we only want the output information for group 2, { }2′ =R ,

the amount of output information is ′ ′×RO R , 3k2 real variables.

If we trace jobs, we must differentiate between local and global information.

Overall, the information required is ( ), , , ,local global local globalO ⎡ ⎤∪ ∪ ∪ ×⎣ ⎦J J J JM M O O J .

Transaction tagging (Schruben and Yücesan 1988) will track the information for only a

subset of jobs, and will require ( ), , , ,local global local globalO ′⎡ ⎤∪ ∪ ∪ ×⎣ ⎦J J J JM M O O J

information.

Memory requirements can also be reduced by maintaining only global

information for all jobs. This requires ( ), ,global globalO ⎡ ⎤∪ ×⎣ ⎦J JM O J information. Only

tracing jobs at certain resources (e.g., to find waiting time distributions at the bottleneck)

uses ( ), ,local localO ′⎡ ⎤∪ ×⎣ ⎦J JM O J information, where ′J may be a function of ′R .

The advantage of storing only local job information is that

, , , , , ,local local local global local global′ ⎡ ⎤⎡ ⎤∪ × ≤ ∪ ∪ ∪ ×⎣ ⎦ ⎣ ⎦J J J J J JM O J M M O O J when ′ ⊆J J . The

45

difference can be significant. In a fab, ( )tJ can be several thousands of wafers, while

( )t′J may be only several hundreds.

If job data are stored in static arrays of size N, N must be large enough to ensure

the simulation will not run out of space. That is, maxN J≥ , where Jmax is a random

variable. This can be very inefficient use of memory. If the data are stored in dynamic

ranked or ordered data structures, less memory may be required, ( )( )O tJ rather than

( )maxO J . However, more computational effort will be needed to create and delete list

elements, as well as for index-based lookup. This is addressed in Section 3.3.3.

3.3.2. Processor Requirements: Simulation Engine The requirements of the simulation engine will depend heavily on the implementation

chosen. For example, some activity scanning model implementations involve repeated

scanning of all activities, which is less efficient than scheduling events if the number of

activities is large.

At a low level, all three world views place “events” to occur in the future on an

events list. An important note is that, even if job information is not stored, at least one

event will be scheduled for each job. In the G/G/s system in Figure 5, each job will

arrive and complete service, so there are two events per job. If the system is congested,

there will be many events scheduled to occur in a short amount of time. This can slow

the simulation, so the effects of congestion are felt even if job information is not being

stored explicitly. The maximum number of events on the events list at any point in time

46

in this example is limited to s+1 (s finishes and one arrival), but if arrivals are happening

quickly, many events are added to and removed from the list in a short amount of time.

3.3.3. Processor Requirements: Data Manipulation The amount general system information is fixed, so the data manipulation for it only

involves changing values. This can be done in O(1) time.

The number of jobs varies during the run for non-closed systems, so manipulation

is required if jobs are ordered. Specifically, it has been shown that insertions into an

ordered list are done in ( )( )O tJ time. Alternately, a list can be sorted in

( ) ( )( )2logO t t⋅J J (Williams 1964). This is the case whether job information is stored

in statically- or dynamically-sized data structures. If the size of the data structure

changes as ( ){ }tJ changes, additional computation is required to add and delete

elements; and for pointer manipulations to sort items.

3.4. Relationship to Current Taxonomies and Formalisms We have proposed a taxonomy of simulation models based on the information required to

accurately represent the system, and to obtain performance estimates. This taxonomy is

independent of the implementation approach. It differs from the formalisms described in

the literature by focusing on the levels of information detail needed to model the system.

Less focus is on classifying the information based on its type (descriptive, informational,

integer, real, etc.), or on the relationship between pieces of information and model

47

implementation; the taxonomy is a tool for model development. It may suggest a means

of implementation, but the implementation is not the focus.

The formalisms in the literature seek to integrate and synthesize the system

model, while the information taxonomy decouples the information in the model and

classifies each piece independently. Many of the existing methods decompose models

into objects and subobjects and then try to integrate them. The taxonomy does not force

the information into a specific form (i.e., an object or entity), but clarifies what is

contained in the system. This clarity serves the purpose of including only the necessary

information in a simulation model.

In (Nance et al. 1999), the authors argue that some redundancy in model

specification can be beneficial, e.g., for model extensions and reusability. We do not

dispute this statement, merely assert that redundancies in a model should be put there

knowingly, not inadvertently.

One of the purposes in undertaking this classification is to make explicit the

concept that often much less information is required than is included in the model.

Conversely, we use the classification to show why certain types of models are unable to

model certain things. Many of the ideas we propose are not new, but we frame them in a

different context. In this section, we compare the taxonomy to existing formalisms.

48

3.4.1. Classical World Views As stated in Section 2.4, the classical world views show how SPLs approach model

implementation. Because of the many variations possible in implementation, a taxonomy

based on implementation is not able to provide a robust framework. The information

taxonomy can be used to classify the information used by both the classical world views,

and by other SPLs.

STPNs use exclusively general system modeling information (region I in the

taxonomy). Because of this, the ability to provide job statistics (region III) is extremely

limited.

To overcome this limitation, (Haas 2002) develops a method to determine delay

distributions for certain model types by adding information in region III.2.iii (dynamic

job information for output statistics). Specifically, define { }: 0nV n ≥ as the sequence of

“start vectors” recording the starts of delay intervals for all ongoing delays after the nth

STPN marking change. The start vectors are updated by inserting, deleting, or reordering

elements depending on the current state, and the transition(s) that fired at the nth marking

change. Vector { }: 0nW n ≥ contains the indices of the marking changes that caused the

delay to be added to the start vector. It is used to sort the delay times in order of

increasing start times if overtaking occurs. { }: 0nW n ≥ is updated in the same way as

{ }: 0nV n ≥ .

For example, to find job cycle times for a G/G/1 queueing system, the current

time is inserted in the leftmost position of start vector { }: 0nV n ≥ if the transition firing

49

corresponds to a job arrival. If the transition firing corresponds to a job finishing service,

remove the rightmost element of { }: 0nV n ≥ and calculate the job’s cycle time (by

subtracting the time just removed from the start vector from the current time). No other

transitions cause changes to { }: 0nV n ≥ in this example. Since there is no overtaking in a

G/G/1 queue, { }: 0nW n ≥ is not necessary.

The extension is not available for all types of models because the information

added to the model is output information only. It is not used for modeling; the

information used for modeling continues to draw from region I only.

Process interaction models use information from all areas, but suffer from a lack

of clear classification of information. For example, failures must be modeled as dummy

(high-priority) jobs because failure information is not properly classified as resource

information. The only other way to have failures affect resource behavior is to model

them as jobs.

The event scheduling world view per se does not limit itself to any specific

regions, nor does it misclassify information. At the lowest level, all SPLs schedule

events (whether to start and end activities or to advance a process). In that way, event

scheduling SPLs allow the lowest-level control of a model, and do not force the modeler

into unnatural programming situations.

50

3.4.2. Resource-Driven and Job-Driven Paradigms The taxonomy allows us to formally define the previously-vague notions of “resource-

driven” and “job-driven.” Specifically, a resource-driven model contains information

from at most regions I and II. A job-driven model contains information from all three

regions. It is possible for a resource-driven model to contain only information in

region I, e.g., the model in Figure 4. In contrast, the resource-driven fab model from

Section 2.3 falls in regions I and II.

The limitations of RD models are due to the lack of information in region III. For

example, lack of modeling information can lead to inability to model dedication

constraints, while the lack of statistics information leads to the inability to report waiting

time distributions. (We show how these problems can be addressed in Sections 4.3

and 4.4.)

The RD/JD paradigm takes a higher-level view of simulation models than the

world views by shifting away from the implementation and identifying the type of

information that is the focus of the model: resource or job information. The information

taxonomy presented here takes this classification a step further by formalizing the

classification categories and allowing a more strict sorting of information.

3.4.3. Entity-Attribute-Set In the Entity-Attribute-Set approach to modeling, the system is represented as a

collection of entities, which may have attributes. Entities can be attributes of other

entities, and the system itself is also considered an entity. Entities can be grouped in sets.

51

The SPL SIMSCRIPT takes this approach to modeling (Markowitz 1979). Events that

change entity attribute values are defined.

In contrast to the information taxonomy presented here, the Entity-Attribute-Set

approach describes the way SIMSCRIPT implements a simulation model. Because of

this, the modeler is forced to classify everything as an entity, and to define the

relationships between entities. The information taxonomy encourages the most intuitive

classification of information, whether as an entity (resource or job), the attribute of a

resource or job, or independent system information. This may not be entity-related.

Similarities are that information can be defined as relating to (e.g., an attribute of)

a resource or job; and that we can define types (sets) of resources and jobs. However, if

we are differentiating between types of entities, as opposed to the entities themselves, it is

likely that we are dealing with general system information, not information from regions

II or III. If explicitly differentiating between entities, the commonalities of the entities

are less of interest than the differences (e.g., different levels of utilization).

3.4.4. Discrete Event System Specification (DEVS) The DEVS formalism takes a systems-theoretic approach to modeling (Zeigler 1976).

From (Zeigler 2003), define the following variables:

X set of input places

Xb collection of bags over X (sets with possible repeated elements); i.e., set of

possible outside events

S set of states

Y set of output places

52

e elapsed time since last transition

ta time advance function, 0,:at S R+∞→

Q total state set, ( ) ( ){ }, | ,0 aQ s e s S e t s= ∈ ≤ ≤

intδ internal transition function, :int S Sδ → ; specifies transition that occurs from

state s if no external events occur and e is allowed to advance to ta(s)

extδ external transition function, : bext Q X Sδ × → ; transition that occurs from state s

if external event x occurs after time e is given by ( ), ,ext s e xδ

conδ confluentl transition function, : bcon Q X Sδ × →

λ output function, : bS Yλ →

Then a DEVS model is specified by { }, , , , , , ,int ext con aM X S Y tδ δ δ λ= .

The system can be broken down into submodels, which are coupled to form the

complete model. Each atomic component has input and output ports, and functions as a

black box when inserted into a larger model. DEVS is concerned with specifying the

basic models, and in their relationships to each other. Reusability of submodels is an

important aspect of this approach.

The information taxonomy does not attempt a specification of the whole model.

It can assist in organizing the information used to identify X, S, and Y. The relationships

between states (as defined by the δ functions) or the actual output mapping λ are not

interesting in the context of the taxonomy.

53

3.4.5. Conical Methodology The Conical Methodology is “an approach which embodies top-down definition of a

simulation model coupled with bottom-up specification of that model” (Nance 1979).

The two phases may be done iteratively and simultaneously. In the model definition

phase, the model is decomposed into objects and subobjects, similar to the approaches

taken in the Entity-Attribute-Set approach and DEVS (it draws from both). In this phase,

object attributes are also typed.

The specification phase uses the Condition Specification from (Overstreet 1982)

to describe model behavior by working up through the hierarchical list of objects and

attributes. There are three types of conditions:

1. interface specification: defines model inputs and outputs

2. model dynamic specification

a. objects: defining attributes, as well as their relationship to the objects in the model

b. action clusters: defines conditions and the set of actions to be taken when the

conditions are met

3. report specification: defines output data, and how they are to be computed

The information taxonomy is comparable to conditions 1, 2a, and 3 in the Condition

Specification. Unlike condition 2b, it does not define action clusters or other model

dynamics. The information taxonomy does not force the classification into objects or

attributes of objects. Its classification works with the possibly independent pieces of

information contained in the model, but not with the relationship between these pieces of

information.

54

3.4.6. Object-Oriented Modeling Object-oriented modeling has become more common with the popularity of languages

like C++ and Java, for example (Healy and Kilgore 1997). In object-oriented

programming, the system is decomposed into objects; objects are encapsulated, which

means that information is assigned strictly to one object. There may be an overarching

system object that calls functions on the other objects to change their attribute values.

The information taxonomy is related to object-oriented modeling in that objects

may be a natural way of implementing a model that has been classified using the

taxonomy. For example, the system itself is an object with associated information. If

maintaining individual resource information, each resource is an instantiation of a

resource class.

The main difference between object-oriented modeling and the taxonomy is that

the former is a means of implementing models; the latter can serve as a tool to aid the

development of an object-oriented model by organizing the information, but need not

lead to an object-oriented model.

55

4. IMPLICATIONS

4.1. General Implications on Modeling Our research has focused on making simulation models more efficient. On the one hand,

we have looked at the impact modeling methodologies themselves (world views) have on

simulation execution times. On the other hand, we have developed an information

taxonomy to assist in modeling. It focuses on the information needed for a simulation

model to accurately represent the system in terms of the modeling objective.

The taxonomy provides clarity on why certain models or modeling approaches are

insufficient in some cases. For example, a STPN cannot find waiting time distributions

for a LCFS G/G/s queueing system because it does not have access to the required job

information to model job ordering in queue; it has information from region I, not

region III.

The taxonomy further tells the user whether a given modeling approach will be

able to model the desired system behavior. For example, if the user wishes to determine

the waiting time distribution for a LCFS G/G/s queueing system using a STPN, the

taxonomy can tell the user that this is not possible, because this requires information from

region III.2.ii.a, which is not available in the STPN. This can save the user time because

(s)he need not look for an answer (i.e., a STPN) that will give the desired statistics. Such

an answer does not exist. Rather, the user can focus on an alternate implementation, or

on developing approximations or extensions to STPNs.

56

4.2. Implications on Modeling: Service Disciplines In this section, we show how the information taxonomy can be applied to

queueing/service disciplines. There is some literature classifying service discipline, but

the focus of the classifications has not been on model implementation requirements.

(Jackson 1957) distinguishes between static and dynamic rules; job priorities do

not change over time with static rules (e.g., priority classes with random selection within

classes). With dynamic rules, the priorities may change (e.g., critical ratio ranking).

(Conway and Maxwell 1962) classify jobs as using only local job information, or using

global information about any aspect of the system. These two approaches are combined

in (Moore and Wilson 1967) to form a two-dimension classification of static/dynamic and

local/global.

(Panwalkar and Iskander 1977) survey over 100 scheduling rules used in the

literature. They propose a broad classification of jobs into Simple Priority Rules,

Heuristic Scheduling Rules, and Other Rules. The first category contains rules that use

job information, as well as rules that use queue sizes to assign jobs to servers and rules

that do not use any specific information (e.g., random assignment). Subclassification is

“based on information related to (i) processing times, (ii) due dates, (iii) number of

operations, (iv) costs, (v) setup times, (vi) arrival times (and random), (vii) slack (based

on processing and due dates), (viii) machines (machine-oriented rules), and (ix)

miscellaneous information.” Rules can be combined directly, or combined using

different weights. Heuristic rules use more complicated logic that often weighs different

scheduling alternatives before selecting the most appropriate one. Heuristic rules may

57

incorporate Simple Priority Rules. The third category of rules contains shop-specific

rules or other rules that do not fall in the other two groups.

While this classification does use information to classify service disciplines, there

is no differentiation between the types of information used. For example, processing

times and arrival times (at the machine) use only local job information, while due dates

use global job information. Machine-oriented rules do not use any job information, but

are placed in the same category. Heuristic rules also use information, but this

information is not used to explicitly categorize the rules. Rather, the classification of

“heuristic” is based on the types of decisions made to schedule jobs.

(Yoshida and Touzaki 1999) propose a quantitative measure to compare service

disciplines in a manufacturing environment. It evaluates the “closeness” of service

disciplines based on performance measures for the problem. The authors do not attempt

a general classification of service disciplines, especially in terms of the information

required to simulate or model them.

(Gahagan and Herrmann 2001) propose a queue controller that specifies

according to which entity attribute the queue is sorted. There is no differentiation

between the type of information used (e.g., local or global job information). The queue

controller is able to model eight of the eleven most common dispatching rules given in

(Vollmann et al. 1988). The three it is unable to model are critical ratio, slack time per

operation, and “next queue.” Critical ratio is discussed more in Section 4.2.3.4. The

reason it cannot be implemented in this case is that the queue controller sorts entities

based on one attribute. Critical ratio requires two attributes (due date and remaining

processing time) to calculate the sorting criterion. Slack time per operation similarly

58

requires more than one attribute (due date, remaining processing time, number of

remaining operations). “Next queue” compares the queue sizes of the machine groups

jobs will visit next. The next job served is the one that will be joining the shortest queue.

This rule requires dynamic information about other parts of the system, and the jobs in

queue would have to be reordered repeatedly as queue sizes elsewhere change.

4.2.1. Clarification of Terminology Service disciplines refer to the rules a server uses to decide which of the waiting jobs to

process next (Gross and Harris 1998). We are not considering the assignment of an

available server to a newly-arrived job. That is, we are interested in the job selection that

occurs on the FINISH-START edge in Figure 4, not the server selection on the ENTER-

START edge.

There is some ambiguity in the literature about the difference between a “service

discipline” and a “dispatching rule.” (Baker 1998) provides a helpful discussion

highlighting that the two are the same, though the term “dispatching rule” is typically

used in a scheduling context. There, we need to differentiate between dispatching and

scheduling. Dispatching uses rules like the ones we are discussing here, where servers

determine in real-time which job to process next. Scheduling determines in advance

when which jobs will be processed by which server. It has some objective, for example,

to minimize the total waiting time experienced by all jobs.

Some of the service protocols we address here are service protocols only in a

loose sense. For example, the critical ratio ranking (Section 2.3.1) is not directly used by

a server to decide which job to process next. Rather, it is used to sort the server’s queue

59

when a job arrives. The server uses a FCFS protocol to pick its next job; the first job in

queue is the first to be served, though because of the critical ratio ranking, it may not

have been the first job to join the queue. We consider it a service protocol because we

could model this ordering as leaving jobs unsorted when they arrive, and having the

server choose the job with the smallest critical ratio to process next. The same applies to

other rules which sort jobs in the queue.

4.2.2. Goals and Assumptions Service disciplines are an important component of system models, and may even be the

focus of the study. They can have an impact on queue sizes and, relatedly, job waiting

and cycle times.

Server utilization is unlikely to be affected by the service discipline, since we

assume the servers will eventually serve all jobs, unless the system is unstable ( sλ ν≥ ⋅ ).

The service protocol may affect the number of jobs processed if it affects the

waiting/cycle times, which, in turn, cause jobs to balk or have to involuntarily leave the

queue. For example, parts may have to be scrapped if it takes too long to process them.

This will affect the server utilization. In these cases, however, we would be more

concerned with the number of balked jobs than the server utilization.

Correct modeling of service disciplines may be important not just for output

statistics (subregions iii), but also for modeling. For example, the order in which jobs

leave the server may be important. We discuss this problem and how to solve it without

using information from region III in Appendix B.

60

Job preemption (Buzacott and Shanthikumar 1993) can be modeled using a small

amount of additional general information; we must know what should be done with the

preempted job, and possibly store the remaining service time.

4.2.3. Service Discipline Taxonomy The information required to model a service discipline depends on the desired output

statistics. For example, if we only need the average queue size in a G/G/s system, the

number of jobs in queue and the number of available resources are sufficient to model the

system for both FCFS and LCFS disciplines. For now, we assume that job waiting and

cycle times are desired; that is, we want output in regions III.2.ii.a and b (or III.2.iii.a and

III.2.iii.b). We classify the service disciplines based on the types of modeling

information required (subregions 1.i and 2.i in all three main regions).

If information from several regions is used, the service discipline is classified in

the region whose memory requirements tend to dominate. For example, a discipline that

requires general and local job information is classified under local job information. If

local and global job information is needed, the discipline is classified under global

information.

61

4.2.3.1. General Information Disciplines that require only general information (subregions 1.i and 2.i of region I) are

ones that do not prioritize jobs based on the time they have spent in the system, or their

arrival time at the current queue. Examples of such disciplines are

• Random selection

• “Mob Rule”: with multiple job types, next job is selected from the type that has

the largest number waiting (used as a substitute for FCFS in the RD fab model

from Section 2.3). Job “types” may be different classes of jobs, or may be the

same class of job at different processing stages (for re-entrant systems).

• Priority: certain job types have priority over others; examples include push and

pull disciplines based on processing stage. These are also known as the first-

buffer-first-served (earliest processing stage) or last-buffer-first-served (latest

processing stage) disciplines (Govil and Fu 1999).

• Batching: a certain minimum number of jobs is required before service can begin;

between (min batch size) and (max batch size) jobs are serviced at the same time

In all these examples, there cannot be delay-based queueing within groups. For example,

it is not possible to model a two-class priority server with FCFS service within a priority

class using only general information if waiting and cycle times are the desired output.

Among possible service disciplines, the ones in this category require the least

amount of information. The information used is already contained in the model and is

needed for basic model functionality.

62

4.2.3.2. Resource Information Service disciplines that require resource information (subregions 1.i and 2.i of region II)

are much less common than those that use job information; resource information is more

likely to be used in selecting a server to process a job, rather than in selecting a job to

begin service.

An example of a job selection rule that uses resource and general information is

one where jobs are put in queue based on resource characteristics they need (e.g., a

resource qualified to perform a specific task). We are not differentiating between jobs, so

the job counts are general information. When a resource becomes available, it selects at

random one of the jobs that need its qualifications if such jobs are present, and selects a

job at random if none are present. In this example, job-specific statistics would not be

available.

4.2.3.3. Local Job Information Service disciplines that require local job information (subregions 1.i and 2.i.a of

region III) use information about the job’s current situation. Examples include:

• FCFS

• LCFS

• Shortest processing time: the job with the shortest processing time is served first.

This is common in scheduling problems (Pinedo 2002). It is possible to model

this using only general information if the service times are generated not at job

63

arrival, but rather as order statistics at the start of service. The ability to do this

may be situation-dependent.

4.2.3.4. Global Job Information Service disciplines that require global job information (subregions 1.i and 2.i.b of

region III) use information about the job’s experience in the system as a whole. They

may also use general or local job information to select among jobs that have the same

global evaluation criteria. Examples include:

• Earliest due date: the job whose due date is closest is served first; if due dates are

past, the job that is most tardy is selected.

• Dedication: jobs queue for the specific resource by which they were previously

serviced, or to which they have been assigned.

• Critical ratio: see below

Critical ratio ranking, see (Rose 2002), is a common discipline used in the semiconductor

industry. A survey of more dispatching rules used in semiconductor fabs can be found in

(Atherton and Atherton 1995). While there are slight implementation differences in some

software packages, the critical ratio (CR) is fundamentally defined as the ratio of the

job’s tardiness over its remaining processing time,

1

due date current timetotal remaining processing time

−+

. (1)

64

The job-driven implementation of the fab model in Section 2.3 implements a variation of

the critical ratio (Fischbein 2002):

- expected time in system actual time in system (2)

While (2) does not require the specification of a due date, it does still require global job

information (actual time in system).

4.2.4. Evaluation The ability to classify service disciplines based on the information required to model

them is an extremely useful tool. The authors have spent much time trying to model

disciplines in a “resource-driven” simulation that simply cannot be modeled (e.g., critical

ratio).

The service discipline taxonomy above assumes detailed job statistics are required

from the simulation. If this is not the case, it may be possible to reduce the modeling

information needed. For example, average queue sizes (and job waiting times) for

(priority) FCFS and LCFS can be found using only general system information. The

same is true for shortest processing time scheduling if service times are not generated on

job arrival. Whether the information needed can be reduced depends on the actual

discipline used, and on the desired output statistics.

65

4.3. Desired Approximation Characteristics In some cases, we may not want or be able to include the necessary information in our

simulation to model the system or obtain the desired output statistics. In these cases,

approximations are required to substitute for the unavailable information. In this section,

we discuss the criteria for comparing different approximation algorithms.

Two aspects of approximation algorithms should be considered when evaluating

their usefulness. The first is the approximation accuracy, discussed in Section 4.3.1, the

second is the computational effort, discussed in Section 4.3.2. Section 4.3.3 gives an

example of a single measure that can incorporate both aspects.

4.3.1. Approximation Accuracy If an approximation is to be of value, it must return performance measures that are near

the actual performance measures, or provide reliable bounds on the measures. This is

true for approximations in simulation models, optimization algorithms (e.g., integer

programming techniques), or in any other application. The definition of “near” is

application-dependent.

An approximation algorithm whose behavior is provable is preferable to one that

appears to work empirically, but cannot be formally shown to be accurate. Qualities of

the algorithm that can be proven include the following. They are listed in roughly

increasing order of desirability. That is, reliability is the first concern, followed by

bounding behavior, etc.

66

• Reliability: The algorithm should give the same results each time it is run. In a

deterministic setting, this means identical results should be obtained for the same

problem and the same algorithm parameter settings. In a stochastic environment, the

results across runs should be statistically identical; that is, any differences in output

are due only to randomness in the system. An algorithm that returns inconsistent

estimates across runs is an unreliable source of information on the underlying

process.

• Bounding behavior: Knowing the algorithm will always stochastically dominate (or

be dominated by) the true value is helpful as the user will know the (unknown)

system values are at least or at most as great as the ones returned by the

approximation. For example, the solution to an integer program will never be better

than its linear programming relaxation. Approximations may provide lower or upper

bounds, or both. Tightness of bounds is desirable.

• (Rate of) Convergence: Convergence is important for algorithms in general. For

approximation algorithms specifically, convergence is important. Fast convergence is

even more important, since we hope to gain a computational advantage by sacrificing

the accuracy of results. Convergence is discussed in (Powell 1981).

• Percent error: If an approximation can be shown to always be within a certain

percentage of the system values or optimal solution, its usefulness will be increased.

Often, the percentage difference is related to the convergence rate of the algorithm,

see for example (Fleischer 2004). This characteristic is the strongest listed here, and

it is usually the most desired. It is also often difficult to attain, especially in the

context of simulation models. Percent error is frequently specified in requirements,

67

and is preferable to absolute error because it removes the unit and scale components

of the reported error.

The qualities listed above are all quantifiable and measurable. For example, the

reliability of the algorithm output can be tested using standard statistical hypothesis

testing. For the bounding behavior, we can specify the type of dominance the

approximation has. More discussion of approximation algorithms, and examples of

dominance behavior, can be found in (Axelsson and Marinova 1999;

Gutin and Yeo 2002; Gutin et al. 2003). An excellent reference for stochastic orderings

can be found in (Shaked and Shanthikumar 1994).

These requirements should not be considered independently of the computational

requirements discussed below. For example, an algorithm that provides tight, reliable

bounds with little computation may be preferable to one that is within x% of the optimal

solution but that requires y percent more computational effort. Reliable, provable bounds

on the solution are very valuable information.

4.3.2. Computational Requirements While the accuracy is important, there is another aspect of algorithms that is more

significant for approximations than it is for conventional algorithms: the computational

requirements. The amount of memory and processor time required is important for any

algorithm, but with an approximation, we are typically consciously giving up accuracy in

an effort to gain speed and reduce the system (usually computer system) requirements to

execute the algorithm. If the approximation is relatively accurate but requires more time

68

and resources to run than the exact method it is trying to replace, there is no benefit to

using the approximation.

As with the quality of solution characteristics, the computational requirements of

an algorithm are measurable. Both the amount of memory and the processor time

required can either be found theoretically, or can be estimated by looking at computer

statistics while running the algorithm.

4.3.3. Error Measures Even when we can measure the accuracy of the algorithm and its computational

requirements explicitly, we must consider at least three numbers when comparing

alternate approaches to the same problem. The question of how to weight these different

aspects in the comparison is a challenge: Is an error of x% acceptable if we have reduced

our memory requirements while increasing the execution time?

4.3.3.1. Existing Measures We have found limited literature on a single error measure. (L'Ecuyer 1994) defines

( )C X as the expected value of the random variable of the CPU time required to compute

the realization of random variable X. [ ]MSE X is the mean squared error of estimating

X. Then the efficiency of X, not to be confused with statistical efficiency (Fisher 1922),

is

( ) [ ] ( )1Eff X

MSE X C X=

⋅ (3)

69

A similar efficiency measure is given in (Fox and Glynn 1990); the authors focus on the

amount of variance reduction obtained through conditioning versus the computational

effort required to obtain the conditioned values.

If X and Y are unbiased random variables that are used to estimate the same

performance measure of our system, then X is preferable to Y if ( ) ( )Eff X Eff Y> .

There are several problems with this measure. The first is that we wish to take

into account not only CPU time, but also the amount of memory required for the solution.

The second is that this efficiency measure is not normalized, and it is not clear how to

interpret a quantity like “0.3/m2⋅sec⋅MB.” Clearly, efficiencies closer to zero are less

desirable, but one can change the value of the efficiency simply by changing the units of

[ ]MSE X .

We would like an efficiency measure that is not dependent on the units used to

measure it. In addition, we would like it to be bounded like the correlation coefficient (a

normalized covariance); this would allow more meaningful comparisons to be drawn.

(Schmeiser and Yeh 2002) discuss a single criterion for evaluating confidence

interval procedures. It compares the results from any “real-world” confidence interval to

the “ideal” confidence interval. We take a similar approach in the error measure

discussed next.

70

4.3.3.2. Example: Graphical Approach to Measuring Approximation Error One approach for including all three aspects in a single measure treats them as

coordinates in three-dimensional space. Each component should be expressed as a

percentage of the method that would give the exact (or optimal) solution, which

eliminates the measure’s dependence on units and the need to provide interpretation of

units. For example, if the exact solution is $10 and the approximation solution is $15, the

approximation coordinate will be $15/$10 = 150%. If the approximation were $5, the

coordinate would be $5/$10 = 50%. The baseline algorithm is located at (1,1,1). The

baseline algorithm can be any algorithm that represents the status quo, or that gives the

optimal or most accurate solution.

The three dimensions for this measure are memory, error, and CPU seconds.

Depending on the application, different metric spaces can be constructed from this to

measure the distance from the “origin” (1,1,1), for example a normed linear space

(Powell 1981). In general, the memory and CPU requirement coordinates cannot be

negative, though the approximation error can be. The coordinates are not bounded above

in any of the three dimensions.

If several algorithms are plotted together, similarities between the algorithms may

be detected based on clustering behavior. For example, some algorithms may be

extremely accurate but require a large amount of CPU power and little memory, while a

different class of algorithms are extremely accurate but require a lot of memory and little

CPU. If x1 and x2 are large positive numbers, the first set of algorithms would be

clustered around (1,1,x1) and the second around (x2,1,1).

71

Another interesting idea is that we can specify regions similar to the efficient

frontier in the measure’s space (Brealey and Myers 1996). For example, we can define

application-specific regions where the magnitudes of CPU, memory, or error coordinates

are unacceptably large. We can also define “breakeven” points where the cost of CPU

power and memory, or CPU/memory and error are equivalent.

While integrating all three algorithm components into a single measure and

allowing flexibility in the distance measure used to compare algorithms, this measure has

several drawbacks. The first is that it may not be bounded, depending on the distance

measure chosen (e.g., a rectilinear distance measure in a normed linear space). Examples

of distance measures can be found in (Dillon and Goldstein 1984). If we are comparing

algorithms to each other, it may be sufficient to be able to compare the value of their

error measures, even if the measures are not bounded.

Another problem is that one must know or at least estimate the approximation

error. (The relative CPU and memory requirements are more easily determined since

they can be found theoretically by analyzing the algorithm, or measured when running

the algorithm.) The error measure does not capture bounding behavior well. I.e., if an

algorithm gives reliable upper and lower bounds, it is not clear how to incorporate these

bounds into the accuracy component. A possibility is to use the distance between the

bounds in the calculation instead of the approximation error:

( )2 upper lower

exact−

(4)

72

One must know the baseline quantities the approximation is being compared to. It may

not always be possible to do so, especially in the case of simulation models, where the

entire purpose of the model is to find these baseline quantities.

Despite its disadvantages, this graphical approach to comparing approximation

algorithms is appealing. We will investigate it more in future work.

4.4. Example: Approximating Dedication Constraints In some cases, we are unable to model system behavior correctly because of lack of

information. One such example is dedication constraints. As defined in Section 2.3.2,

dedication constraints (in our context) are requirements that jobs be serviced by the same

machine on successive visits to the machine group. To model the constraints accurately,

we need information from region III.2.i.b. We give a formal definition of dedication

constraints in Section 4.4.1.

Resource dedication is used in an attempt to improve the quality of service (e.g. in

using primary care physicians with HMOs), or the quality of the final product (e.g. “bin

yield” in wafer manufacturing, the percentage of wafers that meet specifications).

Dedication also has an impact on system dynamics and delays in the system: With

dedication, it is possible that a server may remain idle even though there are jobs waiting

at its server group. A job may wait longer than it would have if dedication were not

being used. Ignoring dedication when modeling the system will lead to biased queue

sizes and server utilizations. Queue sizes will be underestimated, while utilizations may

be overestimated. (Underestimating queue sizes is a more realistic and significant

problem. For congested systems, there will not be much forced idle time for the servers.)

73

There is literature discussing dedication, as well as its impact on operations. See

(Jensen et al. 1996; Shafer and Charnes 1997; Woods 1998; Rohan 1999;

Akcalt et al. 2001), for example. The literature we found assumes that all required

information is available for modeling.

In this section, we present a method that bounds performance statistics for a

system with dedication without using job information (region III). In Section 4.4.1, we

define notation and terminology. Sections 4.4.2 through 4.4.4 motivate and introduce the

approximation. Section 4.4.5 presents computational results, and Section 4.4.6 proposes

improvements and outlines future work. Examples of systems with dedication constraints

are given in Appendix C.1.

In addition to the dedication constraints referred to above, there are other types of

dedication. With tool dedication, certain tools are dedicated to very specific tasks

although they may be able to perform other tasks as well. This is often done to reduce

the number of setups or the amount of cleaning required for tools (Rohan 1999).

4.4.1. Problem Statement Consider an open queueing network in which jobs of j* types are processed at g* server

groups. Each server group g is composed of *gs functionally identical servers. Each job

type is characterized by known service-time distributions, priorities, and routes. The r*

route steps may be deterministic or stochastic, state-dependent, and are re-entrant (visit

the same server group more than once). One or more route steps have a dedication

constraint, with each job required to visit the same server, not just the same server group,

74

from a previous step. The set { }* * * *, , , ,gj g r ts and the associated information are in

region I of the taxonomy. The network has set of associated performance measures θ.

Define the following processes (classification regions are given in parentheses):

( ) ( )( ){ }* * *:1 ,1 : 0rgt Q t r r g g t t= ≤ ≤ ≤ ≤ ≤ ≤Q is the number of jobs at each route

step r waiting in queue for each server group g at time t (region I.2.ii).

( ) ( )( ){ }* * *:1 ,1 : 0ig gt S t i s g g t t= ≤ ≤ ≤ ≤ ≤ ≤S is the status of server i in group g at

time t. ( ) { },igS t busy idle∈ (region II.2.ii.b). The total number of available

servers in group g at time t is ( ) ( )( )*

1

gs

g igi

S t I S t idle⋅=

= =∑ , where ( )I ⋅ is the

indicator function. ( )gS t⋅ is in region I.2.ii.

In a resource-driven simulation or to a casual observer of jobs in a system,

( ) ( )( ){ }*, : 0t t t t≤ ≤Q S are observable.

Additionally, let

( ) ( )( ){ }* * * *:1 ,1 ,1 : 0irg gt Q t i s r r g g t t= ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤Q be the number of jobs at route

step r waiting specifically for server i of group g at time t (region II.2.ii.b). The

total number of jobs waiting at group g, either for a specific server or for any of

the servers, is ( ) ( ) ( )**

1 1

gsrtotg rg irg

r iQ t Q t Q t

= =

⎛ ⎞= ⎜ + ⎟

⎜ ⎟⎝ ⎠

∑ ∑ (region I.2.ii); and let

75

( ) ( )( ){ }* * * *:1 ,1 ,1 : 0ijg gt D t i s j j g g t t= ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤D be the number of jobs of

type j that server i of group g has processed that may return to i in the future

(region II.2.ii.b).

In order to find ( ){ }tQ with dedication constraints, we need the following information:

( ){ }*:1 ,1gkJ g g k N= ≤ ≤ ≤ ≤J , for each job k, the resource in group g k has been

dedicated to. If no dedication is present at g, the array entry can be left blank

(region III.2.i.b).

If complete information on the system were available (if { }J were known), we would

know the stochastic process ( ) ( )( ){ }*, : 0t t t t≤ ≤Q D . If ( ){ }tQ is unknown, the jobs

that would be counted there are added to ( ){ }tQ .

We wish to use the observed stochastic process ( ) ( ) ( )( ){ }*, , : 0t t t t t≤ ≤Q S D to

mimic the behavior the system would exhibit if { }J and therefore ( ){ }*: 0t t t≤ ≤Q were

observable.

Examples of systems with dedication constraints are given in Appendix C.1.

76

4.4.2. System Behavior from the Job Perspective To develop bounds for the system performance measures using a model where { }J (and

therefore ( ){ }tQ ) is unobservable, we consider the system behavior from the jobs’

perspectives.

When dedication is not present, the typical logic for arriving jobs is that they are

immediately processed if there are available (idle) resources, and must wait if there are

none. When dedication is present, jobs will not always be processed when they arrive

(for a revisit) and resources are idle: It is possible that the available resources do not

include the ones that processed the job originally. We can approximate the probability

that this occurs by the following:

( ){ }{ }

( )*

0g

g

g

P can start|job arrives, S t

P required resource available

S tidle resourcestotal resources s

⋅

⋅

>

=

≈ =

(5)

When a job arrives at server group g, subject to dedication constraints, at time t and finds

idle resources, the job does not begin processing unconditionally. Rather, it begins

processing with probability ( )*

g

g

S ts⋅ , which uses only existing information from region I.

This is an approximation because the probability of the required resource being

available is not the same as the proportion of idle resources. The approximation ignores

the memory that is implicit in the system, and also assumes that arriving jobs were

equally likely to have been assigned to any of the resources.

77

We expect the job wait times in the approximation to be greater than they would

be in a system where dedication is modeled explicitly. That is, this approximation is

dominated by the exact model. For example, consider the case where times between

visits to g are short, and the system is lightly loaded. In this case, the probability that

server i, having just completed processing job k, is idle when k returns is high; this is

because there are not many other jobs in the system, and i completed processing k a short

time ago. The probability does not take this information (“memory of the system”) into

account, and may require k to wait.

Let { }( ) : 0gA t t ≥ be the number of times through time t an arriving job is forced

to wait at g although there is an idle server. If / ( )approx exactgA t is the number of times the

approximation/exact algorithm forces arriving jobs to wait at g by time t, we conjecture

without proof that * *( ) ( )approx exactg gA t A t> .

4.4.3. System Behavior from the Resource Perspective This section examines the system behavior from the perspective of resource g, subject to

dedication constraints. When dedication is not present, a resource that becomes available

(finishes processing a job or is put back online after maintenance) immediately begins

processing a waiting job. When dedication is present, the resource may be forced to

remain idle even though there are jobs in queue. That is, server i in group g becomes

available at time t, finds ( )totgQ t > 0, but cannot begin processing. We can approximate

the probability that this occurs as follows:

78

( ){ }{ }

{ }

{ }( )

( ) ( )1

* *1

0

1

1

1 11 1 1 1

totg

tottot gg

totg

Q t

k

Q tQ t

k g g

P can begin processing | i now available, Q t

P at least one job waiting for i

P no job waiting for i

P job k not waiting for i

s s

=

=

>

=

= −

≈ −

⎛ ⎞ ⎛ ⎞≈ − − = − −⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠

∏

∏

(6)

In this case, when a server in group g becomes available at time t and finds jobs waiting,

it does not begin processing immediately. Rather, it begins processing with probability

( )

*

11 1

totgQ t

gs⎛ ⎞

− −⎜ ⎟⎜ ⎟⎝ ⎠

, which uses only existing information from region I.

If the number of jobs in queue goes to zero, the probability will go to 0:

( )

( )* 0

11 1 1 1 0

totg

totg

Q t

Q tgs →

⎛ ⎞− − ⎯⎯⎯⎯→ − =⎜ ⎟⎜ ⎟⎝ ⎠

(7)

If the number of jobs in queue gets very large, the probability of starting will tend to 1:

( )

( )*

11 1 1 0 1

totg

totg

Q t

Q tgs →∞

⎛ ⎞− − ⎯⎯⎯⎯→ − =⎜ ⎟⎜ ⎟⎝ ⎠

(8)

This is an approximation because it assumes that all jobs waiting in queue are

independent of one another, which may not be. For example, if the servers have to be set

up to process a certain job type, they may process several of that job type before

switching to a different setup. Then the m jobs of that type that were processed

sequentially will all dependently return to the queue at roughly the same time.

79

Furthermore, the approximation assumes, as in Section 4.4.2, that the probability of

waiting for server i is *

1

gs, which may not be true.

It is not clear a priori whether this approximation will dominate or be dominated

by the exact algorithm. We can construct scenarios to argue either way. Consider the

example above where several jobs of the same type are processed sequentially because of

setup costs. In this case, we will underestimate the probability of being able to start,

since we are ignoring the fact that m of the jobs are waiting for i. Instead, we assign each

of them an equally small probability of waiting for this resource.

On the other hand, if server i has just been down for repair or maintenance for a

long time, and times between revisits are short, it is unlikely that there will be any jobs

waiting for i in queue (because it has not been processing any and sending them out into

the system). The probability ignores this fact and assumes that each job is equally likely

to be waiting for i. Both extensions require more information from region II. The current

approximation uses only existing information.

Let { }( ) : 0gV t t ≥ be the number of times by time t a server in group g was forced

to remain idle even though there were jobs waiting when it becomes available. If

/ ( )approx exactgV t is the number of times by t the approximation/exact algorithm forces a

server at g to remain idle when there are jobs waiting, we cannot say a priori whether

* *( ) ( )approx exactg gV t V t> , or * *( ) ( )approx exact

g gV t V t< .

80

4.4.4. Approximation Implementation Figure 4 shows the event graph model for an n-stage queueing network. Figure 7 shows

the event graph after the approximation probabilities have been added. If stage i is not

subject to dedication constraints, the probabilities are omitted. RND is a function

returning a pseudorandom Uniform(0,1) number.

Figure 7. Event graph for an n-stage queueing network, with dedication constraints modeled using the approximation

4.4.5. Results To test our model, we constructed a simulation of a deterministic re-entrant queueing

network with one job type and dedication constraints on one of the server groups. This

group is modeled explicitly, while the remaining network is modeled as a black box; it is

more interesting to know how long jobs were away from the server group, and less

interesting to know what they were doing while they were gone. Parameters varied in the

simulation are:

{Qi=Qi-1, Ri=Ri-1}

START (i)

FINISH(i)

tsta ENTER (i)

{Ri = Ri + 1, DONE = (i = = n)}

(Ri > 0 && *i

i

RRNDs

≤ ) i

i

i

1

(i = = 1)

{Qi = Qi + 1}

i+1

(Qi > 0 && *

11 1iQ

i

RNDs

⎛ ⎞≤ − −⎜ ⎟

⎝ ⎠)

(DONE = 0)

81

*gs number of servers at the dedication group g

λ arrival rate to the system

ν service rate at group g (may differ for the different visits to g)

τ “revisit rate”: 1/average amount of time between visits to g

m number of revisits to g by each job

d service discipline at g

Our performance measures are the average queue size (total) at g, and *( )gA t and *( )gV t .

Figure 8 shows the average total queue size when m = 2 (total visits to g by each

job is 3), d is a pull discipline (jobs at visit 3 get processed before those at visit 2 before

those at visit 1), and the time between revisits is “long” (τ = 0.5⋅λ).

Each of the points in Figure 8 is the average of 5 pairs of antithetic simulation

runs. The number of servers is fixed at 5, and the service rate is varied to provide an

even sampling across utilizations. The arrival and service distributions are Exponential.

The x-axis is resource utilization, which is not one of the system parameters. It is an

output from the simulation, and is used as a surrogate for the number of revisits, the

service rate, etc. We are using this surrogate measure of congestion because of the large

number of input parameters. The effective traffic intensity outlined in

(Fendick and Whitt 1998) is a possible independent measure we will investigate in the

future.

In Figure 8, the full dedication case (line with triangles, “Ded.” in the legend) is

bounded below by the no dedication case (line with diamonds, “No Ded.” in the legend),

as previously observed. (This corresponds to the model in Figure 4.) It is bounded above

82

by the approximation we have constructed (line with circles, “Prob.” in the legend). If

these results hold up, we would be able to run two fast, straightforward simulations and

obtain bounds on the behavior of the (slightly) slower and more detailed simulation.

Figure 8. ( )totgQ t for the system described above

Figure 9 shows the queue size for the same system where the time between revisits is

short, τ = 2⋅λ. The same type of bounding behavior as in Figure 8 is evident. The

distances between the three curves remain relatively constant, whereas the bounds in

Figure 8 become tighter as the utilization (system congestion) increases. This follows the

intuition we formed earlier that the approximation does not maintain the system memory

that is more clearly defined if the times between revisits are short.

83

Figure 10 shows *( )approxgA t and *( )exact

gA t as proportions of the total number of

jobs processed for both long and short times between revisits (solid and dashed lines,

respectively). When revisit times are long, *( )approxgA t is very close to *( )exact

gA t ,

especially as the utilization increases. With long revisit times, and especially when the

servers are very busy, server i’s status is independent of a job’s arrival to or departure

from g. When the revisit times are short, there is a large discrepancy, especially for more

lightly-loaded systems. This confirms the arguments given in Sections 4.4.2 and 4.4.3.

When the system is heavily loaded, this effect is dampened, as i is likely to be able to

start working on another job right away, or at least before the job it just completed

returns.

We note that * *( ) ( )approx exactg gA t A t> , confirming the intuition in Section 4.4.2 that

the estimate would be conservative.

Figure 11 shows *( )approxgV t and *( )exact

gV t as proportions of the total number of

jobs processed. Again, the approximation appears very close when the system memory

has had the chance to erase itself, that is, when the time between revisits is long, or the

time is short but the system is congested. This reinforces our earlier intuition. The

differences between *( )approxgV t and *( )exact

gV t are smaller than those between *( )approxgA t

and *( )exactgA t , which validates our a priori difficulties of determining whether the

approximation would be conservative.

The values for both *( )approxgA t and *( )approx

gV t do not depend on the revisit rate τ,

though τ obviously does have a large impact on the values of *( )exactgA t and *( )exact

gV t .

84

This illustrates that we are not taking system information other than the queue size and

server status into account. Better approximations may attempt to include more

information.

We now give two conclusions we draw from our experimentation. The first we

state as a lemma and prove. The second is a hypothesis based on our intuition and

experimental results and is presented without proof.

Lemma 1 The queue size at a resource subject to dedication constraints is bounded

below by the queue size of the resource group in the same model, but without

dedication constraints.

Proof: The systems with and without dedication are identical except in the servers that

are allowed to process job k. If dedication is present, only server i’ may serve k,

while there are no restrictions when dedication is not enforced. Define ( )i tΦ as

the next time on or after time t that server i will become available to process jobs.

If job k arrives at time t and there are no dedication constraints, k’s wait is

( ){ }min iit tΦ − . With dedication, k’s wait is ( )'i t tΦ − . Because

( ){ } ( )'min i iit tΦ ≤ Φ , k’s wait with dedication constraints cannot be shorter than

in the unconstrained system. ■

85

Figure 9. ( )totgQ t for the system in Figure 8, with short revisit times

Figure 10. Aapprox(t*) and Aexact(t*) as proportions of the total number of jobs processed

86

Figure 11. Vapprox(t*) and Vexact(t*) as proportions of the total number of jobs

processed

Claim 1 The approximation proposed result in * *( ) ( )approx exactg gA t A t≥ and

* *( ) ( )approx exactg gV t V t≥ .

4.4.6. Possible Improvements and Future Work In the near future, we will

• perform more runs and calculate confidence intervals or other measures of

precision to rule out the possibility that our conclusions are due to noise;

• improve the approximations by taking into account things like repairs, sequential

processing of identical parts, or the number of parts of each type each server has

87

processed (if the server has processed none of this part type, the probability that

the part is waiting for this server is zero); that is, add more information to the

model and approximation;

• test the approximation given in Appendix C.2.

On a longer time horizon, we would like to

• prove that our approximations provide upper and lower bounds on the true

system; and

• quantify in which situations the approximations will provide good bounds.

4.5. Example: Approximating Waiting Time Distributions Besides preventing accurate modeling of system behavior, lack of information can hurt a

simulation study by precluding the gathering of desired output statistics. For example,

job waiting time distributions may be required without the associated access to

information in region III of the information taxonomy.

In this section, we introduce the Time Slice Counter Method, which can be used

to find delay distributions for first-come-first-serve (FCFS) terminating systems. The

advantages of the Time Slice Counter Method are that it does not require information

from region III, and that it appears to work better for congested and variable systems than

it does for idle systems. It adds a fixed amount of information from region I to the basic

resource-driven model.

We have found little literature trying to estimate the entire distribution without

tracing individual jobs. A recent work that estimates quantiles of the distribution without

88

maintaining the relevant statistics throughout the whole simulation (e.g., waiting time at

an individual station in a queueing network during the simulation of the whole network)

is presented in (McNeill et al. 2003). The waiting times do need to be stored while the

job is at the station, i.e., this method uses information from III.2.iii.a. The quantiles are

estimated using a Cornish-Fisher expansion by tracking the higher moments of the

statistic in question during the run. The moments are used after the run is completed to

calculate quantiles. A big advantage is that any quantile can be calculated without having

to rerun the simulation.

This approach differs from the Time Slice Counter Method in two ways. First, it

uses information from III.2.iii.a rather than from I.2.iii. Second, the user specifies which

quantiles are of interest. The Time Slice Counter Method presented below looks at the

probability of waiting at most γ time units, for the γ’s of interest. The former specifies a

quantile and determines γ, while the latter specifies γ and tries to determine the quantile.

Both are valuable in different contexts. In the future, integration of the two may provide

useful results.

4.5.1. Basic Approach If we are interested in the probability a job in a first-come-first-serve (FCFS) terminating

queueing system waits at most γ time units, we need not explicitly track each job.

Rather, we can use the event graph model shown in Figure 12. This is the general event

graph model for a G/G/s queueing system (like the one in Figure 4) with an additional

event vertex (Schruben and Schruben 2001). Define the following variables (all of which

are in region I):

89

( ){ }*: 0AN t t t≤ ≤ number of arrivals by time t

( ){ }*: 0SN t t t≤ ≤ number of starts by time t

( ){ }*: 0DN t t tγ

≤ ≤ number of jobs delayed γ time units by time t; ( ) ( )D AN t N tγ

γ= − ;

for t ≥ γ; we omit the delay subscript when its value is obvious

( ){ }*: 0W t t tγ ≤ ≤ number of jobs whose wait was less than γ time units by time t,

( ) ( )DW t N tγγ ≤ ; we omit the delay subscript when its value is

obvious

( )WF γ current estimate of { }P delay γ≤ ; denoted as PROB in event

graphs

( )WF γ true value of { }P delay γ≤ ; estimated using ( )WF γ

When a job enters the system, a DELAY event is scheduled to occur after γ time units.

When a DELAY event occurs, counter ND is incremented. When a START event occurs,

counter NS is incremented. If more DELAY events than START events have occurred,

the current job’s waiting time was greater than γ. The estimate of the probability a job

will wait at most γ time units is

( ) { } ( )( )

*

*WD

W tnumber jobs delayed F P delaytotal number jobs N t

γ

γγγ γ ≤= ≤ = = (9)

90

Lemma 2 ( )WF γ is an unbiased estimator for ( )WF γ .

Proof: The proof rests on the use of indicator random variables in the estimate.

( ) ( )( )

( )( )

( )

{ }( )

( )( )

( )

( ) ( )

*

* *

*

* *

* *

ˆ

D

D D

N t

k=1W

D D

N t N t

Wk=1 k=1

WD D

I job k's delayW tE F E E

N t N t

P job k's delay FE E F

N t N t

γ

γ γ

γ γ

γ γ

γγ

γ

γ γγ

⎡ ⎤⎢ ⎥≤⎡ ⎤ ⎢ ⎥⎡ ⎤ ⎢ ⎥= = ⎢ ⎥⎣ ⎦ ⎢ ⎥⎣ ⎦ ⎢ ⎥⎢ ⎥⎣ ⎦

⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥≤⎢ ⎥ ⎢ ⎥= = =⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

∑

∑ ∑

■

Figure 12. Event graph model for estimating { }P delay γ≤

A realization of this system is given in Figure 13. The dashed line represents the number

of DELAY events that have occurred at time t, ( ){ }DN t , while the solid line is the

number of START events that have occurred, ( ){ }SN t . At time t1, the START line lies

{R = R + 1} {Q=Q-1, R=R-1, NS=NS+1, W=W+(ND<NS), PROB=W/ND}

DELAY

{ND=ND+1}

START FINISH ts

ta ENTER

{Q = Q + 1}

(Q > 0)

(R > 0)γ

91

above the DELAY line, ( ) ( )1 1S DN t N t> ; this means that more START events have

occurred by t1, and that the last job that began service was not delayed more than γ time

units. At time t2, the DELAY line lies above the START line, ( ) ( )2 2D SN t N t> ,

indicating that more DELAY than START events have occurred; the next job to begin

service will have waited more than γ time units.

The advantage of this approach is that we do not need to store any information.

We are able to classify jobs based on event counts using the fact that the number of

events by time t is at least n iff the time of the nth event occurred by t,

( )E nN t n E t≥ ⇔ ≤ .

Switching our perspective to looking at horizontal “lines” from the event counts

on the y-axis rather than vertical “time slices” from the x-axis, we can read off the status

of each job. The first START event occurs before the first DELAY event ( 1 1S D< ), so

the first job was serviced before it was delayed γ time units. The same holds for the

second, third, and fourth START and DELAY events and their corresponding jobs. For

the fifth and sixth jobs, the DELAY events occur before the START events, and these

two jobs had to wait more than the γ time units. Finally, the seventh job was processed

before it became overdue. Our estimate of { }P delay γ≤ is ( )WF γ = 5/7.

92

Figure 13. Sample realization of START and DELAY event occurrences

To estimate the distribution of the waiting times, ( )WF ⋅ , the user schedules multiple

DELAY events for each job, each with a different value of γ.

The methods presented in this section are of interest because they allow the

estimation of the probability during the simulation run, not as a post-processing step.

Tracking waiting times and post-processing is an inefficient way of finding waiting time

distributions. It requires additional computational effort as described in Section 3.3.3.

Although we avoid the post-processing, we are still maintaining a record for each

job on the future events list (FEL). Since scheduling events is usually the most expensive

operation in computer simulation programs, we have created more work by scheduling

the DELAY event for every job in the system while losing the complete information

available to us by tracking all jobs. We have found the exact solution by adding

computational effort as described in Section 3.3.2.

START

DELAY

t1

count

1

2

3

4

5

7

6

time

t2

93

4.5.2. Eliminating Individual Job Information

4.5.2.1. Lower Time Slice Counter Method We now introduce the Lower Time Slice Counter Method. With it, we do not trace

individual jobs by scheduling DELAY events, but track an approximation of the number

of DELAY events that have occurred by a certain time. (Section 4.5.2.2 introduces a

slight modification, the Upper Time Slice Counter Method.) No DELAY events are

scheduled. Define the following variables, all of which are in region I of the taxonomy:

ξ time slice size

Ξ number of time slices used for the simulation; *t ξ⎡ ⎤Ξ = ⎢ ⎥

{ }:1kA k N≤ ≤ time of the kth job’s arrival

{ }:1kS k N≤ ≤ time of the kth job’s service start

{ }, :1kD k Nγ ≤ ≤ time of the kth job’s γ-time unit delay

{ }, :1i iγ∆ ≤ ≤ Ξ number of γ-time unit DELAY events by time interval i

( ){ },

*: 0LN t t tγ ξ

≤ ≤ number of approximated γ-DELAY events from the Lower Time

Slice Counter Method that have occurred by time t, for time slice size ξ

( ){ },

*: 0UN t t tγ ξ

≤ ≤ number of approximated γ-DELAY events from the Upper Time

Slice Counter Method that have occurred by time t, for time slice size ξ

94

( ),t γ ξ floor index of the time slice a DELAY event scheduled at time t for γ time

units in the future with time slice size ξ falls into; ( ), tt γγ ξξ

⎢ ⎥+= ⎢ ⎥⎣ ⎦

( ),t γ ξ ceiling index of the time slice a DELAY event scheduled at time t for γ

time units in the future with time slice size ξ falls into;

( ) ( ), , 1tt tγγ ξ γ ξξ

⎡ ⎤+= = +⎢ ⎥⎢ ⎥

For all variables, we omit the subscripts for the delay γ and time slice size ξ when their

meaning is obvious from context.

The Lower Time Slice Counter Method divides the simulation run duration *t into

time slices of length ξ. When a job enters the system at time t, it will have waited at least

γ time units at time t γ+ . We say this event falls into time slice ( ),t γ ξ . The DELAY

event is known to occur sometime in the time interval ( ) ( ) ), , ,t tγ ξ ξ γ ξ ξ⎡ ⋅ ⋅⎣ . Since we

are tracking a cumulative count of the number of DELAY events, the DELAY counters

for time slices ( ),t γ ξ , ( ),t γ ξ +1,…, Ξ will be incremented by one when a job arrives.

( ),t γ ξ∆ is the counter for time slice ( ),t γ ξ .

When a START event occurs at time t’, it compares ( )'SN t to ( )' 0,t ξ∆ , and

increments ( )'W tγ as before, i.e., if ( )'SN t > ( )' 0,t ξ∆ . We do not have a START counter

for the time slices, as the exact time of a START event is known.

A necessary but not sufficient condition for misclassification is that the START

event happens in the same time slice as the DELAY event, ( )0,k k kD S Dξ ξ⋅ ≤ < .

95

Conversely, a sufficient condition for correct classification is that the two events happen

in different time slices. We state this as Lemma 3:

Lemma 3 A sufficient condition for correct classification of job k using the Lower Time

Slice Counter Method is that the kth START event does not occur in the same time

slice as the kth DELAY event. A necessary but not sufficient condition for

misclassification is that the events occur in the same time slice.

Proof: The time of the kth START event time, occurring in time slice ( )0,kS ξ , is kS .

The (unobserved) kth DELAY event time, occurring in time slice ( )0,kD ξ , is kD .

The kth job is not delayed if k kS D≤ . There are three possible relationships

between ( )0,kS ξ and ( )0,kD ξ :

1. ( ) ( )0, 0,k kS Dξ ξ< :

This implies k kS D< , ( ) ( )0, 0,k k k kS S D Dξ ξ ξ ξ⋅ ≤ < ⋅ ≤ ; job k will be

correctly classified as not delayed.

2. ( ) ( )0, 0,k kS Dξ ξ> :

This implies k kS D> , ( ) ( )0, 0,k k k kD D S Sξ ξ ξ ξ⋅ ≤ < ⋅ ≤ ; job k will be

correctly classified as delayed.

3. ( ) ( )0, 0,k kS Dξ ξ= : No conclusion can be drawn about the relationship

between kS and kD . It is possible that ( ) ( )0, 0,k k k kD D S Dξ ξ ξ ξ⋅ ≤ < < ⋅ or

that ( ) ( )0, 0,k k k kD S D Dξ ξ ξ ξ⋅ ≤ ≤ < ⋅ . In the first case, a correct

classification will be made, in the second an incorrect one. ■

96

The smaller the time slice sizes are, the more accurate the estimate will be. As time slice

sizes get larger, we underestimate ( )WF γ , thinking more jobs are overdue than is really

the case.

Figure 14 shows the same system realization as Figure 13, but includes the

approximated DELAY curve. The time slice size is ξ time units. The approximated

DELAY is bolded.

The number of approximated DELAYs is at least as large as the number of actual

DELAYs at any time t. In three instances (jobs 2, 3, and 7), this leads us to falsely

believe the jobs waited longer than γ time units; these are the shaded areas in Figure 14.

The number of shaded rectangles tells us how often we overestimate the wait. The length

of the rectangles gives an indication of the sensitivity to the arrival time the

approximation has: Any arrival that occurred γ units before any of the points in the

shaded area will be misclassified.

The approximation will not underestimate the probability of forcing a job to wait

too long. If ξ is small, the approximation will be accurate, as ( )( )0

lim 0, 0k kD Dξ

ξ γ→

− ⋅ = .

If ξ is large, the approximation will be too conservative, as ( )( )*lim 0,k k kt

D D Dξ

ξ γ→

− ⋅ = ,

and we will conclude that all jobs were delayed for more than γ time units. We prove the

algorithm’s accuracy in Theorem 1.

97

Figure 14. System realization from Figure 13 with approximated DELAY curve

Theorem 1 In a terminating single-station queueing system with a FCFS service

discipline, as the time slice size ξ goes to zero, for any sample path, the estimate

( )WF γ from the Lower Time Slice Counter Method will converge to the job-

driven estimate of ( )WF γ for any γ ≥ 0: ( ) ( )( )0

ˆ ˆlim 0LTSCM JDW WF F

ξγ γ

→− = .

Proof: We must show that the Lower Time Slice Counter Method will give the same

result as the model in Section 4.5.1. This is equivalent to showing that

( )( )0

lim 0, 0k kD Dξ

ξ γ→

− ⋅ = for any k and for any γ.

( )0 0, kk k k

DD D Dξ γ ξ ξξ

⎢ ⎥≤ − ⋅ = − ⋅ <⎢ ⎥

⎣ ⎦ is true because kD cannot occur before

( )0,kDξ ξ⋅ , its component that is an integer-multiple of ξ. Further, the distance

ξ 2ξ 3ξ 4ξ 5ξ

count

1

2

3

4

5

7

6

time

START

DELAY

98

between kD and ( )0,kDξ ξ⋅ can be at most ξ. Taking limits on the left- and

right-hand sides gives

( )( )

( ) ( )( )

0 0 0 0

0 0

0

lim 0 lim lim 0 lim 0

lim 0 lim 0, 0

ˆ ˆlim 0

k kk k

kk k k

LTSCM JDW W

D DD D

DD D D

F F

ξ ξ ξ ξ

ξ ξ

ξ

ξ ξ ξξ ξ

ξ ξ γξ

γ γ

→ → → →

→ →

→

⎛ ⎞ ⎛ ⎞⎢ ⎥ ⎢ ⎥≤ − ⋅ < ⇔ ≤ − ⋅ <⎜ ⎟ ⎜ ⎟⎢ ⎥ ⎢ ⎥

⎣ ⎦ ⎣ ⎦⎝ ⎠ ⎝ ⎠⎛ ⎞⎢ ⎥

⇔ − ⋅ = ⇔ − ⋅ =⎜ ⎟⎢ ⎥⎣ ⎦⎝ ⎠

⇔ − =

which completes the proof. ■

We note that as time slice size ξ goes to zero, the number of time slices increases. At

some point, more memory may be required for the time slice counters than would be to

maintain job information. This is especially true for lightly-loaded systems.

Figure 15 shows results for 10,000 simulated jobs in an M/M/1 queueing system

with an interarrival rate of 0.667 and a service rate of 1. We show the results for various

values of γ and three time slice sizes. The solid bold line is the analytic solution. The

time slice approximations become more accurate as the time slice sizes decrease. For

smaller values of γ, the approximation can be significantly inaccurate, especially for

larger time slices. This confirms the intuition that the estimate of the probability goes to

zero as the time slice size increases. (Not only is the absolute time slice size larger, but

its size relative to the smaller delays is larger than it is to the longer delays.)

The inaccuracy is not observed for larger γ values as the time slice sizes have not

increased sufficiently to cause the START and DELAY events to happen in the same

time slice.

99

Figure 15. P{wait ≤ γ} for various values of γ and several time slice sizes

4.5.2.2. Upper Time Slice Counter Method A simple change to the Lower Time Slice Counter Method results in the Upper Time

Slice Method: We increment the counter for time slices ( ), tt γγ ξξ

⎡ ⎤+= ⎢ ⎥⎢ ⎥

,

( ), tt γγ ξξ

⎡ ⎤+= ⎢ ⎥⎢ ⎥

+ 1,…,Ξ rather than the time slices starting at ( ), tt γγ ξξ

⎢ ⎥+= ⎢ ⎥⎣ ⎦

. In other

words, we update the counter at the end of the time slice in which the DELAY event

occurs, not at the beginning. This will result in an overestimation of ( )WF γ . All results

presented for the Lower Time Slice Method follow analogously for the Upper Time Slice

Method, and will not be presented again here.

100

Figure 16 shows experimental results for the same system as in Figure 15 using

both the Lower and Upper Time Slice Counter Methods. The Upper Time Slice Counter

Method (“ceiling” in the legend) estimates lie above the true probabilities (bold line).

For other system parameters (arrival and service rates), the difference in errors

between the Lower and Upper Time Slice Counter Methods is not as pronounced as here.

The only generalization we can make is that the approximations become more accurate as

the time slice size decreases, and/or as we get closer to the right tail of the distribution.

Figure 16. { }P wait γ≤ for various values of γ and several time slice sizes using both the Lower and Upper Time Slice Counter Methods

4.5.3. Observations In this section, we make general observations about the two Time Slice Counter Methods

which will motivate the Time Slice Counter Method discussed in Section 4.5.4.

101

While so far we have looked at our step functions as functions of time, insights

are more easily gained by examining the step functions from the jobs’ perspectives.

In the original model shown in Figure 12 and Figure 13, job k can easily be

classified as delayed or not delayed by looking at the relative positions of the DELAY

and START curves between counts k and k+1. If the DELAY curve is to the left of the

START curve, job k has been delayed. If the curve is to the right, the job has not been

delayed. If the curves are on top of each other, the DELAY and START events happened

at the same time, and the job can be considered delayed.

In both the Lower and Upper Time Slice Counter Methods, rather than comparing

the relative positions of the START and DELAY curves, we compare the positions of the

START and approximated DELAY curves. We do not observe the actual DELAY curve.

In the Lower Time Slice Counter Method, we may incur a false positive, a type I error, by

misclassifying jobs as being delayed even though they are not. In the Upper Time Slice

Counter Method, we may incur a false negative, a type II error, by misclassifying jobs as

not delayed even though they are.

The problems with the Lower and Upper Time Slice Counter Methods are that we

do not know where the actual DELAY curve lies relative to the START curve, and that

we do not always know whether we have properly classified a job. If we use both

methods together, we are able to increase the proportion of jobs we can classify with

certainty. This is the Time Slice Counter Method, discussed next.

102

4.5.4. Time Slice Counter Method

4.5.4.1. Further Observations At time t, ( )SN t is the START curve, ( )DN t

γ the DELAY curve, ( )

,LN tγ ξ

the

approximated DELAY curve resulting from the Lower Time Slice Counter Method with

time slice size ξ, and ( ),UN t

γ ξ the approximated DELAY curve resulting from the Upper

Time Slice Counter Method with time slice size ξ and for delay γ. For the sake of

simplicity, we denote them as S, D, L, and U, respectively, in the following discussion.

The relative positions of L, D, and U are fixed, by definition. The only variability

comes from the location of S. The four possible configurations of the curves are given in

Table 5.

Table 5. Possible orderings of the curves

1 SLDU2 LSDU3 LDSU4 LDUS

Table 6 shows the possible orderings of the curves, the correct classification of jobs with

these orderings, and how the Lower and Upper Time Slice Counter Methods perform.

Both methods are correct for three of the four possible states, and they are not both wrong

at the same time. Further, if both classify a job the same way, the classification is

correct. This corresponds to the instances where the START curve lies in a different time

slice than the actual DELAY curve does, and is illustrated in Figure 17.

103

Table 6. Lower and Upper Time Slice Counter Method Classifications

state Correct classification

LTSCM UTSCM Lower Counter correct?

Upper Counter correct?

SLDU Not delayed Not delayed

Not delayed

Yes Yes

LSDU Not delayed Delayed Not delayed

No Yes

LDSU Delayed Delayed Not delayed

Yes No

LDUS Delayed Delayed Delayed Yes Yes

Figure 17. Accuracy of job classification using the Lower and Upper Time Slice Counter Methods

The checkmarks indicate the method properly classified a job, and the X’s indicate an

error was made. The times when both methods have checkmarks are those times when

the solid bold line (START) falls outside the solid unbolded (Lower Delay Counter) and

dotted (Upper Delay Counter) lines. If it falls between them, one of the two methods will

ξ 2ξ 3ξ 4ξ 5ξ

count

1

2

3

4

5

7

6

time

START

DELAY

√

√ √

√

√

√ √

√

X

X

X

X

√√

104

make a mistake. Since we do not know the location of the dashed line, we do not know

which method is in error. (In this example, the Lower Time Slice Counter Method is

incorrect three times, the Upper Time Slice Counter Method once.)

4.5.4.2. The Time Slice Counter Method Algorithm While we know that at least one of the two methods will correctly classify jobs, we do

not know which one to choose if the two disagree. We do know that the categorization is

correct if both make the same classification. A method combining the two can use this

information to improve our probability estimates: We can track both counts and always

classify jobs as (not) delayed if both methods agree, when the START and DELAY

events occur in different time slices. Having the events fall in different time slices is a

necessary and sufficient condition for certainty of correct classification. If the two

methods do not agree, we must decide how to classify the job. Several options are

described in Appendix D.2. The implementation used here is described in Appendix

D.2.6. We discuss complete details of the implementation in Appendix D.3.

The main advantage of combining the Lower and Upper Time Slice Counter

Methods is that we are able to unambiguously classify a larger number of jobs than using

the methods independently. Specifically, for the Lower Time Slice Counter Method, we

only know that the jobs in the class SLDU have been correctly classified. We cannot

differentiate between the other three cases as we only observe the S and L curves. For the

Upper Time Slice Counter Method, we only know that the jobs in LDUS have been

correctly classified. Combining the two methods leaves uncertainty only for jobs in

LSDU and LDSU.

105

The basic Time Slice Counter Method algorithm follows below.

Select values for γi¸ ξ or (ξi) Initialize k = 1, , 0iγ∆ = (∀γ, i = 0, 1, ... , Ξ), ( ) ( ) 0SN t W tγ= =

(t ≥ 0), 0γΦ = (∀γ) For every job k When the job arrives at time kA If server available, set START_flag = TRUE For every γ Calculate ( ),kA γ ξ

If ( ),kA γγ ξ > Φ // see Appendix D.3.3

For i = 1γΦ + to ( ),kA γ ξ , set , ,i γγ γ Φ∆ = ∆

Set ( ),kAγ γ ξΦ =

Increment ( ), ,kAγ γ ξ∆ // LTSM counters

When the job starts service at time kS

Increment ( )S kN S

Calculate ( )0,kS ξ

For every γ If ( )0,kS γξ > Φ

For i = 1γΦ + to ( )0,kS ξ , set , ,i γγ γ Φ∆ = ∆

Set ( )0,kSγ ξΦ =

If START_flag == TRUE Classify job k as not delayed, i.e., increment ( )kW Sγ

Else if ( ) ( ), ,kS k AN S γ γ ξ> ∆ , classify job k as not

delayed, i.e. increment ( )kW Sγ

Else if ( ) ( ), , 1kS k AN S γ γ ξ −< ∆ , classify job k as delayed

Else if ( ) ( ), , 1kS k AN S γ γ ξ −> ∆ and ( )0,2k kS S ξξ ξ≥ ⋅ + ,

classify job k as delayed, i.e., increment ( )kW Sγ

Update ( ) ( )ˆ kW

W SF

kγγ =

Set START_flag = FALSE Theorem 2 states that the Time Slice Counter Method converges to the job-driven

probability estimate as the time slice size goes to zero:

106

Theorem 2 In a terminating single-stage queueing system with a FCFS service

discipline, as the time slice size ξ goes to zero, for any sample path, the estimate

( )WF γ from the Time Slice Counter Method will converge to the job-driven

estimate of ( )WF γ for any γ ≥ 0: ( ) ( )( )0

ˆ ˆlim 0TSCM JDW WF F

ξγ γ

→− = .

Proof: The proof is direct from Theorem 1. We are not restricting ourselves to a G/G/1

queue anymore because it does not matter how many servers there are in a FCFS

system, the first job in queue is the first job to be served. ■

Again, as ξ goes to zero, the number of time slice counters may require more memory

than storing job information directly.

4.5.5. Experimentation: Single-Server System In this section, we discuss experiments done with the Time Slice Counter Method on

single-server, single-stage queueing systems. Arrival and service distributions are

Exponential (M), Uniform (U), and Beta(0.5, 2) (B), with different arrival and service

rates. The choice of parameters for the Beta distribution is discussed below.

By changing the distributions and arrival/service rates, we are simulating different

levels of congestion and variability in the system. We use the traffic intensity ρ as a

measure of system congestion though it does not directly measure the number of jobs in

the system: an M/M/1 system with ρ = 0.9 has far fewer jobs in it than an M/M/1000

system with ρ = 0.9. Since the Time Slice Counter Method’s performance is insensitive

107

to the number of jobs in the system, this is an important point. Because this section deals

with a single-server system, ρ is adequate.

Variability in the system is more difficult to measure. This “global” variability

can be caused both by variability in the arrivals and variability in the service. Rather than

measuring variability by adding the coefficients of variation of the two processes, we use

the squared coefficient of variation of the departure process, specifically, the

interdeparture times, 2DC . (We explain the reason shortly.) For the M/M/1 system, we

are able to calculate this quantity directly. For other systems, we must use

approximations. The formulae used are from (Buzacott and Shanthikumar 1993), and are

summarized in Table 7. The general distribution stands for both the Uniform and Beta

distributions. 2AC and 2

SC refer to the squared coefficients of variation for the interarrival

and service distributions, respectively.

Table 7. Formulae for 2DC in different systems

System 2DC

M/M/1 2 1AC = M/G/1 2 2 21 SCρ ρ− + G/G/1 ( )( )2 2 2 21 1 A A SC C Cρ ρ ρ− + ⋅ +

Table 8 lists the distributions used in the experimentation. Table 9 lists the experiments

for single-server systems in increasing order of 2DC . The mean service rate was 1 in all

cases. In addition, the same sets were run doubling the arrival and service rates, and also

halving them. While ρ and 2DC remain unchanged for these additional trials, the

interarrival and service times are relatively shorter or longer compared to the time slice

sizes.

108

The limits for the Uniform distribution with rate η were 1 3,2 2η η⎡ ⎤⎢ ⎥⎣ ⎦

, and the

parameters for the Beta distribution were 1 0.5α = and 2 2α = , see

(Law and Kelton 2000) for an explanation of the parameters of the Beta distribution. The

parameters, range, and variance of the Beta distributions were chosen to ensure a 2 1C > .

As an illustration of the shape of this Beta, a histogram of the service times for the

Beta(0.5, 2) with mean 1 is shown in Figure 18. Its shape is similar to that of the

Exponential distribution. The mode of the Beta(2, 0.5) is at the right end of the

distribution, but its variability is much smaller than that of the Beta(0.5, 2). As increased

variability is of greater importance than a different shape, we use Beta(0.5, 2).

Table 8 and Table 9 show why we chose not to simply add 2AC and 2

SC as a

measure of “global” variability: The first two systems in Table 9 have the same sum of

2AC and 2

SC , 0.1666, but different 2DC values. Since there is a difference between the two

systems, we would like our measure to reflect this.

Table 8. Distributions and their characteristics

Distribution mean range variance C2 1 [0.5, 1.5] 0.0833 0.0833

1.1 [0.55, 1.65] 0.1008 0.0833 Uniform

1.5 [0.75, 2.25] 0.1875 0.0833 1 n/a 1 1

1.1 n/a 1.21 1 Exponential

1.5 n/a 2.25 1 1 [0, 5] 1.1429 1.1429

1.1 [0, 5.5] 1.3829 1.1429 Beta(0.5, 2)

1.5 [0, 7.5] 2.5714 1.1429

109

Table 9. Systems in increasing order of 2DC

System mean Arrival rho 2DC

U/U/1 1.5 0.6667 0.0664 U/U/1 1.1 0.9091 0.0770 M/U/1 1.1 0.9091 0.2424 B/U/1 1.1 0.9091 0.2807 U/M/1 1.5 0.6667 0.4738 U/B/1 1.5 0.6667 0.5373 M/U/1 1.5 0.6667 0.5926 B/U/1 1.5 0.6667 0.7082 U/M/1 1.1 0.9091 0.8346 U/B/1 1.1 0.9091 0.9527 M/M/1 1.5 0.6667 1 M/M/1 1.1 0.9091 1 B/M/1 1.1 0.9091 1.0383 M/B/1 1.5 0.6667 1.0635 B/M/1 1.5 0.6667 1.1156 M/B/1 1.1 0.9091 1.1181 B/B/1 1.1 0.9091 1.1564 B/B/1 1.5 0.6667 1.1791

Figure 18. Histogram of service times generated as Beta(0.5,2) with mean 1

110

In all cases, 20 values of γ were tracked: 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 3.0,

4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 15.0, and 20.0. Ten independent replications of 10,000

jobs each were run, and the estimates ( )W iF γ from each replication were averaged.

Common random numbers were used across the different systems. Exact waiting times

were also tracked for comparison. The waiting times for all ten runs were used to get the

job-driven estimates ( )W iF γ where analytic solutions are not available.

Figure 19 shows the lightly-loaded M/M/1 system described in Table 9. The time

slice sizes of 1, 2, and 3 time units were chosen arbitrarily (as were the values of γ).

Time slice sizes 1 and 2 appear to lie right on top of the analytic values of ( )WF γ . Time

slice size 3 deviates slightly, though the difference is difficult to see in this graph.

Figure 20 shows the results for the heavily-loaded M/M/1 system. The Time Slice

Counter Method does well, and it is difficult to find any significant deviations from the

exact values.

Figure 21 shows the estimation errors for the two M/M/1 systems together. The

errors are plotted against the values of ( )WF γ . The errors are calculated as

( ) ( )ˆ TSCMW Werror F Fγ γ γ= − . (10)

Each curve has 20 points, corresponding to the 20 γ values listed above. The points for

the “busy” and “idle” systems do not line up vertically because the values of ( )WF L

differ for the two. This graph shows how the Time Slice Counter Method performs for

the two levels of congestion as well as for different time slice sizes.

For both types of system, the errors get smaller (in absolute value) as the time

slice size decreases. Even the errors for time slice size 3 are always less than 10

111

percentage points (in absolute value). In both lightly- and heavily-loaded systems, the

largest errors occur in the left tail of the distribution, while at the right end, the errors are

indistinguishable from zero.

Figure 19. Approximated probabilities using the Time Slice Counter Method for a lightly-loaded M/M/1 system

In Section 4.5.4.2, we stated that the Time Slice Counter Method will correctly classify

jobs as delayed or not delayed if their START and DELAY events fall in different time

slices. Congestion increases the variability in system, and, we hypothesize, in the jobs’

departure process (since it now depends on more factors and jobs), though this is not

measured by 2DC . We further hypothesize that increasing the variability will make it less

likely that the START and DELAY events fall in the same time slice; therefore, the

approximation works work better for more variable or congested systems. Before we

112

show more results confirming this, we comment on the quality of the approximations for

the tails of the distribution.

Figure 20. Approximated probabilities using the Time Slice Counter Method for a heavily-loaded M/M/1 system

The approximations are most likely to be bad at the left tail of the distribution, i.e., for

small values of γ. Here, the ENTER and DELAY events are likely to fall in the same

time slice. If there is only a small delay between the job’s arrival and its START, the

START and DELAY values will fall in the same time slice; in this case, we will not be

able to definitively classify the job. This is more likely to happen in the case of the

lightly-loaded system, and explains why the estimates for those systems are less accurate

in the tail than for the more heavily-loaded ones.

113

Figure 21. Errors for M/M/1 Time Slice Counter Method trials

The right-tail estimates are extremely accurate. The γ values for the right tail are so large

that the DELAY events occur far after the START events. This result is positive, as we

are often interested in the probability of a rare event happening, which is in the right tail.

Figure 22 shows the errors for the lightly- and heavily-loaded U/U/1 systems, the

least variable systems, as functions of the job-driven estimates of ( )WF γ . The observed

errors for the heavily-loaded system are greater than those for the lightly-loaded system,

even exceeding 10 percentage points. If we only compare errors in the areas in which we

have estimates for both systems ( ( )ˆ 0.85WF γ ≥ ), the estimates for the more heavily-loaded

system tend to be more accurate. As we have no estimate for the lower region for the

lightly-loaded system, we should draw no conclusions on the relative behaviors there. As

before, smaller time slice sizes have (significantly) smaller errors than larger ones do.

114

Figure 22. Errors for U/U/1 Time Slice Counter Method trials

Figure 23 shows the error comparison for the lightly- and heavily-loaded B/B/1 systems.

Again, the more heavily-loaded system has smaller errors than the lightly-loaded system.

They do not exceed 3 percentage points, even for time slice size 3 (which is three times

the average service time). For time slice size 1, the error is always small for the heavily-

loaded system, and only once exceeds 1 percentage point for the lightly-loaded system.

These results are at least as good as those for the M/M/1 systems, though we have

introduced significantly more variability. There are no estimates for the right tail of the

distribution for the heavily-loaded system.

For the experiments doubling and halving the arrival and service rates, the Time

Slice Counter Method works relatively well. For smaller interarrival and service times

(larger rates), the approximations using larger time slice sizes degrade. Time slice size 1

115

still performs well. The deterioration makes sense as the time slice sizes are relatively

larger than the average interarrival and service times. For larger interarrival and service

times (smaller rates), the method performs well except for the heavily-loaded B/B/1

system. In this case, our γ values far underestimate the upper bound of the distribution,

( )ˆ 20 0.6301WF = .

Figure 23. Errors for B/B/1 Time Slice Counter Method trials

Results for experimentation with multiple-server tandem queues are given in

Appendix D.4.

116

4.5.6. Discussion

4.5.6.1. Single-Stage Systems We have proven that the Time Slice Counter Method converges to the job-driven

estimate for ( )WF γ as time slice size decreases. We argue that more congested/variable

systems allow the Time Slice Counter Method to get better answers more easily (with

larger time slices). Since we can make our time slice sizes arbitrarily small at no

additional time cost (there are memory costs), this observation is less problematic than it

might be. It does highlight an advantage of the Time Slice Counter Method: Estimating

waiting times is more readily done for busy systems than for idle systems.

The memory costs will increase as time slice size goes to zero. They can be offset

somewhat by using circular arrays of size O γξ⎛ ⎞⎜ ⎟⎝ ⎠

. Though this is less than Ξ for

reasonable γ, the memory requirements will still eventually overtake those of using job

information directly.

To get an idea of which factors are most important in determining the size of the

error, we performed a multiple linear regression (forward and backward stepwise and

best subset) on the error data. The factors included in the model are given in Table 10.

Table 10. Factors included in multiple regression ACTUAL the actual value of ( )WF γ (as estimated by tracing jobs) TIME SLICE the size of the time slice CV2 2

DC GAMMA current value of γ RATE arrival rate λ RHO /ρ λ ν=

117

The data used were the job-driven distribution estimates and the errors made by the Time

Slice Counter Method for different time slice sizes. The numbers used were from all the

experiments listed in Table 9 (using the deterministic classification of uncertain jobs

described in Appendix D.2.6), as well as the same experiments with doubled and halved

rates.

All three methods of choosing a model resulted in the same selection. The model

is shown in Table 11. The adjusted R2 of the model is 0.1828, which is low. All six

factors are statistically significant. Of these, the two with the largest coefficients are the

“actual” (job-driven) distribution value, and the measure of congestion or busyness, ρ .

The intercept of the model is negative, and increasing values of ( )WF γ and ρ bring the

errors closer to zero.

Table 11. Multiple linear regression model of Time Slice Counter Method errors PREDICTOR VARIABLES COEFFICIENT STD ERROR STUDENT'S T P VIF --------- ----------- --------- ----------- ------ --- CONSTANT -0.05639 0.00432 -13.05 0.0000 ACTUAL 0.03867 0.00181 21.33 0.0000 4.5 TIME SLICE 0.00131 3.104E-04 4.23 0.0000 1.0 CV2 0.00378 8.781E-04 4.31 0.0000 1.6 GAMMA -0.00148 8.352E-05 -17.68 0.0000 2.6 RATE 0.00124 3.652E-04 3.39 0.0007 1.5 RHO 0.04344 0.00361 12.04 0.0000 3.0 R-SQUARED 0.1844 RESID. MEAN SQUARE (MSE) 1.989E-04 ADJUSTED R-SQUARED 0.1828 STANDARD DEVIATION 0.01410 SOURCE DF SS MS F P ---------- --- ---------- ---------- ----- ------ REGRESSION 6 0.13889 0.02315 116.39 0.0000 RESIDUAL 3089 0.61439 1.989E-04 TOTAL 3095 0.75329 CASES INCLUDED 3096 MISSING CASES 0

118

Figure 24 shows a plot of the residual values versus the fitted values. There is a clear

pattern, indicating that there is behavior in the error not explained by the regression

model. Better models may include squared or product terms to try to capture

relationships between the variables, and will be the subject of future research.

Figure 24. Regression residuals versus fitted values

4.5.6.2. Other Service Disciplines The Time Slice Counter Method works for almost all disciplines listed in Section 4.2.3.1

if we assume FCFS queueing within classes (e.g., priority classes, or round-robin

groups). For example, for systems with priority service on multiple job types if we

assume FCFS servicing within the priority groups and no preemption, we would maintain

separate counters for the different priority groups.

119

Without extensions, the method will not work for systems with multiple job types

but no priority service. Similarly, any discipline that allows job ordering in queue to

change (LCFS) or that otherwise needs information from region III cannot be modeled.

Possible extensions to the Time Slice Counter Method to accommodate such disciplines

are left to future research.

4.5.6.3. Summary We summarize our conclusions and hypotheses from our experimentation for both single-

and multiple-stage tandem queues (Appendix D.4) in the following list.

• For single-stage G/G/s systems, decreasing time slice sizes leads to the correct

waiting-time probability estimate.

• For multiple-stage systems with multiple servers at one or more of the stages,

decreasing the time slice size will not necessarily lead to the correct cycle-time

estimate because of overtaking (the kth job to start service may not be the kth job to

finish service).

• For single-stage systems, greater variability and congestion (as measured by 2DC

and ρ ) will make estimating the waiting-time probability easier, i.e., we can get

smaller errors with larger time slices.

• For multiple-stage systems, greater variability makes overtaking more likely and will

make cycle-time estimates less accurate. Higher congestion appears to increase the

accuracy of the Time Slice Counter Method, but does not cause a significant increase

in overtaking.

120

• Longer tandem queues do not have greater errors than shorter tandem queues.

Extensions to the Time Slice Counter Method are given in Appendix D.5.

4.5.7. Future Work Future work in this area includes implementing the extensions proposed in Appendix D.5,

especially the one for LCFS service. If they work as we hope, they will be great

enhancements to the Time Slice Counter Method, increasing its flexibility and

applicability without resorting to full job traces.

We would also like to develop further extensions that allow us to reduce

estimation errors for non-FCFS disciplines and cycle time estimation. A possible

approach uses the amount of overtaking done by jobs (determined during the simulation

run) to adjust the estimates obtained by the Time Slice Counter Method. For some

background on overtaking, see (Whitt 1984; Daganzo 1997). We may also be able to use

the Time Slice Counter Method to analyze the general cause and effect process studied by

statisticians.

We want to further investigate the properties of tandem queueing systems,

specifically to look at the validity of Claim 4. If it is true, we may be able to extend our

study to analyze the effects of queueing disciplines themselves. Perhaps, as the length of

the tandem queue increases, the effects of the service discipline (e.g. LCFS versus FCFS)

will be less noticeable.

A final idea for future research is relating the number of time slices with no

events in them to the estimation error. As ξ goes to zero, the error will decrease, while, at

121

the same time, the number of time slices will increase. We would like to quantify the

relationship between the two; it may be possible to determine the error size based on the

number of “empty” time slices. This reduces the need for guesswork, transaction

tagging, or trial runs.

122

5. CONCLUSIONS The main contribution of this dissertation is the introduction of an information taxonomy

for simulation models. While a great amount of literature tries to classify models, many

of the classification schemes fall short. The classical world views focus on the

implementation of models, and are neither exclusive nor exhaustive. The resource-driven

and job-driven paradigms had not been defined explicitly, which reduces their value as a

taxonomy. The information taxonomy presented here allows their formal definition. A

comparison of the taxonomy to current formalisms is included in this dissertation.

The advantage of the information taxonomy is that it allows the modeler to focus

on the information that must be included in the model to capture system behavior and

return the desired output statistics. Consciously including only the necessary information

can reduce the time to create the model, as well as the potential for modeling and

implementation errors. The information taxonomy also allows the user to organize the

information contained in the model, which may suggest a natural means of

implementation.

The taxonomy can be used to identify what can be modeled and what output is

available given a set of information. For example, in a resource-driven simulation, it is

not obvious how to model critical ratio service disciplines accurately. Either more

information must be included in the model, or a suitable approximation must be found.

If an approximation is required, several aspects of the algorithm must be

considered: There are tradeoffs between algorithm accuracy, memory, and computational

123

requirements. We give an example of a single graphical error measure that incorporates

all three aspects.

Usually, the unavailable information is job-related because this type of

information is the most extensive for many systems. An example of the impact of lack of

modeling information is in dedication constraints. We have defined the problem faced by

the user, and have proposed an approximation algorithm to overcome the lack of

information. The approximation gives upper and lower bounds on the exact solution

without including job information in the model.

An example of the impact of lack of statistics information is in job delay

distributions. We have proposed an approximation for job waiting time distributions in

FCFS systems with no overtaking. No job information is needed for this approximation.

A fixed set of counters is added; this approximation is useful when the number of

counters is smaller than the number of jobs in the system.

We feel that this line of research is valuable in making technology more effective.

Technology should be a decision-making tool, and should not be a significant bottleneck

in the process. By providing a means of classifying the system information, we are able

to assist in organizing model development.

In the future, we would like to further develop the approximation error measure

introduced in Section 4.3.3.2. It has the advantage of combining three important

approximation aspects (accuracy, memory requirements, and computational

requirements) into one measure. It also allows the visualization of the measure, which

can aid in the comparison of different approximations.

124

We would also like to develop extensions to the approximations proposed. In

some cases, this means we must add more information to the model. In these cases, we

will investigate the tradeoffs.

125

6. BIBLIOGRAPHY Aetna. 2001. Plan Explanation.

Akcalt, E., K. Nemoto, and R. Uzsoy. 2001. Cycle-Time Improvements for Photolithography Process in Semiconductor Manufacturing. IEEE Transactions on Semiconductor Manufacturing. 14(1): 48-56.

Atherton, L. and R. Atherton. 1995. Wafer Fabrication: Factory Performance and Analysis. Kluwer. Boston, MA.

AutoSimulations, I. 1999. AutoSched AP User's Guide v 6.2. Bountiful, Utah.

Axelsson, O. and R. Marinova. 1999. On a Hybrid Method of Characteristics and Central Difference Method for Convection-Diffusion Problems. Department of Mathematics Report No. 9941, University of Nijmegen.

Baker, A. D. 1998. A Survey of Factory Control Algorithms that Can Be Implemented in a Multi-Agent Heterarchy: Dispatching, Scheduling, and Pull. Journal of Manufacturing Systems. 17(4): 297-320.

Bayley, G. V. and J. M. Hammersley. 1946. The "Effective" Number of INdependent Observations in an Autocorrelated Time Series. Journal of the Royal Statistical Society. 8:184-197.

Bowen, D. 2003. Personal Communication. 3 March 2003.

Brealey, R. A. and S. C. Myers. 1996. Principles of Corporate Finance. 5th Edition. McGraw-Hill.

Brown, S., F. Chance, J. W. Fowler, and J. K. Robinson. 1997. A Centralized Approach to Factory Simulation. 3. 83-86.

Buzacott, J. A. and J. G. Shanthikumar. 1993. Stochastic Models of Manufacturing Systems. W. J. Farbrycky and J. H. Mize. Prentice Hall International Series in Industrial and Systems Engineering. Prentice-Hall, Inc. Upper Saddle River, NJ.

Carson, J. S. 1993. Modeling and Simulation Worldviews. Proceedings of the 1993 Winter Simulation Conference. eds. G. W. Evans, M. Mollaghasemi, E. C. Russell and W. E. Biles. New York, NY, USA: IEEE. 18-23.

Commoner, F., A. W. Holt, S. Even, and A. Pnueli. 1971. Marked Directed Graphs. Journal of Computer and System Sciences. 5(5): 511-523.

Conway, R. W. and W. L. Maxwell. 1962. Network Dispatching by the Shortest-Operation Discipline. Operations Research. 10(1): 51-73.

Cota, B. A. and R. G. Sargent. 1992. A Modification of the Process Interaction World View. ACM Transactions on Modeling & Computer Simulation. 2(2): 109-129.

Daganzo, C. F. 1997. Fundamentals of Transportation and Traffic Operations. Oxford; New York: Pergamon.

126

Derrick, E. J. 1992. A Visual Simulation Support Environment-Based on a Multifaceted Conceptual Framework. Ph.D. Dissertation, Department of Computer Science, Virginia Polytechnic Institute and State University. Blacksburg, VA.

Derrick, E. J., O. Balci, and R. E. Nance. 1989. A Comparison of Selected Conceptual Frameworks for Simulation Modeling. Proceedings of the 1989 Winter Simulation Conference. eds. A. MacNair, K. J. Musselman and P. Heidelberger. SCS. 711-718.

Dillon, W. R. and M. Goldstein. 1984. Multivariate Analysis: Methods and Applications. John Wiley & Sons. New York.

Fendick, K. W. and W. Whitt. 1998. Verifying Cell Loss Requirements in High-Speed Communication Networks. Technical Report TR 98.33.1, AT&T Research Labs.

Fischbein, S. A. 2002. Personal Communication.

Fisher, R. A. 1922. On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character. 222:309-368.

Fishman, G. S. 1973. Concepts and Methods in Discrete Event Digital Simulation. Wiley-Interscience, John Wiley & Sons. New York.

Fleischer, L. 2004. Fast Approximation Algorithms for Fractional Covering Problems with Box Constraints. Proceedings of the 36th ACM/SIAM Symposium on Discrete Algorithms.

Fox, B. L. and P. W. Glynn. 1990. Discrete-Time Conversion for Simulating Finite-Horizon Markov Processes. SIAM Journal on Applied Mathematics. 50(5): 1457-1473.

Gahagan, S. M. and J. W. Herrmann. 2001. Improving Simulation Model Adaptability with a Production Control Framework. Proceeding of the 2001 Winter Simulation Conference. eds. B. A. Peters, J. S. Smith, D. J. Medeiros and M. W. Rohrer. Piscataway, NJ, USA: IEEE. 937-945.

Govil, M. K. and M. C. Fu. 1999. Queueing Theory in Manufacturing: A Survey. Journal of Manufacturing Systems. 18(3): 214-240.

Gross, D. and C. M. Harris. 1998. Fundamentals of Queueing Theory. V. Barnett, R. A. Bradley, N. I. Fisher, J. S. Hunter, J. B. Kadane, D. G. Kendall, D. W. Scott, A. F. M. Smith, J. L. Teugels and G. S. Watson. Wiley Series in Probability and Statistics. Third Edition. John Wiley & Sons Inc. New York.

Gutin, G., A. Vainshtein, and A. Yeo. 2003. Domination Analysis of Combinatorial Optimization Problems. Discrete Applied Mathematics. 129(2-3): 513-520.

Gutin, G. and A. Yeo. 2002. Polynomial Approximation Algorithms for the TSP and QAP with a Factorial Domination Number. Discrete Applied Mathematics. 119(1-2): 107-116.

127

Haas, P. J. 2002. Stochastic Petri Nets: Modeling, Stability, Simulation. P. W. Glynn and S. M. Robinson. Springer Series in Operations Research. Springer-Verlag. New York.

HealthNet. 2002. A Complete Explanation of Your Plan.

Healy, K. J. and R. A. Kilgore. 1997. Silk™: a Java-based process simulation language. Proceedings of the 1997 Winter Simulation Conference. 475-482.

Henriksen, J. O. 1981. GPSS - Finding the Appropriate World-View. Proceedings of the 1981 Winter Simulation Conference. 505-516.

Hopp, W. and M. Spearman. 1991. Throughput of a Constant Work in Process Manufacturing Line Subject to Failures. International Journal of Production Research. 29(3): 635-655.

Hyden, P., L. Schruben, and T. Roeder. 2001. Resource Graphs for Modeling Large-Scale, Highly Congested Systems. Proceeding of the 2001 Winter Simulation Conference. ed. M. W. Rohrer. Piscataway, NJ, USA: IEEE. 523-529.

Jackson, J. R. 1957. Simulation Research on Job Shop Production. Naval Research Logistics Quarterly. 4:287-295.

Jefferson, D. 1983. Virtual Time. Proceedings of the 1983 International Conference on Parallel Processing. eds. H. J. Siegel and L. Siegel. IEEE Comput. Soc. Press. 384-394.

Jensen, J. B., M. K. Malhotra, and P. R. Philipoom. 1996. Machine Dedication and Process Flexibility in a Group Technology Environment. Journal of Operations Management. 14(1): 19-39.

Jordan, M. 2003. An Introduction to Probabilistic Graphical Models. in preparation.

Jordan, M., Z. Ghahramani, T. Jaakkola, and L. Saul. 1999. An Introduction to Variational Methods for Graphical Models. Cambridge, MA. The MIT Press.

Kiviat, P. J. 1969. Digital Computer Simulation: Computer Programming Languages. Memorandum RM-5883-PR, The Rand Corporation. Santa Monica, CA.

Landau, E. 1909. Handbuch der Lehre von der Verteilung der Primzahlen. Teubner. Leipzig.

Law, A. and W. D. Kelton. 2000. Simulation Modeling and Analysis. Third Edition. McGraw-Hill Higher Education.

L'Ecuyer, P. 1994. Efficiency Improvement and Variance Reduction. Proceedings of the 1994 Winter Simulation Conference. eds. J. D. Tew, M. S. Manivannan, D. A. Sadowski and A. F. Seila. New York, NY, USA: IEEE. 122-132.

Little, J. D. 1961. A Proof of the Queueing Formula: L = λW. Operations Research. 9:383-387.

Markowitz, H. M. 1979. Simscript: Past, Present, and Some Thoughts about the Future. New York. Academic Press.

128

McNeill, J. E., G. T. Mackulak, and J. W. Fowler. 2003. Indirect Estimation of Cycle Time Quantiles from Discrete Event Simulation Models Using the Cornish-Fisher Expansion. Proceedings of the 2003 Winter Simulation Conference. eds. S. Chick, P. J. Sanchez, D. Ferrin and D. J. Morrice. 1377-1382.

Merriam-Webster. 1993. Merriam-Webster's Collegiate Dictionary. Tenth Edition. Springfield, Massachusetts.

Microsoft. 2004. Microsoft Application Center 2000: Load Balancing.

Miller, J. A., G. Baramidze, P. A. Fishwick, and A. P. Sheth. 2004. Investigating Ontologies for Simulation Modeling. Proceedings of the 37th Annual Simulation Symposium.

Moore, J. M. and J. R. Wilson. 1967. A Review of Simulation Research in Job Shop Scheduling. Journal of Production Inventory Management. 8:1-10.

Nance, R. E. 1979. Model Representation in Discrete Event Simulation: Prospects for Developing Documentation Standards. New York. Academic Press.

Nance, R. E. 1981. The Time and State Relationships in Simulation Modeling. Communications of the ACM. 24(4): 173-179.

Nance, R. E., C. M. Overstreet, and E. H. Page. 1999. Redundancy in Model Specifications for Discrete Event Simulation. ACM Transactions on Modeling and Computer Simulation. 9(3): 254-281.

Overstreet, C. M. 1982. Model Specification and Analysis for Discrete Event Simulation. Ph.D. Dissertation, Department of Computer Science, Virginia Polytechnic Institute and State University. Blacksburg, VA.

Overstreet, C. M. 1987. Using Graphs to Translate Between World Views. Proceedings of the 1987 Winter Simulation Conference. eds. A. Thesen, H. Grant and W. D. Kelton. San Diego, CA, USA: SCS. 582-589.

Page, E. H. 1994. Simulation Modeling Methodology: Principles and Etiology of Decision Support. Ph.D. Dissertation, Department of Computer Science, Virginia Polytechnic Institute and State University. Blacksburg, VA.

Panwalkar, S. S. and W. Iskander. 1977. A Survey of Scheduling Rules. Operations Research. 25(1): 45-60.

Petri, C. A. 1962. Fundamentals of a Theory of Asynchronous Information Flow. Proceedings of the IFIP Congress. 386-390.

Pinedo, M. 2002. Scheduling: Theory, Algorithms, and Systems. Second Edition. Prentice Hall. Upper Saddle, NJ.

Powell, M. J. D. 1981. Approximation Theory and Methods. Cambridge University Press. Cambridge.

Roeder, T. M., S. A. Fischbein, M. Janakiram, and L. W. Schruben. 2002. Resource-Driven and Job-Driven Simulations. 2002 International Conference on Modeling and Analysis of Semiconductor Manufacturing. 78-83.

129

Rohan, D. 1999. Machine Dedication under Product and Process Diversity. Proceedings of the 1999 Winter Simulation Conference. eds. P. A. Farrington, H. Black Nembhard, D. T. Sturrock and G. W. Evans. IEEE. 897-902.

Rose, O. 2002. Some Issues of the Critical Ratio Dispatch Rule in Semiconductor Manufacturing. Proceedings of the 2002 Winter Simulation Conference. eds. E. Yücesan, C.-H. Chen, J. L. Snowdon and J. M. Charnes. IEEE. 1401-1405.

Ross, S. M. 1997. Introduction to Probability Models. 6th Edition. Academic Press. San Diego, CA.

Savage, E. L., L. W. Schruben, and E. Yücesan. 2004. On the Generality of Event Graph Models. INFORMS Journal on Computing.

Schmeiser, B. and Y. Yeh. 2002. On Choosing a Single Criterion for Confidence-Interval Procedures. Proceedings of the 2002 Winter Simulation Conference. eds. E. Yücesan, C.-H. Chen, J. L. Snowdon and J. M. Charnes. IEEE. 345-352.

Schmidt, J. W. and R. E. Taylor. 1970. Simulation and Analysis of Industrial Systems. Richard D. Irwin. Homewood, IL.

Schriber, T. 1991. An Introduction to Simulation Using GPSS/H. John Wiley & Sons.

Schruben, D. and L. W. Schruben. 2001. Graphical Simulation Modeling Using SIGMA. 4th Edition. Custom Simulations.

Schruben, L. 1983. Simulation Modeling with Event Graphs. Communications of the ACM. 26(11): 957-963.

Schruben, L. W. 2000. Mathematical Programming Models of Discrete Event System Dynamics. Proceedings of the 2000 Winter Simulation Conference. eds. J. A. Joines, R. R. Barton, K. Kang and P. A. Fishwick. Piscataway, NJ, USA: IEEE. 381-385.

Schruben, L. W. 2003. Conditional Parametric Petri Nets and their Mapping to Simulation Event Graphs. European Simulation and Modelling Conference.

Schruben, L. W. and R. Kulkarni. 1982. Some Consequences of Estimating Parameters for the M/M/1 Queue. Operations Research Letters. 1(2): 75-78.

Schruben, L. W. and T. M. Roeder. 2003. Fast simulations of large-scale highly congested systems. Simulation: Transactions of the Society for Modeling and Simulation International. 79(3): 1-11.

Schruben, L. W. and E. Yücesan. 1988. Transaction Tagging in Highly Congested Queueing Simulations. Queueing Systems. 3(3): 257-264.

Seila, A. F., V. Ceric, and P. Tadikamalla. 2003. Applied Simulation Modeling. Duxbury Applied Series. Thomson Brooks/Cole. Belmont, CA.

Shafer, S. M. and J. M. Charnes. 1997. Offsetting Lower Routing Flexibility in Cellular Manufacturing due to Machine Dedication. International Journal of Production Research. 35(2): 551-567.

130

Shaked, M. and J. G. Shanthikumar. 1994. Stochastic Orders and Their Approximations. D. Aldous and Y. L. Tong. Probability and Mathematical Statistics. Academic Press, Inc. San Diego.

Sturm, R. 2002. Personal Communication. 3 May 2002.

Tibshirani, R. 1988. Variance Stabilization and the Bootstrap. Biometrika. 75(3): 433-444.

Tocher, K. D. 1963. The Art of Simulation. English Universities Press. London.

Törn, A. 1981. Simulation Graphs: A General Tool for Modelling Simulation Designs. Simulation. 37(6): 187-194.

Vollmann, T. E., W. L. Berry, and D. C. Whybark. 1988. Manufacturing Planning and Control Systems. Second Edition. Dow Jones-Irwin. Homewood, IL.

Welch, P. D. 1981. On hte Problem of the Initial Transient in Steady-State Simulation. IBM Watson Research Center. Yorktown Heights, NY.

Whitt, W. 1984. The Amount of Overtaking in a Network of Queues. Networks. 14(3): 411-426.

Williams, J. W. J. 1964. Algorithm 232 (Heap Sort). Communications of the ACM. 7(347-348.

Witten, I. H., G. M. Birtwistle, J. Cleary, D. R. Hill, D. Levinson, G. Lomow, R. Neal, M. Peterson, B. W. Unger, and B. Wyvill. 1983. Jade: A Distributed Software Prototyping Environment. ACM Operating Systems Review. 17(3): 10-23.

Wonnacott, P. 1996. Run-time Support for Parallel Discrete Event Simulation Languages. Ph.D. Dissertation, Computer Science, Univeristy of Exeter. Exeter.

Woods, R. H. 1998. A Cost Benefit Analysis of Photolithography and Metrology Dedication in a Metrology Constrained Multipart Number Fabricator. IEEE/SEMI Advanced Semiconductor Manufacturing Conference and Workshop. IEEE. 145-147.

Yoshida, T. and H. Touzaki. 1999. A Study on Association Among Dispatching Rules in Manufacturing Scheduling Problems. 1999 7th IEEE International Conference on Emerging Technologies and Factory Automation. Proceedings ETFA '99. ed. J. M. Fuertes. IEEE. 1355-1360.

Zeigler, B. P. 1976. Theory of Modeling and Simulation. John Wiley. New York.

Zeigler, B. P. 2003. DEVS Today: Recent Advances in Discrete Event-based Information. Proceedings of the 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems. 148-162.

131

Appendix A. DEFINITIONS AND NOTATION A.1 DEFINITIONS AND ABBREVIATIONS System A system is a collection of entities that interact with a common purpose

according to sets of laws and policies (Schmidt and Taylor 1970).

Model A model is “a system used as a surrogate for another system.”

(Schruben and Schruben 2001).

Simulation A simulation model is a (computer) program that is used as a surrogate for

another system.

Job A job is a transient part of a system. For example, a customer enters and

leaves a grocery store; the job is not in the store for the duration of the

store’s business hours. The terms “job,” “customer,” and “transient entity”

are used interchangeably. The literature also refers to jobs as “dynamic

entities.”

Resource A resource is a resident element of a system. For example, a cash register

will be in a store during the entire time the store is in operation. Workers

may be transient or resident elements, depending on the focus of the study.

The terms “resource,” “server,” and “resident entity” are used

interchangeably. The literature also refers to them as “static entities.”

Event “A change in object state, occurring at an instant” (Nance 1981).

Taxonomy In the context of this dissertation, a taxonomy is a framework to assist in

organizing information used for a simulation model.

132

RD Resource-driven; simulation approach where the focus is on the resident

entities in the model.

JD Job-driven; simulation approach where the focus is on the transient entities

in the model.

FEL The Future Events List is a data structure that stores the simulation events

that have been scheduled to occur in the future.

SPL Simulation Programming Language

STPN Stochastic Timed Petri Net, see Section 2.2.2.

HMM Hidden Markov Models; a type of statistical learning model.

Delay “Delay” is used as a generic term for the time between two events of

interest. Typically, a delay is the time between a job’s arrival at a server and

its service start. In this dissertation, it will also be used to refer to a job’s

cycle time (time in system). It can also refer generically to the time between

a job’s visits to points U and V in the system, or between events X and Y.

Average Unless otherwise specified, we refer to count, not time, averages in this

paper. That is, the sample average of random variable X is defined as

1

n

ii

X X n=

=∑ .

Service Rule or rules a server uses to decide which job to process next. The

Discipline rule(s) may be job-, server-, or system-based, or may be arbitrary. A service

discipline is also referred to as a service protocol or a dispatching rule. A

common service discipline is first-come-first-serve (FCFS).

133

FCFS First-come-first-serve service discipline. It is also known as first-in-first-out

(FIFO).

LCFS Last-come-first-serve service discipline. It is also known as last-in-first-out

(LIFO).

CR Critical ratio ranking; used to sort jobs in queue based on their due dates and

expected remaining processing time. See Section 4.2.3.4.

A.2 NOTATION We organize the notation by the section in which it appears.

A.2.1 GENERAL NOTATION AND NOTATIONAL CONVENTIONS kth job in Refers to the kth job to enter the system. In FCFS single-stage systems or

the system single-server tandem queues, this will also correspond to the kth service in

the system. In multiple-server n-stage tandem queues, overtaking is

possible, so the kth job in the system may not be the kth service at

stages 2, 3,…, n.

( )*i * indicates the largest value (for an index); for example, i = 1,. 2, …, *i

i index for a resource in a group of several resources; also time slice index

j index for job type

k generic counter

g index for a resource group

r index for a step in a route

s total number of servers at a stage in a queueing system

t index for time

134

*t length of the simulation run

N number of jobs simulated

n number of stages for tandem queueing systems

Q number of jobs waiting in queue, typically used in an event graph

R number of available servers, typically used in an event graph

X random variable; its realization is x, so X x=

θ set of performance measures

λ arrival rate

ν service rate

ρ traffic intensity; for a given system, λρν

=

( )I i indicator function; ( )I i = 1 if i is true, 0 otherwise

Xµ average (or expected) value of random variable X

2Xσ variance of random variable X

2XC squared coefficient of variation for random variable X,

22

2X

XX

C σµ

=

[ ]MSE X mean squared error of the random variable X taken from a population with

unknown mean Xµ , [ ] ( )2 2XMSE X E X bias varianceµ⎡ ⎤= − = +⎣ ⎦

RND a (function returning a) Uniform (0,1) random number

135

A.2.2 INFORMATION TAXONOMY

( ){ }*: 0t t t≤ ≤G set of general system information; when the set does not change

over time, we omit the time index for simplicity, ( )t t= ∀G G

( ){ }*: 0t t t≤ ≤R set of resources; when the set does not change over time, we omit

the time index for simplicity, ( )t t= ∀R R

( ){ }*: 0t t t≤ ≤J set of jobs; when the set does not change over time, we omit the

time index for simplicity, ( )t t= ∀J J

( ){ }*2 : 0t t t≤ ≤R power set of ( ){ }tR ; that is, the set of all subsets of ( ){ }tR

( ){ }*2 : 0t t t≤ ≤J power set of ( ){ }tJ ; that is, the set of all subsets of ( ){ }tJ

( ){ }*: 0t t t′ ≤ ≤R element of power set of ( ){ }tR , ( ) ( )2 tt′ ∈ RR ;

that is, a subset of all resources

( ){ }*: 0t t t′ ≤ ≤J element of power set of ( ){ }tJ , ( ) ( )2 tt′ ∈ JJ ;

that is, a subset of all jobs

Jmax maximum value of ( )tJ , ( )maxJ t t≥ ∀J

M set of modeling behaviors or rules,

, , , ,local global local global= ∪ ∪ ∪ ∪G R R J JM M M M M M ; the set of system aspects that

are to be modeled, for example resource dedication or FCFS queueing. Subscripts

“local” and “global” refer to local and global behaviors.

136

O set of output statistics, , , , ,local global local global= ∪ ∪ ∪ ∪G R R J JO O O O O O ; the set of

desired output statistics, for example, job waiting time distributions.

A.2.3 APPROXIMATION ALGORITHM CHARACTERISTICS C(X) amount of CPU time (processing power) required to get an estimate for random

variable X.

( )Eff X an algorithm’s efficiency at attaining an estimate for random variable X.

A.2.4 ESTIMATING DEDICATION CONSTRAINTS d service discipline used to process jobs

m number of revisits jobs make to a resource group

( ) ( )( ){ }* * *:1 ,1 : 0rgt Q t r r g g t t= ≤ ≤ ≤ ≤ ≤ ≤Q number of jobs at each route step r

waiting in queue for each server group g at time t (region I.2.ii in the taxonomy)

( ) ( )( ){ }* * *:1 ,1 : 0ig gt S t i s g g t t= ≤ ≤ ≤ ≤ ≤ ≤S status of server i in group g at time t.

( ) { },igS t busy idle∈ (region II.2.ii.b). The total number of available servers in

group g at time t is ( ) ( )( )*

1

gs

g igi

S t I S t idle⋅=

= =∑ . ( )gS t⋅ is in region I.2.ii.

( ) ( )( ){ }* * * *:1 ,1 ,1 : 0irg gt Q t i s r r g g t t= ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤Q number of jobs at route step r

waiting specifically for server i of group g at time t (region II.2.ii.b). The total

137

number of jobs waiting at group g, either for a specific server or for any of the

servers, is ( ) ( ) ( )**

1 1

gsrtotg rg irg

r iQ t Q t Q t

= =

⎛ ⎞= ⎜ + ⎟

⎜ ⎟⎝ ⎠

∑ ∑ (region I.2.ii).

( ) ( )( ){ }* * * *:1 ,1 ,1 : 0ijg gt D t i s j j g g t t= ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤D number of jobs of type j that

server i of group g has processed that may return to i in the future

(region II.2.ii.b).

( ){ }*:1 ,1gkJ g g k N= ≤ ≤ ≤ ≤J , for each job k, the resource in group g k has been

dedicated to. If no dedication is present at g, the array entry can be left blank

(region III.2.i.b).

( )( ){ }* *( ) :1 : 0gt A t g g t t= ≤ ≤ ≤ ≤A number of times by time t jobs arriving at g have

been forced to wait although there was at least one idle server

( )( ){ }* *( ) :1 : 0gt V t g g t t= ≤ ≤ ≤ ≤V number of times by time t a server in group g

remains idle although it finds ( ) 0totgQ t >

A.2.5 APPROXIMATING WAITING TIME DISTRIBUTIONS γ delay value

ξ time slice size

Ξ total number of time slices, *t ξ⎡ ⎤Ξ = ⎢ ⎥

( ){ }, :1 :1k k iW i n k N= ≤ ≤ ≤ ≤W waiting time at the ith stage of the kth job in the system;

for single-stage systems, we omit the second index for simplicity

138

( ),t γ ξ Lower Time Slice Counter Method time slice number of a DELAY event

scheduled at time t for γ time units in the future at time slice size ξ;

( ), tt γγ ξξ

⎢ ⎥+= ⎢ ⎥⎣ ⎦

( ),t γ ξ Upper Time Slice Counter Method time slice number of a DELAY event

scheduled at time t for γ time units in the future at time slice size ξ

{ }:1kA k N≤ ≤ arrival time of the kth job in the system

( ){ }*: 0AN t t t≤ ≤ number of arrivals by time t

( ){ }, :1 :1k k iS i n k N= ≤ ≤ ≤ ≤S start service time at the ith stage of the kth job in the

system; for single-stage systems, we omit the stage index for simplicity

( ) ( )( ){ }*:1 0iS St N t i n : t t= ≤ ≤ ≤ ≤N number of starts at stage i by time t; for single-

stage systems, we omit the stage index for simplicity

S shorthand for ( )SN t ; may also refer to the time of the START event ( kS )

( ){ }, , , :1 :1k k iD i n k Nγ γ= ≤ ≤ ≤ ≤D time kth job in the system has been delayed γ time

units at the ith stage of the system; for single-stage systems, we omit the stage

index for simplicity; we omit the delay subscript when its value is obvious

( ) ( )( ){ },

*:1 0iD Dt N t i n : t t

γ γ= ≤ ≤ ≤ ≤N number of jobs at stage i delayed γ time units

by time t; ( ) ( ),1D AN t N t

γγ= − for t ≥ γ; for single-stage systems, we omit the

stage index for simplicity; we omit the delay subscript when its value is obvious

D shorthand for ( )DN tγ

; may also refer to the time of the DELAY event ( kD )

139

( ) ( )( ){ }*, :1 : 0it W t i n t tγ γ= ≤ ≤ ≤ ≤W number of jobs at stage i whose wait was less

than γ time units by time t, ( ) ( )DW t N tγγ ≤ ; for single-stage systems, we omit the

stage index for simplicity; we omit the delay subscript when its value is obvious

{ }, :1i iγ∆ ≤ ≤ Ξ number of γ-time unit DELAY events through time slice i;

we omit the delay subscript when its value is obvious

( ){ },

*: 0LN t t tγ ξ

≤ ≤ number of approximated γ-DELAY events from the Lower Time

Slice Counter Method that have occurred by time t, for time slice size ξ; we may

omit the subscript for simplicity. ( ) ( ), , 0,L tN tγ ξ γ ξ= ∆

L shorthand for ( ),LN t

γ ξ

( ){ },

*: 0UN t t tγ ξ

≤ ≤ number of approximated γ-DELAY events from the Upper Time

Slice Counter Method that have occurred by time t, for time slice size ξ; we may

omit the subscript for simplicity. ( ) ( ), , 0,U tN tγ ξ γ ξ= ∆

U shorthand for ( ),UN t

γ ξ

( ){ }, :1 :1k k iF i n k N= ≤ ≤ ≤ ≤F finish time at the ith stage of the kth job in the system;

for single-stage systems, we omit the stage index for simplicity

( )WF ⋅ distribution function of the job waiting (or other delay) times

( )WF γ { }P delay γ≤

( )WF γ estimate of { }P delay γ≤

140

Appendix B. EXAMPLE: GENERATING CORRECT DEPARTURE PROCESS WITHOUT JOB INFORMATION

In this section, we define the problem of generating the correct departure process from a

queueing system if information from region III is unavailable. We propose a solution

procedure, and describe how the quality of a solution could be measured for this problem.

B.1 PROBLEM STATEMENT Consider an open queueing network in which jobs of j* types are processed on g* server

groups. Each server group g is composed of *gs functionally identical servers. Each job

type is characterized by known service-time distributions and routes. The r* route steps

may be deterministic, stochastic, and/or state-dependent. Both the routings and service

times may be non-stationary. At each of the server groups, jobs are processed according

to some service discipline. The set { }* * * *, , , ,gj g r ts and the associated information are in

region I of the taxonomy.

Define the following processes. Taxonomy classifications are given in parentheses.

( ){ }* * *( ) ( ) :1 ,1 : 0ig gt N t i s g g t t= ≤ ≤ ≤ ≤ ≤ ≤N is the number of jobs that have left

server i in group g by time t. It may be sufficient to consider ( ) ( )*

1

gs

g igi

N t N t⋅=

= ∑ ,

the number of jobs that have left the entire group, rather than looking at each

server individually. ({ }( )tN falls in region II.2.ii.b, ( )gN t⋅ in region I.2.ii.)

141

( ){ }* *:1 ( ) :1g kg gJ k N t g g⋅= ≤ ≤ ≤ ≤J is the job type that is the kth departure from

group g. { }*1,..., ,kgJ j k g∈ ∀ (region I.2.ii).

( ){ }* *:1 ( ) :1g kg gM k N t g g⋅= ≤ ≤ ≤ ≤M is the server that processed the kth departure

from group g. For every g, { }*1,..,kg gM s∈ (region III.2.i.b).

( ){ }*:1 ( ) :1g kg iD k N t g g= ≤ ≤ ≤ ≤D is the time of the kth departure from server

group g. The event { }kgD t≤ is equivalent to the event { }( )gN t k⋅ ≥ .

0 0gD = for all g (region III.2.i.b).

( ){ }* * *( ) ( ) :1 ,1 : 0rgt Q t r r g g t t= ≤ ≤ ≤ ≤ ≤ ≤Q defines the number of jobs at each route

step r waiting in queue for each server group g at time t (region I.2.ii)

The problem is to generate the departure process

( ){ }* *( ), , , : 0 ,1g g gt t t g g≤ ≤ ≤ ≤N J M D from the server groups, using only the steady-

state observed stochastic process { }*( ) : 0t t t≤ ≤Q . That is, to use information from

region I to find information in regions II.2.ii and III.2.ii.

B.2 SOLUTION CHARACTERISTICS AND ERROR MEASURES Ideally, a solution to this problem will generate the same departure process

( ){ }* *( ), , , : 0 ,1g g gt t t g g≤ ≤ ≤ ≤N J M D as a model that was maintaining all job

information. (Whitt 1984) discusses overtaking in queues, and proposes a comparison of

142

the distributions of the possible job orderings, based on the perturbations of the original

job ordering. This approach may be of value in this context.

If it is not possible to generate the same departure process, we would like to

minimize the disruption caused to the rest of the system by sending out the “wrong” job

type. That is, if sending out job type j would cause less disorder in the remaining system

than sending out job type j’, we would prefer to send out j. Possible objective functions

to minimize the disruption are to

Minimize { }P wrong type

or to

Maximize { }P distance from correct type < ε .

The first objective function is similar to analyzing queueing models. The second

objective function requires a measure of “distance” from the “correct type.”

One way to measure the distance from the correct type or the disruption caused in

the remaining system is looking at the deviations in queue sizes at all the server groups.

There are (at least) two possible ways of measuring these deviations. Let

( ) ( )* *

1 1

g rtot

rgg r

Q t Q t= =

=∑∑

be the total number of jobs waiting in queue in the system at time t. Then the first error

measure measures the squared error of the total jobs waiting in system until time t*:

( ) ( )( )*

2

1 *0

1 ttot totcomplete approxError Q t Q t dt

t= −∫ (11)

143

The second error measure differs slightly:

( ) ( )( )* * *

2

2 *1 10

1 t g rcomplete approxrg rg

g rError Q t Q t dt

t = =

= −∑∑∫ (12)

The following example illustrates the difference. Let r* = g* = 2.

Table 12. Example values to illustrate differences in error measures

r, g Qrgcomplete Qrg

approx (Qrgcomplete - Qrg

approx)2

1, 1 3 5 4 2, 1 1 2 1 1, 2 2 5 9 2, 2 4 5 1 sum 10 17

Error1 = (10 – 17)2 = 49.

Error2 = 4 + 1 + 9 + 1 = 15.

The choice of error measure depends on the actual problem, and whether the difference in

total queue size (Error1) or in the relative differences between individual queues (Error2)

is more important. A mixture of the two calculates the differences in the total queue sizes

at each server group. Define the number of jobs waiting at server group g as

( ) ( )*

1

r

g rgr

Q t Q t⋅=

= ∑ .

Then

( ) ( )( )* *

2

3 *10

1 t gcomplete approxg g

gError Q t Q t dt

t ⋅ ⋅=

= −∑∫ (13)

An unresolved question is whether the error measure should be a random variable or a

deterministic quantity. If it is a random variable, it may be possible to establish

dominance of one error measure over the others.

144

B.3 SOLUTION APPROACH: DISCRETIZED-TIME QUEUES Rather than storing information from region III (job waiting times), we propose

discretizing the time jobs spend in queue by adding information to region I: integer

counts of the different types of jobs that have been waiting for specified discrete lengths

of time. This allows the approximation of job orderings in queue.

We may be able to make use of ideas from the discrete-time conversion literature.

See, for example (Fox and Glynn 1990). The idea there is to condition a continuous-time

Markov Chain on the states visited (the embedded chain), and to use the resulting

conversion to estimate quantities such as costs for the continuous case.

In a conventional resource-driven approach (using information from region I

only), the queue for a given server (group) is an array of integers, see Figure 25.

Figure 25. Sample queue for 5 parts that visit the same server 3 times

The associated discretized queue is a matrix where the rows represent the number of jobs

waiting for the discrete time intervals. This is illustrated in Figure 26 for an interval of 5

time units.

Figure 26. Sample discretized queue for the queue given in Figure 25

Part 1 Part 1 Part 1 Part 2 Part 2 Part 2 Part 3 Part 3 Part 3 Part 4 Part 4 Part 4 Part 5 Part 5 Part 5 Test- Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 wafer

3

3 3

1 1 1

1

2

2 0-5 5-10 > 10

Part 1 Part 1 Part 1 Part 2 Part 2 Part 2 Part 3 Part 3 Part 3 Part 4 Part 4 Part 4 Part 5 Part 5 Part 5 Test- Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 wafer

4 3 2 3 1 2 1 1

145

Using the information in the queue matrix, we can select jobs based on approximate

FCFS (or LCFS) disciplines. Similarly, we can estimate the lengths of time individual

jobs have been waiting in queue.

Parts that have been waiting 0-n minutes at time t will have waited 0+∆t to n+∆t

minutes at time t+∆t. Some mechanism must be found to correctly update the matrix. A

possible implementation is to shift the entries in the matrix “down” a row every n time

units. This is inefficient because it updates the matrix more often than is required. It

should be sufficient to update the matrix when jobs enter the queue or start service

because the queue is unaffected by other system events.

This is an approximation as a job that arrived at time n-ε will be reclassified as

having waited greater than n time units ε after its arrival.

This approach allows us to approximate the necessary information from regions

III.2.i and III.2.ii. That is, we can approximate system behavior and get estimates of

waiting times. As the time interval length decreases, the memory requirements will

increase. In some cases, it may be more memory-efficient to use the information from

region III instead.

146

Appendix C. DEDICATION CONSTRAINTS C.1 EXAMPLES OF SYSTEMS WITH DEDICATION C.1.1 WAFER PRODUCTION A deterministic, re-entrant, and state-dependent queueing system is a fab-level model in

semiconductor manufacturing. There are many thousands of wafers that pass through the

system. An example of state-dependency is that routing to QA (quality assurance) may

be increased if a machine group is producing substandard wafers. One of the machine

groups wafers visit repeatedly are the stepper tools that motivated the research in

Section 4.4. On return visits to the stepper group, wafers are sent to the same subset of

stepper tools that processed them before.

We observe ( ){ }tQ , the number of wafers waiting at each resource group g for

any of its *gs servers. For the stepper tools (and other machines that may be subject to

dedication constraints), we additionally can observe ( ){ }tQ , the number of wafers that

are on a return visit. Tools with dedication constraints may also have positive values of

( ){ }tQ if there are unassigned wafers waiting for their first visit to the tool.

We know ( ){ }tS , the state of every machine in the factory. Finally, ( ){ }tD is

known, but is of interest only for the tools with dedication. A resource-driven model of

the fab cannot observe ( ){ }tQ because of lack of knowledge of { }J .

147

C.1.2 HEALTH CARE Health Maintenance Organizations (HMOs) can be viewed as very large, stochastic,

state-dependent, re-entrant (with dedication) systems. The following description is based

on the author’s experience with health care plans, for example (Aetna 2001;

HealthNet 2002).

To receive care and get referrals or prescriptions, patients are required to visit

their primary care physicians (PCPs), their “dedicated resources.” Even if other doctors

in the same office are available while the PCP is not, the patient must visit the PCP. The

system is stochastic as patients may see several doctors for the same condition, but the

order in which they visit the doctors need not be fixed. It is state-dependent because the

routing may depend on the availability of physicians other than the PCP. (If the preferred

cardiologist is not available, patients may be referred to a different one.) In emergencies,

it is even possible to be seen by a doctor other than the PCP.

In this example, we observe ( ){ }tQ , the number of patients waiting for a group

of doctors, any of which would be able to help the patient. E.g., it likely does not matter

which dentist in an office of four dentists fills a small cavity. In most cases, the observed

( ){ }tQ will be greater than ( ){ }tQ . We can observe ( ){ }tS and ( ){ }tD , although the

meaning of the latter is not as significant here as it is in other examples. It represents the

number of patients that have been assigned to a doctor. We assume that patients will

continue to seek treatment from their PCPs indefinitely unless the patients switch doctors

or pass away. (In reality, ( ){ }tD is used to determine whether doctors will accept new

148

patients. It is an interesting problem to try to do an analogous assignment of jobs to

machines in production systems like the semiconductor fab discussed above.)

C.1.3 WEB SERVERS Another stochastic, re-entrant, state-dependent system is found in web browsing. Users

browse the Internet by sending HTTP requests to web servers, who respond by sending

back the web pages and other content. This is a very large stochastic queueing network:

Users may visit different websites or pages on a site as they desire. g*, *gs , and r* are

extremely large.

State dependency occurs on both sides of the information exchange: The user may

choose to visit a different website (cnn.com instead of msnbc.com, for example) if the

site (s)he is trying to access is not responding quickly enough. On the server side,

requests are routed to and processed by different server banks, depending on the volume

of requests. In extreme cases, additional servers may be added temporarily.

In web server administration, there is the notion of “sticky web pages”

(Microsoft 2004). This states that all HTTP requests from a client machine during a

particular browsing session must be processed by the same server, otherwise the session

information is lost. I.e., the client is dedicated to this server for the duration of the

browsing session. For this system, we can observe ( ){ }tQ , ( ){ }tQ , and ( ){ }tS , though

their states change extremely rapidly. ( ){ }tD is less easily observable. We may know

how many users a particular server has served, but it is not always clear whether the users

are still perusing the site.

149

C.1.4 OTHER There are cases where we may not have observations for all four of the stochastic

processes mentioned above. For example, we may know when an ATM machine is in

use, but not know how many customers are waiting in line for it. In this case, we have a

very simple G/G/1 queue, but cannot observe ( ){ }tQ . ( ){ }tS is still observable.

C.2 ENHANCED APPROXIMATION Approximate ( ){ }0tot

gP can begin processing|i now available, Q t > by

( )

( )

( )*

*11

totgQ t

g g

g g

s S ts S t

⋅

⋅

⎛ ⎞−− ⎜ ⎟⎜ ⎟− +⎝ ⎠

(14)

This probability is independent of the total number of servers in g, sg*, and uses only the

number of servers that are currently busy; it takes into account the fact that waiting jobs

are not waiting for currently idle servers. Define 00 = 1:

• If there are no jobs waiting in queue, ( )tQtotg = 0,

( )( )

( )*

*1 1 1 01

totgQ t

g g

g g

s S ts S t

⋅

⋅

⎛ ⎞−− = − =⎜ ⎟⎜ ⎟− +⎝ ⎠

; a server will not begin processing a job if there

are no jobs waiting.

• If ( )totgQ t > 0 and all servers (but i) had been idle (and since i has just completed,

*( )g gS t s⋅ = ), ( )

( )

( ) ( )*

*

01 1 1 0 11 1

tot totg gQ t Q t

g g

g g

s S ts S t

⋅

⋅

⎛ ⎞− ⎛ ⎞− = − = − =⎜ ⎟ ⎜ ⎟⎜ ⎟− + ⎝ ⎠⎝ ⎠; one server will have

to begin serving jobs if all servers are currently idle.

150

• If *gs = 10, ( )tQtot

g = 1, and ( )tS g⋅ = 8: ( )

( )

( ) 1*

*

2 11 11 3 3

totgQ t

g g

g g

s S ts S t

⋅

⋅

⎛ ⎞− ⎛ ⎞− = − =⎜ ⎟ ⎜ ⎟⎜ ⎟− + ⎝ ⎠⎝ ⎠,

which is greater than the probability of 0.1 we obtain from our current

approximation because it assumes the jobs in queue are waiting for one of the

busy machines.

• If *gs = 10, ( )tot

gQ t = 5, and ( )gS t⋅ = 8,

( )( )

( ) 5*

*

21 1 0.86831 3

totgQ t

g g

g g

s S ts S t

⋅

⋅

⎛ ⎞− ⎛ ⎞− = − =⎜ ⎟ ⎜ ⎟⎜ ⎟− + ⎝ ⎠⎝ ⎠, which is greater than our probability

of 0.4095 because it assumes jobs are not waiting for the 7 servers that were

previously idle.

This refinement was formalized by David Bowen (Bowen 2003).

151

Appendix D. APPROXIMATING WAITING TIME DISTRIBUTIONS D.1 POSSIBLE TRANSITIONS BETWEEN CURVE ORDERINGS Figure 27 shows the possible transitions between orderings for jobs whose DELAYs fall

in the same time slice. Each transition corresponds to the change in ordering from job k

to job k +1. It is possible to start in any of the states, and to transition from a state to

itself. It is also possible to move from the left to the right in Figure 27, but not from right

to left. The time values for L (Lower Time Slice Counter Method counter) and U (Upper

Time Slice Counter Method counter) are fixed for a given time slice. D (DELAY time) is

always between L and U, but it is possible for the relative positions of S (START time)

and D to change. Once S has moved to the right of L or U, it cannot move to their left

(for a given time slice).

Transitions between time slices can go from any ordering to any other ordering,

since the locations of L and U change for the new time slice. Possible transitions within

and between time slices are illustrated in Figure 17. The time slice between times 2ξ and

3ξ shows the transitions from SLDU to LSDU to LDSU to LDUS. The transition from

time slice 1 (indexing begins at 0) to time slice 2 gives a transition from LSDU to SLDU,

and the transition from time slice 2 to time slice 3 is LDUS to LSDU. Neither transition

is possible within a time slice.

The possible orderings and their transitions are a Semi-Markov Process. In

Appendix D.2.7, we discuss the Markov property of this system. The times between

transitions are random quantities that follow distributions Gi. It is likely that we cannot

determine G in non-trivial cases. For the time being, we focus our discussion on the

embedded Markov Chain.

152

Figure 27. Possible transitions between orderings within a time slice D.2 CLASSIFICATION OF UNCERTAIN JOBS In this section, we discuss various ways of handling jobs whose DELAY and START

events fall in the same time slice. We include simple probabilistic arguments why some

methods perform more accurately than others. Jobs that find an idle server are not

included in this analysis as their delay is zero. I.e., all probabilities given below are

probabilities conditioned on k kA S≠ . We omit this notation.

There are two fundamental assumptions underlying the calculations for the

probabilities of correct classification. The first is that DELAY and START events are

equally likely to occur anywhere in the time slice (f(s) = f(d) = 1/ξ). This is true for

Exponential arrivals (DELAYS) as, conditioned on the number of events in a time period,

the occurrence times of these events have the same distribution as the order statistics of

the Uniform distribution. For other distributions, this is not true, though empirical

SLDU

LDSU

LSDU

LDUS

153

analysis has shown that the assumption has some support if we include no other

information (e.g., the number of DELAY events in the time slice). There was also little

experimental evidence to contradict f(s) = 1/ξ. For the time being, we will use this

assumption.

The second assumption is that the locations of D and S are independent of one

another without any other information (e.g., queue size, last start service time).

D.2.1 IGNORE UNCERTAIN JOBS A simple approach is to ignore the jobs we cannot classify with certainty. This approach

gives good results if the distributions of the jobs that can and cannot be classified are the

same. If they differ, we are estimating the wrong distribution.

Experimentally, this approach did not work well. It performed consistently worse

than other methods described below. In the future, we would like to prove that the

distributions (of all jobs and those with known classifications) are different.

D.2.2 ALWAYS CLASSIFY AS (NOT) DELAYED This approach biases the estimate of ( )WF γ because the classification of all jobs whose

DELAY and START events happen in the same time slice is not the same except in very

special cases. For example, in a D/G/2 system with arrivals every time unit, and a service

time Uniformly distributed between 0.1 and 0.2 time units, all jobs are not delayed for

any [ )0.2,γ ξ∈ .

154

Let ξ be the time slice size, and i is the index of the time slice this D falls in,

Diξ

⎢ ⎥= ⎢ ⎥⎣ ⎦

. Figure 28 illustrates the relative positions of these quantities and labels the

different time slice regions. The region between the beginning of the time slice and the

START time (the current time) is A, and represents S iξξ− proportion of the total time

slice. The line segment between the current time and the end of the time slice is B and is

( )1i Sξξ

+ − part of the time slice. We do not know the location of D, other than that

( )1i D iξ ξ≤ < + .

Figure 28. Location of current time in time slice

If we choose to classify all jobs as delayed, the probability of a correct classification is:

{ } { } { } { } ( )( )

( ) ( )

( )

( ) ( )

( )

1

1 1

2

122 2

2

2 2

|

1

1 1 12 2

1 2 1 0.52

i

i

i i

i i

i

i

P correct P job delayed P D S P D s S s f s ds

s i s ids ds

s is i i i i i

i i i i

ξ

ξ

ξ ξ

ξ ξ

ξ

ξ

ξξ ξ ξ ξ

ξ ξ

+

+ +

+

= = ≤ = ≤ = =

−= ⋅ = − =

⎡ ⎤ ⎡ ⎤= − = + − − + − =⎢ ⎥ ⎣ ⎦⎣ ⎦

= + + − − =

∫

∫ ∫ (15)

iξ (i+1)ξ

S iξξ− ( )1i Sξ

ξ+ −

A B

S

155

This is also the probability of a correct classification if all jobs are classified as “not

delayed.”

D.2.3 RANDOMLY CLASSIFY JOBS USING A FIXED PROBABILITY We can classify uncertain jobs as delayed with a fixed probability p by flipping a coin

with probability p for each job and basing our classification on the result of the coin flip.

The probability of a correct classification is:

{ } { } { }{ } { }

( ) ( ) ( )( )

( ) ( ) ( )( )

( ) ( )

1

1

2 2

2

, ,

, ,

11

1 1 1

2 1 1 2

i

i

i

i

P correct P delayed classified delayed P not delayed classified not delayed

P D S RND p P D S RND p

i ss i p p f s ds

i p p ssp ip ds

p s i p p

ξ

ξ

ξ

ξ

ξξξ ξ

ξ ξ ξ ξ

ξ

+

+

= + =

= ≤ ≤ + > > =

⎡ ⎤⎛ ⎞+ −⎛ ⎞−= + − =⎢ ⎥⎜ ⎟⎜ ⎟

⎢ ⎥⎝ ⎠ ⎝ ⎠⎣ ⎦

+ − −= − + − =

− − −= +

∫

∫( )

( ) ( ) ( )

( ) ( )( ) ( )

( ) ( )

1

12

2

2 2

1

2 1 1 2 12

2 1 1 1 2 1 12

2 12 1 2 1 1 0.52

i

i

i

i

ds

p s i p ps

p i i i p p i i

pp i p i p

ξ

ξ

ξ

ξ

ξ

ξ ξ

+

+

+=

⎡ ⎤− − − += + =⎢ ⎥⎣ ⎦

− ⎡ ⎤= + − + − − + ⋅ + − =⎣ ⎦−

= − + − − − + =

∫

(16)

This is the same as the probability of always classifying a job as (not) delayed.

D.2.4 RANDOMLY CLASSIFY JOBS BASED ON THE CURRENT TIME SLICE

LOCATION A job is classified as delayed if D ≤ S. If we assume that D is equally likely to have

occurred anywhere in the time slice, then the probability a job is delayed depends on S’s

156

relative position in the time slice. Specifically, { } { } S iP delayed P S D ξξ−

= > = . If a

Uniform(0,1) random number is less than this probability, we classify the job as delayed.

The probability of correct classification is:

{ } { } { }

( )

( )

( )

2 21

2 2 2

3 2

, ,

, ,

1

22 12

i

i

i

i


S i S iP D S RND P D S RND

s i s i f s ds

s is is i ds

ξ

ξ

ξ

ξ ξξ ξ

ξ ξξ ξ

ξξ ξξ ξ ξ

+

= + =

⎧ ⎫ ⎧ ⎫− −= ≤ ≤ + > > =⎨ ⎬ ⎨ ⎬

⎩ ⎭ ⎩ ⎭⎡ ⎤⎛ ⎞ ⎛ ⎞− −

= + − =⎢ ⎥⎜ ⎟ ⎜ ⎟⎢ ⎥⎝ ⎠ ⎝ ⎠⎣ ⎦

−− += + −

∫( )

( ) ( )( )

( ) ( )

( ) ( ) ( ) ( ) ( )

1

1 2

3 2

1

3 23 2

3 23 2

2 2 2

2 2 1 2 1 12

2 1 12 2 13

2 1 2 1 1 2 1 1 13

2 22 2 4 4 1 2 2 13 3

i

i

i

i

i s i is ds

i iis s s

i i i i i i i i i

i i i i i i

ξ

ξ

ξ

ξ

ξ

ξ ξ ξ

ξ ξ ξ

+

+

+

=

+ + += − + =

⎡ ⎤+ ++= − + =⎢ ⎥⎣ ⎦

⎡ ⎤ ⎡ ⎤ ⎡ ⎤= + − − + + − + + + + − =⎣ ⎦⎣ ⎦ ⎣ ⎦

= + + − − − + + + =

∫

∫

(17)

Allowing the probability of classification to depend on the current location in the time

slice increases our overall probability of correct classification from 0.5 (for a fixed

probability) to 0.67.

D.2.5 USING EXPECTATIONS This is similar in spirit to the idea proposed in Appendix D.2.4. Rather than estimating

{ } P job delayed by Equation (9), we estimate it by

157

{ }

( )( )

( )

*

1*

,DN tk k

jobs k

D

S Sexpected relative bin location of job

P delaytotal number jobs N t

γ

γ

γ ξ ξξγ =

−

≤ = =∑ ∑

(18)

This yields the same results as the method from Appendix D.2.4.

This approach can be refined to use the expected order statistics rather than the

current location in the time slice. This uses additional information (the counter values) in

making the estimates, and will be considered as an extension in Appendix D.5.4.

D.2.6 DETERMINISTICALLY CLASSIFY JOBS DEPENDING ON CURRENT

LOCATION This approach classifies jobs as “delayed” if their START occurs near the end of the time

slice, specifically if it is in the second half of the time slice: ,2

S m m i ξξ> = + . This is

similar to the approach from Appendix D.2.3 with p = 0.5, but eliminates additional

randomness introduced by the coin flip. The probability of a correct classification is:

{ } { } { }

( )( )1

20

, ,

, ,2 2

12 2

i

i

i


P D S S i P D S S i

s i s iP s i P s i f s ds

s i ds

ξ

ξ

ξ ξξ ξ

ξ ξ ξ ξξ ξξ ξ

ξξ

+

+

= + =

⎧ ⎫ ⎧ ⎫= ≤ > + + > ≤ + =⎨ ⎬ ⎨ ⎬⎩ ⎭ ⎩ ⎭

⎡ ⎤⎛ ⎞ ⎛ ⎞− −⎧ ⎫ ⎧ ⎫= > + + − ≤ + =⎨ ⎬ ⎨ ⎬⎢ ⎥⎜ ⎟ ⎜ ⎟⎩ ⎭ ⎩ ⎭⎝ ⎠ ⎝ ⎠⎣ ⎦

−=

∫

( )

( ) ( )

( )

( ) ( )

( ) ( )

1 0.5

2.5

1 0.52 2

2 20.5

2 222

1

12 2

1 1 1 1 1 11 1 12 2 2 2 2 2

1 1 1 1 1 3 1 32 2 2 8 2 8 2 4

i i

i

i i

i i

s i ds

s i i ss s

i i i i i i i i i i

i i i i

ξ ξ

ξ ξ

ξ ξ

ξ ξ

ξξ ξ

ξ ξ ξ ξ

+ +

+ +

+

−+ − =

⎡ ⎤ ⎡ ⎤+= − + − =⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

⎡ ⎤ ⎡ ⎤⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞= + ⋅ + − − + − + + − + − + − − =⎢ ⎥ ⎢ ⎥⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

= + − − + + − =

∫ ∫

(19)

158

Removing the randomness has increased the probability from 0.67 to 0.75. Figure 29

shows the functions over which we integrate to get the probabilities. In the legend, b ξ=

and s iA ξξ−

= . “Deterministic” refers to the integrand from Section D.2.6, “random” to

the integrand in Section D.2.4.

Figure 29. Probabilities integrated over

The unbolded lines are the individual functions and the bold lines are the sums of the two

functions in the deterministic and random cases. Two things are illustrated in this graph.

The first is that that randomness causes the parabolic shape in the integrand

because we square the terms we wish to integrate over. In other words, the random case

integrates over ( )2

s i f sξξ

⎛ ⎞−⎜ ⎟⎝ ⎠

rather than just ( )s i f sξξ

⎛ ⎞−⎜ ⎟⎝ ⎠

. The deterministic case has

a factor of 2

s i ξξ> + , which acts as an indicator function and is either 0 or 1. Therefore,

we are actually integrating over only one term in either half of the time slice for the

159

deterministic case. The random case has contributions from both terms throughout the

time slice.

The second thing we observe is that 2

m i ξξ= + is the point that maximizes the

value of our integral in the deterministic case. Any point 'm m≠ would reduce the area

under the function. This is illustrated in Figure 30. Choosing m’ rather than m requires

us to integrate over the solid bold line. This reduces the value of the integral by the area

of the shaded triangle and is true for any m’ < m, and, by symmetry, for any point m’ > m.

Figure 30. Area lost by using a point other than 2

m i ξξ= + as a cutoff

D.2.7 HIDDEN MARKOV MODELS Hidden Markov Models (HMMs) are a type of model usually studied in the machine

learning/artificial intelligence communities (Jordan et al. 1999; Jordan 2003). In HMMs,

the system finds itself in multinomial state iX after the ith transition. In our case, iX is a

4-dimensional vector with each element representing one of the four states in Table 5.

A

1-A

m=(i+0.5)ξm’

160

The assumption is that, given 1iX − , the probability of iX is independent of the preceding

states.

The models are “hidden” because we cannot observe iX , i.e., ( )LDN t in this

example. Instead, we have observations iY with values in the set { }, ,LSU LUS SLU .

Figure 31 shows the possible mappings of iX ’s to iY ’s.

Figure 31. Mapping of hidden states to observable outputs of the states

It may be necessary to augment the state by including a Boolean indicating whether the

current observation is in the same time slice as the previous one; the transition

probabilities change depending on whether we are transitioning within or between time

slices. The value of the Boolean is observable using general information from region I.

The additional time slice information would make estimating transition probabilities

easier because we know that certain probabilities are zero.

The advantage of this HMM is that the probability of observing iY given iX is 0

or 1. That means that only the transition probabilities between the iX , { }1 |i iP X X+ , not

the { }|i iP Y X , need be estimated.

SLDU LSDU LDSU LDUS

SLU LSU LUS

Xi

Yi

161

The HMM is a way of approximating delay probabilities in the cases where we do

not know whether a job was delayed. How to appropriately train the model without

generating many runs with a full job trace is an open question. Generally, machine

learning problems have large amounts of training data to allow the model to refine its

estimates. We are trying to avoid gathering large amounts of data, although it may

become necessary to do so if we wish to experiment with HMMs.

D.2.8 COMPARISON OF UNCERTAIN JOB CLASSIFICATION The main competitor for the deterministic classification described in D.2.6 is the random

classification described in D.2.4. In this section, we compare the approximation errors

for the two approaches. The deterministic approach reduces the estimation error

noticeably and consistently. The experiments run are those listed in Table 9, along with

the systems with doubled and halved rates.

Only in very few cases was there no discernible difference between the two

classification methods. These are the cases where the error was already extremely

insignificant. Figure 32 shows the difference in estimation errors for the M/U/1 system

with an average interarrival time of 0.55 and an average service time of 0.5. This system

is the one (of the 54 run, see Table 9) where the random classification has the best

performance compared to the deterministic classification. The difference in error is

random deterministicerror error− .

The lines with the triangles correspond to the runs with a time slice size of 1

(twice the average service time), the lines with the squares time slice size 2, and the lines

with diamonds time slice size 3.

162

Figure 32. Difference in estimation errors for the random and deterministic Time Slice Counter Method M/U/1 implementations

Figure 33 shows the errors for the system where the improvement is the largest, the

U/U/1 system with average interarrival 0.55 and average service time 0.5.

As the average interarrival and service times increase, the improvement becomes

less pronounced. The average errors for each of the 54 runs were always positive (if

extremely close to zero). Figure 34 shows the differences for an M/B/1 system with an

average interarrival time of 3 and an average service time of 2.

163

Figure 33. Difference in estimation errors for the random and deterministic Time Slice Counter Method U/U/1 implementations

Figure 34. Difference in estimation errors for the random and deterministic Time Slice Counter Method M/B/1 implementations

164

D.3 IMPLEMENTATION DETAILS D.3.1 CLASSIFYING JOBS BASED ON COUNTERS Unlike the discussions in Sections 4.5.3 and 4.5.4.1 the implementation uses vertical time

slices, not horizontal job characteristics. In all cases, the count comparisons are done

after ( )SN t has been incremented.

• ( ) ( )S UN t N t≤ : The START event occurs in a later time slice than the actual

DELAY event; the job is delayed.

• ( ) ( )S LN t N t> : The START event occurs in an earlier time slice than the actual

DELAY event; the job is not delayed.

• ( ) ( ) ( )U S LN t N t N t< ≤ : The START event occurs in the same time slice as the

actual DELAY event; we cannot tell whether the job is delayed. Based on

experimentation and the analyses in Appendix D.2, we classify jobs

deterministically based on the relative position of the START event in the time

slice (Appendix D.2.6).

D.3.2 JOBS FINDING AN IDLE SERVER The addition of a Boolean variable in region I.2.i to indicate whether a job can begin

service immediately upon arrival increases the accuracy of the Time Slice Counter

Method. If there is an idle server available for the job, the job’s delay is zero and its wait

is less than γ for any γ>0.

165

D.3.3 UPDATING COUNTERS The way in which the time slice counters are updated has a large impact on the

computational requirements of the Time Slice Counter Method. Incrementing the

counters as described in Section 4.5.2.1 has complexity O(NΞ) for each γ; each of the N

jobs causes a traversal of the array of Ξ counters. For long and/or congested runs, the

simulation will slow dramatically. We show we can update counters in O(Ξ) time per γ.

Claim 2 We can update and maintain correct time slice counts for one delay value γ in

O(Ξ) time.

Proof: Our proof is illustrated using Figure 35, which shows the relationships

between times for updating. We can update the array by performing one addition

for each job, which occurs in constant time. The speed of the algorithm is

therefore independent of the number of jobs simulated. We show we pass through

the array of time slice counters exactly once in our updating steps, so the

complexity is linear in the size of the array.

1. At time t, DELAY counters for times before t + γ are no longer incremented.

Counters after t + γ need not yet be incremented because ( )SN t is compared

to the counter for time t. For example, in Figure 35, if the current time is t1,

we increment the counter for the time slice for time t1 + γ, time slice i. We

will no longer be incrementing time slice counters for j < i. Furthermore, we

need not increment later counters (k>i), since we have not yet reached them.

( )1SN t is compared to , 4iγ −∆ .

2. If time progresses to t’, counters between t + γ and t’ + γ can be set to the

166

value of the counter at t + γ by the argument in 1. This is true whether the

event at t’ is an arrival or a start service. For example, if time in Figure 35

advances to an arrival at t2, the DELAY occurs in time slice i+3, and there

will be no more changes to the counts for time slices i, i+1 and i+2. The

values for these time slices can be set to the current value of time slice i,

, 4iγ −∆ . If, instead, time has advanced to a start service event at time t3, we

perform the update step setting the values of time slices i+1 and i+2 to the

current value of time slice i. In fact, time slices s through i+6 (corresponding

to t3 + γ) can be updated. ■

Figure 35. Illustration of relationships of times for updating time slices

D.3.4 INITIALIZATION BIAS Currently, nothing has been done to deal with initialization bias. (Queues are initialized

to zero and servers as idle.) Typical approaches include initializing the queue to its

steady-state value, or truncating the output at a point when the user feels the system has

“warmed up.” The problem with the first is that we must know the steady-state queue

size a priori, and may have to do trial runs (themselves subject to initialization bias) to

γ γ

t1 t2 t2+γ

i-2 i i+2i-4 i+4

t3 t1+γ

167

determine it. For methods dealing with initialization bias, see (Welch 1981;

Schruben and Kulkarni 1982).

The problem with the second approach (truncation) is a question relevant to the

Time Slice Counter Method itself: How does one “truncate” information from the Time

Slice Counter Method? If we do not begin incrementing ( )W tγ until some time t’, the

counters we use to classify jobs will have been affected by the jobs that have arrived

before t’. To obtain the correct probability estimate at the end of the simulation, we must

divide random variable ( )*W tγ by random variable ( ) ( )* 'S SN t N t− . We must prove that

this estimate would be accurate, and that it would overcome initialization bias.

Another possibility is to not increment any counters until time t’. This will

introduce a different initialization bias, as we will begin incrementing ( )LN t and ( )UN t ,

but not be able to increment ( )SN t until we reach job k, where job k is the first job to

have had 'kA t≥ .

How to deal with initialization bias is an open problem and is the subject of future

research.

D.3.5 SELECTING PARAMETER VALUES Guidance is required on selecting the number and values for the γi, and for selecting a

time slice size ξ.

The selection of the γi is application-dependent. In some cases, special values

of γ are of interest. In cases where the Time Slice Counter Method is being used to

estimate the waiting time distribution, the modeler must use knowledge about the system

168

and perhaps intuition to select values. This knowledge may be acquired through trial

runs. Other possible approaches we will investigate in the future include doing

preliminary runs to estimate bounds on γ; bootstrapping; and performing transformations

on γ to create a bounded distribution. The latter is an attractive option if we are able to

find an appropriate transformation (e.g. one that works for any ( )WF ⋅ ). For a method

using both bootstrapping and transformations to estimate a (bootstrap) confidence

interval on a “parameter of interest,” see (Tibshirani 1988). In Appendix D.5.2, we

discuss ideas for adding γ values during the run.

In Section 4.5.5, we show that a value of ξ ≈ average service time resulted in

estimation errors of around 1 percentage point or less for single-stage queueing systems

in most cases. Since the main cost of ξ is memory, smaller values can be chosen with

little negative effect on speed. Although not implemented here, it is possible to choose

different values of ξ for different γi. Estimates in the right tail of the distribution appear

highly accurate, so larger values of ξ for these γ can reduce memory requirements. Other

efficiencies can be gained using a circular array of size O γξ⎛ ⎞⎜ ⎟⎝ ⎠

.

D.4 MULTIPLE-SERVER TANDEM QUEUEING SYSTEMS D.4.1 EXPERIMENTATION In this section, we show the results of our experimentations for job cycle-time estimation

in n-stage tandem queueing systems. We have results for n = 2 and n = 10, for both

FCFS and last-come-first-serve (LCFS) disciplines. To estimate cycle times, we compare

DELAY to (final) FINISH events. Cycle times are information in region III.2.ii.b.

169

When there are multiple servers at one or more stages, it is possible for job k to

leave a stage or the system before job j, even though job j arrived before k:

, ,,k j k i j iA A F F> ≤ for i = 1,…, n. This phenomenon is known as “overtaking.”

The simplest case for multiple-server queues is having 2 servers at each stage,

which is what we do here; this situation also minimizes the opportunity for overtaking.

The service distributions and rates are the same at each of the n stages in our

experiments.

In the following graphs, we show only the results for time slice size 1. Larger

time slice sizes performed more poorly. Even small time slice sizes can result in errors if

overtaking is present. In a sense, these experiments test how big an impact the

overtaking has, i.e., how much the job ordering is shuffled during the jobs’ stay in the

system.

Up to four different scenarios were simulated for each of the following

arrival/service distribution combinations: M/M/⋅, M/U/⋅, U/U/⋅, M/B/⋅, and B/B/⋅.2

Table 13 lists the scenarios. The first two are the same as in the single-stage experiments.

Since we have two servers at each stage, these are very lightly-loaded systems. The third

and fourth systems are more congested. In the 2-stage runs, the γ are the same as in

Section 4.5.5. In some of the 10-stage runs, the values are multiplied by 5 to account for

the longer time in system. (The new γ are therefore 0.4, 0.8, 1.2, 1.6, 2.0 and so forth.)

2 The Beta variable ranges for Scenarios 3 and 4, respectively, are (0, 9) and (0, 10).

170

Table 13. Experiments for multiple-server tandem queueing systems arrival rate service rate RHO

1 0.6667 1 0.3333 2 0.9091 1 0.4546 3 0.9091 0.5556 0.8181 4 0.9091 0.5 0.9091

Figure 36 shows the estimation errors (as functions of ( )WF γ ) for a 2-stage M/M/2

system under all four scenarios for time slice size 1. The largest error for the estimates in

Figure 36 is around 7 percentage points. For scenarios 3 and 4, the γ did not cover the

complete distribution. For the lower tail of the distribution, our estimates are good. This

is positive, as the left tail was often prone to larger errors than the remaining distribution

for single-stage systems. The errors also become smaller as we approach the right end of

the distribution. We underestimate the variability for the first two scenarios.

In Section 4.5.5, we observed that more congested and variable systems had

smaller errors than the others. Figure 37 shows the errors for the 2-stage M/B/2 system

with time slice size 1. The errors are sizeable for all four scenarios, although they get

smaller as we reach the right end of the distribution. We believe the large errors are

because the variability is causing more overtaking to occur. We underestimate the

variability, especially in the first two scenarios.

Figure 38 shows the estimation errors for the 2-stage M/U/2 system with time

slice size 1. The estimation errors are relatively small compared to those for the M/B/2

system. They compare favorably with the errors shown in Figure 22. We believe the

Time Slice Counter Method performs better in the tandem queueing system because the

job’s stay in the system is longer and there are two stages, which increases the variability

171

a job is subjected to. Overtaking does not appear to be as big a problem as in the other

tandem queueing systems, perhaps because the service times are not too variable.

Figure 36. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and service times

The same four scenarios outlined in Table 13 were run for the 10-stage tandem queueing

network. We have the same service discipline and rates at each of the 10 stages, and 2

servers at each of the stages.

Figure 39 shows the estimation errors for the M/M/2 system. Scenarios 1 and 2

perform extremely poorly, while scenarios 3 and 4 do well. Their errors are smaller than

those in the 2-stage case.

Figure 40 shows the errors for the estimated cycle time distribution for the M/B/2

system. The errors for the more heavily-loaded systems (Scenarios 3 and 4) are smaller

172

than those for the lightly-loaded systems, and smaller than the errors for the 2-stage

tandem queue.

Figure 37. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and Beta service times

Figure 38. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and Uniform service times

173

Figure 39. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and service times

Figure 40. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and Beta service times

174

Figure 41 shows error results for the first two scenarios in a 10-stage M/U/2 system. All

errors are less than 1 percentage point in absolute value. These errors are far smaller than

those for the 2-stage system. We summarize and discuss our conclusions from the

experiments in the next section.

Figure 42 shows the estimation errors for the systems in Figure 39, but with a

LCFS, not FCFS, service discipline at each stage. As the congestion in the system

increases (Scenarios 3 and 4), the estimation errors become huge (almost 50%). For the

more lightly-loaded systems, the errors are not significantly different from the FCFS

errors. The errors get larger as the system becomes congested. These results are

expected, since a congested LCFS system experiences a great deal of job reordering.

Figure 41. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and Uniform service times

175

Figure 42. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and service times, and a LCFS service protocol

D.4.2 DISCUSSION Our conclusions on estimating cycle times are slightly different for the n-stage tandem

queue with multiple servers than in the single-stage queueing system. The estimation

problem is more complicated because of overtaking.

The Time Slice Counter Method is not guaranteed to give the correct answer

when overtaking is present, even as the time slice size gets arbitrarily small. We can

ensure that the ith DELAY and ith FINISH events fall in different time slices, but it is

possible that they do not correspond to the same job. There was virtually no difference in

the errors in experiments for time slice sizes 1, 0.5, and 0.1 with the four M/M/2 scenarios

for the 10-stage tandem queues.

Congestion and variability do not increase the accuracy of the estimates with

overtaking. Specifically, if there is much service time variability, the odds are increased

176

that one job will overtake the other job. This explains why the results for the 2-stage

M/U/2 system are better than those for the M/M/2 system, which are better than those for

the M/B/2 system. We observe these phenomena in both the 2-stage and 10-stage tandem

queueing systems. This leads to the following claim, stated without proof:

Claim 3 In an n-stage tandem queueing system with multiple servers at least one of the

stages, increased variability of service times leads to overtaking, which leads to

less accurate cycle time estimates by the Time Slice Counter Method.

If we compare the same system types (e.g. M/M/2 under different scenarios), the more

congested runs still tend to perform better than the more idle runs. We believe that this is

because congestion allows DELAY and FINISH events to spread out more. Overtaking

does not appear to become more pronounced when the system gets more congested

(changing arrival/service rates, not distributions). Perhaps it would become more

pronounced if the number of servers was increased (while keeping ρ constant). On the

other hand, perhaps overtaking is driven more by the underlying system (e.g. M/M/2

versus M/B/2) than by the congestion in the system. We will research this in future work.

These observations lead to an important distinction between congestion and

variability. Variability is caused by the distributions used, while congestion is related to

the distribution parameters. If we use 2DC as our measure of variability in the system and

ρ as our measure of congestion, Table 9 shows that we can have both high variability and

low congestion (e.g., B/B/1 with mean arrival of 1.5) and low variability with high

congestion (e.g., U/U1 with mean arrival of 1.1). More congested systems may lead

individual jobs to experience more variability than they would for more idle systems,

177

because longer busy periods mean the job’s behavior is affected by a larger number of

other jobs’ behaviors.

The errors for the 10-stage system are never greater than those for the 2-stage

system. They can even be substantially smaller than in the 2-stage case. We hypothesize

that this is because the greater number of stages allows jobs that were overtaken to regain

their position farther ahead in the system (mixing of jobs):

Claim 4 As the length of a tandem queueing system increases, the cycle-time estimation

error will not increase. It may decrease.

D.5 EXTENSIONS In this section, we discuss extensions to the Time Slice Counter Method. They address

some of the problems of the method. In all cases, we are adding information to the

simulation model.

D.5.1 RUN-TIME ERROR ESTIMATION A disadvantage of the Time Slice Counter Method is that we do not know the accuracy of

the method beyond the guidelines given in Section 4.5.6. To ensure small errors, we

must do a full trace simulation of the same system (or know the analytic solution) and

compare the estimates. In doing so, we have defeated the purpose of the approximation.

In (Schruben and Yücesan 1988), the authors propose transaction tagging, a

method of gathering job statistics by tracking only a fraction 1/k of all jobs. They are

motivated by memory limitations in a process interaction-based simulation package.

178

They determine the fraction of jobs that should be tagged to minimize the probability of

the program aborting before a sufficient number of jobs have been simulated.

Another advantage is that the effective sample size for congested systems can be

significantly smaller than the actual sample size (Bayley and Hammersley 1946).

Transaction tagging reduces the inefficiencies of tracing jobs that do not provide much

additional information.

Transaction tagging relies on a feature of the software used: The software tracks

every job, but the user has the option of not tracking all attributes for all jobs. It is

unlikely that this is the case for all software packages. In the context of the information

taxonomy, a subset ( )t′J of all jobs is tracked completely, while only a minimum

amount of information is maintained on the remaining ( ) ( )\t t′J J .

In resource-driven models, we only maintain counts of the numbers of jobs at

different processing stages. Information on the jobs’ ordering in queue is lost. If we

track a fraction of the jobs to get trace information, we must know the queue location of

these jobs. The immediate answer is to have a counter for the number of jobs in queue

before the tagged job. If there is another tagged job in queue, we track the number of

jobs between the two. This is information in region I.

In determining the number of jobs to tag, there are two possibilities. We can fix

the number in a manner similar to the limited amount of (work-in-progress) WIP in a

CONWIP (CONstant WIP) production system (Hopp and Spearman 1991). An

advantage of this approach is that memory can be allocated at the beginning of the run,

rather than dynamically during the run. Disadvantages include the need to decide on a

179

limit before the run, and the possibility that all tracked jobs will converge in one part of

the system, leaving us without information on all other parts of the system.

The second possibility does not limit the number of jobs we are tracking at any

one point in time; we may specify which fraction of all jobs to tag. This approach grows

impractical as the number of tagged jobs increases. We need an unknown number of

counters for the unknown number of tagged jobs in queue. As the proportion of tagged

jobs approaches the total number of jobs, we need as many counters as there are jobs.

To solve this problem, we can add a counter to the job data structure in a job-

driven simulation. This data structure can be use for the tagged jobs in the resource-

driven simulation. In this example, we assume each job is represented as an element in a

linked list. Each station has a counter for the total jobs waiting in queue as in a normal

resource-driven simulation, and a linked list of tagged jobs. This approach is equivalent

to tracing a subset of the jobs globally.

Each tagged element has a counter in its associated data structure. If it is the first

tagged job in queue, the counter will indicate how many untagged jobs are in queue in

front of the tagged job. If it is not the first job, the counter will indicate how many

untagged jobs are between the current tagged job and the one in front of it. This is

illustrated in Figure 43. There are 10 jobs in queue, 2 of which are tagged. They are

highlighted with stars. Below the queue is the linked list with the two tagged jobs and the

associated counts.

180

Figure 43. Example of job tagging

The advantage of this implementation is that, when a job begins service, we need only

decrement two counters, the global queue counter and the counter for the first tagged job

in queue. The relative positions of the tagged jobs in queue do not change, so we need

not make any changes to the counters for the other tagged jobs. When a job arrives to the

queue, we increment the global queue counter. If it is a tagged job, we must additionally

set its counter.

We can increase the computational efficiency of this implementation by storing

an additional (global) counter at each queue containing the number of jobs behind the last

tagged job in queue.

With the ability to accurately (and quickly) track individual jobs, we are able to

assess the accuracy of the Time Slice Counter Method in real-time. If we are not using

dynamic delay values (values of γ added during the run, see Appendix D.5.2), we can do

this by counting the number of (tagged) jobs that have been delayed at most γ time units,

for every γ. We then compare the estimates obtained using the Time Slice Counter

Method and those obtained using the tagged jobs. In addition, we have rough estimates

of WF using the tagged job delays.

ID: 17 Count: 4 Join Q time: 22.19

ID: 29 Count: 2 Join Q time: 24.37

Global queue info: Q = 10 Tagged = 2

181

If we use dynamic delay values, we still may be able to assess the accuracy of the

Time Slice Counter Method. Fewer jobs are used in assessing the accuracy for those

values of γ added later. If it is difficult to simulate jobs, or we do not have the resources

to simulate many more jobs, it is desirable to keep the exact information on all the tagged

jobs. To do so, we could write the job waiting and cycle times to a file upon service

completion. This additional information will come at a computational cost when we want

to evaluate a new probability. Data must be read from the file, and we calculate the

desired new probabilities in NOk

⎛ ⎞⎜ ⎟⎝ ⎠

time. Nonetheless, it is possible to get probability

estimates for dynamic values of γ during the simulation run. Implementation details are

beyond the current scope and will be discussed in future work.

The Time Slice Counter Method is valuable even though we can get estimates of

( )WF γ using the exact job waiting and cycle times of the tagged jobs because it is able to

computationally efficiently give accurate estimates of ( )WF γ for FCFS systems without

overtaking. It does so without allocating memory during the run for new jobs, and

without pointer manipulations. It also uses information on all jobs, not just a fraction of

the jobs. We need not concern ourselves about which fraction of jobs should be tagged,

and about how to store and process the job information we obtain from the tagging.

The estimates from the Time Slice Counter Method have lower variance than the

estimates using tagged job information to estimate ( )WF γ .

182

D.5.2 DYNAMIC DELAY VALUES A second drawback of the Time Slice Counter Method is that we are forced to pick γ

values before we know anything about the system (other than the arrival and service rates

and distributions). In some cases, we miss significant portions of the distribution. In

Figure 36, we have estimates only up to 40% for one of the systems and 80% for another.

We do not have any data on the upper tail of the distribution; this upper tail is often the

interesting section because it contains the events that are most likely to be harmful

(unacceptably long waiting times, catastrophic downtimes, etc.). In many of the other

systems studied, over half of the γ values have an associated ( )WF γ of 1. This is an

inefficient use of study resources.

The final disadvantage of picking γ’s before doing the simulation is that we do not

know on which sections of the distribution to focus. We have seen instances where

( )ˆ 0.3W iF γ = and ( )1ˆ 0.7W iF γ + = . Since there is a large difference between the two, it

would be worthwhile to add granularity between γi and γi+1 to have more information in

that area.

The obvious solution to these problems is to allow values of γ to be added or

deleted during the run. If we stop collecting statistics on γ, we may use the allocated

memory for new values of γ. The problem of termination bias here is faced by the

simulation as a whole.

Adding γ values raises the following questions:

1. When do we decide to add values?

2. How do we decide that we need more values?

183

3. How many new values do we need?

4. Which new values should we choose?

5. Is the current simulation run long enough to ensure good estimates for the new

values?

The first question requires further research. We want to make the decision once we are

confident that our current estimates are “accurate enough,” which may be situation-

dependent. For example, we may wish to set a desired half-width for a 95% confidence

interval on the estimate, or to specify a certain number of simulated jobs before we

evaluate. Since we are not ending the simulation run at this time, our estimates need not

have achieved the level of accuracy required of the final answer.

The answer to the second question is situation-dependent. The user may specify

how many points on the distribution should be estimated, or that (s)he wants a point for at

least every 10% increase in probability, i.e., ( ) ( )1 11 2

ˆ ˆ0.1 , 0.2W W

F Fγ γ− −= = , etc. If the

current values of γ are not fulfilling this requirement, more must be added. Further

research is required.

The third and fourth questions are tied to the second question. If there is a large

gap between two successive values, we need to add an appropriate number of

appropriately-chosen new values. The simplest answer is to pick a number proportional

to the gap between ( )W iF γ and ( )1W iF γ + , and to space the new γ’s evenly between γi and

γi+1. More research is necessary to address this.

The final question asks whether the remaining simulation time is long enough to

give us accurate estimates of ( )WF γ for the new values of γ. We will have an idea of

184

how much time is/how many jobs are required to achieve the desired level of accuracy

from our initial run control strategy, and from the existing observations. Additional runs

may be required.

D.5.3 DYNAMIC TIME SLICES Theorem 2 states that the Time Slice Counter Method can be made arbitrarily accurate for

a single-stage FCFS queueing system. As outlined in Appendix D.5.1, we can tell during

the simulation run how accurately we are able to estimate the job-driven ( )WF γ . If our

estimates are inaccurate, we can increase the resolution of our time slices.

For example, we can double the number of time slices by halving the time slice

size. We either assign both time slices the same value as the original time slice, or we

can try to “smooth” the values. Figure 44 shows an example of this. The original time

slice counts are given on the top line. The bottom line shows the new time slice counts.

If there is a difference between two successive time slices, the counts for the new time

slices are roughly in the middle of the two original ones. This follows the assumption

that the DELAYs are uniformly distributed in the time slice.

Figure 44. Time slice counts after doubling the number of time slices

The smoothing need only be done for time slices that fall between the current simulation

time t and t + γ for all γ; that is, if ξ is the original time slice size, we smooth between

5 7 10 10 15

5 6 7 8 10 10 10 12 15 15

185

time slices ( ),t 0 ξ and ( ),t γ ξ . We are no longer interested in time slices in the past,

and time slices beyond ( ),t γ ξ (for all γ) have not yet been assigned values.

D.5.4 USING MORE INFORMATION: EXPECTED ORDER STATISTICS The three extensions proposed in previous sections do not use additional information for

the Time Slice Counter Method itself. The extensions proposed next modify the Time

Slice Counter Method by using more information than just the current (time slice) counts.

The way of classifying uncertain jobs in Appendix D.2.5 using expectations can

be refined by using the expected order statistics rather than the expectation of the location

of a single job in the time slice. The additional information used is the relative position

of the job in the time slice. For example, if there are three DELAYs occurring in a time

slice and the current job is the first, the expectation used is the first (of three) order

statistics.

We need to know the number of DELAYs that occur in the time slice; this

requires a comparison to the previous time slice’s count. We also need to know the

relative number of the current job. This is a derivation of the number of DELAYs in the

previous and current time slices, and of the current START count. We must decide on

the order statistic distribution to use: Should we use the distribution of interarrival (and

therefore DELAY) times, or must we determine the distribution of DELAY events in a

time slice? More research is required.

186

D.5.5 USING MORE INFORMATION: LCFS SERVICE DISCIPLINE We outline a method that may allow us to estimate delay probabilities for a LCFS

discipline. The disadvantage is that we need to store additional information.

Nonetheless, we need not explicitly track each job, and the additional information

consists of counters in region I.

If job i is the first or only job in a busy period for a G/G/1 system, it corresponds

to both the ith START and DELAY. If the job arrives to a non-zero queue or other jobs

arrive before it can begin service, this is not the case, and we must use additional

information to determine to which START the ith DELAY corresponds. To do so, we

store the queue sizes at the times of the START events. There are 2 possibilities when

comparing the queue sizes after successive START events. (We assume no balking.)

They are outlined in Table 14.

Table 14. Possible queue sizes and their interpretations

Scenario size comparison interpretation 1 ( ) ( )1j jQ S Q S +≤ There was at least one arrival between the two

START events. START j+1 corresponds to job ( )1A jN S + .

2 ( ) ( )1 1j jQ S Q S+ = − There has been no arrival. We cannot say which job START j+1 corresponds to. It is not job ( )1A jN S + , but we require additional history to be able to uniquely determine which job it is.

From the queue counts, we can reconstruct how many jobs have arrived, which jobs have

already been served, and which job will be served next. The details of this approach are

the subject of future work.

a dissertation submitted in partial satisfaction of the...

Documents