a dissertation submitted in partial satisfaction of the...
TRANSCRIPT
An Information Taxonomy for Discrete Event Simulations
by
Theresa Marie Kiyoko Roeder
B.S. (Case Western Reserve University) 1997 M.S. (Case Western Reserve University) 1999 M.S. (University of California, Berkeley) 2001
A dissertation submitted in partial satisfaction of the
requirements for the degree of
Doctor of Philosophy in
Engineering-Industrial Engineering and Operations Research
in the
GRADUATE DIVISION
of the
UNIVERSITY OF CALIFORNIA, BERKELEY
Committee in charge:
Professor Lee W. Schruben, Chair Professor J. George Shanthikumar
Professor Hyun-soo Ahn Professor David R. Brillinger
Spring 2004
An Information Taxonomy for Discrete Event Simulations
Copyright 2004
by
Theresa Marie Kiyoko Roeder
1
Abstract
An Information Taxonomy for Discrete Event Simulations
by
Theresa Marie Kiyoko Roeder
Doctor of Philosophy in Engineering-Industrial Engineering and Operations Research
University of California, Berkeley
Professor Lee W. Schruben, Chair
Discrete event simulation models are a popular means of decision support because of
their ability to model complex systems with relative ease. This ease, however, can lead
to the inclusion of more detail in the model than is necessary, which, in turn, can lead to
increased development time and a greater number of errors in the model. In this
dissertation, we introduce an information taxonomy to aid in identifying the information
contained in a system.
The taxonomy has a twofold advantage. First, it assists the modeler in organizing
the information on the system, which facilitates model development. Second, it helps the
modeler to identify the information needed to model system characteristics, and to obtain
the desired output. If the user is constrained in the information that can be included in the
model, the taxonomy can help determine whether the model can meet requirements. If
not, more information must be included, or approximations must be developed to
overcome the lack of information.
We develop two approximations when simulating queueing systems in this
dissertation. The first is used to approximate dedication constraints, where lack of job
2
information precludes modeling system behavior. This approximation provides upper
and lower bounds on system performance measures. The second approximation is used
to estimate job waiting time distributions, where lack of job information precludes
obtaining the desired output statistics. We show that the approximation converges to the
job-driven estimate of the waiting time distribution for certain systems.
i
To my family
ii
TABLE OF CONTENTS
1. Introduction........................................................................................1
2. Current Simulation Taxonomy and World Views ............................4
2.1. Resource-Driven versus Job-Driven Simulations............................................ 4 2.1.1. Description.................................................................................................... 4 2.1.2. Comparison ................................................................................................... 6
2.2. Simulation World Views.................................................................................... 7 2.2.1. Process Interaction ........................................................................................ 8 2.2.2. Activity Scanning........................................................................................ 10 2.2.3. Event Scheduling ........................................................................................ 12 2.2.4. Comparison of Process Interaction and Event Scheduling ......................... 15
2.3. Some Recent Experiences: A Semiconductor Fabrication Plant Model ..... 16 2.3.1. System Description ..................................................................................... 17 2.3.2. Results......................................................................................................... 18
2.4. Problems with a Taxonomy Based on the Simulation World Views ........... 21
2.5. Problems with a Taxonomy Based on the Resource-Driven and Job-Driven Paradigms ......................................................................................................... 23
3. Information Taxonomy....................................................................25
3.1. Types of Information ....................................................................................... 26 3.1.1. General or Entity-Specific .......................................................................... 26 3.1.2. Subscripted: Resource or Job...................................................................... 28 3.1.3. Subscripted: Local or Global ...................................................................... 29 3.1.4. Modeling, Statistics, or Both ...................................................................... 30 3.1.5. Static or Dynamic ....................................................................................... 31
3.2. Classification..................................................................................................... 31 3.2.1. Diagram....................................................................................................... 31 3.2.2. Example: Wafer Fab ................................................................................... 33 3.2.3. Example: G/G/s Priority Queue with Two Job Types ................................ 39
3.3. Complexity Analysis......................................................................................... 41 3.3.1. Memory (Storage) Requirements................................................................ 43 3.3.2. Processor Requirements: Simulation Engine.............................................. 45 3.3.3. Processor Requirements: Data Manipulation.............................................. 46
3.4. Relationship to Current Taxonomies and Formalisms................................. 46 3.4.1. Classical World Views................................................................................ 48 3.4.2. Resource-Driven and Job-Driven Paradigms.............................................. 50 3.4.3. Entity-Attribute-Set..................................................................................... 50 3.4.4. Discrete Event System Specification (DEVS)............................................ 51 3.4.5. Conical Methodology.................................................................................. 53 3.4.6. Object-Oriented Modeling .......................................................................... 54
iii
4. Implications......................................................................................55
4.1. General Implications on Modeling ................................................................. 55
4.2. Implications on Modeling: Service Disciplines.............................................. 56 4.2.1. Clarification of Terminology ...................................................................... 58 4.2.2. Goals and Assumptions............................................................................... 59 4.2.3. Service Discipline Taxonomy..................................................................... 60
4.2.3.1. General Information ............................................................................ 61 4.2.3.2. Resource Information .......................................................................... 62 4.2.3.3. Local Job Information ......................................................................... 62 4.2.3.4. Global Job Information........................................................................ 63
4.2.4. Evaluation ................................................................................................... 64
4.3. Desired Approximation Characteristics......................................................... 65 4.3.1. Approximation Accuracy............................................................................ 65 4.3.2. Computational Requirements...................................................................... 67 4.3.3. Error Measures............................................................................................ 68
4.3.3.1. Existing Measures................................................................................ 68 4.3.3.2. Example: Graphical Approach to Measuring Approximation Error ... 70
4.4. Example: Approximating Dedication Constraints........................................ 72 4.4.1. Problem Statement ...................................................................................... 73 4.4.2. System Behavior from the Job Perspective ................................................ 76 4.4.3. System Behavior from the Resource Perspective ....................................... 77 4.4.4. Approximation Implementation.................................................................. 80 4.4.5. Results......................................................................................................... 80 4.4.6. Possible Improvements and Future Work................................................... 86
4.5. Example: Approximating Waiting Time Distributions ................................ 87 4.5.1. Basic Approach........................................................................................... 88 4.5.2. Eliminating Individual Job Information...................................................... 93
4.5.2.1. Lower Time Slice Counter Method..................................................... 93 4.5.2.2. Upper Time Slice Counter Method ..................................................... 99
4.5.3. Observations ............................................................................................. 100 4.5.4. Time Slice Counter Method...................................................................... 102
4.5.4.1. Further Observations ......................................................................... 102 4.5.4.2. The Time Slice Counter Method Algorithm...................................... 104
4.5.5. Experimentation: Single-Server System................................................... 106 4.5.6. Discussion................................................................................................. 116
4.5.6.1. Single-Stage Systems ........................................................................ 116 4.5.6.2. Other Service Disciplines .................................................................. 118 4.5.6.3. Summary............................................................................................ 119
4.5.7. Future Work .............................................................................................. 120
5. Conclusions ................................................................................... 122
6. Bibliography.................................................................................. 125
iv
Appendix A. Definitions and Notation...................................................... 131
A.1 Definitions and Abbreviations....................................................................... 131
A.2 Notation........................................................................................................... 133 A.2.1 General Notation and Notational Conventions ......................................... 133 A.2.2 Information Taxonomy ............................................................................. 135 A.2.3 Approximation Algorithm Characteristics................................................ 136 A.2.4 Estimating Dedication Constraints ........................................................... 136 A.2.5 Approximating Waiting Time Distributions............................................. 137
Appendix B. Example: Generating Correct Departure Process without Job Information ................................................................................................ 140
B.1 Problem Statement......................................................................................... 140
B.2 Solution Characteristics and Error Measures............................................. 141
B.3 Solution Approach: Discretized-Time Queues ............................................ 144
Appendix C. Dedication Constraints......................................................... 146
C.1 Examples of Systems with Dedication .......................................................... 146 C.1.1 Wafer Production ...................................................................................... 146 C.1.2 Health Care ............................................................................................... 147 C.1.3 Web Servers .............................................................................................. 148 C.1.4 Other ......................................................................................................... 149
C.2 Enhanced Approximation.............................................................................. 149
Appendix D. Approximating Waiting Time Distributions ....................... 151
D.1 Possible Transitions Between Curve Orderings.......................................... 151
D.2 Classification of Uncertain Jobs.................................................................... 152 D.2.1 Ignore Uncertain Jobs ............................................................................... 153 D.2.2 Always Classify as (Not) Delayed............................................................ 153 D.2.3 Randomly Classify Jobs Using a Fixed Probability ................................. 155 D.2.4 Randomly Classify Jobs Based on the Current Time Slice Location ....... 155 D.2.5 Using Expectations ................................................................................... 156 D.2.6 Deterministically Classify Jobs Depending on Current Location............. 157 D.2.7 Hidden Markov Models ............................................................................ 159 D.2.8 Comparison of Uncertain Job Classification ............................................ 161
D.3 Implementation Details.................................................................................. 164 D.3.1 Classifying Jobs Based on Counters ......................................................... 164 D.3.2 Jobs Finding an Idle Server....................................................................... 164 D.3.3 Updating Counters .................................................................................... 165 D.3.4 Initialization Bias ...................................................................................... 166 D.3.5 Selecting Parameter Values ...................................................................... 167
v
D.4 Multiple-Server Tandem Queueing Systems ............................................... 168 D.4.1 Experimentation........................................................................................ 168 D.4.2 Discussion................................................................................................. 175
D.5 Extensions ....................................................................................................... 177 D.5.1 Run-Time Error Estimation ...................................................................... 177 D.5.2 Dynamic Delay Values ............................................................................. 182 D.5.3 Dynamic Time Slices................................................................................ 184 D.5.4 Using More Information: Expected Order Statistics ................................ 185 D.5.5 Using More Information: LCFS Service Discipline ................................. 186
vi
LIST OF FIGURES Figure 1. Traditional GPSS block diagram for a G/G/1 queue........................................ 9
Figure 2. Petri Net (with initial markings) for a G/G/1 queue....................................... 11
Figure 3. Basic event graph component ......................................................................... 13
Figure 4. Event graph for an n-stage G/G/· queueing network...................................... 13
Figure 5. Two-event event graph for an n-stage G/G/· queueing network..................... 14
Figure 6: Information taxonomy..................................................................................... 32
Figure 7. Event graph for an n-stage queueing network, with dedication constraints modeled using the approximation................................................................... 80
Figure 8. ( )totgQ t for the system described above .......................................................... 82
Figure 9. ( )totgQ t for the system in Figure 8, with short revisit times............................ 85
Figure 10. Aapprox(t*) and Aexact(t*) as proportions of the total number of jobs processed 85
Figure 11. Vapprox(t*) and Vexact(t*) as proportions of the total number of jobs processed 86
Figure 12. Event graph model for estimating { }P delay γ≤ ........................................... 90
Figure 13. Sample realization of START and DELAY event occurrences........................ 92
Figure 14. System realization from Figure 13 with approximated DELAY curve ........... 97
Figure 15. P{wait ≤ γ} for various values of γ and several time slice sizes ..................... 99
Figure 16. { }P wait γ≤ for various values of γ and several time slice sizes using both the Lower and Upper Time Slice Counter Methods ........................................... 100
Figure 17. Accuracy of job classification using the Lower and Upper Time Slice Counter Methods......................................................................................................... 103
Figure 18. Histogram of service times generated as Beta(0.5,2) with mean 1 .............. 109
Figure 19. Approximated probabilities using the Time Slice Counter Method for a lightly-loaded M/M/1 system......................................................................... 111
Figure 20. Approximated probabilities using the Time Slice Counter Method for a heavily-loaded M/M/1 system ....................................................................... 112
Figure 21. Errors for M/M/1 Time Slice Counter Method trials.................................... 113
Figure 22. Errors for U/U/1 Time Slice Counter Method trials..................................... 114
Figure 23. Errors for B/B/1 Time Slice Counter Method trials...................................... 115
Figure 24. Regression residuals versus fitted values ..................................................... 118
Figure 25. Sample queue for 5 parts that visit the same server 3 times......................... 144
Figure 26. Sample discretized queue for the queue given in Figure 25 ......................... 144
vii
Figure 27. Possible transitions between orderings within a time slice .......................... 152
Figure 28. Location of current time in time slice ........................................................... 154
Figure 29. Probabilities integrated over ........................................................................ 158
Figure 30. Area lost by using a point other than 2
m i ξξ= + as a cutoff ....................... 159
Figure 31. Mapping of hidden states to observable outputs of the states ...................... 160
Figure 32. Difference in estimation errors for the random and deterministic Time Slice Counter Method M/U/1 implementations ..................................................... 162
Figure 33. Difference in estimation errors for the random and deterministic Time Slice Counter Method U/U/1 implementations...................................................... 163
Figure 34. Difference in estimation errors for the random and deterministic Time Slice Counter Method M/B/1 implementations...................................................... 163
Figure 35. Illustration of relationships of times for updating time slices ...................... 166
Figure 36. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and service times................................................... 171
Figure 37. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and Beta service times .......................................... 172
Figure 38. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and Uniform service times .................................... 172
Figure 39. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and service times................................................... 173
Figure 40. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and Beta service times .......................................... 173
Figure 41. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and Uniform service times .................................... 174
Figure 42. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and service times, and a LCFS service protocol... 175
Figure 43. Example of job tagging ................................................................................. 180
Figure 44. Time slice counts after doubling the number of time slices .......................... 184
viii
LIST OF TABLES Table 1. Summary of terminology used by different applications................................... 5
Table 2. Disadvantages of the job-driven and resource-driven paradigms ................... 7
Table 3. Disadvantages of the implementation approaches ......................................... 16
Table 4. Simulation run lengths for the fab model........................................................ 19
Table 5. Possible orderings of the curves ................................................................... 102
Table 6. Lower and Upper Time Slice Counter Method Classifications .................... 103
Table 7. Formulae for 2DC in different systems........................................................... 107
Table 8. Distributions and their characteristics ......................................................... 108
Table 9. Systems in increasing order of 2DC ............................................................... 109
Table 10. Factors included in multiple regression ....................................................... 116
Table 11. Multiple linear regression model of Time Slice Counter Method errors...... 117
Table 12. Example values to illustrate differences in error measures.......................... 143
Table 13. Experiments for multiple-server tandem queueing systems.......................... 170
Table 14. Possible queue sizes and their interpretations.............................................. 186
ix
ACKNOWLEDGEMENTS I would like to thank my advisor, Lee W. Schruben, for his energy, creativity, and sound
advice. You always kept me on my toes. I would like to thank my committee members,
George Shanthikumar, Hyun-soo Ahn, and David Brillinger, for their guidance and
unfailing willingness to help me. Thank you also for the somewhat unexpected lessons
about Canadian hockey gear manufacturers.
I could always count on my families across two continents for support. I would
especially like to thank my parents, Karen and Lothar, and my brother, Paul. All of my
friends have been very supportive, but I must explicitly thank Laura and Beshara
Elmufdi, Max Robert, Deepak Rajan, Nirmal Govind, and Krista Damer for always being
there for me. My fellow “BSG” members Debbie Pederson and Wai Kin Chan were an
invaluable sounding board for ideas and had no small part in the development of this
dissertation. Thank you for reminding me how to do integrals.
Finally, I would not have been able to do this without my husband, Mike Cating.
Thank you for everything.
1
1. INTRODUCTION Computer simulation is one of the most popular Operations Research/Management
Science tools used for decision support. Because its focus is sampling the behavior of a
system rather than finding the exact solution to a problem, it lends itself to the analysis of
complex systems that would be difficult to represent using analytical mathematical
modeling techniques. It is possible to represent non-deterministic behavior over time
using technology no more advanced than personal computers (PCs). However, because
of the relative ease of modeling a wide variety of situations (especially compared to more
traditional mathematical modeling techniques such as linear programming), there is the
temptation to include everything in the model, and the simulation models developed can
be too complex for their intended purpose. On p. 12 of (Seila et al. 2003), the authors
note:
Model building is more art than science and requires much practice to master. It is important to develop a model that is as simple as possible while also being highly credible and having a good chance of being found valid for the anticipated decisions. Most experienced modelers agree that it is better to start with a simple model and add details later if necessary than to start with an excessively complex and realistic model.
Excessive detail is detrimental to model efficacy in several ways. First, it can take longer
to develop the model, as the details must be modeled accurately. Second, modeling
system components that are not necessary can induce errors, reducing the accuracy of the
simulation. Finally, the additional information that must be stored and processed is a
factor that can cause the runtimes for simulation models to be very large. (The modeling
approach taken also has a large impact on the runtimes. We will discuss this further in
2
later sections.) For example, a transportation network model with approximately 100
nodes took 18 hours to simulate 10 days (Sturm 2002).
To combat this problem of excessive runtimes, parallel (several replications run at
the same time on several processors) and distributed (single replication run on several
processors) computing algorithms are being developed to devote more processing
resources to the simulations. See, for example, (Jefferson 1983; Witten et al. 1983;
Wonnacott 1996). We take a different, complementary, approach here.
Rather than developing means of devoting more computing power to a model, our
approach focuses on the requirements of model; by paring out unneeded information and
including only what is necessary to model the system and obtain the desired output
statistics, we can reduce runtimes. Faster simulations can provide more information for
the same amount of resources.
While all the above techniques can be used in any kind of simulation model, we
focus our efforts here on discrete-event systems. These are simulation models where the
system state changes at discrete points in time. In contrast, continuous-time simulations
have state variables changing value throughout time (Law and Kelton 2000). An
alternate definition for discrete-event systems is systems whose rules or laws of behavior
change at discrete points of time. For example, we may have a system whose state is
changing constantly up to a threshold, when a different set of rules applies.
In Chapter 2, we discuss different approaches to simulation modeling, and the
effects these approaches can have on execution speed and simulation accuracy. These are
known as the classical simulation world views. We also review the resource-driven and
job-driven paradigms proposed in the literature. Chapter 3 introduces a new taxonomy
3
for the information contained in a system, with special focus on simulation models. We
provide a discussion of the information needed to model different types of service
disciplines. Lack of information can cause aspects of the system to be modeled
incorrectly, or desired output statistics to be unavailable. (Approximations may be used
to overcome these difficulties if the required information is unavailable.)
In Chapter 4, we discuss the implications of the taxonomy. We propose a
taxonomy of service disciplines in Section 4.2. Desired properties of estimation
algorithms, and how we may evaluate the effectiveness of algorithms are outlined in
Section 4.3. In Sections 4.4 and 4.5, we describe two approximation methods that allow
us to overcome lack of information and obtain accurate results. Definitions, notation, and
acronyms used throughout this paper are summarized in Appendix A. Further examples
and implementation details are given in Appendices B through D.
4
2. CURRENT SIMULATION TAXONOMY AND WORLD VIEWS The research in this dissertation is motivated by work done using a simulation approach
that has come to be known as a resource-driven (RD) simulation. Here, the focus is on
the resident entities in the simulation. Most discrete event simulations in use are known
as job-driven (JD) simulations, where the focus is on the transient entities that pass
through the system (Schruben and Schruben 2001). These two approaches are discussed
in Section 2.1. They are different from the classical simulation world views, the topic of
Section 2.2.
The discussion in this section motivates the taxonomy in Chapter 3. It
summarizes work done in (Roeder et al. 2002; Schruben and Roeder 2003), and outlines
the current state of research in this area. As the literature does not provide much
guidance, the concepts of “information” and “amount of information” are vague in this
section. They will be quantified in Chapter 3. Resource-driven and job-driven
simulations have respectively been referred to as “active server” and “active transaction”
approaches to modeling (Henriksen 1981).
2.1. Resource-Driven versus Job-Driven Simulations
2.1.1. Description
In building and analyzing simulation models, it is useful to classify system entities as
resident or transient. Resident entities (resources) remain part of the system for long
intervals of time, whereas transient entities (jobs) enter into and depart from the system
5
with relative frequency. In a factory, a resident entity might be a machine; a transient
entity might be a part. Sometimes, it is not clear whether an entity is resident or transient.
The decision depends on the level of detail desired and objectives of the simulation study;
a factory worker might be regarded as a transient entity in one model and a resident entity
in another. Table 1 summarizes the terminology used by different applications in
referring to resources and jobs.
Table 1. Summary of terminology used by different applications Technology Resources Jobs Queueing Theory Servers Customers Petri Nets Tokens Tokens Event Graphs Resident Entities Transient Entities Arena Resources Entities Simscript Permanent Entities Temporary Entities GPSS Facilities and Storages Transactions
In describing the dynamic behavior of a system, it is often useful to focus on the cycles of
the resident entities. For example, one might describe the busy-idle cycles of machines
or workers. On the other hand, the focus may be on the paths along which transient
entities flow as they pass through the system (e.g., parts moving through a factory).
Transient entity system descriptions tend to be more detailed than resident entity
descriptions. Typically, a mixture of both viewpoints is used, but one or the other
predominates. (For example, jobs may be traced only at certain points in the system, not
throughout the system.)
In (Schruben and Roeder 2003), the authors propose using these approaches as a
way of classifying simulation models at a high level. It is different from the conventional
taxonomy based on simulation world views described in Section 2.2; the world views are
ways of implementing resource- or job-driven simulations.
6
2.1.2. Comparison In highly-congested systems, where there are relatively few resident entities and a great
many transient entities, it is usually more efficient in terms of memory and computational
requirements to study the cycles of the resident entities than to track each transient entity.
These systems include semiconductor factories (“fabs”) with thousands of wafers,
communication systems with millions of messages, and transportation systems with tens
of thousands of vehicles. In simulating such systems, the cycles of resident entities might
be described by the values of only a few variables (e.g., the number of available servers
and the queue size), while the flow of transient entities might require many variables
(representing each job). This is a great disadvantage of job-driven simulations, as,
whenever the simulation becomes congested, the simulation’s memory footprint becomes
large, and its execution slows; in some cases, it may consume all system resources and
cause the operating system to fail.
Very large and highly-congested queueing networks (with any number of jobs of
any number of types at any countable discrete stages of processing) can be modeled as a
resource-driven simulation with a relatively small set of integers. The memory footprint
of the simulation is proportional to the number of resources in the system, not the number
of jobs. As a consequence, the simulation is almost insensitive to system congestion; the
number of resources, and therefore the memory requirements, does not change with
congestion. (There may be a computational impact if the number of jobs arriving during
a time period increases because more arrival events take place during that time.)
However, using only integer counts reduces modeling ability and the number of available
output statistics.
7
On the other hand, systems where there are only a few transient entities and many
resident entities (a construction project or an airline maintenance facility) may be
efficiently studied by examining the flow of transient entities. The advantage of using a
job-driven model over a resource-driven model is that the detailed experiences of
individual jobs can be tracked easily. This, in turn, allows detailed modeling and output.
The advantages and disadvantages of the two approaches are summarized in Table 2.
Table 2. Disadvantages of the job-driven and resource-driven paradigms
Job-Driven Resource-Driven Disadvantages • Large memory footprint for
congested systems • Simulation execution slows
as system size increases
• Insufficient information to a priori model certain system behaviors
• Available output limited Advantages • Detailed modeling possible
• Detailed output statistics available
• Easy animation
• Memory requirements insensitive to system congestion
• Simulation execution speed insensitive to system congestion
2.2. Simulation World Views A conventional taxonomy for simulation models classifies them according to three
primary simulation world views (“Weltanschauungen”). See, for example,
(Derrick et al. 1989; Derrick 1992; Carson 1993; Page 1994). (Overstreet 1987)
proposes an informal graphical method for translating models between world views. The
terminology of world views was brought into the mainstream by (Kiviat 1969).
However, in that work, they are not used as a means of classifying models. Rather, they
are used to describe the approaches different simulation programming languages (SPLs)
take in the implementation of simulation models. Kiviat’s original interpretation of
world views is taken in (Schruben and Roeder 2003). In the interim, the distinction
8
between the implementation and the model has become blurred, and the world views
have been used as the means of classification for simulation models. For this reason, we
present the three world views and highlight their advantages and disadvantages in the
following sections.
2.2.1. Process Interaction The most prevalent traditional world view is the process interaction world view. A
process can be defined as “a set of events that are associated with a system behavior
description” (Kiviat 1969). For example, a job’s progression through the system is a
process. Each job represents a different process. Following (Nance 1981), we define an
event as “a change in object state, occurring at an instant.”
The name “process interaction” derives from the focus on modeling how different
processes in the system interact with each other. For example, loading and unloading
parts into/from a machine and repairing a machine when it fails are separate processes.
In this case, the production process needs the machine and an operator, either of which
may be busy with a repair process. The processes are interacting through their shared
resources. The simulation progresses by activating and deactivating processes, shifting
control between them.
Virtually all commercial simulation packages use a process interaction-based
approach to implement job-driven simulations. For a formal definition of this world
view, see (Cota and Sargent 1992).
The jobs are the active system entities that “seize” available system resources as
needed. This approach typically requires that every step in the processing flow path of
9
every job in the system be explicitly represented. Records of all jobs in the system are
created and maintained. The jobs move through their processing steps (often represented
by a block flow diagram), seizing and releasing system resources as needed.
Figure 1 shows the GPSS blocks (Schriber 1991) for a single-server queueing
system. Jobs are GENerated with a certain interarrival time TA. They then join the
QUEUE for resource 1. If resource 1 is available, they SEIZE the resource, and
DEPART the queue (computing waiting time statistics). After a service time TS, the job
RELEASEs the resource and is TERMinated.
Figure 1. Traditional GPSS block diagram for a G/G/1 queue
Process interaction models are inherently job-driven. They are convenient when
predicting system performance and fast simulation execution speed are not as important
TAGEN
QUEUE
SEIZE
1
DEPART
1
TS
1
RELEASE 1
TERM
10
as detailed system animation. A major advantage to using this approach is that there
often is a direct mapping from simulation logic to animation – the code focuses on
describing the things in the system that move. Another advantage is that modeling a
system using processes is often simpler than focusing on state-changing events (see
Section 2.2.3 for a discussion of the event scheduling world view). Less abstraction is
required to formulate the model when the focus is on processes, as they are usually the
most natural way to think about a system; this allows users with little training and
experience to build fairly sophisticated models.
2.2.2. Activity Scanning The second classical world view is the activity scanning world view. Much as the
process interaction world view is closely related to job-driven simulations, activity
scanning is often used to develop a purely resource-driven model (in the sense of models
where only counts of resources and jobs are available). The name “activity scanning”
derives from the way the simulation model progresses: After each transaction, the whole
model is scanned to find whether any new activity can start or finish.
Stochastic Timed Petri Nets (STPNs) are a popular graphical implementation of
activity scanning that can be useful in simulating certain types of resident entity cycles
(Törn 1981; Haas 2002). They are an extension of Petri Nets (Petri 1962). The bars in a
STPN represent transitions, which may be instantaneous or have time delays. The circles
are places containing tokens; the token counts represent the system state. STPNs are a
special case of the event graphs discussed in Section 2.2.3. Their usefulness is limited to
a subset of the models easily simulated by event graphs. See (Schruben 2003) for an
11
algorithm to convert any Petri Net to a more compact event graph. (There is an
interesting confusion of terminology here due to the term “event graph” having been also
used to denote a very special class of a Petri Nets – a construct completely different from
the one referred to above (Commoner et al. 1971).)
Figure 2 shows a Petri Net for the G/G/1 queueing system shown in Figure 1.
Figure 2. Petri Net (with initial markings) for a G/G/1 queue
An advantage of activity scanning models is that they can be used to quickly show the
relationships between resources in a system, and how the resource cycles relate to one
another (Kiviat 1969; Fishman 1973). A disadvantage is that they grow very quickly as
the model size increases because of lack of parameterization. This can slow the
execution when the entire model must repeatedly be scanned to find the next activity.
The Three-Phase Method is a modification to the activity scanning paradigm. It
reduces the number of activities that must be scanned by classifying activities as time-
bound (“B” events) and conditional (“C” events). An example of a “B” event is a finish
service, which is known to occur at a fixed time after a start service. Start service events,
on the other hand, require that conditions be met before they can begin service (available
job and resource); they are “C” events. In the Three-Phase Method, the simulation clock
tStA
12
time is advanced to the time of the next “B” event(s). After it is executed, all “C” events
are scanned to see if their conditions are now met. The Three-Phase Method was
introduced in (Tocher 1963).
2.2.3. Event Scheduling The final classical world view is the event scheduling world view (Kiviat 1969). The
three elements of a discrete event system model are the state variables, the events that
change the values of these state variables, and the relationships between the events. The
event scheduling world view focuses on these events. The simulation proceeds by
executing the next event, which, in turn, may schedule further events or cancel events
that have already been scheduled.
Event graphs are a graphical way of implementing an event scheduling-based
simulation (Schruben 1983). They are related to state-transition diagrams in queueing
theory (Ross 1997). In an event graph, events are represented as vertices and the
relationships between events are represented as directed edges connecting pairs of event
vertices. In contrast, the vertices in a state-transition diagram represent the state, and the
edges represent possible transitions between states. This can make the diagram infinitely
large (if the state-space is infinite). Since the vertices represent changes in state, not the
states themselves, event graphs are unlikely to suffer this problem, especially when the
model can be parameterized.
Time elapsed between the occurrences of events is represented on the edges of an
event graph. Figure 3 shows its basic structure. It states that
13
If condition (i) is true at the instant event A occurs, then event B will immediately be
scheduled to occur t time units in the future with variables k assigned the values j.
Figure 3. Basic event graph component
An event graph for a (resource-driven version of a) multiple-server n-stage queueing
network is shown in Figure 4. Qi and Ri are the current queue size and the number of
available servers at stage i, respectively. DONE is a Boolean variable indicating whether
the job has completed its final processing stage.
Figure 4. Event graph for an n-stage G/G/· queueing network
In contrast to process interaction-based models, jobs do not actively seize the resources.
There is no explicit differentiation between resource and job entities.
Figure 5 shows the event graph for the model in Figure 4 without the START
event. The two event graphs have the same behavior. In Figure 5, Qi is the total number
of jobs in queue and in service at stage i. Ri is the total number of servers at i.
j A B (k)
t (i)
{Qi=Qi-1, Ri=Ri-1}
START (i)
FINISH (i)
tsta
ENTER (i)
{Ri = Ri + 1, DONE = (i = = n)}
(Ri > 0) i
i
i
1
(i = = 1)
{Qi = Qi + 1}
i+1
(Qi > 0)
(DONE = 0)
14
Figure 5. Two-event event graph for an n-stage G/G/· queueing network
Unlike in process interaction and activity scanning models, event scheduling can be used
to implement both job-driven and resource-driven simulations in a natural way. For
resource-driven simulations, rather than maintaining a record of every job in the system,
only integer counts of the numbers of jobs of particular types and different stages of
processing or in different states are necessary. The system’s state is described by the
availability of resources (also integers) and these job counts. Thus, all the state variables
in a resource-driven simulation are non-negative integers. The state changes for each
event are difference equations that increase or decrease one or more state variables by
integer amounts. This is illustrated in Figures 4 and 5. For a discussion of resource
graphs, a derivation of event graphs designed specifically for resource-driven
simulations, see (Hyden et al. 2001).
When using event scheduling to implement a job-driven simulation, the integer
counts of the jobs at a given step are replaced by lists of these jobs. As in the process
interaction approach, the memory footprint of the simulation will increase with system
congestion.
ts
ta
ENTER (i)
FINISH(i)
{Qi = Qi – 1, DONE = (i = = n)}
(Ri ≥ Qi)i
1
(i = = 1)
{Qi = Qi + 1}
(Qi ≥ Ri)
(DONE = 0)
i
i+1
15
Event graphs have been shown to be able to model Turing Machines
(Savage et al. 2004). That is, event graphs can model anything.
2.2.4. Comparison of Process Interaction and Event Scheduling Traditionally, the terms “job-driven simulation” and “process interaction world view”
have been used almost interchangeably because process interaction-based software is
inherently job-driven. Nonetheless, in (Schruben and Roeder 2003), we show this
paradigm can also be used to develop logic for resource-driven simulations. An early
example of this can be found in (Schriber 1991). Here, the workers processing a large
number of parts are introduced as “transactions” that cycle through states of being busy
and idle.
Some of the disadvantages traditionally associated with job-driven simulations,
see (Roeder et al. 2002), are actually phenomena of the process interaction world view,
not of the job-driven approach itself. Implementing a job-driven model using an event
scheduling paradigm does not experience these difficulties. Table 2 lists the
disadvantages of the job-driven and resource-driven paradigms themselves, while Table 3
below shows the difficulties associated with the actual implementations. The two tables
show that the event scheduling implementation of a resource-driven simulation is
straight-forward, though the resource-driven paradigm itself has problems.
The problems of modularity and encapsulation (or lack thereof) for process
interaction-based simulations are addressed in (Cota and Sargent 1992). They are not
listed in Table 3 since the authors propose a modification of the world view to address
these issues. The modification redefines the control state of a process to allow the
16
process to end before its “time left in state” has run out if the conditions for reactivation
are met. This facilitates encapsulation by allowing processes to be ended prematurely
without the need for other processes to do the cancellation.
Table 3. Disadvantages of the implementation approaches
Process Interaction Activity Scanning (STPN)
Event Scheduling
Job-Driven • Failure modeling inaccurate
• Deadlock possible
N/A without many customizations
Schedule many events in congested systems (slows simulation)
Resource-Driven
Must “trick” software into performing desired behavior
• Model size increases dramatically with system size
• Many extensions required to enable modeling
An expansive evaluation of different modeling and implementation approaches is given
in Table 4.8 of (Page 1994).
2.3. Some Recent Experiences: A Semiconductor Fabrication Plant Model
Current simulation models in the semiconductor industry present prohibitively large run
times, preventing the simulations from being used to their full potential. The goal of the
research project described here was to create a resource-driven simulation of a
semiconductor wafer fabrication plant (“fab”) to see the whether there would be an
increase in execution speed, and whether there were any disadvantages to this approach.
Increased speed in semiconductor simulations is a great concern in the industry
(Brown et al. 1997).
17
We briefly describe the system modeled, and then discuss the results of the study;
these will motivate the work done in the next sections. For a more detailed discussion of
the modeling differences, see (Roeder et al. 2002). For confidentiality reasons, exact
system parameters cannot be given.
2.3.1. System Description The fab produces more than 5 part types using over 80 different tool types. Each tool
type has a varying number of (functionally identical) tools. The tools include serial
processing, batching, and stepper tools. Batch tools are tools such as furnaces where
several lots are processed at one time. Stepper tools are used for the photolithography
steps of the manufacturing process, and are frequently the bottleneck in the system. Each
stepper costs several millions of Dollars.
Tool types are subject to a set of preventive maintenances (PMs) and failures that
occur according to certain probability distributions. Test- and rework wafers visit the
tools periodically. All tools require load and unload times, and stepper tools also have
setup-dependent setup times. Wafer lots are queued based on critical ratio ranking, and
the first lot in queue is selected to be processed next. Critical ratio (CR) ranking orders
jobs based on their expected due dates and remaining processing times. (We define CR
more formally in Section 4.2.3.4.) Proprietary processing rules are used in processing
parts on the stepper tools. Operators and generic tools (e.g., masks and reticles) were not
modeled to facilitate the modeling process.
18
Lots visit tools repeatedly, and visit subroutes with certain frequencies. (E.g.,
every 10th job to get to Step 4 executes Step 4, all others go directly to Step 5.) The basic
route followed by the parts consists of over 500 processing steps.
The job-driven simulation used by Intel is coded in AutoSched AP (ASAP), a
commercial software package by Brooks/AutoSimulations, Inc., (AutoSimulations 1999).
It is heavily used in the semiconductor industry. Data and options are specified in
spreadsheet files and are processed by the software. Intel engineers code customized
functionality in C++. ASAP is a process interaction-based software package.
The resource-driven simulation developed at UC Berkeley was implemented
using the software package SIGMA for the underlying simulation engine and to generate
C source code (Schruben and Schruben 2001). Additional coding was done in C to
mimic the functionality of the job-driven simulation. The resulting simulation is an
executable file. The ASAP input files were imported into a Microsoft Access database,
which, along with Microsoft Excel, was used to create plain text input files. SIGMA uses
an event scheduling algorithm to model system dynamics.
In order to be able to compare the output of the two simulations accurately, the
resource-driven model simulates the job-driven model, not the fab itself. This is done to
prevent differences in output caused by different approaches to modeling the same
aspects of the fab.
2.3.2. Results In undertaking the study, the expectation was that the resource-driven simulation would
be somewhat faster than the job-driven simulation currently in use. The results
19
confirmed this intuition. The simulation runtimes for 2 years of simulated time are
shown in Table 4.1 The resource-driven model was more than two orders of magnitude
faster than the job-driven model when failures were included in the model. Failures
increase the runtime because they are modeled as jobs, artificially increasing system
congestion.
Table 4. Simulation run lengths for the fab model
Job-Driven Resource-Driven With PMs/Failures
≈ 3 hours < 10 minutes (≈ 352 seconds)
Without PMs/Failures
≈ 1.5 hours < 10 minutes (≈ 300 seconds)
While the resource-driven event-scheduling model is faster, the job-driven process-
interaction model has an advantage when it comes to modeling certain aspects of the
system, and to the available statistics. The majority of system features were modeled
accurately in the resource-driven model, and the available output for these parts of the
system was statistically insignificantly different from that of the job-driven simulation.
The resource-driven simulation was able to achieve these accurate results much more
quickly than the job-driven simulation. There are, however, important components of the
system that were difficult to model in a resource-driven simulation.
These system are those that require the knowledge of specific job attributes. One
such example is dedication constraints. Dedication constraints are rules typically used
with the photolithography (stepper) tools to ensure that successive layers of etchings on
wafers are the same depth: At each visit to the stepper tool, wafers are exposed to
ultraviolet light that is used to create the circuitry of the chips. Because the dimensions
1 The model was run on a Dell Dimension Pentium 4 1.7 GHz PC with 512 MB of RAM.
20
of the chips are so small, differences in the wavelengths of the light emitted by the
individual stepper tools have an impact on the quality of the resulting chip (Woods 1998).
Dedication constraints require wafers to be processed by the same tool i on
different, though perhaps not all, visits to a tool group. A less restrictive version of this
rule, partial dedication, is that the wafer must be processed by one of a subset of the tools
in the group, not necessarily by tool i. Modeling dedication constraints in a
straightforward manner requires more information than can be obtained from merely
looking at job and resource counts.
Stepper tools are among the most expensive resources in the fab and are typically
the bottleneck, so it is important to model this constraint. Because not all wafers that
arrive at the tool group are processed immediately, even if there are idle tools, the work-
in-progress at that tool group is greater than if the constraints were not in place. We will
return to dedication constraints in Section 4.4.
In resource-driven models, queueing at resources has the twofold problem of
limited available output statistics, and of being difficult to model accurately in some
cases. While we may be interested in knowing the distribution of the waiting times of
jobs at the resource, we are typically limited to averages in resource-driven simulations.
The queue sizes are known at any given time, and the average queue size can be used to
calculate the average waiting time using Little’s Law (Little 1961).
Queueing disciplines that are difficult to model require specific job information,
often quantities such as how long a job has been in the system (global delay information),
or how long the job has been at the resource (local delay information). We will discuss
21
queueing disciplines more in Chapter 4 after proposing an information-based taxonomy
for simulation models.
2.4. Problems with a Taxonomy Based on the Simulation World Views In (Schruben and Roeder 2003), the authors claim the simulation world views are better
treated as ways of implementing, not classifying, simulation models. This coincides with
their historical development, see (Kiviat 1969; Fishman 1973). Using them as a
taxonomy is problematic as they provide neither an exclusive nor an exhaustive means of
classification.
For example, there is periodic confusion on the categorization of certain models;
Petri Nets have been described both as activity scanning (Miller et al. 2004) and process
interaction (Seila et al. 2003). A proper taxonomy should classify models as one or the
other.
The world views are also unable to unambiguously account for resource-driven
and job-driven simulations: Both RD and JD simulations can be implemented as either
process interaction or event scheduling models. The RD/JD framework should not be
ignored, because it takes a higher-level view of models by focusing on the entities of
interest rather than on the actual implementation. It helps decide whether the emphasis of
a model should lie with the resources, with jobs, or with both; this is useful in developing
the model itself because it provides clarity.
The process interaction world view was developed as a mixture of the event
scheduling and activity scanning world views: “Conceptually, the process interaction
approach combines the sophisticated event scheduling feature of the event scheduling
22
approach with the concise modeling power of the activity scanning approach”
(Fishman 1973). This mixture of activity scanning and event scheduling is different from
the mixture found in the Three Phase Method, which explicitly adds event scheduling
elements to the activity scanning to reduce the number of activities that must be scanned
at each transaction. The activity scanning world view itself has been described either in
terms of activities that are started and ended by events (Kiviat 1969), or defined as
activities independently of events (Fishman 1973). The ability to overlap and mix world
views is necessary to adequately define the behavior of SPLs, but is not desirable in a
proper taxonomy, where there should be clear distinctions.
The simulation world views were originally put forth to describe the ways
simulation programming languages approach model implementation. Computer
programming and data structures have made great progress since the 1960s, and there are
many variations and refinements in implementations of simulation models. The Three
Phase Method is an early example of a refinement to the activity scanning approach to
reduce computational requirements. As methods of model implementation, the world
views still provide a useful high-level differentiation between SPLs. However, a modeler
should be aware that they are only rough means of classifying SPLs, and that there may
be many deviations in the details. For a detailed discussion of formalisms and
frameworks for simulation models, see (Page 1994).
23
2.5. Problems with a Taxonomy Based on the Resource-Driven and Job-Driven Paradigms
While intended to be an improvement on the simulation world views as a taxonomy, the
resource-driven and job-driven paradigms do not satisfactorily capture the behaviors of
systems and models of systems. For example, resource-driven simulations have been
defined as simulations where only integer counts of the jobs and available resources are
maintained (Schruben 2000). This is not the case in the simulation described in
Section 2.3; there, detailed information such as utilization or time since last failure is
maintained for all resources in the system. On the other hand, a job-driven simulation
has been defined as one where all jobs in the system are traced, with the implication that
no information on the resources is kept. This is also not true, as JD models often contain
all possible information in the system. In a sense, the name “job-driven” is a misnomer.
A large part of the problem with using the RD and JD paradigms as a taxonomy is
that they are ill-defined. While the model from Section 2.3 does not fit the definition
cited earlier from the literature, it is generally accepted as a resource-driven model. A
defining characteristic appears to be the fact that job information is not traced.
Classifying models based on job tracing is an option, but can lead to difficulties with so-
called mixed models, where jobs are traced in certain parts of the system but not in
others. It does also not allow differentiation between the RD model in Figure 4 and the
fab model from Section 2.3.
While it is not a requirement of a taxonomy, the RD/JD framework is not helpful
in deciding what aspects of a system can and cannot be modeled using a resource-driven
or mixed model (e.g., an RD simulation can model a FCFS service discipline with one
24
job type, but not two). The information taxonomy presented next stresses the information
needed to model desired system characteristics, or to obtain desired output statistics.
25
3. INFORMATION TAXONOMY We have outlined problems associated with current taxonomies of simulation models,
both for the classical world views and the resource-driven/job-driven taxonomy proposed
in (Schruben and Roeder 2003). While much work continues to be done trying to use the
world views to capture models, we feel the fundamental problem is that the world views
are neither an exclusive nor an exhaustive classification because they try to characterize
implementations, not models.
The taxonomy proposed in this section draws heavily from the literature, but
focuses on the information contained in the system model. (Merriam-Webster 1993)
defines information as “facts, data.” The taxonomy allows the modeler to determine
what information is (data are) needed to model specific aspects of the system and,
conversely, to state what can and cannot be modeled given the informational constraints
on the model. The taxonomy will clarify why certain types of models are subject to
certain problems (e.g., inaccurate failure modeling in process interaction-based models);
and why certain statistics are not available for some types of problems (e.g., FCFS
waiting time distributions for a “resource-driven” model). Section 3.4 relates the
taxonomy to existing formalisms.
It is important to note that the taxonomy focuses on the model itself, not the
implementation of the model (though implementation can be aided through the use of the
taxonomy). Implementation can use a SPL based on any of the world views, or newer
methods such as object-oriented or web-based technologies (Healy and Kilgore 1997).
26
In addition to aiding in modeling decisions such as data structure design, the
taxonomy assists in the complexity analysis of the memory requirements of different
models. Computational requirements are not addressed directly, as they are very
dependent on the implementation; however, information (and memory) requirements may
give a rough estimate of the computational requirements, regardless of implementation.
For example, a queueing discipline based on a job’s total time in system requires ordered
insertions into a job queue, regardless of the chosen implementation.
3.1. Types of Information The following subsections provide different high-level ways of classifying information.
While the classifications in any subsection are exclusive and exhaustive, the subsections
themselves are not mutually exclusive. We give examples of the different types of
information in each subsection. Section 3.2 combines the classifications from this section
into a single taxonomy. We then give examples of the information from the perspective
of the taxonomy as a whole.
In some cases, the classification of information depends on the system to be
studied, and on the purposes of the study. We give examples of different classifications
of the same information, depending on the situation.
3.1.1. General or Entity-Specific At the highest level, we can classify information as being general information about the
system, or information about specific (resident or transient) entities in the system.
27
Examples of general information are the number of resources or the number of jobs
waiting in queue. Examples of entity-specific information are the last failure time for
each resource, or the waiting times for each job. We refer to the information about the
system itself as “non-subscripted,” while the information about entities is “subscripted”
(because that is how such information is generally represented mathematically). The
subscripted information is relevant if we are differentiating between the different entities.
If we are not interested in the waiting times of individual jobs, we may not need to
explicitly differentiate between jobs j and j’.
If we do not differentiate between resources, then the processing rate at a resource
group is general information about the system, i.e., non-subscripted. Different
classifications are possible if there are different processing rates for the k resources in the
same group. (k is system information.) We can either treat them as separate resource
groups, in which case the processing rates would be general information; or differentiate
between the k resources in one group, in which case the processing rates would be
subscripted information. The second approach is preferable if other resource information
is needed (e.g., utilizations), or to simplify job routing.
The subscripts referred to here should not be confused with subscripts introduced
to facilitate modeling. For example, a queueing network simulation that uses only counts
of the number of jobs and available resources at each stage may refer to jobs going from
station i to station i+1. The only information used in the model, however, is general
information.
28
3.1.2. Subscripted: Resource or Job Subscripted (entity-related) information may pertain either to resources or to jobs.
Resources are the resident entities that remain in a system for extended periods of time,
even for the duration of the simulation. Jobs are transient, and more likely to enter into
and leave from the system.
In some cases, there may be ambiguity. For example, a worker is a resource, but
leaves the system to go on break or at the end of a shift. If the purpose of the simulation
is worker scheduling, it may make sense to treat the worker as a job. If the purpose is to
model the system as a whole, including the flow of jobs, the worker can be considered a
resource. (The same worker will be going on and off shift throughout the simulation,
while jobs do not usually re-enter the system once they have left.)
Another example of ambiguity is the case of a machine that requires servicing by
another resource, say, a worker. Although the machine is in the system throughout the
simulation, it becomes a job from the worker’s perspective. If the focus of the study is
maintenance scheduling, the machine could be treated as a job. If the study is also
explicitly modeling job flows, the machine more naturally fits in the role of a resource, as
it does not make sense to have a job processing another job. Since the machine is also
serviced by another resource, it can be seen in the context of a hierarchy of resources
(Schruben and Schruben 2001).
It is possible to classify entities as resident or transient in any simulation model.
As a general rule, entities that remain or reappear in the system throughout the study can
29
be considered resources. Those entities that will ultimately leave the system (even if
their sojourn is long) are jobs.
3.1.3. Subscripted: Local or Global Entity information can be either local or global. When classifying information as local or
global, we are usually referring to temporal location. For example, the job’s waiting time
at the current resource (local temporal) or the job’s time-in-system (global temporal).
Some literature also refers to spatially local or global information (Baker 1998). In our
context, globally spatial information is information that requires knowledge of the system
as a whole (general and entity information), while locally spatial information is
information (not necessarily delay-based) that is restricted to the current location or
resource (entity information).
Most resource information is global because resources are more likely to be
stationary entities (e.g., machines). There are instances where differentiating between
local and global resource information is necessary. For example, a worker servicing
machines may not be allowed to spend more than x time units at a given machine. The
time already spent at the machine can be considered local information for the worker.
The differentiation between local and global is more intuitive for jobs because
jobs are more likely to move through the system and have different experiences at
different stages. The experiences at different stages do not have to be dependent on one
another. In a queueing network, a job spends different amounts of time at the different
stations. These times are local information, and may be discarded when the job has
finished at a station.
30
3.1.4. Modeling, Statistics, or Both Information may be used to model the system behavior, to estimate system statistics, or to
do both. For example, a job’s routing is used to model the system, but is not used
directly for output statistics. The cumulative area under the queue versus time curve for a
resource group is used for output statistics, to calculate the (time) average queue size.
The utilization of a resource may be used both to model system behavior (by assigning a
new job to the resource that has been utilized least) and to calculate statistics (to report
average utilization).
It is important to differentiate between these uses for the information. Only one
integer variable is needed to merely simulate a G/G/s queueing system, the number of
jobs in the system, see Figure 5. (System parameters are also required, i.e., the
interarrival and service rates, and the number of servers s.) If we wish to collect statistics
on the queue behavior, we need to introduce more variables, for example the largest
queue size, or the number of waiting job-hours. More desired output leads to larger
memory and computational requirements.
Similarly, the more detailed the system behavior, the more memory and
computational effort necessary. For example, simulating dedication constraints in
semiconductor fabs requires knowing which machine processed each wafer. If dedication
constraints are not an important element of the system, the job information need not be
stored (for this purpose).
This categorization is related to the classification of attributes used for control,
measurement, or both in (Gahagan and Herrmann 2001).
31
3.1.5. Static or Dynamic Any information in a system is either static or dynamic. Static information does not
change over time, for example the number of resources in a particular resource group
(unless explicitly modeling resource acquisition/relinquishment). Static information is
usually an input parameter to the simulation.
Dynamic information can change value over time; an example is the number of
available resources at a resource group. Dynamic information is used during the
simulation run and reported as output statistics at the end.
An example where there may be ambiguity is the intensity function for a Non-
homogeneous Poisson Process. In this case, we have two types of information: The
functional specification of the intensity is an input parameter (static), while the
instantaneous intensity is dynamic.
3.2. Classification
3.2.1. Diagram
Figure 6 shows a diagram of the information taxonomy we propose, incorporating the
classifications described above. The regions are not drawn to scale.
32
Figure 6: Information taxonomy
We define the following regions, and give examples of each in the next section.
I. Non-subscripted 1. Static
i. Modeling ii. Modeling and statistics
2. Dynamic i. Modeling ii. Modeling and statistics iii. Statistics
II. Subscripted: Resources 1. Static
i. Modeling ii. Modeling and statistics
2. Dynamic i. Modeling
(a) Local (b) Global
ii. Modeling and statistics (a) Local (b) Global
iii. Statistics (a) Local (b) Global
33
III. Subscripted: Jobs 1. Static
i. Modeling ii. Modeling and statistics
2. Dynamic i. Modeling
(a) Local (b) Global
ii. Modeling and statistics (a) Local (b) Global
iii. Statistics (a) Local (b) Global
3.2.2. Example: Wafer Fab In this section, we illustrate the information taxonomy using the semiconductor
simulation described in Section 2.3. Where appropriate, we indicate the impact the lack
of information had in the resource-driven event scheduling implementation of the model.
We also show where the process interaction model misclassifies information, leading to
unnecessary difficulties in model implementation. All resource information is global
because the resources in this model are stationary.
I.1.i.: Non-subscripted static information for modeling
• Number of job types
• Number of resource groups
• Wafer release times for each job type
• Queueing discipline at each resource group; queueing disciplines and other
operational rules are typically modeler- or system-defined, not characteristics of the
resources themselves.
34
• Service distributions; the distributions are modeler- or system-defined, not
characteristics of the resources themselves. In some cases, the true service
distributions are not even known.
• Processing rates; if the processing rates are the same for all resources in a group, this
information is general system information. If there are differences, it becomes
resource information (II.1.i).
• Job routings and processing times; this quantity can also be defined as a job
characteristic if desired. Since the routing is deterministic in this case, it makes more
sense as a general system parameter, rather than as a job attribute. In the case of
stochastic routing, the routing probabilities should be treated as system parameters,
while the actual routings themselves are job-specific.
• Preventive maintenance and failure schedule for each resource group.
• Batch sizes for each resource group. If several job types can be batched together, the
admissible combinations are also in this category.
I.1.ii.: Non-subscripted static information for modeling and statistics
• Simulation run length
• Number of resources in each group; used to assign jobs among the available idle
servers (with probability idle servers/total servers), and also to calculate resource
group statistics such as average utilization.
• Load/unload, setup times for resource groups
35
I.2.i.: Non-subscripted dynamic information for modeling
Bottleneck information is an example of the type of information in this category, though
it was not used in the fab simulation.
I.2.ii.: Non-subscripted dynamic information for modeling and statistics
• Current simulation clock time
• Number of available resources in each group; used to assign jobs among available
idle servers, and to calculate resource group statistics.
• Number of jobs in queue at each resource group; because this simulation does not
require much information about the queue sizes, it is sufficient to treat the queue
counts as global information. If more elaborate queue information were desired, the
queues themselves may be treated as resources.
I.2.iii.: Non-subscripted dynamic information for statistics
• Cumulative area under queue versus time curve; the sole purpose of these variables
are book-keeping, to determine the average queue sizes during the simulation run.
II.1.i.: Resource static information for modeling
While there was no static resource modeling information here, an example is resource
qualifications.
II.1.ii.: Resource static information for modeling and statistics
This model contained no static resource information used for both modeling and
statistics.
36
II.2.i.a: Resource local dynamic information for modeling
While there was no local dynamic resource information in the model, an example is a
worker’s time at current location.
II.2.i.b: Resource global dynamic information for modeling
• Resource status
II.2.ii.a: Resource local dynamic information for modeling and statistics
While there was no local dynamic resource information for modeling and statistics in this
model, an example is the time a worker arrived at a machine group to perform
maintenance.
II.2.ii.b: Resource global dynamic information for modeling and statistics
• Dedicated queue size; in the case of resource dedication, each resource (in each
group) will have its own queue of jobs that have previously been processed by it. If
jobs are assigned to specific resources on arrival at the group, they would also be
included in this queue. The assignment may occur based on the number of jobs
already in the resource’s dedicated queue.
• Time to next preventive maintenance/failure for each type of PM/failure. In the
process interaction model, PMs and failures are not treated as attributes of a resource;
rather, they are seen either as general or as job information. As a consequence, PMs
and failures are modeled as high-priority jobs that make the resources “unavailable”
at the appropriate times.
37
II.2.iii.a: Resource local dynamic information for statistics
While there is no local dynamic resource information for statistics in this model, an
example is the cumulative time a worker spends at different stations.
II.2.iii.b: Resource global dynamic information for statistics
• Cumulative time in PM/failure
• Cumulative time spent processing, idle, loading, unloading etc.
III.1.i.: Job static information for modeling
While there was no specific job-related static information used only for modeling here,
examples include due dates; travel rates for transitions between resources that do not
require a resource such as a conveyor belt (e.g. walking rate for amusement park
visitors); or job-specific routing for probabilistic routing. The due date could also be
used for book-keeping purposes, for example, if jobs are serviced based on earliest due
date.
III.1.ii.: Job static information for modeling and statistics
• Job type; if jobs are traced explicitly, the job type is used for routing purposes
(modeling), and also to gather statistics on jobs based on type. In a resource-driven
model, this information is known only implicitly in the events scheduled for a job
(e.g., start service on a job of type 3).
38
III.2.i.a: Job local dynamic information for modeling
• Position in queue; a job’s position in queue is local information as it is no longer
relevant once the job has left the current queue. This is a job attribute, not a resource
or queue, attribute. It is not known in a resource-driven simulation.
III.2.i.b: Job global dynamic information for modeling
None of the following are known in a resource-driven simulation.
• Remaining processing time; this is often used in determining job ordering in queues.
• Current location in system
• Resource ID used for processing; this information is used for modeling dedication
constraints. Since it is not available in the resource-driven implementation of the fab
simulation, dedication cannot be modeled.
III.2.ii.a: Job local dynamic information for modeling and statistics
• Time joined current queue; this information can be used for modeling, e.g., the FCFS
discipline will service the job with the earliest time. It is also used to gather waiting
time statistics for the job. This information is not known in a resource-driven
simulation.
III.2.ii.b: Job global dynamic information for modeling and statistics
• Time entered the system; this information can be used both for job ordering in queues
and for calculating cycle time statistics. This information is not known in a resource-
driven simulation.
39
III.2.iii.a: Job local dynamic information for statistics
• Waiting time. This information is not known in a resource-driven simulation.
III.2.iii.b: Job global dynamic information for statistics
• Cycle time. This information is not known in a resource-driven simulation.
• Though not used in this simulation, the job’s classification or grade (showing defect
levels) at the end of processing is also information that can be used for statistical
purposes. This information is not known in a resource-driven simulation.
3.2.3. Example: G/G/s Priority Queue with Two Job Types In this section, we illustrate the information taxonomy with another example, a priority
queue with two job types and FCFS queueing within classes. Since it is a single-stage
system, there is no difference between local and global information. We assume detailed
output is desired.
I.1.i.: Non-subscripted static information for modeling
• Number of job types (2)
• Priority of each job type
• Arrival and service distributions for both job types
• Processing rates if s servers are functionally identical
I.1.ii.: Non-subscripted static information for modeling and statistics
• Simulation run length
• Number of resources (s)
40
I.2.ii.: Non-subscripted dynamic information for modeling and statistics
• Current simulation clock time
• Number of available resources
I.2.iii.: Non-subscripted dynamic information for statistics
• Cumulative area under queue versus time curve
• Cumulative area under queue versus time curves for the two job types
II.1.i.: Resource static information for modeling
• Processing rates if there are differences between the s servers
II.1.ii.: Resource static information for modeling and statistics
There is no static resource information for modeling and statistics.
II.2.i.: Resource dynamic information for modeling
• Resource status
II.2.ii.: Resource dynamic information for modeling and statistics
• Resource utilization
II.2.iii.: Resource dynamic information for statistics
• Cumulative time spent processing, idle, etc.
III.1.i.: Job static information for modeling
Not applicable in this example.
41
III.1.ii.: Job static information for modeling and statistics
• Job type
III.2.i.a: Job local dynamic information for modeling
• Position in queue
III.2.ii.a: Job local dynamic information for modeling and statistics
• Time joined current queue; within priority classes, there should be some discipline for
deciding which job to process next, e.g., FCFS.
III.2.iii.a: Job local dynamic information for statistics
• Waiting time
• Cycle time
3.3. Complexity Analysis One application of the information taxonomy presented here is that it facilitates the
complexity analysis of simulation models. The memory complexity can be derived from
the information needed to model the system and to find the desired output statistics.
Some computational complexity can be obtained from these requirements, though the
complexity depends heavily on the model implementation, which is not dictated by the
taxonomy. We use big-O notation (Landau 1909) to express the complexities.
We define the following quantities:
( ){ }*: 0t t t≤ ≤G set of general system information; when the set does not change
over time, we omit the time index for simplicity, ( )t t= ∀G G
42
( ){ }*: 0t t t≤ ≤R set of resources; when the set does not change over time, we omit
the time index for simplicity, ( )t t= ∀R R
( ){ }*: 0t t t≤ ≤J set of jobs; when the set does not change over time, we omit the
time index for simplicity, ( )t t= ∀J J
( ){ }*2 : 0t t t≤ ≤R power set of ( ){ }tR ; that is, the set of all subsets of ( ){ }tR
( ){ }*2 : 0t t t≤ ≤J power set of ( ){ }tJ ; that is, the set of all subsets of ( ){ }tJ
( ){ }*: 0t t t′ ≤ ≤R element of power set of ( ){ }tR , ( ) ( )2 tt′ ∈ RR ;
that is, a subset of all resources
( ){ }*: 0t t t′ ≤ ≤J element of power set of ( ){ }tJ , ( ) ( )2 tt′ ∈ JJ ;
that is, a subset of all jobs
Jmax maximum value of ( )tJ , ( )maxJ t t≥ ∀J
M set of modeling behaviors or rules,
, , , ,local global local global= ∪ ∪ ∪ ∪G R R J JM M M M M M ; the set of system aspects that
are to be modeled, for example resource dedication or FCFS queueing. Subscripts
“local” and “global” refer to local and global behaviors.
O set of output statistics, , , , ,local global local global= ∪ ∪ ∪ ∪G R R J JO O O O O O ; the set of
desired output statistics, for example, job waiting time distributions.
43
3.3.1. Memory (Storage) Requirements We do not consider the memory required to store events on the future events list. This
quantity depends on the implementation chosen. Simulation engine processor
requirements are discussed in the next section.
The amount of general information typically used during a simulation run is
[ ]( )O ∪ ×G GM O G . Since the amount of system information does not change during
the simulation, this quantity is fixed. For example, if we simulate an M/M/1 FCFS
system and desire only average queue statistics, ×GM G contains the interarrival and
service rates, as well as the server status (2 real-valued and one integer-valued variables).
The simulation run length and current time are also in this category (2 real variables).
[ ]∩ ×G GM O G contains the current queue size (1 integer variable), and ×GO G is the
cumulative area under the queue versus time curve (1 real variable).
If we trace resources, the amount of resource-related information is
( ), , , ,local global local globalO ⎡ ⎤∪ ∪ ∪ ×⎣ ⎦R R R RM M O O R . (To simplify notation, we will
aggregate local and global resource information in this section since there are a limited
number of cases where there is explicit local and global information for resources.) Since
the number of resources does not usually change after the simulation has begun, the
amount of information stored here is also a fixed quantity. The amount of information
depends on the number of resources in the system, R .
In some cases, we may be interested only in tracing a subset of the resources, e.g.,
photolithography tools in a wafer fab. In this case, the amount of information is
44
[ ]( )O ′∪ ×R RM O R . For example, let there be two machine groups, with k1 and k2
machines, respectively. Each machine has a different error rate (qualification), and we
wish to find the utilization and the average percentages of time loading/unloading for
each machine. The error rate information (which may be used to assign jobs to
machines) is in ×RM R (k1+k2 real variables). The cumulative times busy (utilization)
and loading/unloading are not used for modeling, and are information in ×RO R
(3(k1+k2) real variables). If we only want the output information for group 2, { }2′ =R ,
the amount of output information is ′ ′×RO R , 3k2 real variables.
If we trace jobs, we must differentiate between local and global information.
Overall, the information required is ( ), , , ,local global local globalO ⎡ ⎤∪ ∪ ∪ ×⎣ ⎦J J J JM M O O J .
Transaction tagging (Schruben and Yücesan 1988) will track the information for only a
subset of jobs, and will require ( ), , , ,local global local globalO ′⎡ ⎤∪ ∪ ∪ ×⎣ ⎦J J J JM M O O J
information.
Memory requirements can also be reduced by maintaining only global
information for all jobs. This requires ( ), ,global globalO ⎡ ⎤∪ ×⎣ ⎦J JM O J information. Only
tracing jobs at certain resources (e.g., to find waiting time distributions at the bottleneck)
uses ( ), ,local localO ′⎡ ⎤∪ ×⎣ ⎦J JM O J information, where ′J may be a function of ′R .
The advantage of storing only local job information is that
, , , , , ,local local local global local global′ ⎡ ⎤⎡ ⎤∪ × ≤ ∪ ∪ ∪ ×⎣ ⎦ ⎣ ⎦J J J J J JM O J M M O O J when ′ ⊆J J . The
45
difference can be significant. In a fab, ( )tJ can be several thousands of wafers, while
( )t′J may be only several hundreds.
If job data are stored in static arrays of size N, N must be large enough to ensure
the simulation will not run out of space. That is, maxN J≥ , where Jmax is a random
variable. This can be very inefficient use of memory. If the data are stored in dynamic
ranked or ordered data structures, less memory may be required, ( )( )O tJ rather than
( )maxO J . However, more computational effort will be needed to create and delete list
elements, as well as for index-based lookup. This is addressed in Section 3.3.3.
3.3.2. Processor Requirements: Simulation Engine The requirements of the simulation engine will depend heavily on the implementation
chosen. For example, some activity scanning model implementations involve repeated
scanning of all activities, which is less efficient than scheduling events if the number of
activities is large.
At a low level, all three world views place “events” to occur in the future on an
events list. An important note is that, even if job information is not stored, at least one
event will be scheduled for each job. In the G/G/s system in Figure 5, each job will
arrive and complete service, so there are two events per job. If the system is congested,
there will be many events scheduled to occur in a short amount of time. This can slow
the simulation, so the effects of congestion are felt even if job information is not being
stored explicitly. The maximum number of events on the events list at any point in time
46
in this example is limited to s+1 (s finishes and one arrival), but if arrivals are happening
quickly, many events are added to and removed from the list in a short amount of time.
3.3.3. Processor Requirements: Data Manipulation The amount general system information is fixed, so the data manipulation for it only
involves changing values. This can be done in O(1) time.
The number of jobs varies during the run for non-closed systems, so manipulation
is required if jobs are ordered. Specifically, it has been shown that insertions into an
ordered list are done in ( )( )O tJ time. Alternately, a list can be sorted in
( ) ( )( )2logO t t⋅J J (Williams 1964). This is the case whether job information is stored
in statically- or dynamically-sized data structures. If the size of the data structure
changes as ( ){ }tJ changes, additional computation is required to add and delete
elements; and for pointer manipulations to sort items.
3.4. Relationship to Current Taxonomies and Formalisms We have proposed a taxonomy of simulation models based on the information required to
accurately represent the system, and to obtain performance estimates. This taxonomy is
independent of the implementation approach. It differs from the formalisms described in
the literature by focusing on the levels of information detail needed to model the system.
Less focus is on classifying the information based on its type (descriptive, informational,
integer, real, etc.), or on the relationship between pieces of information and model
47
implementation; the taxonomy is a tool for model development. It may suggest a means
of implementation, but the implementation is not the focus.
The formalisms in the literature seek to integrate and synthesize the system
model, while the information taxonomy decouples the information in the model and
classifies each piece independently. Many of the existing methods decompose models
into objects and subobjects and then try to integrate them. The taxonomy does not force
the information into a specific form (i.e., an object or entity), but clarifies what is
contained in the system. This clarity serves the purpose of including only the necessary
information in a simulation model.
In (Nance et al. 1999), the authors argue that some redundancy in model
specification can be beneficial, e.g., for model extensions and reusability. We do not
dispute this statement, merely assert that redundancies in a model should be put there
knowingly, not inadvertently.
One of the purposes in undertaking this classification is to make explicit the
concept that often much less information is required than is included in the model.
Conversely, we use the classification to show why certain types of models are unable to
model certain things. Many of the ideas we propose are not new, but we frame them in a
different context. In this section, we compare the taxonomy to existing formalisms.
48
3.4.1. Classical World Views As stated in Section 2.4, the classical world views show how SPLs approach model
implementation. Because of the many variations possible in implementation, a taxonomy
based on implementation is not able to provide a robust framework. The information
taxonomy can be used to classify the information used by both the classical world views,
and by other SPLs.
STPNs use exclusively general system modeling information (region I in the
taxonomy). Because of this, the ability to provide job statistics (region III) is extremely
limited.
To overcome this limitation, (Haas 2002) develops a method to determine delay
distributions for certain model types by adding information in region III.2.iii (dynamic
job information for output statistics). Specifically, define { }: 0nV n ≥ as the sequence of
“start vectors” recording the starts of delay intervals for all ongoing delays after the nth
STPN marking change. The start vectors are updated by inserting, deleting, or reordering
elements depending on the current state, and the transition(s) that fired at the nth marking
change. Vector { }: 0nW n ≥ contains the indices of the marking changes that caused the
delay to be added to the start vector. It is used to sort the delay times in order of
increasing start times if overtaking occurs. { }: 0nW n ≥ is updated in the same way as
{ }: 0nV n ≥ .
For example, to find job cycle times for a G/G/1 queueing system, the current
time is inserted in the leftmost position of start vector { }: 0nV n ≥ if the transition firing
49
corresponds to a job arrival. If the transition firing corresponds to a job finishing service,
remove the rightmost element of { }: 0nV n ≥ and calculate the job’s cycle time (by
subtracting the time just removed from the start vector from the current time). No other
transitions cause changes to { }: 0nV n ≥ in this example. Since there is no overtaking in a
G/G/1 queue, { }: 0nW n ≥ is not necessary.
The extension is not available for all types of models because the information
added to the model is output information only. It is not used for modeling; the
information used for modeling continues to draw from region I only.
Process interaction models use information from all areas, but suffer from a lack
of clear classification of information. For example, failures must be modeled as dummy
(high-priority) jobs because failure information is not properly classified as resource
information. The only other way to have failures affect resource behavior is to model
them as jobs.
The event scheduling world view per se does not limit itself to any specific
regions, nor does it misclassify information. At the lowest level, all SPLs schedule
events (whether to start and end activities or to advance a process). In that way, event
scheduling SPLs allow the lowest-level control of a model, and do not force the modeler
into unnatural programming situations.
50
3.4.2. Resource-Driven and Job-Driven Paradigms The taxonomy allows us to formally define the previously-vague notions of “resource-
driven” and “job-driven.” Specifically, a resource-driven model contains information
from at most regions I and II. A job-driven model contains information from all three
regions. It is possible for a resource-driven model to contain only information in
region I, e.g., the model in Figure 4. In contrast, the resource-driven fab model from
Section 2.3 falls in regions I and II.
The limitations of RD models are due to the lack of information in region III. For
example, lack of modeling information can lead to inability to model dedication
constraints, while the lack of statistics information leads to the inability to report waiting
time distributions. (We show how these problems can be addressed in Sections 4.3
and 4.4.)
The RD/JD paradigm takes a higher-level view of simulation models than the
world views by shifting away from the implementation and identifying the type of
information that is the focus of the model: resource or job information. The information
taxonomy presented here takes this classification a step further by formalizing the
classification categories and allowing a more strict sorting of information.
3.4.3. Entity-Attribute-Set In the Entity-Attribute-Set approach to modeling, the system is represented as a
collection of entities, which may have attributes. Entities can be attributes of other
entities, and the system itself is also considered an entity. Entities can be grouped in sets.
51
The SPL SIMSCRIPT takes this approach to modeling (Markowitz 1979). Events that
change entity attribute values are defined.
In contrast to the information taxonomy presented here, the Entity-Attribute-Set
approach describes the way SIMSCRIPT implements a simulation model. Because of
this, the modeler is forced to classify everything as an entity, and to define the
relationships between entities. The information taxonomy encourages the most intuitive
classification of information, whether as an entity (resource or job), the attribute of a
resource or job, or independent system information. This may not be entity-related.
Similarities are that information can be defined as relating to (e.g., an attribute of)
a resource or job; and that we can define types (sets) of resources and jobs. However, if
we are differentiating between types of entities, as opposed to the entities themselves, it is
likely that we are dealing with general system information, not information from regions
II or III. If explicitly differentiating between entities, the commonalities of the entities
are less of interest than the differences (e.g., different levels of utilization).
3.4.4. Discrete Event System Specification (DEVS) The DEVS formalism takes a systems-theoretic approach to modeling (Zeigler 1976).
From (Zeigler 2003), define the following variables:
X set of input places
Xb collection of bags over X (sets with possible repeated elements); i.e., set of
possible outside events
S set of states
Y set of output places
52
e elapsed time since last transition
ta time advance function, 0,:at S R+∞→
Q total state set, ( ) ( ){ }, | ,0 aQ s e s S e t s= ∈ ≤ ≤
intδ internal transition function, :int S Sδ → ; specifies transition that occurs from
state s if no external events occur and e is allowed to advance to ta(s)
extδ external transition function, : bext Q X Sδ × → ; transition that occurs from state s
if external event x occurs after time e is given by ( ), ,ext s e xδ
conδ confluentl transition function, : bcon Q X Sδ × →
λ output function, : bS Yλ →
Then a DEVS model is specified by { }, , , , , , ,int ext con aM X S Y tδ δ δ λ= .
The system can be broken down into submodels, which are coupled to form the
complete model. Each atomic component has input and output ports, and functions as a
black box when inserted into a larger model. DEVS is concerned with specifying the
basic models, and in their relationships to each other. Reusability of submodels is an
important aspect of this approach.
The information taxonomy does not attempt a specification of the whole model.
It can assist in organizing the information used to identify X, S, and Y. The relationships
between states (as defined by the δ functions) or the actual output mapping λ are not
interesting in the context of the taxonomy.
53
3.4.5. Conical Methodology The Conical Methodology is “an approach which embodies top-down definition of a
simulation model coupled with bottom-up specification of that model” (Nance 1979).
The two phases may be done iteratively and simultaneously. In the model definition
phase, the model is decomposed into objects and subobjects, similar to the approaches
taken in the Entity-Attribute-Set approach and DEVS (it draws from both). In this phase,
object attributes are also typed.
The specification phase uses the Condition Specification from (Overstreet 1982)
to describe model behavior by working up through the hierarchical list of objects and
attributes. There are three types of conditions:
1. interface specification: defines model inputs and outputs
2. model dynamic specification
a. objects: defining attributes, as well as their relationship to the objects in the model
b. action clusters: defines conditions and the set of actions to be taken when the
conditions are met
3. report specification: defines output data, and how they are to be computed
The information taxonomy is comparable to conditions 1, 2a, and 3 in the Condition
Specification. Unlike condition 2b, it does not define action clusters or other model
dynamics. The information taxonomy does not force the classification into objects or
attributes of objects. Its classification works with the possibly independent pieces of
information contained in the model, but not with the relationship between these pieces of
information.
54
3.4.6. Object-Oriented Modeling Object-oriented modeling has become more common with the popularity of languages
like C++ and Java, for example (Healy and Kilgore 1997). In object-oriented
programming, the system is decomposed into objects; objects are encapsulated, which
means that information is assigned strictly to one object. There may be an overarching
system object that calls functions on the other objects to change their attribute values.
The information taxonomy is related to object-oriented modeling in that objects
may be a natural way of implementing a model that has been classified using the
taxonomy. For example, the system itself is an object with associated information. If
maintaining individual resource information, each resource is an instantiation of a
resource class.
The main difference between object-oriented modeling and the taxonomy is that
the former is a means of implementing models; the latter can serve as a tool to aid the
development of an object-oriented model by organizing the information, but need not
lead to an object-oriented model.
55
4. IMPLICATIONS
4.1. General Implications on Modeling Our research has focused on making simulation models more efficient. On the one hand,
we have looked at the impact modeling methodologies themselves (world views) have on
simulation execution times. On the other hand, we have developed an information
taxonomy to assist in modeling. It focuses on the information needed for a simulation
model to accurately represent the system in terms of the modeling objective.
The taxonomy provides clarity on why certain models or modeling approaches are
insufficient in some cases. For example, a STPN cannot find waiting time distributions
for a LCFS G/G/s queueing system because it does not have access to the required job
information to model job ordering in queue; it has information from region I, not
region III.
The taxonomy further tells the user whether a given modeling approach will be
able to model the desired system behavior. For example, if the user wishes to determine
the waiting time distribution for a LCFS G/G/s queueing system using a STPN, the
taxonomy can tell the user that this is not possible, because this requires information from
region III.2.ii.a, which is not available in the STPN. This can save the user time because
(s)he need not look for an answer (i.e., a STPN) that will give the desired statistics. Such
an answer does not exist. Rather, the user can focus on an alternate implementation, or
on developing approximations or extensions to STPNs.
56
4.2. Implications on Modeling: Service Disciplines In this section, we show how the information taxonomy can be applied to
queueing/service disciplines. There is some literature classifying service discipline, but
the focus of the classifications has not been on model implementation requirements.
(Jackson 1957) distinguishes between static and dynamic rules; job priorities do
not change over time with static rules (e.g., priority classes with random selection within
classes). With dynamic rules, the priorities may change (e.g., critical ratio ranking).
(Conway and Maxwell 1962) classify jobs as using only local job information, or using
global information about any aspect of the system. These two approaches are combined
in (Moore and Wilson 1967) to form a two-dimension classification of static/dynamic and
local/global.
(Panwalkar and Iskander 1977) survey over 100 scheduling rules used in the
literature. They propose a broad classification of jobs into Simple Priority Rules,
Heuristic Scheduling Rules, and Other Rules. The first category contains rules that use
job information, as well as rules that use queue sizes to assign jobs to servers and rules
that do not use any specific information (e.g., random assignment). Subclassification is
“based on information related to (i) processing times, (ii) due dates, (iii) number of
operations, (iv) costs, (v) setup times, (vi) arrival times (and random), (vii) slack (based
on processing and due dates), (viii) machines (machine-oriented rules), and (ix)
miscellaneous information.” Rules can be combined directly, or combined using
different weights. Heuristic rules use more complicated logic that often weighs different
scheduling alternatives before selecting the most appropriate one. Heuristic rules may
57
incorporate Simple Priority Rules. The third category of rules contains shop-specific
rules or other rules that do not fall in the other two groups.
While this classification does use information to classify service disciplines, there
is no differentiation between the types of information used. For example, processing
times and arrival times (at the machine) use only local job information, while due dates
use global job information. Machine-oriented rules do not use any job information, but
are placed in the same category. Heuristic rules also use information, but this
information is not used to explicitly categorize the rules. Rather, the classification of
“heuristic” is based on the types of decisions made to schedule jobs.
(Yoshida and Touzaki 1999) propose a quantitative measure to compare service
disciplines in a manufacturing environment. It evaluates the “closeness” of service
disciplines based on performance measures for the problem. The authors do not attempt
a general classification of service disciplines, especially in terms of the information
required to simulate or model them.
(Gahagan and Herrmann 2001) propose a queue controller that specifies
according to which entity attribute the queue is sorted. There is no differentiation
between the type of information used (e.g., local or global job information). The queue
controller is able to model eight of the eleven most common dispatching rules given in
(Vollmann et al. 1988). The three it is unable to model are critical ratio, slack time per
operation, and “next queue.” Critical ratio is discussed more in Section 4.2.3.4. The
reason it cannot be implemented in this case is that the queue controller sorts entities
based on one attribute. Critical ratio requires two attributes (due date and remaining
processing time) to calculate the sorting criterion. Slack time per operation similarly
58
requires more than one attribute (due date, remaining processing time, number of
remaining operations). “Next queue” compares the queue sizes of the machine groups
jobs will visit next. The next job served is the one that will be joining the shortest queue.
This rule requires dynamic information about other parts of the system, and the jobs in
queue would have to be reordered repeatedly as queue sizes elsewhere change.
4.2.1. Clarification of Terminology Service disciplines refer to the rules a server uses to decide which of the waiting jobs to
process next (Gross and Harris 1998). We are not considering the assignment of an
available server to a newly-arrived job. That is, we are interested in the job selection that
occurs on the FINISH-START edge in Figure 4, not the server selection on the ENTER-
START edge.
There is some ambiguity in the literature about the difference between a “service
discipline” and a “dispatching rule.” (Baker 1998) provides a helpful discussion
highlighting that the two are the same, though the term “dispatching rule” is typically
used in a scheduling context. There, we need to differentiate between dispatching and
scheduling. Dispatching uses rules like the ones we are discussing here, where servers
determine in real-time which job to process next. Scheduling determines in advance
when which jobs will be processed by which server. It has some objective, for example,
to minimize the total waiting time experienced by all jobs.
Some of the service protocols we address here are service protocols only in a
loose sense. For example, the critical ratio ranking (Section 2.3.1) is not directly used by
a server to decide which job to process next. Rather, it is used to sort the server’s queue
59
when a job arrives. The server uses a FCFS protocol to pick its next job; the first job in
queue is the first to be served, though because of the critical ratio ranking, it may not
have been the first job to join the queue. We consider it a service protocol because we
could model this ordering as leaving jobs unsorted when they arrive, and having the
server choose the job with the smallest critical ratio to process next. The same applies to
other rules which sort jobs in the queue.
4.2.2. Goals and Assumptions Service disciplines are an important component of system models, and may even be the
focus of the study. They can have an impact on queue sizes and, relatedly, job waiting
and cycle times.
Server utilization is unlikely to be affected by the service discipline, since we
assume the servers will eventually serve all jobs, unless the system is unstable ( sλ ν≥ ⋅ ).
The service protocol may affect the number of jobs processed if it affects the
waiting/cycle times, which, in turn, cause jobs to balk or have to involuntarily leave the
queue. For example, parts may have to be scrapped if it takes too long to process them.
This will affect the server utilization. In these cases, however, we would be more
concerned with the number of balked jobs than the server utilization.
Correct modeling of service disciplines may be important not just for output
statistics (subregions iii), but also for modeling. For example, the order in which jobs
leave the server may be important. We discuss this problem and how to solve it without
using information from region III in Appendix B.
60
Job preemption (Buzacott and Shanthikumar 1993) can be modeled using a small
amount of additional general information; we must know what should be done with the
preempted job, and possibly store the remaining service time.
4.2.3. Service Discipline Taxonomy The information required to model a service discipline depends on the desired output
statistics. For example, if we only need the average queue size in a G/G/s system, the
number of jobs in queue and the number of available resources are sufficient to model the
system for both FCFS and LCFS disciplines. For now, we assume that job waiting and
cycle times are desired; that is, we want output in regions III.2.ii.a and b (or III.2.iii.a and
III.2.iii.b). We classify the service disciplines based on the types of modeling
information required (subregions 1.i and 2.i in all three main regions).
If information from several regions is used, the service discipline is classified in
the region whose memory requirements tend to dominate. For example, a discipline that
requires general and local job information is classified under local job information. If
local and global job information is needed, the discipline is classified under global
information.
61
4.2.3.1. General Information Disciplines that require only general information (subregions 1.i and 2.i of region I) are
ones that do not prioritize jobs based on the time they have spent in the system, or their
arrival time at the current queue. Examples of such disciplines are
• Random selection
• “Mob Rule”: with multiple job types, next job is selected from the type that has
the largest number waiting (used as a substitute for FCFS in the RD fab model
from Section 2.3). Job “types” may be different classes of jobs, or may be the
same class of job at different processing stages (for re-entrant systems).
• Priority: certain job types have priority over others; examples include push and
pull disciplines based on processing stage. These are also known as the first-
buffer-first-served (earliest processing stage) or last-buffer-first-served (latest
processing stage) disciplines (Govil and Fu 1999).
• Batching: a certain minimum number of jobs is required before service can begin;
between (min batch size) and (max batch size) jobs are serviced at the same time
In all these examples, there cannot be delay-based queueing within groups. For example,
it is not possible to model a two-class priority server with FCFS service within a priority
class using only general information if waiting and cycle times are the desired output.
Among possible service disciplines, the ones in this category require the least
amount of information. The information used is already contained in the model and is
needed for basic model functionality.
62
4.2.3.2. Resource Information Service disciplines that require resource information (subregions 1.i and 2.i of region II)
are much less common than those that use job information; resource information is more
likely to be used in selecting a server to process a job, rather than in selecting a job to
begin service.
An example of a job selection rule that uses resource and general information is
one where jobs are put in queue based on resource characteristics they need (e.g., a
resource qualified to perform a specific task). We are not differentiating between jobs, so
the job counts are general information. When a resource becomes available, it selects at
random one of the jobs that need its qualifications if such jobs are present, and selects a
job at random if none are present. In this example, job-specific statistics would not be
available.
4.2.3.3. Local Job Information Service disciplines that require local job information (subregions 1.i and 2.i.a of
region III) use information about the job’s current situation. Examples include:
• FCFS
• LCFS
• Shortest processing time: the job with the shortest processing time is served first.
This is common in scheduling problems (Pinedo 2002). It is possible to model
this using only general information if the service times are generated not at job
63
arrival, but rather as order statistics at the start of service. The ability to do this
may be situation-dependent.
4.2.3.4. Global Job Information Service disciplines that require global job information (subregions 1.i and 2.i.b of
region III) use information about the job’s experience in the system as a whole. They
may also use general or local job information to select among jobs that have the same
global evaluation criteria. Examples include:
• Earliest due date: the job whose due date is closest is served first; if due dates are
past, the job that is most tardy is selected.
• Dedication: jobs queue for the specific resource by which they were previously
serviced, or to which they have been assigned.
• Critical ratio: see below
Critical ratio ranking, see (Rose 2002), is a common discipline used in the semiconductor
industry. A survey of more dispatching rules used in semiconductor fabs can be found in
(Atherton and Atherton 1995). While there are slight implementation differences in some
software packages, the critical ratio (CR) is fundamentally defined as the ratio of the
job’s tardiness over its remaining processing time,
1
due date current timetotal remaining processing time
−+
. (1)
64
The job-driven implementation of the fab model in Section 2.3 implements a variation of
the critical ratio (Fischbein 2002):
- expected time in system actual time in system (2)
While (2) does not require the specification of a due date, it does still require global job
information (actual time in system).
4.2.4. Evaluation The ability to classify service disciplines based on the information required to model
them is an extremely useful tool. The authors have spent much time trying to model
disciplines in a “resource-driven” simulation that simply cannot be modeled (e.g., critical
ratio).
The service discipline taxonomy above assumes detailed job statistics are required
from the simulation. If this is not the case, it may be possible to reduce the modeling
information needed. For example, average queue sizes (and job waiting times) for
(priority) FCFS and LCFS can be found using only general system information. The
same is true for shortest processing time scheduling if service times are not generated on
job arrival. Whether the information needed can be reduced depends on the actual
discipline used, and on the desired output statistics.
65
4.3. Desired Approximation Characteristics In some cases, we may not want or be able to include the necessary information in our
simulation to model the system or obtain the desired output statistics. In these cases,
approximations are required to substitute for the unavailable information. In this section,
we discuss the criteria for comparing different approximation algorithms.
Two aspects of approximation algorithms should be considered when evaluating
their usefulness. The first is the approximation accuracy, discussed in Section 4.3.1, the
second is the computational effort, discussed in Section 4.3.2. Section 4.3.3 gives an
example of a single measure that can incorporate both aspects.
4.3.1. Approximation Accuracy If an approximation is to be of value, it must return performance measures that are near
the actual performance measures, or provide reliable bounds on the measures. This is
true for approximations in simulation models, optimization algorithms (e.g., integer
programming techniques), or in any other application. The definition of “near” is
application-dependent.
An approximation algorithm whose behavior is provable is preferable to one that
appears to work empirically, but cannot be formally shown to be accurate. Qualities of
the algorithm that can be proven include the following. They are listed in roughly
increasing order of desirability. That is, reliability is the first concern, followed by
bounding behavior, etc.
66
• Reliability: The algorithm should give the same results each time it is run. In a
deterministic setting, this means identical results should be obtained for the same
problem and the same algorithm parameter settings. In a stochastic environment, the
results across runs should be statistically identical; that is, any differences in output
are due only to randomness in the system. An algorithm that returns inconsistent
estimates across runs is an unreliable source of information on the underlying
process.
• Bounding behavior: Knowing the algorithm will always stochastically dominate (or
be dominated by) the true value is helpful as the user will know the (unknown)
system values are at least or at most as great as the ones returned by the
approximation. For example, the solution to an integer program will never be better
than its linear programming relaxation. Approximations may provide lower or upper
bounds, or both. Tightness of bounds is desirable.
• (Rate of) Convergence: Convergence is important for algorithms in general. For
approximation algorithms specifically, convergence is important. Fast convergence is
even more important, since we hope to gain a computational advantage by sacrificing
the accuracy of results. Convergence is discussed in (Powell 1981).
• Percent error: If an approximation can be shown to always be within a certain
percentage of the system values or optimal solution, its usefulness will be increased.
Often, the percentage difference is related to the convergence rate of the algorithm,
see for example (Fleischer 2004). This characteristic is the strongest listed here, and
it is usually the most desired. It is also often difficult to attain, especially in the
context of simulation models. Percent error is frequently specified in requirements,
67
and is preferable to absolute error because it removes the unit and scale components
of the reported error.
The qualities listed above are all quantifiable and measurable. For example, the
reliability of the algorithm output can be tested using standard statistical hypothesis
testing. For the bounding behavior, we can specify the type of dominance the
approximation has. More discussion of approximation algorithms, and examples of
dominance behavior, can be found in (Axelsson and Marinova 1999;
Gutin and Yeo 2002; Gutin et al. 2003). An excellent reference for stochastic orderings
can be found in (Shaked and Shanthikumar 1994).
These requirements should not be considered independently of the computational
requirements discussed below. For example, an algorithm that provides tight, reliable
bounds with little computation may be preferable to one that is within x% of the optimal
solution but that requires y percent more computational effort. Reliable, provable bounds
on the solution are very valuable information.
4.3.2. Computational Requirements While the accuracy is important, there is another aspect of algorithms that is more
significant for approximations than it is for conventional algorithms: the computational
requirements. The amount of memory and processor time required is important for any
algorithm, but with an approximation, we are typically consciously giving up accuracy in
an effort to gain speed and reduce the system (usually computer system) requirements to
execute the algorithm. If the approximation is relatively accurate but requires more time
68
and resources to run than the exact method it is trying to replace, there is no benefit to
using the approximation.
As with the quality of solution characteristics, the computational requirements of
an algorithm are measurable. Both the amount of memory and the processor time
required can either be found theoretically, or can be estimated by looking at computer
statistics while running the algorithm.
4.3.3. Error Measures Even when we can measure the accuracy of the algorithm and its computational
requirements explicitly, we must consider at least three numbers when comparing
alternate approaches to the same problem. The question of how to weight these different
aspects in the comparison is a challenge: Is an error of x% acceptable if we have reduced
our memory requirements while increasing the execution time?
4.3.3.1. Existing Measures We have found limited literature on a single error measure. (L'Ecuyer 1994) defines
( )C X as the expected value of the random variable of the CPU time required to compute
the realization of random variable X. [ ]MSE X is the mean squared error of estimating
X. Then the efficiency of X, not to be confused with statistical efficiency (Fisher 1922),
is
( ) [ ] ( )1Eff X
MSE X C X=
⋅ (3)
69
A similar efficiency measure is given in (Fox and Glynn 1990); the authors focus on the
amount of variance reduction obtained through conditioning versus the computational
effort required to obtain the conditioned values.
If X and Y are unbiased random variables that are used to estimate the same
performance measure of our system, then X is preferable to Y if ( ) ( )Eff X Eff Y> .
There are several problems with this measure. The first is that we wish to take
into account not only CPU time, but also the amount of memory required for the solution.
The second is that this efficiency measure is not normalized, and it is not clear how to
interpret a quantity like “0.3/m2⋅sec⋅MB.” Clearly, efficiencies closer to zero are less
desirable, but one can change the value of the efficiency simply by changing the units of
[ ]MSE X .
We would like an efficiency measure that is not dependent on the units used to
measure it. In addition, we would like it to be bounded like the correlation coefficient (a
normalized covariance); this would allow more meaningful comparisons to be drawn.
(Schmeiser and Yeh 2002) discuss a single criterion for evaluating confidence
interval procedures. It compares the results from any “real-world” confidence interval to
the “ideal” confidence interval. We take a similar approach in the error measure
discussed next.
70
4.3.3.2. Example: Graphical Approach to Measuring Approximation Error One approach for including all three aspects in a single measure treats them as
coordinates in three-dimensional space. Each component should be expressed as a
percentage of the method that would give the exact (or optimal) solution, which
eliminates the measure’s dependence on units and the need to provide interpretation of
units. For example, if the exact solution is $10 and the approximation solution is $15, the
approximation coordinate will be $15/$10 = 150%. If the approximation were $5, the
coordinate would be $5/$10 = 50%. The baseline algorithm is located at (1,1,1). The
baseline algorithm can be any algorithm that represents the status quo, or that gives the
optimal or most accurate solution.
The three dimensions for this measure are memory, error, and CPU seconds.
Depending on the application, different metric spaces can be constructed from this to
measure the distance from the “origin” (1,1,1), for example a normed linear space
(Powell 1981). In general, the memory and CPU requirement coordinates cannot be
negative, though the approximation error can be. The coordinates are not bounded above
in any of the three dimensions.
If several algorithms are plotted together, similarities between the algorithms may
be detected based on clustering behavior. For example, some algorithms may be
extremely accurate but require a large amount of CPU power and little memory, while a
different class of algorithms are extremely accurate but require a lot of memory and little
CPU. If x1 and x2 are large positive numbers, the first set of algorithms would be
clustered around (1,1,x1) and the second around (x2,1,1).
71
Another interesting idea is that we can specify regions similar to the efficient
frontier in the measure’s space (Brealey and Myers 1996). For example, we can define
application-specific regions where the magnitudes of CPU, memory, or error coordinates
are unacceptably large. We can also define “breakeven” points where the cost of CPU
power and memory, or CPU/memory and error are equivalent.
While integrating all three algorithm components into a single measure and
allowing flexibility in the distance measure used to compare algorithms, this measure has
several drawbacks. The first is that it may not be bounded, depending on the distance
measure chosen (e.g., a rectilinear distance measure in a normed linear space). Examples
of distance measures can be found in (Dillon and Goldstein 1984). If we are comparing
algorithms to each other, it may be sufficient to be able to compare the value of their
error measures, even if the measures are not bounded.
Another problem is that one must know or at least estimate the approximation
error. (The relative CPU and memory requirements are more easily determined since
they can be found theoretically by analyzing the algorithm, or measured when running
the algorithm.) The error measure does not capture bounding behavior well. I.e., if an
algorithm gives reliable upper and lower bounds, it is not clear how to incorporate these
bounds into the accuracy component. A possibility is to use the distance between the
bounds in the calculation instead of the approximation error:
( )2 upper lower
exact−
(4)
72
One must know the baseline quantities the approximation is being compared to. It may
not always be possible to do so, especially in the case of simulation models, where the
entire purpose of the model is to find these baseline quantities.
Despite its disadvantages, this graphical approach to comparing approximation
algorithms is appealing. We will investigate it more in future work.
4.4. Example: Approximating Dedication Constraints In some cases, we are unable to model system behavior correctly because of lack of
information. One such example is dedication constraints. As defined in Section 2.3.2,
dedication constraints (in our context) are requirements that jobs be serviced by the same
machine on successive visits to the machine group. To model the constraints accurately,
we need information from region III.2.i.b. We give a formal definition of dedication
constraints in Section 4.4.1.
Resource dedication is used in an attempt to improve the quality of service (e.g. in
using primary care physicians with HMOs), or the quality of the final product (e.g. “bin
yield” in wafer manufacturing, the percentage of wafers that meet specifications).
Dedication also has an impact on system dynamics and delays in the system: With
dedication, it is possible that a server may remain idle even though there are jobs waiting
at its server group. A job may wait longer than it would have if dedication were not
being used. Ignoring dedication when modeling the system will lead to biased queue
sizes and server utilizations. Queue sizes will be underestimated, while utilizations may
be overestimated. (Underestimating queue sizes is a more realistic and significant
problem. For congested systems, there will not be much forced idle time for the servers.)
73
There is literature discussing dedication, as well as its impact on operations. See
(Jensen et al. 1996; Shafer and Charnes 1997; Woods 1998; Rohan 1999;
Akcalt et al. 2001), for example. The literature we found assumes that all required
information is available for modeling.
In this section, we present a method that bounds performance statistics for a
system with dedication without using job information (region III). In Section 4.4.1, we
define notation and terminology. Sections 4.4.2 through 4.4.4 motivate and introduce the
approximation. Section 4.4.5 presents computational results, and Section 4.4.6 proposes
improvements and outlines future work. Examples of systems with dedication constraints
are given in Appendix C.1.
In addition to the dedication constraints referred to above, there are other types of
dedication. With tool dedication, certain tools are dedicated to very specific tasks
although they may be able to perform other tasks as well. This is often done to reduce
the number of setups or the amount of cleaning required for tools (Rohan 1999).
4.4.1. Problem Statement Consider an open queueing network in which jobs of j* types are processed at g* server
groups. Each server group g is composed of *gs functionally identical servers. Each job
type is characterized by known service-time distributions, priorities, and routes. The r*
route steps may be deterministic or stochastic, state-dependent, and are re-entrant (visit
the same server group more than once). One or more route steps have a dedication
constraint, with each job required to visit the same server, not just the same server group,
74
from a previous step. The set { }* * * *, , , ,gj g r ts and the associated information are in
region I of the taxonomy. The network has set of associated performance measures θ.
Define the following processes (classification regions are given in parentheses):
( ) ( )( ){ }* * *:1 ,1 : 0rgt Q t r r g g t t= ≤ ≤ ≤ ≤ ≤ ≤Q is the number of jobs at each route
step r waiting in queue for each server group g at time t (region I.2.ii).
( ) ( )( ){ }* * *:1 ,1 : 0ig gt S t i s g g t t= ≤ ≤ ≤ ≤ ≤ ≤S is the status of server i in group g at
time t. ( ) { },igS t busy idle∈ (region II.2.ii.b). The total number of available
servers in group g at time t is ( ) ( )( )*
1
gs
g igi
S t I S t idle⋅=
= =∑ , where ( )I ⋅ is the
indicator function. ( )gS t⋅ is in region I.2.ii.
In a resource-driven simulation or to a casual observer of jobs in a system,
( ) ( )( ){ }*, : 0t t t t≤ ≤Q S are observable.
Additionally, let
( ) ( )( ){ }* * * *:1 ,1 ,1 : 0irg gt Q t i s r r g g t t= ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤Q be the number of jobs at route
step r waiting specifically for server i of group g at time t (region II.2.ii.b). The
total number of jobs waiting at group g, either for a specific server or for any of
the servers, is ( ) ( ) ( )**
1 1
gsrtotg rg irg
r iQ t Q t Q t
= =
⎛ ⎞= ⎜ + ⎟
⎜ ⎟⎝ ⎠
∑ ∑ (region I.2.ii); and let
75
( ) ( )( ){ }* * * *:1 ,1 ,1 : 0ijg gt D t i s j j g g t t= ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤D be the number of jobs of
type j that server i of group g has processed that may return to i in the future
(region II.2.ii.b).
In order to find ( ){ }tQ with dedication constraints, we need the following information:
( ){ }*:1 ,1gkJ g g k N= ≤ ≤ ≤ ≤J , for each job k, the resource in group g k has been
dedicated to. If no dedication is present at g, the array entry can be left blank
(region III.2.i.b).
If complete information on the system were available (if { }J were known), we would
know the stochastic process ( ) ( )( ){ }*, : 0t t t t≤ ≤Q D . If ( ){ }tQ is unknown, the jobs
that would be counted there are added to ( ){ }tQ .
We wish to use the observed stochastic process ( ) ( ) ( )( ){ }*, , : 0t t t t t≤ ≤Q S D to
mimic the behavior the system would exhibit if { }J and therefore ( ){ }*: 0t t t≤ ≤Q were
observable.
Examples of systems with dedication constraints are given in Appendix C.1.
76
4.4.2. System Behavior from the Job Perspective To develop bounds for the system performance measures using a model where { }J (and
therefore ( ){ }tQ ) is unobservable, we consider the system behavior from the jobs’
perspectives.
When dedication is not present, the typical logic for arriving jobs is that they are
immediately processed if there are available (idle) resources, and must wait if there are
none. When dedication is present, jobs will not always be processed when they arrive
(for a revisit) and resources are idle: It is possible that the available resources do not
include the ones that processed the job originally. We can approximate the probability
that this occurs by the following:
( ){ }{ }
( )*
0g
g
g
P can start|job arrives, S t
P required resource available
S tidle resourcestotal resources s
⋅
⋅
>
=
≈ =
(5)
When a job arrives at server group g, subject to dedication constraints, at time t and finds
idle resources, the job does not begin processing unconditionally. Rather, it begins
processing with probability ( )*
g
g
S ts⋅ , which uses only existing information from region I.
This is an approximation because the probability of the required resource being
available is not the same as the proportion of idle resources. The approximation ignores
the memory that is implicit in the system, and also assumes that arriving jobs were
equally likely to have been assigned to any of the resources.
77
We expect the job wait times in the approximation to be greater than they would
be in a system where dedication is modeled explicitly. That is, this approximation is
dominated by the exact model. For example, consider the case where times between
visits to g are short, and the system is lightly loaded. In this case, the probability that
server i, having just completed processing job k, is idle when k returns is high; this is
because there are not many other jobs in the system, and i completed processing k a short
time ago. The probability does not take this information (“memory of the system”) into
account, and may require k to wait.
Let { }( ) : 0gA t t ≥ be the number of times through time t an arriving job is forced
to wait at g although there is an idle server. If / ( )approx exactgA t is the number of times the
approximation/exact algorithm forces arriving jobs to wait at g by time t, we conjecture
without proof that * *( ) ( )approx exactg gA t A t> .
4.4.3. System Behavior from the Resource Perspective This section examines the system behavior from the perspective of resource g, subject to
dedication constraints. When dedication is not present, a resource that becomes available
(finishes processing a job or is put back online after maintenance) immediately begins
processing a waiting job. When dedication is present, the resource may be forced to
remain idle even though there are jobs in queue. That is, server i in group g becomes
available at time t, finds ( )totgQ t > 0, but cannot begin processing. We can approximate
the probability that this occurs as follows:
78
( ){ }{ }
{ }
{ }( )
( ) ( )1
* *1
0
1
1
1 11 1 1 1
totg
tottot gg
totg
Q t
k
Q tQ t
k g g
P can begin processing | i now available, Q t
P at least one job waiting for i
P no job waiting for i
P job k not waiting for i
s s
=
=
>
=
= −
≈ −
⎛ ⎞ ⎛ ⎞≈ − − = − −⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
∏
∏
(6)
In this case, when a server in group g becomes available at time t and finds jobs waiting,
it does not begin processing immediately. Rather, it begins processing with probability
( )
*
11 1
totgQ t
gs⎛ ⎞
− −⎜ ⎟⎜ ⎟⎝ ⎠
, which uses only existing information from region I.
If the number of jobs in queue goes to zero, the probability will go to 0:
( )
( )* 0
11 1 1 1 0
totg
totg
Q t
Q tgs →
⎛ ⎞− − ⎯⎯⎯⎯→ − =⎜ ⎟⎜ ⎟⎝ ⎠
(7)
If the number of jobs in queue gets very large, the probability of starting will tend to 1:
( )
( )*
11 1 1 0 1
totg
totg
Q t
Q tgs →∞
⎛ ⎞− − ⎯⎯⎯⎯→ − =⎜ ⎟⎜ ⎟⎝ ⎠
(8)
This is an approximation because it assumes that all jobs waiting in queue are
independent of one another, which may not be. For example, if the servers have to be set
up to process a certain job type, they may process several of that job type before
switching to a different setup. Then the m jobs of that type that were processed
sequentially will all dependently return to the queue at roughly the same time.
79
Furthermore, the approximation assumes, as in Section 4.4.2, that the probability of
waiting for server i is *
1
gs, which may not be true.
It is not clear a priori whether this approximation will dominate or be dominated
by the exact algorithm. We can construct scenarios to argue either way. Consider the
example above where several jobs of the same type are processed sequentially because of
setup costs. In this case, we will underestimate the probability of being able to start,
since we are ignoring the fact that m of the jobs are waiting for i. Instead, we assign each
of them an equally small probability of waiting for this resource.
On the other hand, if server i has just been down for repair or maintenance for a
long time, and times between revisits are short, it is unlikely that there will be any jobs
waiting for i in queue (because it has not been processing any and sending them out into
the system). The probability ignores this fact and assumes that each job is equally likely
to be waiting for i. Both extensions require more information from region II. The current
approximation uses only existing information.
Let { }( ) : 0gV t t ≥ be the number of times by time t a server in group g was forced
to remain idle even though there were jobs waiting when it becomes available. If
/ ( )approx exactgV t is the number of times by t the approximation/exact algorithm forces a
server at g to remain idle when there are jobs waiting, we cannot say a priori whether
* *( ) ( )approx exactg gV t V t> , or * *( ) ( )approx exact
g gV t V t< .
80
4.4.4. Approximation Implementation Figure 4 shows the event graph model for an n-stage queueing network. Figure 7 shows
the event graph after the approximation probabilities have been added. If stage i is not
subject to dedication constraints, the probabilities are omitted. RND is a function
returning a pseudorandom Uniform(0,1) number.
Figure 7. Event graph for an n-stage queueing network, with dedication constraints modeled using the approximation
4.4.5. Results To test our model, we constructed a simulation of a deterministic re-entrant queueing
network with one job type and dedication constraints on one of the server groups. This
group is modeled explicitly, while the remaining network is modeled as a black box; it is
more interesting to know how long jobs were away from the server group, and less
interesting to know what they were doing while they were gone. Parameters varied in the
simulation are:
{Qi=Qi-1, Ri=Ri-1}
START (i)
FINISH(i)
tsta ENTER (i)
{Ri = Ri + 1, DONE = (i = = n)}
(Ri > 0 && *i
i
RRNDs
≤ ) i
i
i
1
(i = = 1)
{Qi = Qi + 1}
i+1
(Qi > 0 && *
11 1iQ
i
RNDs
⎛ ⎞≤ − −⎜ ⎟
⎝ ⎠)
(DONE = 0)
81
*gs number of servers at the dedication group g
λ arrival rate to the system
ν service rate at group g (may differ for the different visits to g)
τ “revisit rate”: 1/average amount of time between visits to g
m number of revisits to g by each job
d service discipline at g
Our performance measures are the average queue size (total) at g, and *( )gA t and *( )gV t .
Figure 8 shows the average total queue size when m = 2 (total visits to g by each
job is 3), d is a pull discipline (jobs at visit 3 get processed before those at visit 2 before
those at visit 1), and the time between revisits is “long” (τ = 0.5⋅λ).
Each of the points in Figure 8 is the average of 5 pairs of antithetic simulation
runs. The number of servers is fixed at 5, and the service rate is varied to provide an
even sampling across utilizations. The arrival and service distributions are Exponential.
The x-axis is resource utilization, which is not one of the system parameters. It is an
output from the simulation, and is used as a surrogate for the number of revisits, the
service rate, etc. We are using this surrogate measure of congestion because of the large
number of input parameters. The effective traffic intensity outlined in
(Fendick and Whitt 1998) is a possible independent measure we will investigate in the
future.
In Figure 8, the full dedication case (line with triangles, “Ded.” in the legend) is
bounded below by the no dedication case (line with diamonds, “No Ded.” in the legend),
as previously observed. (This corresponds to the model in Figure 4.) It is bounded above
82
by the approximation we have constructed (line with circles, “Prob.” in the legend). If
these results hold up, we would be able to run two fast, straightforward simulations and
obtain bounds on the behavior of the (slightly) slower and more detailed simulation.
Figure 8. ( )totgQ t for the system described above
Figure 9 shows the queue size for the same system where the time between revisits is
short, τ = 2⋅λ. The same type of bounding behavior as in Figure 8 is evident. The
distances between the three curves remain relatively constant, whereas the bounds in
Figure 8 become tighter as the utilization (system congestion) increases. This follows the
intuition we formed earlier that the approximation does not maintain the system memory
that is more clearly defined if the times between revisits are short.
83
Figure 10 shows *( )approxgA t and *( )exact
gA t as proportions of the total number of
jobs processed for both long and short times between revisits (solid and dashed lines,
respectively). When revisit times are long, *( )approxgA t is very close to *( )exact
gA t ,
especially as the utilization increases. With long revisit times, and especially when the
servers are very busy, server i’s status is independent of a job’s arrival to or departure
from g. When the revisit times are short, there is a large discrepancy, especially for more
lightly-loaded systems. This confirms the arguments given in Sections 4.4.2 and 4.4.3.
When the system is heavily loaded, this effect is dampened, as i is likely to be able to
start working on another job right away, or at least before the job it just completed
returns.
We note that * *( ) ( )approx exactg gA t A t> , confirming the intuition in Section 4.4.2 that
the estimate would be conservative.
Figure 11 shows *( )approxgV t and *( )exact
gV t as proportions of the total number of
jobs processed. Again, the approximation appears very close when the system memory
has had the chance to erase itself, that is, when the time between revisits is long, or the
time is short but the system is congested. This reinforces our earlier intuition. The
differences between *( )approxgV t and *( )exact
gV t are smaller than those between *( )approxgA t
and *( )exactgA t , which validates our a priori difficulties of determining whether the
approximation would be conservative.
The values for both *( )approxgA t and *( )approx
gV t do not depend on the revisit rate τ,
though τ obviously does have a large impact on the values of *( )exactgA t and *( )exact
gV t .
84
This illustrates that we are not taking system information other than the queue size and
server status into account. Better approximations may attempt to include more
information.
We now give two conclusions we draw from our experimentation. The first we
state as a lemma and prove. The second is a hypothesis based on our intuition and
experimental results and is presented without proof.
Lemma 1 The queue size at a resource subject to dedication constraints is bounded
below by the queue size of the resource group in the same model, but without
dedication constraints.
Proof: The systems with and without dedication are identical except in the servers that
are allowed to process job k. If dedication is present, only server i’ may serve k,
while there are no restrictions when dedication is not enforced. Define ( )i tΦ as
the next time on or after time t that server i will become available to process jobs.
If job k arrives at time t and there are no dedication constraints, k’s wait is
( ){ }min iit tΦ − . With dedication, k’s wait is ( )'i t tΦ − . Because
( ){ } ( )'min i iit tΦ ≤ Φ , k’s wait with dedication constraints cannot be shorter than
in the unconstrained system. ■
85
Figure 9. ( )totgQ t for the system in Figure 8, with short revisit times
Figure 10. Aapprox(t*) and Aexact(t*) as proportions of the total number of jobs processed
86
Figure 11. Vapprox(t*) and Vexact(t*) as proportions of the total number of jobs
processed
Claim 1 The approximation proposed result in * *( ) ( )approx exactg gA t A t≥ and
* *( ) ( )approx exactg gV t V t≥ .
4.4.6. Possible Improvements and Future Work In the near future, we will
• perform more runs and calculate confidence intervals or other measures of
precision to rule out the possibility that our conclusions are due to noise;
• improve the approximations by taking into account things like repairs, sequential
processing of identical parts, or the number of parts of each type each server has
87
processed (if the server has processed none of this part type, the probability that
the part is waiting for this server is zero); that is, add more information to the
model and approximation;
• test the approximation given in Appendix C.2.
On a longer time horizon, we would like to
• prove that our approximations provide upper and lower bounds on the true
system; and
• quantify in which situations the approximations will provide good bounds.
4.5. Example: Approximating Waiting Time Distributions Besides preventing accurate modeling of system behavior, lack of information can hurt a
simulation study by precluding the gathering of desired output statistics. For example,
job waiting time distributions may be required without the associated access to
information in region III of the information taxonomy.
In this section, we introduce the Time Slice Counter Method, which can be used
to find delay distributions for first-come-first-serve (FCFS) terminating systems. The
advantages of the Time Slice Counter Method are that it does not require information
from region III, and that it appears to work better for congested and variable systems than
it does for idle systems. It adds a fixed amount of information from region I to the basic
resource-driven model.
We have found little literature trying to estimate the entire distribution without
tracing individual jobs. A recent work that estimates quantiles of the distribution without
88
maintaining the relevant statistics throughout the whole simulation (e.g., waiting time at
an individual station in a queueing network during the simulation of the whole network)
is presented in (McNeill et al. 2003). The waiting times do need to be stored while the
job is at the station, i.e., this method uses information from III.2.iii.a. The quantiles are
estimated using a Cornish-Fisher expansion by tracking the higher moments of the
statistic in question during the run. The moments are used after the run is completed to
calculate quantiles. A big advantage is that any quantile can be calculated without having
to rerun the simulation.
This approach differs from the Time Slice Counter Method in two ways. First, it
uses information from III.2.iii.a rather than from I.2.iii. Second, the user specifies which
quantiles are of interest. The Time Slice Counter Method presented below looks at the
probability of waiting at most γ time units, for the γ’s of interest. The former specifies a
quantile and determines γ, while the latter specifies γ and tries to determine the quantile.
Both are valuable in different contexts. In the future, integration of the two may provide
useful results.
4.5.1. Basic Approach If we are interested in the probability a job in a first-come-first-serve (FCFS) terminating
queueing system waits at most γ time units, we need not explicitly track each job.
Rather, we can use the event graph model shown in Figure 12. This is the general event
graph model for a G/G/s queueing system (like the one in Figure 4) with an additional
event vertex (Schruben and Schruben 2001). Define the following variables (all of which
are in region I):
89
( ){ }*: 0AN t t t≤ ≤ number of arrivals by time t
( ){ }*: 0SN t t t≤ ≤ number of starts by time t
( ){ }*: 0DN t t tγ
≤ ≤ number of jobs delayed γ time units by time t; ( ) ( )D AN t N tγ
γ= − ;
for t ≥ γ; we omit the delay subscript when its value is obvious
( ){ }*: 0W t t tγ ≤ ≤ number of jobs whose wait was less than γ time units by time t,
( ) ( )DW t N tγγ ≤ ; we omit the delay subscript when its value is
obvious
( )WF γ current estimate of { }P delay γ≤ ; denoted as PROB in event
graphs
( )WF γ true value of { }P delay γ≤ ; estimated using ( )WF γ
When a job enters the system, a DELAY event is scheduled to occur after γ time units.
When a DELAY event occurs, counter ND is incremented. When a START event occurs,
counter NS is incremented. If more DELAY events than START events have occurred,
the current job’s waiting time was greater than γ. The estimate of the probability a job
will wait at most γ time units is
( ) { } ( )( )
*
*WD
W tnumber jobs delayed F P delaytotal number jobs N t
γ
γγγ γ ≤= ≤ = = (9)
90
Lemma 2 ( )WF γ is an unbiased estimator for ( )WF γ .
Proof: The proof rests on the use of indicator random variables in the estimate.
( ) ( )( )
( )( )
( )
{ }( )
( )( )
( )
( ) ( )
*
* *
*
* *
* *
ˆ
D
D D
N t
k=1W
D D
N t N t
Wk=1 k=1
WD D
I job k's delayW tE F E E
N t N t
P job k's delay FE E F
N t N t
γ
γ γ
γ γ
γ γ
γγ
γ
γ γγ
⎡ ⎤⎢ ⎥≤⎡ ⎤ ⎢ ⎥⎡ ⎤ ⎢ ⎥= = ⎢ ⎥⎣ ⎦ ⎢ ⎥⎣ ⎦ ⎢ ⎥⎢ ⎥⎣ ⎦
⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥≤⎢ ⎥ ⎢ ⎥= = =⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦
∑
∑ ∑
■
Figure 12. Event graph model for estimating { }P delay γ≤
A realization of this system is given in Figure 13. The dashed line represents the number
of DELAY events that have occurred at time t, ( ){ }DN t , while the solid line is the
number of START events that have occurred, ( ){ }SN t . At time t1, the START line lies
{R = R + 1} {Q=Q-1, R=R-1, NS=NS+1, W=W+(ND<NS), PROB=W/ND}
DELAY
{ND=ND+1}
START FINISH ts
ta ENTER
{Q = Q + 1}
(Q > 0)
(R > 0)γ
91
above the DELAY line, ( ) ( )1 1S DN t N t> ; this means that more START events have
occurred by t1, and that the last job that began service was not delayed more than γ time
units. At time t2, the DELAY line lies above the START line, ( ) ( )2 2D SN t N t> ,
indicating that more DELAY than START events have occurred; the next job to begin
service will have waited more than γ time units.
The advantage of this approach is that we do not need to store any information.
We are able to classify jobs based on event counts using the fact that the number of
events by time t is at least n iff the time of the nth event occurred by t,
( )E nN t n E t≥ ⇔ ≤ .
Switching our perspective to looking at horizontal “lines” from the event counts
on the y-axis rather than vertical “time slices” from the x-axis, we can read off the status
of each job. The first START event occurs before the first DELAY event ( 1 1S D< ), so
the first job was serviced before it was delayed γ time units. The same holds for the
second, third, and fourth START and DELAY events and their corresponding jobs. For
the fifth and sixth jobs, the DELAY events occur before the START events, and these
two jobs had to wait more than the γ time units. Finally, the seventh job was processed
before it became overdue. Our estimate of { }P delay γ≤ is ( )WF γ = 5/7.
92
Figure 13. Sample realization of START and DELAY event occurrences
To estimate the distribution of the waiting times, ( )WF ⋅ , the user schedules multiple
DELAY events for each job, each with a different value of γ.
The methods presented in this section are of interest because they allow the
estimation of the probability during the simulation run, not as a post-processing step.
Tracking waiting times and post-processing is an inefficient way of finding waiting time
distributions. It requires additional computational effort as described in Section 3.3.3.
Although we avoid the post-processing, we are still maintaining a record for each
job on the future events list (FEL). Since scheduling events is usually the most expensive
operation in computer simulation programs, we have created more work by scheduling
the DELAY event for every job in the system while losing the complete information
available to us by tracking all jobs. We have found the exact solution by adding
computational effort as described in Section 3.3.2.
START
DELAY
t1
count
1
2
3
4
5
7
6
time
t2
93
4.5.2. Eliminating Individual Job Information
4.5.2.1. Lower Time Slice Counter Method We now introduce the Lower Time Slice Counter Method. With it, we do not trace
individual jobs by scheduling DELAY events, but track an approximation of the number
of DELAY events that have occurred by a certain time. (Section 4.5.2.2 introduces a
slight modification, the Upper Time Slice Counter Method.) No DELAY events are
scheduled. Define the following variables, all of which are in region I of the taxonomy:
ξ time slice size
Ξ number of time slices used for the simulation; *t ξ⎡ ⎤Ξ = ⎢ ⎥
{ }:1kA k N≤ ≤ time of the kth job’s arrival
{ }:1kS k N≤ ≤ time of the kth job’s service start
{ }, :1kD k Nγ ≤ ≤ time of the kth job’s γ-time unit delay
{ }, :1i iγ∆ ≤ ≤ Ξ number of γ-time unit DELAY events by time interval i
( ){ },
*: 0LN t t tγ ξ
≤ ≤ number of approximated γ-DELAY events from the Lower Time
Slice Counter Method that have occurred by time t, for time slice size ξ
( ){ },
*: 0UN t t tγ ξ
≤ ≤ number of approximated γ-DELAY events from the Upper Time
Slice Counter Method that have occurred by time t, for time slice size ξ
94
( ),t γ ξ floor index of the time slice a DELAY event scheduled at time t for γ time
units in the future with time slice size ξ falls into; ( ), tt γγ ξξ
⎢ ⎥+= ⎢ ⎥⎣ ⎦
( ),t γ ξ ceiling index of the time slice a DELAY event scheduled at time t for γ
time units in the future with time slice size ξ falls into;
( ) ( ), , 1tt tγγ ξ γ ξξ
⎡ ⎤+= = +⎢ ⎥⎢ ⎥
For all variables, we omit the subscripts for the delay γ and time slice size ξ when their
meaning is obvious from context.
The Lower Time Slice Counter Method divides the simulation run duration *t into
time slices of length ξ. When a job enters the system at time t, it will have waited at least
γ time units at time t γ+ . We say this event falls into time slice ( ),t γ ξ . The DELAY
event is known to occur sometime in the time interval ( ) ( ) ), , ,t tγ ξ ξ γ ξ ξ⎡ ⋅ ⋅⎣ . Since we
are tracking a cumulative count of the number of DELAY events, the DELAY counters
for time slices ( ),t γ ξ , ( ),t γ ξ +1,…, Ξ will be incremented by one when a job arrives.
( ),t γ ξ∆ is the counter for time slice ( ),t γ ξ .
When a START event occurs at time t’, it compares ( )'SN t to ( )' 0,t ξ∆ , and
increments ( )'W tγ as before, i.e., if ( )'SN t > ( )' 0,t ξ∆ . We do not have a START counter
for the time slices, as the exact time of a START event is known.
A necessary but not sufficient condition for misclassification is that the START
event happens in the same time slice as the DELAY event, ( )0,k k kD S Dξ ξ⋅ ≤ < .
95
Conversely, a sufficient condition for correct classification is that the two events happen
in different time slices. We state this as Lemma 3:
Lemma 3 A sufficient condition for correct classification of job k using the Lower Time
Slice Counter Method is that the kth START event does not occur in the same time
slice as the kth DELAY event. A necessary but not sufficient condition for
misclassification is that the events occur in the same time slice.
Proof: The time of the kth START event time, occurring in time slice ( )0,kS ξ , is kS .
The (unobserved) kth DELAY event time, occurring in time slice ( )0,kD ξ , is kD .
The kth job is not delayed if k kS D≤ . There are three possible relationships
between ( )0,kS ξ and ( )0,kD ξ :
1. ( ) ( )0, 0,k kS Dξ ξ< :
This implies k kS D< , ( ) ( )0, 0,k k k kS S D Dξ ξ ξ ξ⋅ ≤ < ⋅ ≤ ; job k will be
correctly classified as not delayed.
2. ( ) ( )0, 0,k kS Dξ ξ> :
This implies k kS D> , ( ) ( )0, 0,k k k kD D S Sξ ξ ξ ξ⋅ ≤ < ⋅ ≤ ; job k will be
correctly classified as delayed.
3. ( ) ( )0, 0,k kS Dξ ξ= : No conclusion can be drawn about the relationship
between kS and kD . It is possible that ( ) ( )0, 0,k k k kD D S Dξ ξ ξ ξ⋅ ≤ < < ⋅ or
that ( ) ( )0, 0,k k k kD S D Dξ ξ ξ ξ⋅ ≤ ≤ < ⋅ . In the first case, a correct
classification will be made, in the second an incorrect one. ■
96
The smaller the time slice sizes are, the more accurate the estimate will be. As time slice
sizes get larger, we underestimate ( )WF γ , thinking more jobs are overdue than is really
the case.
Figure 14 shows the same system realization as Figure 13, but includes the
approximated DELAY curve. The time slice size is ξ time units. The approximated
DELAY is bolded.
The number of approximated DELAYs is at least as large as the number of actual
DELAYs at any time t. In three instances (jobs 2, 3, and 7), this leads us to falsely
believe the jobs waited longer than γ time units; these are the shaded areas in Figure 14.
The number of shaded rectangles tells us how often we overestimate the wait. The length
of the rectangles gives an indication of the sensitivity to the arrival time the
approximation has: Any arrival that occurred γ units before any of the points in the
shaded area will be misclassified.
The approximation will not underestimate the probability of forcing a job to wait
too long. If ξ is small, the approximation will be accurate, as ( )( )0
lim 0, 0k kD Dξ
ξ γ→
− ⋅ = .
If ξ is large, the approximation will be too conservative, as ( )( )*lim 0,k k kt
D D Dξ
ξ γ→
− ⋅ = ,
and we will conclude that all jobs were delayed for more than γ time units. We prove the
algorithm’s accuracy in Theorem 1.
97
Figure 14. System realization from Figure 13 with approximated DELAY curve
Theorem 1 In a terminating single-station queueing system with a FCFS service
discipline, as the time slice size ξ goes to zero, for any sample path, the estimate
( )WF γ from the Lower Time Slice Counter Method will converge to the job-
driven estimate of ( )WF γ for any γ ≥ 0: ( ) ( )( )0
ˆ ˆlim 0LTSCM JDW WF F
ξγ γ
→− = .
Proof: We must show that the Lower Time Slice Counter Method will give the same
result as the model in Section 4.5.1. This is equivalent to showing that
( )( )0
lim 0, 0k kD Dξ
ξ γ→
− ⋅ = for any k and for any γ.
( )0 0, kk k k
DD D Dξ γ ξ ξξ
⎢ ⎥≤ − ⋅ = − ⋅ <⎢ ⎥
⎣ ⎦ is true because kD cannot occur before
( )0,kDξ ξ⋅ , its component that is an integer-multiple of ξ. Further, the distance
ξ 2ξ 3ξ 4ξ 5ξ
count
1
2
3
4
5
7
6
time
START
DELAY
98
between kD and ( )0,kDξ ξ⋅ can be at most ξ. Taking limits on the left- and
right-hand sides gives
( )( )
( ) ( )( )
0 0 0 0
0 0
0
lim 0 lim lim 0 lim 0
lim 0 lim 0, 0
ˆ ˆlim 0
k kk k
kk k k
LTSCM JDW W
D DD D
DD D D
F F
ξ ξ ξ ξ
ξ ξ
ξ
ξ ξ ξξ ξ
ξ ξ γξ
γ γ
→ → → →
→ →
→
⎛ ⎞ ⎛ ⎞⎢ ⎥ ⎢ ⎥≤ − ⋅ < ⇔ ≤ − ⋅ <⎜ ⎟ ⎜ ⎟⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦⎝ ⎠ ⎝ ⎠⎛ ⎞⎢ ⎥
⇔ − ⋅ = ⇔ − ⋅ =⎜ ⎟⎢ ⎥⎣ ⎦⎝ ⎠
⇔ − =
which completes the proof. ■
We note that as time slice size ξ goes to zero, the number of time slices increases. At
some point, more memory may be required for the time slice counters than would be to
maintain job information. This is especially true for lightly-loaded systems.
Figure 15 shows results for 10,000 simulated jobs in an M/M/1 queueing system
with an interarrival rate of 0.667 and a service rate of 1. We show the results for various
values of γ and three time slice sizes. The solid bold line is the analytic solution. The
time slice approximations become more accurate as the time slice sizes decrease. For
smaller values of γ, the approximation can be significantly inaccurate, especially for
larger time slices. This confirms the intuition that the estimate of the probability goes to
zero as the time slice size increases. (Not only is the absolute time slice size larger, but
its size relative to the smaller delays is larger than it is to the longer delays.)
The inaccuracy is not observed for larger γ values as the time slice sizes have not
increased sufficiently to cause the START and DELAY events to happen in the same
time slice.
99
Figure 15. P{wait ≤ γ} for various values of γ and several time slice sizes
4.5.2.2. Upper Time Slice Counter Method A simple change to the Lower Time Slice Counter Method results in the Upper Time
Slice Method: We increment the counter for time slices ( ), tt γγ ξξ
⎡ ⎤+= ⎢ ⎥⎢ ⎥
,
( ), tt γγ ξξ
⎡ ⎤+= ⎢ ⎥⎢ ⎥
+ 1,…,Ξ rather than the time slices starting at ( ), tt γγ ξξ
⎢ ⎥+= ⎢ ⎥⎣ ⎦
. In other
words, we update the counter at the end of the time slice in which the DELAY event
occurs, not at the beginning. This will result in an overestimation of ( )WF γ . All results
presented for the Lower Time Slice Method follow analogously for the Upper Time Slice
Method, and will not be presented again here.
100
Figure 16 shows experimental results for the same system as in Figure 15 using
both the Lower and Upper Time Slice Counter Methods. The Upper Time Slice Counter
Method (“ceiling” in the legend) estimates lie above the true probabilities (bold line).
For other system parameters (arrival and service rates), the difference in errors
between the Lower and Upper Time Slice Counter Methods is not as pronounced as here.
The only generalization we can make is that the approximations become more accurate as
the time slice size decreases, and/or as we get closer to the right tail of the distribution.
Figure 16. { }P wait γ≤ for various values of γ and several time slice sizes using both the Lower and Upper Time Slice Counter Methods
4.5.3. Observations In this section, we make general observations about the two Time Slice Counter Methods
which will motivate the Time Slice Counter Method discussed in Section 4.5.4.
101
While so far we have looked at our step functions as functions of time, insights
are more easily gained by examining the step functions from the jobs’ perspectives.
In the original model shown in Figure 12 and Figure 13, job k can easily be
classified as delayed or not delayed by looking at the relative positions of the DELAY
and START curves between counts k and k+1. If the DELAY curve is to the left of the
START curve, job k has been delayed. If the curve is to the right, the job has not been
delayed. If the curves are on top of each other, the DELAY and START events happened
at the same time, and the job can be considered delayed.
In both the Lower and Upper Time Slice Counter Methods, rather than comparing
the relative positions of the START and DELAY curves, we compare the positions of the
START and approximated DELAY curves. We do not observe the actual DELAY curve.
In the Lower Time Slice Counter Method, we may incur a false positive, a type I error, by
misclassifying jobs as being delayed even though they are not. In the Upper Time Slice
Counter Method, we may incur a false negative, a type II error, by misclassifying jobs as
not delayed even though they are.
The problems with the Lower and Upper Time Slice Counter Methods are that we
do not know where the actual DELAY curve lies relative to the START curve, and that
we do not always know whether we have properly classified a job. If we use both
methods together, we are able to increase the proportion of jobs we can classify with
certainty. This is the Time Slice Counter Method, discussed next.
102
4.5.4. Time Slice Counter Method
4.5.4.1. Further Observations At time t, ( )SN t is the START curve, ( )DN t
γ the DELAY curve, ( )
,LN tγ ξ
the
approximated DELAY curve resulting from the Lower Time Slice Counter Method with
time slice size ξ, and ( ),UN t
γ ξ the approximated DELAY curve resulting from the Upper
Time Slice Counter Method with time slice size ξ and for delay γ. For the sake of
simplicity, we denote them as S, D, L, and U, respectively, in the following discussion.
The relative positions of L, D, and U are fixed, by definition. The only variability
comes from the location of S. The four possible configurations of the curves are given in
Table 5.
Table 5. Possible orderings of the curves
1 SLDU2 LSDU3 LDSU4 LDUS
Table 6 shows the possible orderings of the curves, the correct classification of jobs with
these orderings, and how the Lower and Upper Time Slice Counter Methods perform.
Both methods are correct for three of the four possible states, and they are not both wrong
at the same time. Further, if both classify a job the same way, the classification is
correct. This corresponds to the instances where the START curve lies in a different time
slice than the actual DELAY curve does, and is illustrated in Figure 17.
103
Table 6. Lower and Upper Time Slice Counter Method Classifications
state Correct classification
LTSCM UTSCM Lower Counter correct?
Upper Counter correct?
SLDU Not delayed Not delayed
Not delayed
Yes Yes
LSDU Not delayed Delayed Not delayed
No Yes
LDSU Delayed Delayed Not delayed
Yes No
LDUS Delayed Delayed Delayed Yes Yes
Figure 17. Accuracy of job classification using the Lower and Upper Time Slice Counter Methods
The checkmarks indicate the method properly classified a job, and the X’s indicate an
error was made. The times when both methods have checkmarks are those times when
the solid bold line (START) falls outside the solid unbolded (Lower Delay Counter) and
dotted (Upper Delay Counter) lines. If it falls between them, one of the two methods will
ξ 2ξ 3ξ 4ξ 5ξ
count
1
2
3
4
5
7
6
time
START
DELAY
√
√ √
√
√
√ √
√
X
X
X
X
√√
104
make a mistake. Since we do not know the location of the dashed line, we do not know
which method is in error. (In this example, the Lower Time Slice Counter Method is
incorrect three times, the Upper Time Slice Counter Method once.)
4.5.4.2. The Time Slice Counter Method Algorithm While we know that at least one of the two methods will correctly classify jobs, we do
not know which one to choose if the two disagree. We do know that the categorization is
correct if both make the same classification. A method combining the two can use this
information to improve our probability estimates: We can track both counts and always
classify jobs as (not) delayed if both methods agree, when the START and DELAY
events occur in different time slices. Having the events fall in different time slices is a
necessary and sufficient condition for certainty of correct classification. If the two
methods do not agree, we must decide how to classify the job. Several options are
described in Appendix D.2. The implementation used here is described in Appendix
D.2.6. We discuss complete details of the implementation in Appendix D.3.
The main advantage of combining the Lower and Upper Time Slice Counter
Methods is that we are able to unambiguously classify a larger number of jobs than using
the methods independently. Specifically, for the Lower Time Slice Counter Method, we
only know that the jobs in the class SLDU have been correctly classified. We cannot
differentiate between the other three cases as we only observe the S and L curves. For the
Upper Time Slice Counter Method, we only know that the jobs in LDUS have been
correctly classified. Combining the two methods leaves uncertainty only for jobs in
LSDU and LDSU.
105
The basic Time Slice Counter Method algorithm follows below.
Select values for γi¸ ξ or (ξi) Initialize k = 1, , 0iγ∆ = (∀γ, i = 0, 1, ... , Ξ), ( ) ( ) 0SN t W tγ= =
(t ≥ 0), 0γΦ = (∀γ) For every job k When the job arrives at time kA If server available, set START_flag = TRUE For every γ Calculate ( ),kA γ ξ
If ( ),kA γγ ξ > Φ // see Appendix D.3.3
For i = 1γΦ + to ( ),kA γ ξ , set , ,i γγ γ Φ∆ = ∆
Set ( ),kAγ γ ξΦ =
Increment ( ), ,kAγ γ ξ∆ // LTSM counters
When the job starts service at time kS
Increment ( )S kN S
Calculate ( )0,kS ξ
For every γ If ( )0,kS γξ > Φ
For i = 1γΦ + to ( )0,kS ξ , set , ,i γγ γ Φ∆ = ∆
Set ( )0,kSγ ξΦ =
If START_flag == TRUE Classify job k as not delayed, i.e., increment ( )kW Sγ
Else if ( ) ( ), ,kS k AN S γ γ ξ> ∆ , classify job k as not
delayed, i.e. increment ( )kW Sγ
Else if ( ) ( ), , 1kS k AN S γ γ ξ −< ∆ , classify job k as delayed
Else if ( ) ( ), , 1kS k AN S γ γ ξ −> ∆ and ( )0,2k kS S ξξ ξ≥ ⋅ + ,
classify job k as delayed, i.e., increment ( )kW Sγ
Update ( ) ( )ˆ kW
W SF
kγγ =
Set START_flag = FALSE Theorem 2 states that the Time Slice Counter Method converges to the job-driven
probability estimate as the time slice size goes to zero:
106
Theorem 2 In a terminating single-stage queueing system with a FCFS service
discipline, as the time slice size ξ goes to zero, for any sample path, the estimate
( )WF γ from the Time Slice Counter Method will converge to the job-driven
estimate of ( )WF γ for any γ ≥ 0: ( ) ( )( )0
ˆ ˆlim 0TSCM JDW WF F
ξγ γ
→− = .
Proof: The proof is direct from Theorem 1. We are not restricting ourselves to a G/G/1
queue anymore because it does not matter how many servers there are in a FCFS
system, the first job in queue is the first job to be served. ■
Again, as ξ goes to zero, the number of time slice counters may require more memory
than storing job information directly.
4.5.5. Experimentation: Single-Server System In this section, we discuss experiments done with the Time Slice Counter Method on
single-server, single-stage queueing systems. Arrival and service distributions are
Exponential (M), Uniform (U), and Beta(0.5, 2) (B), with different arrival and service
rates. The choice of parameters for the Beta distribution is discussed below.
By changing the distributions and arrival/service rates, we are simulating different
levels of congestion and variability in the system. We use the traffic intensity ρ as a
measure of system congestion though it does not directly measure the number of jobs in
the system: an M/M/1 system with ρ = 0.9 has far fewer jobs in it than an M/M/1000
system with ρ = 0.9. Since the Time Slice Counter Method’s performance is insensitive
107
to the number of jobs in the system, this is an important point. Because this section deals
with a single-server system, ρ is adequate.
Variability in the system is more difficult to measure. This “global” variability
can be caused both by variability in the arrivals and variability in the service. Rather than
measuring variability by adding the coefficients of variation of the two processes, we use
the squared coefficient of variation of the departure process, specifically, the
interdeparture times, 2DC . (We explain the reason shortly.) For the M/M/1 system, we
are able to calculate this quantity directly. For other systems, we must use
approximations. The formulae used are from (Buzacott and Shanthikumar 1993), and are
summarized in Table 7. The general distribution stands for both the Uniform and Beta
distributions. 2AC and 2
SC refer to the squared coefficients of variation for the interarrival
and service distributions, respectively.
Table 7. Formulae for 2DC in different systems
System 2DC
M/M/1 2 1AC = M/G/1 2 2 21 SCρ ρ− + G/G/1 ( )( )2 2 2 21 1 A A SC C Cρ ρ ρ− + ⋅ +
Table 8 lists the distributions used in the experimentation. Table 9 lists the experiments
for single-server systems in increasing order of 2DC . The mean service rate was 1 in all
cases. In addition, the same sets were run doubling the arrival and service rates, and also
halving them. While ρ and 2DC remain unchanged for these additional trials, the
interarrival and service times are relatively shorter or longer compared to the time slice
sizes.
108
The limits for the Uniform distribution with rate η were 1 3,2 2η η⎡ ⎤⎢ ⎥⎣ ⎦
, and the
parameters for the Beta distribution were 1 0.5α = and 2 2α = , see
(Law and Kelton 2000) for an explanation of the parameters of the Beta distribution. The
parameters, range, and variance of the Beta distributions were chosen to ensure a 2 1C > .
As an illustration of the shape of this Beta, a histogram of the service times for the
Beta(0.5, 2) with mean 1 is shown in Figure 18. Its shape is similar to that of the
Exponential distribution. The mode of the Beta(2, 0.5) is at the right end of the
distribution, but its variability is much smaller than that of the Beta(0.5, 2). As increased
variability is of greater importance than a different shape, we use Beta(0.5, 2).
Table 8 and Table 9 show why we chose not to simply add 2AC and 2
SC as a
measure of “global” variability: The first two systems in Table 9 have the same sum of
2AC and 2
SC , 0.1666, but different 2DC values. Since there is a difference between the two
systems, we would like our measure to reflect this.
Table 8. Distributions and their characteristics
Distribution mean range variance C2 1 [0.5, 1.5] 0.0833 0.0833
1.1 [0.55, 1.65] 0.1008 0.0833 Uniform
1.5 [0.75, 2.25] 0.1875 0.0833 1 n/a 1 1
1.1 n/a 1.21 1 Exponential
1.5 n/a 2.25 1 1 [0, 5] 1.1429 1.1429
1.1 [0, 5.5] 1.3829 1.1429 Beta(0.5, 2)
1.5 [0, 7.5] 2.5714 1.1429
109
Table 9. Systems in increasing order of 2DC
System mean Arrival rho 2DC
U/U/1 1.5 0.6667 0.0664 U/U/1 1.1 0.9091 0.0770 M/U/1 1.1 0.9091 0.2424 B/U/1 1.1 0.9091 0.2807 U/M/1 1.5 0.6667 0.4738 U/B/1 1.5 0.6667 0.5373 M/U/1 1.5 0.6667 0.5926 B/U/1 1.5 0.6667 0.7082 U/M/1 1.1 0.9091 0.8346 U/B/1 1.1 0.9091 0.9527 M/M/1 1.5 0.6667 1 M/M/1 1.1 0.9091 1 B/M/1 1.1 0.9091 1.0383 M/B/1 1.5 0.6667 1.0635 B/M/1 1.5 0.6667 1.1156 M/B/1 1.1 0.9091 1.1181 B/B/1 1.1 0.9091 1.1564 B/B/1 1.5 0.6667 1.1791
Figure 18. Histogram of service times generated as Beta(0.5,2) with mean 1
110
In all cases, 20 values of γ were tracked: 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 3.0,
4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 15.0, and 20.0. Ten independent replications of 10,000
jobs each were run, and the estimates ( )W iF γ from each replication were averaged.
Common random numbers were used across the different systems. Exact waiting times
were also tracked for comparison. The waiting times for all ten runs were used to get the
job-driven estimates ( )W iF γ where analytic solutions are not available.
Figure 19 shows the lightly-loaded M/M/1 system described in Table 9. The time
slice sizes of 1, 2, and 3 time units were chosen arbitrarily (as were the values of γ).
Time slice sizes 1 and 2 appear to lie right on top of the analytic values of ( )WF γ . Time
slice size 3 deviates slightly, though the difference is difficult to see in this graph.
Figure 20 shows the results for the heavily-loaded M/M/1 system. The Time Slice
Counter Method does well, and it is difficult to find any significant deviations from the
exact values.
Figure 21 shows the estimation errors for the two M/M/1 systems together. The
errors are plotted against the values of ( )WF γ . The errors are calculated as
( ) ( )ˆ TSCMW Werror F Fγ γ γ= − . (10)
Each curve has 20 points, corresponding to the 20 γ values listed above. The points for
the “busy” and “idle” systems do not line up vertically because the values of ( )WF L
differ for the two. This graph shows how the Time Slice Counter Method performs for
the two levels of congestion as well as for different time slice sizes.
For both types of system, the errors get smaller (in absolute value) as the time
slice size decreases. Even the errors for time slice size 3 are always less than 10
111
percentage points (in absolute value). In both lightly- and heavily-loaded systems, the
largest errors occur in the left tail of the distribution, while at the right end, the errors are
indistinguishable from zero.
Figure 19. Approximated probabilities using the Time Slice Counter Method for a lightly-loaded M/M/1 system
In Section 4.5.4.2, we stated that the Time Slice Counter Method will correctly classify
jobs as delayed or not delayed if their START and DELAY events fall in different time
slices. Congestion increases the variability in system, and, we hypothesize, in the jobs’
departure process (since it now depends on more factors and jobs), though this is not
measured by 2DC . We further hypothesize that increasing the variability will make it less
likely that the START and DELAY events fall in the same time slice; therefore, the
approximation works work better for more variable or congested systems. Before we
112
show more results confirming this, we comment on the quality of the approximations for
the tails of the distribution.
Figure 20. Approximated probabilities using the Time Slice Counter Method for a heavily-loaded M/M/1 system
The approximations are most likely to be bad at the left tail of the distribution, i.e., for
small values of γ. Here, the ENTER and DELAY events are likely to fall in the same
time slice. If there is only a small delay between the job’s arrival and its START, the
START and DELAY values will fall in the same time slice; in this case, we will not be
able to definitively classify the job. This is more likely to happen in the case of the
lightly-loaded system, and explains why the estimates for those systems are less accurate
in the tail than for the more heavily-loaded ones.
113
Figure 21. Errors for M/M/1 Time Slice Counter Method trials
The right-tail estimates are extremely accurate. The γ values for the right tail are so large
that the DELAY events occur far after the START events. This result is positive, as we
are often interested in the probability of a rare event happening, which is in the right tail.
Figure 22 shows the errors for the lightly- and heavily-loaded U/U/1 systems, the
least variable systems, as functions of the job-driven estimates of ( )WF γ . The observed
errors for the heavily-loaded system are greater than those for the lightly-loaded system,
even exceeding 10 percentage points. If we only compare errors in the areas in which we
have estimates for both systems ( ( )ˆ 0.85WF γ ≥ ), the estimates for the more heavily-loaded
system tend to be more accurate. As we have no estimate for the lower region for the
lightly-loaded system, we should draw no conclusions on the relative behaviors there. As
before, smaller time slice sizes have (significantly) smaller errors than larger ones do.
114
Figure 22. Errors for U/U/1 Time Slice Counter Method trials
Figure 23 shows the error comparison for the lightly- and heavily-loaded B/B/1 systems.
Again, the more heavily-loaded system has smaller errors than the lightly-loaded system.
They do not exceed 3 percentage points, even for time slice size 3 (which is three times
the average service time). For time slice size 1, the error is always small for the heavily-
loaded system, and only once exceeds 1 percentage point for the lightly-loaded system.
These results are at least as good as those for the M/M/1 systems, though we have
introduced significantly more variability. There are no estimates for the right tail of the
distribution for the heavily-loaded system.
For the experiments doubling and halving the arrival and service rates, the Time
Slice Counter Method works relatively well. For smaller interarrival and service times
(larger rates), the approximations using larger time slice sizes degrade. Time slice size 1
115
still performs well. The deterioration makes sense as the time slice sizes are relatively
larger than the average interarrival and service times. For larger interarrival and service
times (smaller rates), the method performs well except for the heavily-loaded B/B/1
system. In this case, our γ values far underestimate the upper bound of the distribution,
( )ˆ 20 0.6301WF = .
Figure 23. Errors for B/B/1 Time Slice Counter Method trials
Results for experimentation with multiple-server tandem queues are given in
Appendix D.4.
116
4.5.6. Discussion
4.5.6.1. Single-Stage Systems We have proven that the Time Slice Counter Method converges to the job-driven
estimate for ( )WF γ as time slice size decreases. We argue that more congested/variable
systems allow the Time Slice Counter Method to get better answers more easily (with
larger time slices). Since we can make our time slice sizes arbitrarily small at no
additional time cost (there are memory costs), this observation is less problematic than it
might be. It does highlight an advantage of the Time Slice Counter Method: Estimating
waiting times is more readily done for busy systems than for idle systems.
The memory costs will increase as time slice size goes to zero. They can be offset
somewhat by using circular arrays of size O γξ⎛ ⎞⎜ ⎟⎝ ⎠
. Though this is less than Ξ for
reasonable γ, the memory requirements will still eventually overtake those of using job
information directly.
To get an idea of which factors are most important in determining the size of the
error, we performed a multiple linear regression (forward and backward stepwise and
best subset) on the error data. The factors included in the model are given in Table 10.
Table 10. Factors included in multiple regression ACTUAL the actual value of ( )WF γ (as estimated by tracing jobs) TIME SLICE the size of the time slice CV2 2
DC GAMMA current value of γ RATE arrival rate λ RHO /ρ λ ν=
117
The data used were the job-driven distribution estimates and the errors made by the Time
Slice Counter Method for different time slice sizes. The numbers used were from all the
experiments listed in Table 9 (using the deterministic classification of uncertain jobs
described in Appendix D.2.6), as well as the same experiments with doubled and halved
rates.
All three methods of choosing a model resulted in the same selection. The model
is shown in Table 11. The adjusted R2 of the model is 0.1828, which is low. All six
factors are statistically significant. Of these, the two with the largest coefficients are the
“actual” (job-driven) distribution value, and the measure of congestion or busyness, ρ .
The intercept of the model is negative, and increasing values of ( )WF γ and ρ bring the
errors closer to zero.
Table 11. Multiple linear regression model of Time Slice Counter Method errors PREDICTOR VARIABLES COEFFICIENT STD ERROR STUDENT'S T P VIF --------- ----------- --------- ----------- ------ --- CONSTANT -0.05639 0.00432 -13.05 0.0000 ACTUAL 0.03867 0.00181 21.33 0.0000 4.5 TIME SLICE 0.00131 3.104E-04 4.23 0.0000 1.0 CV2 0.00378 8.781E-04 4.31 0.0000 1.6 GAMMA -0.00148 8.352E-05 -17.68 0.0000 2.6 RATE 0.00124 3.652E-04 3.39 0.0007 1.5 RHO 0.04344 0.00361 12.04 0.0000 3.0 R-SQUARED 0.1844 RESID. MEAN SQUARE (MSE) 1.989E-04 ADJUSTED R-SQUARED 0.1828 STANDARD DEVIATION 0.01410 SOURCE DF SS MS F P ---------- --- ---------- ---------- ----- ------ REGRESSION 6 0.13889 0.02315 116.39 0.0000 RESIDUAL 3089 0.61439 1.989E-04 TOTAL 3095 0.75329 CASES INCLUDED 3096 MISSING CASES 0
118
Figure 24 shows a plot of the residual values versus the fitted values. There is a clear
pattern, indicating that there is behavior in the error not explained by the regression
model. Better models may include squared or product terms to try to capture
relationships between the variables, and will be the subject of future research.
Figure 24. Regression residuals versus fitted values
4.5.6.2. Other Service Disciplines The Time Slice Counter Method works for almost all disciplines listed in Section 4.2.3.1
if we assume FCFS queueing within classes (e.g., priority classes, or round-robin
groups). For example, for systems with priority service on multiple job types if we
assume FCFS servicing within the priority groups and no preemption, we would maintain
separate counters for the different priority groups.
119
Without extensions, the method will not work for systems with multiple job types
but no priority service. Similarly, any discipline that allows job ordering in queue to
change (LCFS) or that otherwise needs information from region III cannot be modeled.
Possible extensions to the Time Slice Counter Method to accommodate such disciplines
are left to future research.
4.5.6.3. Summary We summarize our conclusions and hypotheses from our experimentation for both single-
and multiple-stage tandem queues (Appendix D.4) in the following list.
• For single-stage G/G/s systems, decreasing time slice sizes leads to the correct
waiting-time probability estimate.
• For multiple-stage systems with multiple servers at one or more of the stages,
decreasing the time slice size will not necessarily lead to the correct cycle-time
estimate because of overtaking (the kth job to start service may not be the kth job to
finish service).
• For single-stage systems, greater variability and congestion (as measured by 2DC
and ρ ) will make estimating the waiting-time probability easier, i.e., we can get
smaller errors with larger time slices.
• For multiple-stage systems, greater variability makes overtaking more likely and will
make cycle-time estimates less accurate. Higher congestion appears to increase the
accuracy of the Time Slice Counter Method, but does not cause a significant increase
in overtaking.
120
• Longer tandem queues do not have greater errors than shorter tandem queues.
Extensions to the Time Slice Counter Method are given in Appendix D.5.
4.5.7. Future Work Future work in this area includes implementing the extensions proposed in Appendix D.5,
especially the one for LCFS service. If they work as we hope, they will be great
enhancements to the Time Slice Counter Method, increasing its flexibility and
applicability without resorting to full job traces.
We would also like to develop further extensions that allow us to reduce
estimation errors for non-FCFS disciplines and cycle time estimation. A possible
approach uses the amount of overtaking done by jobs (determined during the simulation
run) to adjust the estimates obtained by the Time Slice Counter Method. For some
background on overtaking, see (Whitt 1984; Daganzo 1997). We may also be able to use
the Time Slice Counter Method to analyze the general cause and effect process studied by
statisticians.
We want to further investigate the properties of tandem queueing systems,
specifically to look at the validity of Claim 4. If it is true, we may be able to extend our
study to analyze the effects of queueing disciplines themselves. Perhaps, as the length of
the tandem queue increases, the effects of the service discipline (e.g. LCFS versus FCFS)
will be less noticeable.
A final idea for future research is relating the number of time slices with no
events in them to the estimation error. As ξ goes to zero, the error will decrease, while, at
121
the same time, the number of time slices will increase. We would like to quantify the
relationship between the two; it may be possible to determine the error size based on the
number of “empty” time slices. This reduces the need for guesswork, transaction
tagging, or trial runs.
122
5. CONCLUSIONS The main contribution of this dissertation is the introduction of an information taxonomy
for simulation models. While a great amount of literature tries to classify models, many
of the classification schemes fall short. The classical world views focus on the
implementation of models, and are neither exclusive nor exhaustive. The resource-driven
and job-driven paradigms had not been defined explicitly, which reduces their value as a
taxonomy. The information taxonomy presented here allows their formal definition. A
comparison of the taxonomy to current formalisms is included in this dissertation.
The advantage of the information taxonomy is that it allows the modeler to focus
on the information that must be included in the model to capture system behavior and
return the desired output statistics. Consciously including only the necessary information
can reduce the time to create the model, as well as the potential for modeling and
implementation errors. The information taxonomy also allows the user to organize the
information contained in the model, which may suggest a natural means of
implementation.
The taxonomy can be used to identify what can be modeled and what output is
available given a set of information. For example, in a resource-driven simulation, it is
not obvious how to model critical ratio service disciplines accurately. Either more
information must be included in the model, or a suitable approximation must be found.
If an approximation is required, several aspects of the algorithm must be
considered: There are tradeoffs between algorithm accuracy, memory, and computational
123
requirements. We give an example of a single graphical error measure that incorporates
all three aspects.
Usually, the unavailable information is job-related because this type of
information is the most extensive for many systems. An example of the impact of lack of
modeling information is in dedication constraints. We have defined the problem faced by
the user, and have proposed an approximation algorithm to overcome the lack of
information. The approximation gives upper and lower bounds on the exact solution
without including job information in the model.
An example of the impact of lack of statistics information is in job delay
distributions. We have proposed an approximation for job waiting time distributions in
FCFS systems with no overtaking. No job information is needed for this approximation.
A fixed set of counters is added; this approximation is useful when the number of
counters is smaller than the number of jobs in the system.
We feel that this line of research is valuable in making technology more effective.
Technology should be a decision-making tool, and should not be a significant bottleneck
in the process. By providing a means of classifying the system information, we are able
to assist in organizing model development.
In the future, we would like to further develop the approximation error measure
introduced in Section 4.3.3.2. It has the advantage of combining three important
approximation aspects (accuracy, memory requirements, and computational
requirements) into one measure. It also allows the visualization of the measure, which
can aid in the comparison of different approximations.
124
We would also like to develop extensions to the approximations proposed. In
some cases, this means we must add more information to the model. In these cases, we
will investigate the tradeoffs.
125
6. BIBLIOGRAPHY Aetna. 2001. Plan Explanation.
Akcalt, E., K. Nemoto, and R. Uzsoy. 2001. Cycle-Time Improvements for Photolithography Process in Semiconductor Manufacturing. IEEE Transactions on Semiconductor Manufacturing. 14(1): 48-56.
Atherton, L. and R. Atherton. 1995. Wafer Fabrication: Factory Performance and Analysis. Kluwer. Boston, MA.
AutoSimulations, I. 1999. AutoSched AP User's Guide v 6.2. Bountiful, Utah.
Axelsson, O. and R. Marinova. 1999. On a Hybrid Method of Characteristics and Central Difference Method for Convection-Diffusion Problems. Department of Mathematics Report No. 9941, University of Nijmegen.
Baker, A. D. 1998. A Survey of Factory Control Algorithms that Can Be Implemented in a Multi-Agent Heterarchy: Dispatching, Scheduling, and Pull. Journal of Manufacturing Systems. 17(4): 297-320.
Bayley, G. V. and J. M. Hammersley. 1946. The "Effective" Number of INdependent Observations in an Autocorrelated Time Series. Journal of the Royal Statistical Society. 8:184-197.
Bowen, D. 2003. Personal Communication. 3 March 2003.
Brealey, R. A. and S. C. Myers. 1996. Principles of Corporate Finance. 5th Edition. McGraw-Hill.
Brown, S., F. Chance, J. W. Fowler, and J. K. Robinson. 1997. A Centralized Approach to Factory Simulation. 3. 83-86.
Buzacott, J. A. and J. G. Shanthikumar. 1993. Stochastic Models of Manufacturing Systems. W. J. Farbrycky and J. H. Mize. Prentice Hall International Series in Industrial and Systems Engineering. Prentice-Hall, Inc. Upper Saddle River, NJ.
Carson, J. S. 1993. Modeling and Simulation Worldviews. Proceedings of the 1993 Winter Simulation Conference. eds. G. W. Evans, M. Mollaghasemi, E. C. Russell and W. E. Biles. New York, NY, USA: IEEE. 18-23.
Commoner, F., A. W. Holt, S. Even, and A. Pnueli. 1971. Marked Directed Graphs. Journal of Computer and System Sciences. 5(5): 511-523.
Conway, R. W. and W. L. Maxwell. 1962. Network Dispatching by the Shortest-Operation Discipline. Operations Research. 10(1): 51-73.
Cota, B. A. and R. G. Sargent. 1992. A Modification of the Process Interaction World View. ACM Transactions on Modeling & Computer Simulation. 2(2): 109-129.
Daganzo, C. F. 1997. Fundamentals of Transportation and Traffic Operations. Oxford; New York: Pergamon.
126
Derrick, E. J. 1992. A Visual Simulation Support Environment-Based on a Multifaceted Conceptual Framework. Ph.D. Dissertation, Department of Computer Science, Virginia Polytechnic Institute and State University. Blacksburg, VA.
Derrick, E. J., O. Balci, and R. E. Nance. 1989. A Comparison of Selected Conceptual Frameworks for Simulation Modeling. Proceedings of the 1989 Winter Simulation Conference. eds. A. MacNair, K. J. Musselman and P. Heidelberger. SCS. 711-718.
Dillon, W. R. and M. Goldstein. 1984. Multivariate Analysis: Methods and Applications. John Wiley & Sons. New York.
Fendick, K. W. and W. Whitt. 1998. Verifying Cell Loss Requirements in High-Speed Communication Networks. Technical Report TR 98.33.1, AT&T Research Labs.
Fischbein, S. A. 2002. Personal Communication.
Fisher, R. A. 1922. On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character. 222:309-368.
Fishman, G. S. 1973. Concepts and Methods in Discrete Event Digital Simulation. Wiley-Interscience, John Wiley & Sons. New York.
Fleischer, L. 2004. Fast Approximation Algorithms for Fractional Covering Problems with Box Constraints. Proceedings of the 36th ACM/SIAM Symposium on Discrete Algorithms.
Fox, B. L. and P. W. Glynn. 1990. Discrete-Time Conversion for Simulating Finite-Horizon Markov Processes. SIAM Journal on Applied Mathematics. 50(5): 1457-1473.
Gahagan, S. M. and J. W. Herrmann. 2001. Improving Simulation Model Adaptability with a Production Control Framework. Proceeding of the 2001 Winter Simulation Conference. eds. B. A. Peters, J. S. Smith, D. J. Medeiros and M. W. Rohrer. Piscataway, NJ, USA: IEEE. 937-945.
Govil, M. K. and M. C. Fu. 1999. Queueing Theory in Manufacturing: A Survey. Journal of Manufacturing Systems. 18(3): 214-240.
Gross, D. and C. M. Harris. 1998. Fundamentals of Queueing Theory. V. Barnett, R. A. Bradley, N. I. Fisher, J. S. Hunter, J. B. Kadane, D. G. Kendall, D. W. Scott, A. F. M. Smith, J. L. Teugels and G. S. Watson. Wiley Series in Probability and Statistics. Third Edition. John Wiley & Sons Inc. New York.
Gutin, G., A. Vainshtein, and A. Yeo. 2003. Domination Analysis of Combinatorial Optimization Problems. Discrete Applied Mathematics. 129(2-3): 513-520.
Gutin, G. and A. Yeo. 2002. Polynomial Approximation Algorithms for the TSP and QAP with a Factorial Domination Number. Discrete Applied Mathematics. 119(1-2): 107-116.
127
Haas, P. J. 2002. Stochastic Petri Nets: Modeling, Stability, Simulation. P. W. Glynn and S. M. Robinson. Springer Series in Operations Research. Springer-Verlag. New York.
HealthNet. 2002. A Complete Explanation of Your Plan.
Healy, K. J. and R. A. Kilgore. 1997. Silk™: a Java-based process simulation language. Proceedings of the 1997 Winter Simulation Conference. 475-482.
Henriksen, J. O. 1981. GPSS - Finding the Appropriate World-View. Proceedings of the 1981 Winter Simulation Conference. 505-516.
Hopp, W. and M. Spearman. 1991. Throughput of a Constant Work in Process Manufacturing Line Subject to Failures. International Journal of Production Research. 29(3): 635-655.
Hyden, P., L. Schruben, and T. Roeder. 2001. Resource Graphs for Modeling Large-Scale, Highly Congested Systems. Proceeding of the 2001 Winter Simulation Conference. ed. M. W. Rohrer. Piscataway, NJ, USA: IEEE. 523-529.
Jackson, J. R. 1957. Simulation Research on Job Shop Production. Naval Research Logistics Quarterly. 4:287-295.
Jefferson, D. 1983. Virtual Time. Proceedings of the 1983 International Conference on Parallel Processing. eds. H. J. Siegel and L. Siegel. IEEE Comput. Soc. Press. 384-394.
Jensen, J. B., M. K. Malhotra, and P. R. Philipoom. 1996. Machine Dedication and Process Flexibility in a Group Technology Environment. Journal of Operations Management. 14(1): 19-39.
Jordan, M. 2003. An Introduction to Probabilistic Graphical Models. in preparation.
Jordan, M., Z. Ghahramani, T. Jaakkola, and L. Saul. 1999. An Introduction to Variational Methods for Graphical Models. Cambridge, MA. The MIT Press.
Kiviat, P. J. 1969. Digital Computer Simulation: Computer Programming Languages. Memorandum RM-5883-PR, The Rand Corporation. Santa Monica, CA.
Landau, E. 1909. Handbuch der Lehre von der Verteilung der Primzahlen. Teubner. Leipzig.
Law, A. and W. D. Kelton. 2000. Simulation Modeling and Analysis. Third Edition. McGraw-Hill Higher Education.
L'Ecuyer, P. 1994. Efficiency Improvement and Variance Reduction. Proceedings of the 1994 Winter Simulation Conference. eds. J. D. Tew, M. S. Manivannan, D. A. Sadowski and A. F. Seila. New York, NY, USA: IEEE. 122-132.
Little, J. D. 1961. A Proof of the Queueing Formula: L = λW. Operations Research. 9:383-387.
Markowitz, H. M. 1979. Simscript: Past, Present, and Some Thoughts about the Future. New York. Academic Press.
128
McNeill, J. E., G. T. Mackulak, and J. W. Fowler. 2003. Indirect Estimation of Cycle Time Quantiles from Discrete Event Simulation Models Using the Cornish-Fisher Expansion. Proceedings of the 2003 Winter Simulation Conference. eds. S. Chick, P. J. Sanchez, D. Ferrin and D. J. Morrice. 1377-1382.
Merriam-Webster. 1993. Merriam-Webster's Collegiate Dictionary. Tenth Edition. Springfield, Massachusetts.
Microsoft. 2004. Microsoft Application Center 2000: Load Balancing.
Miller, J. A., G. Baramidze, P. A. Fishwick, and A. P. Sheth. 2004. Investigating Ontologies for Simulation Modeling. Proceedings of the 37th Annual Simulation Symposium.
Moore, J. M. and J. R. Wilson. 1967. A Review of Simulation Research in Job Shop Scheduling. Journal of Production Inventory Management. 8:1-10.
Nance, R. E. 1979. Model Representation in Discrete Event Simulation: Prospects for Developing Documentation Standards. New York. Academic Press.
Nance, R. E. 1981. The Time and State Relationships in Simulation Modeling. Communications of the ACM. 24(4): 173-179.
Nance, R. E., C. M. Overstreet, and E. H. Page. 1999. Redundancy in Model Specifications for Discrete Event Simulation. ACM Transactions on Modeling and Computer Simulation. 9(3): 254-281.
Overstreet, C. M. 1982. Model Specification and Analysis for Discrete Event Simulation. Ph.D. Dissertation, Department of Computer Science, Virginia Polytechnic Institute and State University. Blacksburg, VA.
Overstreet, C. M. 1987. Using Graphs to Translate Between World Views. Proceedings of the 1987 Winter Simulation Conference. eds. A. Thesen, H. Grant and W. D. Kelton. San Diego, CA, USA: SCS. 582-589.
Page, E. H. 1994. Simulation Modeling Methodology: Principles and Etiology of Decision Support. Ph.D. Dissertation, Department of Computer Science, Virginia Polytechnic Institute and State University. Blacksburg, VA.
Panwalkar, S. S. and W. Iskander. 1977. A Survey of Scheduling Rules. Operations Research. 25(1): 45-60.
Petri, C. A. 1962. Fundamentals of a Theory of Asynchronous Information Flow. Proceedings of the IFIP Congress. 386-390.
Pinedo, M. 2002. Scheduling: Theory, Algorithms, and Systems. Second Edition. Prentice Hall. Upper Saddle, NJ.
Powell, M. J. D. 1981. Approximation Theory and Methods. Cambridge University Press. Cambridge.
Roeder, T. M., S. A. Fischbein, M. Janakiram, and L. W. Schruben. 2002. Resource-Driven and Job-Driven Simulations. 2002 International Conference on Modeling and Analysis of Semiconductor Manufacturing. 78-83.
129
Rohan, D. 1999. Machine Dedication under Product and Process Diversity. Proceedings of the 1999 Winter Simulation Conference. eds. P. A. Farrington, H. Black Nembhard, D. T. Sturrock and G. W. Evans. IEEE. 897-902.
Rose, O. 2002. Some Issues of the Critical Ratio Dispatch Rule in Semiconductor Manufacturing. Proceedings of the 2002 Winter Simulation Conference. eds. E. Yücesan, C.-H. Chen, J. L. Snowdon and J. M. Charnes. IEEE. 1401-1405.
Ross, S. M. 1997. Introduction to Probability Models. 6th Edition. Academic Press. San Diego, CA.
Savage, E. L., L. W. Schruben, and E. Yücesan. 2004. On the Generality of Event Graph Models. INFORMS Journal on Computing.
Schmeiser, B. and Y. Yeh. 2002. On Choosing a Single Criterion for Confidence-Interval Procedures. Proceedings of the 2002 Winter Simulation Conference. eds. E. Yücesan, C.-H. Chen, J. L. Snowdon and J. M. Charnes. IEEE. 345-352.
Schmidt, J. W. and R. E. Taylor. 1970. Simulation and Analysis of Industrial Systems. Richard D. Irwin. Homewood, IL.
Schriber, T. 1991. An Introduction to Simulation Using GPSS/H. John Wiley & Sons.
Schruben, D. and L. W. Schruben. 2001. Graphical Simulation Modeling Using SIGMA. 4th Edition. Custom Simulations.
Schruben, L. 1983. Simulation Modeling with Event Graphs. Communications of the ACM. 26(11): 957-963.
Schruben, L. W. 2000. Mathematical Programming Models of Discrete Event System Dynamics. Proceedings of the 2000 Winter Simulation Conference. eds. J. A. Joines, R. R. Barton, K. Kang and P. A. Fishwick. Piscataway, NJ, USA: IEEE. 381-385.
Schruben, L. W. 2003. Conditional Parametric Petri Nets and their Mapping to Simulation Event Graphs. European Simulation and Modelling Conference.
Schruben, L. W. and R. Kulkarni. 1982. Some Consequences of Estimating Parameters for the M/M/1 Queue. Operations Research Letters. 1(2): 75-78.
Schruben, L. W. and T. M. Roeder. 2003. Fast simulations of large-scale highly congested systems. Simulation: Transactions of the Society for Modeling and Simulation International. 79(3): 1-11.
Schruben, L. W. and E. Yücesan. 1988. Transaction Tagging in Highly Congested Queueing Simulations. Queueing Systems. 3(3): 257-264.
Seila, A. F., V. Ceric, and P. Tadikamalla. 2003. Applied Simulation Modeling. Duxbury Applied Series. Thomson Brooks/Cole. Belmont, CA.
Shafer, S. M. and J. M. Charnes. 1997. Offsetting Lower Routing Flexibility in Cellular Manufacturing due to Machine Dedication. International Journal of Production Research. 35(2): 551-567.
130
Shaked, M. and J. G. Shanthikumar. 1994. Stochastic Orders and Their Approximations. D. Aldous and Y. L. Tong. Probability and Mathematical Statistics. Academic Press, Inc. San Diego.
Sturm, R. 2002. Personal Communication. 3 May 2002.
Tibshirani, R. 1988. Variance Stabilization and the Bootstrap. Biometrika. 75(3): 433-444.
Tocher, K. D. 1963. The Art of Simulation. English Universities Press. London.
Törn, A. 1981. Simulation Graphs: A General Tool for Modelling Simulation Designs. Simulation. 37(6): 187-194.
Vollmann, T. E., W. L. Berry, and D. C. Whybark. 1988. Manufacturing Planning and Control Systems. Second Edition. Dow Jones-Irwin. Homewood, IL.
Welch, P. D. 1981. On hte Problem of the Initial Transient in Steady-State Simulation. IBM Watson Research Center. Yorktown Heights, NY.
Whitt, W. 1984. The Amount of Overtaking in a Network of Queues. Networks. 14(3): 411-426.
Williams, J. W. J. 1964. Algorithm 232 (Heap Sort). Communications of the ACM. 7(347-348.
Witten, I. H., G. M. Birtwistle, J. Cleary, D. R. Hill, D. Levinson, G. Lomow, R. Neal, M. Peterson, B. W. Unger, and B. Wyvill. 1983. Jade: A Distributed Software Prototyping Environment. ACM Operating Systems Review. 17(3): 10-23.
Wonnacott, P. 1996. Run-time Support for Parallel Discrete Event Simulation Languages. Ph.D. Dissertation, Computer Science, Univeristy of Exeter. Exeter.
Woods, R. H. 1998. A Cost Benefit Analysis of Photolithography and Metrology Dedication in a Metrology Constrained Multipart Number Fabricator. IEEE/SEMI Advanced Semiconductor Manufacturing Conference and Workshop. IEEE. 145-147.
Yoshida, T. and H. Touzaki. 1999. A Study on Association Among Dispatching Rules in Manufacturing Scheduling Problems. 1999 7th IEEE International Conference on Emerging Technologies and Factory Automation. Proceedings ETFA '99. ed. J. M. Fuertes. IEEE. 1355-1360.
Zeigler, B. P. 1976. Theory of Modeling and Simulation. John Wiley. New York.
Zeigler, B. P. 2003. DEVS Today: Recent Advances in Discrete Event-based Information. Proceedings of the 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems. 148-162.
131
Appendix A. DEFINITIONS AND NOTATION A.1 DEFINITIONS AND ABBREVIATIONS System A system is a collection of entities that interact with a common purpose
according to sets of laws and policies (Schmidt and Taylor 1970).
Model A model is “a system used as a surrogate for another system.”
(Schruben and Schruben 2001).
Simulation A simulation model is a (computer) program that is used as a surrogate for
another system.
Job A job is a transient part of a system. For example, a customer enters and
leaves a grocery store; the job is not in the store for the duration of the
store’s business hours. The terms “job,” “customer,” and “transient entity”
are used interchangeably. The literature also refers to jobs as “dynamic
entities.”
Resource A resource is a resident element of a system. For example, a cash register
will be in a store during the entire time the store is in operation. Workers
may be transient or resident elements, depending on the focus of the study.
The terms “resource,” “server,” and “resident entity” are used
interchangeably. The literature also refers to them as “static entities.”
Event “A change in object state, occurring at an instant” (Nance 1981).
Taxonomy In the context of this dissertation, a taxonomy is a framework to assist in
organizing information used for a simulation model.
132
RD Resource-driven; simulation approach where the focus is on the resident
entities in the model.
JD Job-driven; simulation approach where the focus is on the transient entities
in the model.
FEL The Future Events List is a data structure that stores the simulation events
that have been scheduled to occur in the future.
SPL Simulation Programming Language
STPN Stochastic Timed Petri Net, see Section 2.2.2.
HMM Hidden Markov Models; a type of statistical learning model.
Delay “Delay” is used as a generic term for the time between two events of
interest. Typically, a delay is the time between a job’s arrival at a server and
its service start. In this dissertation, it will also be used to refer to a job’s
cycle time (time in system). It can also refer generically to the time between
a job’s visits to points U and V in the system, or between events X and Y.
Average Unless otherwise specified, we refer to count, not time, averages in this
paper. That is, the sample average of random variable X is defined as
1
n
ii
X X n=
=∑ .
Service Rule or rules a server uses to decide which job to process next. The
Discipline rule(s) may be job-, server-, or system-based, or may be arbitrary. A service
discipline is also referred to as a service protocol or a dispatching rule. A
common service discipline is first-come-first-serve (FCFS).
133
FCFS First-come-first-serve service discipline. It is also known as first-in-first-out
(FIFO).
LCFS Last-come-first-serve service discipline. It is also known as last-in-first-out
(LIFO).
CR Critical ratio ranking; used to sort jobs in queue based on their due dates and
expected remaining processing time. See Section 4.2.3.4.
A.2 NOTATION We organize the notation by the section in which it appears.
A.2.1 GENERAL NOTATION AND NOTATIONAL CONVENTIONS kth job in Refers to the kth job to enter the system. In FCFS single-stage systems or
the system single-server tandem queues, this will also correspond to the kth service in
the system. In multiple-server n-stage tandem queues, overtaking is
possible, so the kth job in the system may not be the kth service at
stages 2, 3,…, n.
( )*i * indicates the largest value (for an index); for example, i = 1,. 2, …, *i
i index for a resource in a group of several resources; also time slice index
j index for job type
k generic counter
g index for a resource group
r index for a step in a route
s total number of servers at a stage in a queueing system
t index for time
134
*t length of the simulation run
N number of jobs simulated
n number of stages for tandem queueing systems
Q number of jobs waiting in queue, typically used in an event graph
R number of available servers, typically used in an event graph
X random variable; its realization is x, so X x=
θ set of performance measures
λ arrival rate
ν service rate
ρ traffic intensity; for a given system, λρν
=
( )I i indicator function; ( )I i = 1 if i is true, 0 otherwise
Xµ average (or expected) value of random variable X
2Xσ variance of random variable X
2XC squared coefficient of variation for random variable X,
22
2X
XX
C σµ
=
[ ]MSE X mean squared error of the random variable X taken from a population with
unknown mean Xµ , [ ] ( )2 2XMSE X E X bias varianceµ⎡ ⎤= − = +⎣ ⎦
RND a (function returning a) Uniform (0,1) random number
135
A.2.2 INFORMATION TAXONOMY
( ){ }*: 0t t t≤ ≤G set of general system information; when the set does not change
over time, we omit the time index for simplicity, ( )t t= ∀G G
( ){ }*: 0t t t≤ ≤R set of resources; when the set does not change over time, we omit
the time index for simplicity, ( )t t= ∀R R
( ){ }*: 0t t t≤ ≤J set of jobs; when the set does not change over time, we omit the
time index for simplicity, ( )t t= ∀J J
( ){ }*2 : 0t t t≤ ≤R power set of ( ){ }tR ; that is, the set of all subsets of ( ){ }tR
( ){ }*2 : 0t t t≤ ≤J power set of ( ){ }tJ ; that is, the set of all subsets of ( ){ }tJ
( ){ }*: 0t t t′ ≤ ≤R element of power set of ( ){ }tR , ( ) ( )2 tt′ ∈ RR ;
that is, a subset of all resources
( ){ }*: 0t t t′ ≤ ≤J element of power set of ( ){ }tJ , ( ) ( )2 tt′ ∈ JJ ;
that is, a subset of all jobs
Jmax maximum value of ( )tJ , ( )maxJ t t≥ ∀J
M set of modeling behaviors or rules,
, , , ,local global local global= ∪ ∪ ∪ ∪G R R J JM M M M M M ; the set of system aspects that
are to be modeled, for example resource dedication or FCFS queueing. Subscripts
“local” and “global” refer to local and global behaviors.
136
O set of output statistics, , , , ,local global local global= ∪ ∪ ∪ ∪G R R J JO O O O O O ; the set of
desired output statistics, for example, job waiting time distributions.
A.2.3 APPROXIMATION ALGORITHM CHARACTERISTICS C(X) amount of CPU time (processing power) required to get an estimate for random
variable X.
( )Eff X an algorithm’s efficiency at attaining an estimate for random variable X.
A.2.4 ESTIMATING DEDICATION CONSTRAINTS d service discipline used to process jobs
m number of revisits jobs make to a resource group
( ) ( )( ){ }* * *:1 ,1 : 0rgt Q t r r g g t t= ≤ ≤ ≤ ≤ ≤ ≤Q number of jobs at each route step r
waiting in queue for each server group g at time t (region I.2.ii in the taxonomy)
( ) ( )( ){ }* * *:1 ,1 : 0ig gt S t i s g g t t= ≤ ≤ ≤ ≤ ≤ ≤S status of server i in group g at time t.
( ) { },igS t busy idle∈ (region II.2.ii.b). The total number of available servers in
group g at time t is ( ) ( )( )*
1
gs
g igi
S t I S t idle⋅=
= =∑ . ( )gS t⋅ is in region I.2.ii.
( ) ( )( ){ }* * * *:1 ,1 ,1 : 0irg gt Q t i s r r g g t t= ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤Q number of jobs at route step r
waiting specifically for server i of group g at time t (region II.2.ii.b). The total
137
number of jobs waiting at group g, either for a specific server or for any of the
servers, is ( ) ( ) ( )**
1 1
gsrtotg rg irg
r iQ t Q t Q t
= =
⎛ ⎞= ⎜ + ⎟
⎜ ⎟⎝ ⎠
∑ ∑ (region I.2.ii).
( ) ( )( ){ }* * * *:1 ,1 ,1 : 0ijg gt D t i s j j g g t t= ≤ ≤ ≤ ≤ ≤ ≤ ≤ ≤D number of jobs of type j that
server i of group g has processed that may return to i in the future
(region II.2.ii.b).
( ){ }*:1 ,1gkJ g g k N= ≤ ≤ ≤ ≤J , for each job k, the resource in group g k has been
dedicated to. If no dedication is present at g, the array entry can be left blank
(region III.2.i.b).
( )( ){ }* *( ) :1 : 0gt A t g g t t= ≤ ≤ ≤ ≤A number of times by time t jobs arriving at g have
been forced to wait although there was at least one idle server
( )( ){ }* *( ) :1 : 0gt V t g g t t= ≤ ≤ ≤ ≤V number of times by time t a server in group g
remains idle although it finds ( ) 0totgQ t >
A.2.5 APPROXIMATING WAITING TIME DISTRIBUTIONS γ delay value
ξ time slice size
Ξ total number of time slices, *t ξ⎡ ⎤Ξ = ⎢ ⎥
( ){ }, :1 :1k k iW i n k N= ≤ ≤ ≤ ≤W waiting time at the ith stage of the kth job in the system;
for single-stage systems, we omit the second index for simplicity
138
( ),t γ ξ Lower Time Slice Counter Method time slice number of a DELAY event
scheduled at time t for γ time units in the future at time slice size ξ;
( ), tt γγ ξξ
⎢ ⎥+= ⎢ ⎥⎣ ⎦
( ),t γ ξ Upper Time Slice Counter Method time slice number of a DELAY event
scheduled at time t for γ time units in the future at time slice size ξ
{ }:1kA k N≤ ≤ arrival time of the kth job in the system
( ){ }*: 0AN t t t≤ ≤ number of arrivals by time t
( ){ }, :1 :1k k iS i n k N= ≤ ≤ ≤ ≤S start service time at the ith stage of the kth job in the
system; for single-stage systems, we omit the stage index for simplicity
( ) ( )( ){ }*:1 0iS St N t i n : t t= ≤ ≤ ≤ ≤N number of starts at stage i by time t; for single-
stage systems, we omit the stage index for simplicity
S shorthand for ( )SN t ; may also refer to the time of the START event ( kS )
( ){ }, , , :1 :1k k iD i n k Nγ γ= ≤ ≤ ≤ ≤D time kth job in the system has been delayed γ time
units at the ith stage of the system; for single-stage systems, we omit the stage
index for simplicity; we omit the delay subscript when its value is obvious
( ) ( )( ){ },
*:1 0iD Dt N t i n : t t
γ γ= ≤ ≤ ≤ ≤N number of jobs at stage i delayed γ time units
by time t; ( ) ( ),1D AN t N t
γγ= − for t ≥ γ; for single-stage systems, we omit the
stage index for simplicity; we omit the delay subscript when its value is obvious
D shorthand for ( )DN tγ
; may also refer to the time of the DELAY event ( kD )
139
( ) ( )( ){ }*, :1 : 0it W t i n t tγ γ= ≤ ≤ ≤ ≤W number of jobs at stage i whose wait was less
than γ time units by time t, ( ) ( )DW t N tγγ ≤ ; for single-stage systems, we omit the
stage index for simplicity; we omit the delay subscript when its value is obvious
{ }, :1i iγ∆ ≤ ≤ Ξ number of γ-time unit DELAY events through time slice i;
we omit the delay subscript when its value is obvious
( ){ },
*: 0LN t t tγ ξ
≤ ≤ number of approximated γ-DELAY events from the Lower Time
Slice Counter Method that have occurred by time t, for time slice size ξ; we may
omit the subscript for simplicity. ( ) ( ), , 0,L tN tγ ξ γ ξ= ∆
L shorthand for ( ),LN t
γ ξ
( ){ },
*: 0UN t t tγ ξ
≤ ≤ number of approximated γ-DELAY events from the Upper Time
Slice Counter Method that have occurred by time t, for time slice size ξ; we may
omit the subscript for simplicity. ( ) ( ), , 0,U tN tγ ξ γ ξ= ∆
U shorthand for ( ),UN t
γ ξ
( ){ }, :1 :1k k iF i n k N= ≤ ≤ ≤ ≤F finish time at the ith stage of the kth job in the system;
for single-stage systems, we omit the stage index for simplicity
( )WF ⋅ distribution function of the job waiting (or other delay) times
( )WF γ { }P delay γ≤
( )WF γ estimate of { }P delay γ≤
140
Appendix B. EXAMPLE: GENERATING CORRECT DEPARTURE PROCESS WITHOUT JOB INFORMATION
In this section, we define the problem of generating the correct departure process from a
queueing system if information from region III is unavailable. We propose a solution
procedure, and describe how the quality of a solution could be measured for this problem.
B.1 PROBLEM STATEMENT Consider an open queueing network in which jobs of j* types are processed on g* server
groups. Each server group g is composed of *gs functionally identical servers. Each job
type is characterized by known service-time distributions and routes. The r* route steps
may be deterministic, stochastic, and/or state-dependent. Both the routings and service
times may be non-stationary. At each of the server groups, jobs are processed according
to some service discipline. The set { }* * * *, , , ,gj g r ts and the associated information are in
region I of the taxonomy.
Define the following processes. Taxonomy classifications are given in parentheses.
( ){ }* * *( ) ( ) :1 ,1 : 0ig gt N t i s g g t t= ≤ ≤ ≤ ≤ ≤ ≤N is the number of jobs that have left
server i in group g by time t. It may be sufficient to consider ( ) ( )*
1
gs
g igi
N t N t⋅=
= ∑ ,
the number of jobs that have left the entire group, rather than looking at each
server individually. ({ }( )tN falls in region II.2.ii.b, ( )gN t⋅ in region I.2.ii.)
141
( ){ }* *:1 ( ) :1g kg gJ k N t g g⋅= ≤ ≤ ≤ ≤J is the job type that is the kth departure from
group g. { }*1,..., ,kgJ j k g∈ ∀ (region I.2.ii).
( ){ }* *:1 ( ) :1g kg gM k N t g g⋅= ≤ ≤ ≤ ≤M is the server that processed the kth departure
from group g. For every g, { }*1,..,kg gM s∈ (region III.2.i.b).
( ){ }*:1 ( ) :1g kg iD k N t g g= ≤ ≤ ≤ ≤D is the time of the kth departure from server
group g. The event { }kgD t≤ is equivalent to the event { }( )gN t k⋅ ≥ .
0 0gD = for all g (region III.2.i.b).
( ){ }* * *( ) ( ) :1 ,1 : 0rgt Q t r r g g t t= ≤ ≤ ≤ ≤ ≤ ≤Q defines the number of jobs at each route
step r waiting in queue for each server group g at time t (region I.2.ii)
The problem is to generate the departure process
( ){ }* *( ), , , : 0 ,1g g gt t t g g≤ ≤ ≤ ≤N J M D from the server groups, using only the steady-
state observed stochastic process { }*( ) : 0t t t≤ ≤Q . That is, to use information from
region I to find information in regions II.2.ii and III.2.ii.
B.2 SOLUTION CHARACTERISTICS AND ERROR MEASURES Ideally, a solution to this problem will generate the same departure process
( ){ }* *( ), , , : 0 ,1g g gt t t g g≤ ≤ ≤ ≤N J M D as a model that was maintaining all job
information. (Whitt 1984) discusses overtaking in queues, and proposes a comparison of
142
the distributions of the possible job orderings, based on the perturbations of the original
job ordering. This approach may be of value in this context.
If it is not possible to generate the same departure process, we would like to
minimize the disruption caused to the rest of the system by sending out the “wrong” job
type. That is, if sending out job type j would cause less disorder in the remaining system
than sending out job type j’, we would prefer to send out j. Possible objective functions
to minimize the disruption are to
Minimize { }P wrong type
or to
Maximize { }P distance from correct type < ε .
The first objective function is similar to analyzing queueing models. The second
objective function requires a measure of “distance” from the “correct type.”
One way to measure the distance from the correct type or the disruption caused in
the remaining system is looking at the deviations in queue sizes at all the server groups.
There are (at least) two possible ways of measuring these deviations. Let
( ) ( )* *
1 1
g rtot
rgg r
Q t Q t= =
=∑∑
be the total number of jobs waiting in queue in the system at time t. Then the first error
measure measures the squared error of the total jobs waiting in system until time t*:
( ) ( )( )*
2
1 *0
1 ttot totcomplete approxError Q t Q t dt
t= −∫ (11)
143
The second error measure differs slightly:
( ) ( )( )* * *
2
2 *1 10
1 t g rcomplete approxrg rg
g rError Q t Q t dt
t = =
= −∑∑∫ (12)
The following example illustrates the difference. Let r* = g* = 2.
Table 12. Example values to illustrate differences in error measures
r, g Qrgcomplete Qrg
approx (Qrgcomplete - Qrg
approx)2
1, 1 3 5 4 2, 1 1 2 1 1, 2 2 5 9 2, 2 4 5 1 sum 10 17
Error1 = (10 – 17)2 = 49.
Error2 = 4 + 1 + 9 + 1 = 15.
The choice of error measure depends on the actual problem, and whether the difference in
total queue size (Error1) or in the relative differences between individual queues (Error2)
is more important. A mixture of the two calculates the differences in the total queue sizes
at each server group. Define the number of jobs waiting at server group g as
( ) ( )*
1
r
g rgr
Q t Q t⋅=
= ∑ .
Then
( ) ( )( )* *
2
3 *10
1 t gcomplete approxg g
gError Q t Q t dt
t ⋅ ⋅=
= −∑∫ (13)
An unresolved question is whether the error measure should be a random variable or a
deterministic quantity. If it is a random variable, it may be possible to establish
dominance of one error measure over the others.
144
B.3 SOLUTION APPROACH: DISCRETIZED-TIME QUEUES Rather than storing information from region III (job waiting times), we propose
discretizing the time jobs spend in queue by adding information to region I: integer
counts of the different types of jobs that have been waiting for specified discrete lengths
of time. This allows the approximation of job orderings in queue.
We may be able to make use of ideas from the discrete-time conversion literature.
See, for example (Fox and Glynn 1990). The idea there is to condition a continuous-time
Markov Chain on the states visited (the embedded chain), and to use the resulting
conversion to estimate quantities such as costs for the continuous case.
In a conventional resource-driven approach (using information from region I
only), the queue for a given server (group) is an array of integers, see Figure 25.
Figure 25. Sample queue for 5 parts that visit the same server 3 times
The associated discretized queue is a matrix where the rows represent the number of jobs
waiting for the discrete time intervals. This is illustrated in Figure 26 for an interval of 5
time units.
Figure 26. Sample discretized queue for the queue given in Figure 25
Part 1 Part 1 Part 1 Part 2 Part 2 Part 2 Part 3 Part 3 Part 3 Part 4 Part 4 Part 4 Part 5 Part 5 Part 5 Test- Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 wafer
3
3 3
1 1 1
1
2
2 0-5 5-10 > 10
Part 1 Part 1 Part 1 Part 2 Part 2 Part 2 Part 3 Part 3 Part 3 Part 4 Part 4 Part 4 Part 5 Part 5 Part 5 Test- Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 Visit 1 Visit 2 Visit 3 wafer
4 3 2 3 1 2 1 1
145
Using the information in the queue matrix, we can select jobs based on approximate
FCFS (or LCFS) disciplines. Similarly, we can estimate the lengths of time individual
jobs have been waiting in queue.
Parts that have been waiting 0-n minutes at time t will have waited 0+∆t to n+∆t
minutes at time t+∆t. Some mechanism must be found to correctly update the matrix. A
possible implementation is to shift the entries in the matrix “down” a row every n time
units. This is inefficient because it updates the matrix more often than is required. It
should be sufficient to update the matrix when jobs enter the queue or start service
because the queue is unaffected by other system events.
This is an approximation as a job that arrived at time n-ε will be reclassified as
having waited greater than n time units ε after its arrival.
This approach allows us to approximate the necessary information from regions
III.2.i and III.2.ii. That is, we can approximate system behavior and get estimates of
waiting times. As the time interval length decreases, the memory requirements will
increase. In some cases, it may be more memory-efficient to use the information from
region III instead.
146
Appendix C. DEDICATION CONSTRAINTS C.1 EXAMPLES OF SYSTEMS WITH DEDICATION C.1.1 WAFER PRODUCTION A deterministic, re-entrant, and state-dependent queueing system is a fab-level model in
semiconductor manufacturing. There are many thousands of wafers that pass through the
system. An example of state-dependency is that routing to QA (quality assurance) may
be increased if a machine group is producing substandard wafers. One of the machine
groups wafers visit repeatedly are the stepper tools that motivated the research in
Section 4.4. On return visits to the stepper group, wafers are sent to the same subset of
stepper tools that processed them before.
We observe ( ){ }tQ , the number of wafers waiting at each resource group g for
any of its *gs servers. For the stepper tools (and other machines that may be subject to
dedication constraints), we additionally can observe ( ){ }tQ , the number of wafers that
are on a return visit. Tools with dedication constraints may also have positive values of
( ){ }tQ if there are unassigned wafers waiting for their first visit to the tool.
We know ( ){ }tS , the state of every machine in the factory. Finally, ( ){ }tD is
known, but is of interest only for the tools with dedication. A resource-driven model of
the fab cannot observe ( ){ }tQ because of lack of knowledge of { }J .
147
C.1.2 HEALTH CARE Health Maintenance Organizations (HMOs) can be viewed as very large, stochastic,
state-dependent, re-entrant (with dedication) systems. The following description is based
on the author’s experience with health care plans, for example (Aetna 2001;
HealthNet 2002).
To receive care and get referrals or prescriptions, patients are required to visit
their primary care physicians (PCPs), their “dedicated resources.” Even if other doctors
in the same office are available while the PCP is not, the patient must visit the PCP. The
system is stochastic as patients may see several doctors for the same condition, but the
order in which they visit the doctors need not be fixed. It is state-dependent because the
routing may depend on the availability of physicians other than the PCP. (If the preferred
cardiologist is not available, patients may be referred to a different one.) In emergencies,
it is even possible to be seen by a doctor other than the PCP.
In this example, we observe ( ){ }tQ , the number of patients waiting for a group
of doctors, any of which would be able to help the patient. E.g., it likely does not matter
which dentist in an office of four dentists fills a small cavity. In most cases, the observed
( ){ }tQ will be greater than ( ){ }tQ . We can observe ( ){ }tS and ( ){ }tD , although the
meaning of the latter is not as significant here as it is in other examples. It represents the
number of patients that have been assigned to a doctor. We assume that patients will
continue to seek treatment from their PCPs indefinitely unless the patients switch doctors
or pass away. (In reality, ( ){ }tD is used to determine whether doctors will accept new
148
patients. It is an interesting problem to try to do an analogous assignment of jobs to
machines in production systems like the semiconductor fab discussed above.)
C.1.3 WEB SERVERS Another stochastic, re-entrant, state-dependent system is found in web browsing. Users
browse the Internet by sending HTTP requests to web servers, who respond by sending
back the web pages and other content. This is a very large stochastic queueing network:
Users may visit different websites or pages on a site as they desire. g*, *gs , and r* are
extremely large.
State dependency occurs on both sides of the information exchange: The user may
choose to visit a different website (cnn.com instead of msnbc.com, for example) if the
site (s)he is trying to access is not responding quickly enough. On the server side,
requests are routed to and processed by different server banks, depending on the volume
of requests. In extreme cases, additional servers may be added temporarily.
In web server administration, there is the notion of “sticky web pages”
(Microsoft 2004). This states that all HTTP requests from a client machine during a
particular browsing session must be processed by the same server, otherwise the session
information is lost. I.e., the client is dedicated to this server for the duration of the
browsing session. For this system, we can observe ( ){ }tQ , ( ){ }tQ , and ( ){ }tS , though
their states change extremely rapidly. ( ){ }tD is less easily observable. We may know
how many users a particular server has served, but it is not always clear whether the users
are still perusing the site.
149
C.1.4 OTHER There are cases where we may not have observations for all four of the stochastic
processes mentioned above. For example, we may know when an ATM machine is in
use, but not know how many customers are waiting in line for it. In this case, we have a
very simple G/G/1 queue, but cannot observe ( ){ }tQ . ( ){ }tS is still observable.
C.2 ENHANCED APPROXIMATION Approximate ( ){ }0tot
gP can begin processing|i now available, Q t > by
( )
( )
( )*
*11
totgQ t
g g
g g
s S ts S t
⋅
⋅
⎛ ⎞−− ⎜ ⎟⎜ ⎟− +⎝ ⎠
(14)
This probability is independent of the total number of servers in g, sg*, and uses only the
number of servers that are currently busy; it takes into account the fact that waiting jobs
are not waiting for currently idle servers. Define 00 = 1:
• If there are no jobs waiting in queue, ( )tQtotg = 0,
( )( )
( )*
*1 1 1 01
totgQ t
g g
g g
s S ts S t
⋅
⋅
⎛ ⎞−− = − =⎜ ⎟⎜ ⎟− +⎝ ⎠
; a server will not begin processing a job if there
are no jobs waiting.
• If ( )totgQ t > 0 and all servers (but i) had been idle (and since i has just completed,
*( )g gS t s⋅ = ), ( )
( )
( ) ( )*
*
01 1 1 0 11 1
tot totg gQ t Q t
g g
g g
s S ts S t
⋅
⋅
⎛ ⎞− ⎛ ⎞− = − = − =⎜ ⎟ ⎜ ⎟⎜ ⎟− + ⎝ ⎠⎝ ⎠; one server will have
to begin serving jobs if all servers are currently idle.
150
• If *gs = 10, ( )tQtot
g = 1, and ( )tS g⋅ = 8: ( )
( )
( ) 1*
*
2 11 11 3 3
totgQ t
g g
g g
s S ts S t
⋅
⋅
⎛ ⎞− ⎛ ⎞− = − =⎜ ⎟ ⎜ ⎟⎜ ⎟− + ⎝ ⎠⎝ ⎠,
which is greater than the probability of 0.1 we obtain from our current
approximation because it assumes the jobs in queue are waiting for one of the
busy machines.
• If *gs = 10, ( )tot
gQ t = 5, and ( )gS t⋅ = 8,
( )( )
( ) 5*
*
21 1 0.86831 3
totgQ t
g g
g g
s S ts S t
⋅
⋅
⎛ ⎞− ⎛ ⎞− = − =⎜ ⎟ ⎜ ⎟⎜ ⎟− + ⎝ ⎠⎝ ⎠, which is greater than our probability
of 0.4095 because it assumes jobs are not waiting for the 7 servers that were
previously idle.
This refinement was formalized by David Bowen (Bowen 2003).
151
Appendix D. APPROXIMATING WAITING TIME DISTRIBUTIONS D.1 POSSIBLE TRANSITIONS BETWEEN CURVE ORDERINGS Figure 27 shows the possible transitions between orderings for jobs whose DELAYs fall
in the same time slice. Each transition corresponds to the change in ordering from job k
to job k +1. It is possible to start in any of the states, and to transition from a state to
itself. It is also possible to move from the left to the right in Figure 27, but not from right
to left. The time values for L (Lower Time Slice Counter Method counter) and U (Upper
Time Slice Counter Method counter) are fixed for a given time slice. D (DELAY time) is
always between L and U, but it is possible for the relative positions of S (START time)
and D to change. Once S has moved to the right of L or U, it cannot move to their left
(for a given time slice).
Transitions between time slices can go from any ordering to any other ordering,
since the locations of L and U change for the new time slice. Possible transitions within
and between time slices are illustrated in Figure 17. The time slice between times 2ξ and
3ξ shows the transitions from SLDU to LSDU to LDSU to LDUS. The transition from
time slice 1 (indexing begins at 0) to time slice 2 gives a transition from LSDU to SLDU,
and the transition from time slice 2 to time slice 3 is LDUS to LSDU. Neither transition
is possible within a time slice.
The possible orderings and their transitions are a Semi-Markov Process. In
Appendix D.2.7, we discuss the Markov property of this system. The times between
transitions are random quantities that follow distributions Gi. It is likely that we cannot
determine G in non-trivial cases. For the time being, we focus our discussion on the
embedded Markov Chain.
152
Figure 27. Possible transitions between orderings within a time slice D.2 CLASSIFICATION OF UNCERTAIN JOBS In this section, we discuss various ways of handling jobs whose DELAY and START
events fall in the same time slice. We include simple probabilistic arguments why some
methods perform more accurately than others. Jobs that find an idle server are not
included in this analysis as their delay is zero. I.e., all probabilities given below are
probabilities conditioned on k kA S≠ . We omit this notation.
There are two fundamental assumptions underlying the calculations for the
probabilities of correct classification. The first is that DELAY and START events are
equally likely to occur anywhere in the time slice (f(s) = f(d) = 1/ξ). This is true for
Exponential arrivals (DELAYS) as, conditioned on the number of events in a time period,
the occurrence times of these events have the same distribution as the order statistics of
the Uniform distribution. For other distributions, this is not true, though empirical
SLDU
LDSU
LSDU
LDUS
153
analysis has shown that the assumption has some support if we include no other
information (e.g., the number of DELAY events in the time slice). There was also little
experimental evidence to contradict f(s) = 1/ξ. For the time being, we will use this
assumption.
The second assumption is that the locations of D and S are independent of one
another without any other information (e.g., queue size, last start service time).
D.2.1 IGNORE UNCERTAIN JOBS A simple approach is to ignore the jobs we cannot classify with certainty. This approach
gives good results if the distributions of the jobs that can and cannot be classified are the
same. If they differ, we are estimating the wrong distribution.
Experimentally, this approach did not work well. It performed consistently worse
than other methods described below. In the future, we would like to prove that the
distributions (of all jobs and those with known classifications) are different.
D.2.2 ALWAYS CLASSIFY AS (NOT) DELAYED This approach biases the estimate of ( )WF γ because the classification of all jobs whose
DELAY and START events happen in the same time slice is not the same except in very
special cases. For example, in a D/G/2 system with arrivals every time unit, and a service
time Uniformly distributed between 0.1 and 0.2 time units, all jobs are not delayed for
any [ )0.2,γ ξ∈ .
154
Let ξ be the time slice size, and i is the index of the time slice this D falls in,
Diξ
⎢ ⎥= ⎢ ⎥⎣ ⎦
. Figure 28 illustrates the relative positions of these quantities and labels the
different time slice regions. The region between the beginning of the time slice and the
START time (the current time) is A, and represents S iξξ− proportion of the total time
slice. The line segment between the current time and the end of the time slice is B and is
( )1i Sξξ
+ − part of the time slice. We do not know the location of D, other than that
( )1i D iξ ξ≤ < + .
Figure 28. Location of current time in time slice
If we choose to classify all jobs as delayed, the probability of a correct classification is:
{ } { } { } { } ( )( )
( ) ( )
( )
( ) ( )
( )
1
1 1
2
122 2
2
2 2
|
1
1 1 12 2
1 2 1 0.52
i
i
i i
i i
i
i
P correct P job delayed P D S P D s S s f s ds
s i s ids ds
s is i i i i i
i i i i
ξ
ξ
ξ ξ
ξ ξ
ξ
ξ
ξξ ξ ξ ξ
ξ ξ
+
+ +
+
= = ≤ = ≤ = =
−= ⋅ = − =
⎡ ⎤ ⎡ ⎤= − = + − − + − =⎢ ⎥ ⎣ ⎦⎣ ⎦
= + + − − =
∫
∫ ∫ (15)
iξ (i+1)ξ
S iξξ− ( )1i Sξ
ξ+ −
A B
S
155
This is also the probability of a correct classification if all jobs are classified as “not
delayed.”
D.2.3 RANDOMLY CLASSIFY JOBS USING A FIXED PROBABILITY We can classify uncertain jobs as delayed with a fixed probability p by flipping a coin
with probability p for each job and basing our classification on the result of the coin flip.
The probability of a correct classification is:
{ } { } { }{ } { }
( ) ( ) ( )( )
( ) ( ) ( )( )
( ) ( )
1
1
2 2
2
, ,
, ,
11
1 1 1
2 1 1 2
i
i
i
i
P correct P delayed classified delayed P not delayed classified not delayed
P D S RND p P D S RND p
i ss i p p f s ds
i p p ssp ip ds
p s i p p
ξ
ξ
ξ
ξ
ξξξ ξ
ξ ξ ξ ξ
ξ
+
+
= + =
= ≤ ≤ + > > =
⎡ ⎤⎛ ⎞+ −⎛ ⎞−= + − =⎢ ⎥⎜ ⎟⎜ ⎟
⎢ ⎥⎝ ⎠ ⎝ ⎠⎣ ⎦
+ − −= − + − =
− − −= +
∫
∫( )
( ) ( ) ( )
( ) ( )( ) ( )
( ) ( )
1
12
2
2 2
1
2 1 1 2 12
2 1 1 1 2 1 12
2 12 1 2 1 1 0.52
i
i
i
i
ds
p s i p ps
p i i i p p i i
pp i p i p
ξ
ξ
ξ
ξ
ξ
ξ ξ
+
+
+=
⎡ ⎤− − − += + =⎢ ⎥⎣ ⎦
− ⎡ ⎤= + − + − − + ⋅ + − =⎣ ⎦−
= − + − − − + =
∫
(16)
This is the same as the probability of always classifying a job as (not) delayed.
D.2.4 RANDOMLY CLASSIFY JOBS BASED ON THE CURRENT TIME SLICE
LOCATION A job is classified as delayed if D ≤ S. If we assume that D is equally likely to have
occurred anywhere in the time slice, then the probability a job is delayed depends on S’s
156
relative position in the time slice. Specifically, { } { } S iP delayed P S D ξξ−
= > = . If a
Uniform(0,1) random number is less than this probability, we classify the job as delayed.
The probability of correct classification is:
{ } { } { }
( )
( )
( )
2 21
2 2 2
3 2
, ,
, ,
1
22 12
i
i
i
i
P correct P delayed classified delayed P not delayed classified not delayed
S i S iP D S RND P D S RND
s i s i f s ds
s is is i ds
ξ
ξ
ξ
ξ ξξ ξ
ξ ξξ ξ
ξξ ξξ ξ ξ
+
= + =
⎧ ⎫ ⎧ ⎫− −= ≤ ≤ + > > =⎨ ⎬ ⎨ ⎬
⎩ ⎭ ⎩ ⎭⎡ ⎤⎛ ⎞ ⎛ ⎞− −
= + − =⎢ ⎥⎜ ⎟ ⎜ ⎟⎢ ⎥⎝ ⎠ ⎝ ⎠⎣ ⎦
−− += + −
∫( )
( ) ( )( )
( ) ( )
( ) ( ) ( ) ( ) ( )
1
1 2
3 2
1
3 23 2
3 23 2
2 2 2
2 2 1 2 1 12
2 1 12 2 13
2 1 2 1 1 2 1 1 13
2 22 2 4 4 1 2 2 13 3
i
i
i
i
i s i is ds
i iis s s
i i i i i i i i i
i i i i i i
ξ
ξ
ξ
ξ
ξ
ξ ξ ξ
ξ ξ ξ
+
+
+
=
+ + += − + =
⎡ ⎤+ ++= − + =⎢ ⎥⎣ ⎦
⎡ ⎤ ⎡ ⎤ ⎡ ⎤= + − − + + − + + + + − =⎣ ⎦⎣ ⎦ ⎣ ⎦
= + + − − − + + + =
∫
∫
(17)
Allowing the probability of classification to depend on the current location in the time
slice increases our overall probability of correct classification from 0.5 (for a fixed
probability) to 0.67.
D.2.5 USING EXPECTATIONS This is similar in spirit to the idea proposed in Appendix D.2.4. Rather than estimating
{ } P job delayed by Equation (9), we estimate it by
157
{ }
( )( )
( )
*
1*
,DN tk k
jobs k
D
S Sexpected relative bin location of job
P delaytotal number jobs N t
γ
γ
γ ξ ξξγ =
−
≤ = =∑ ∑
(18)
This yields the same results as the method from Appendix D.2.4.
This approach can be refined to use the expected order statistics rather than the
current location in the time slice. This uses additional information (the counter values) in
making the estimates, and will be considered as an extension in Appendix D.5.4.
D.2.6 DETERMINISTICALLY CLASSIFY JOBS DEPENDING ON CURRENT
LOCATION This approach classifies jobs as “delayed” if their START occurs near the end of the time
slice, specifically if it is in the second half of the time slice: ,2
S m m i ξξ> = + . This is
similar to the approach from Appendix D.2.3 with p = 0.5, but eliminates additional
randomness introduced by the coin flip. The probability of a correct classification is:
{ } { } { }
( )( )1
20
, ,
, ,2 2
12 2
i
i
i
P correct P delayed classified delayed P not delayed classified not delayed
P D S S i P D S S i
s i s iP s i P s i f s ds
s i ds
ξ
ξ
ξ ξξ ξ
ξ ξ ξ ξξ ξξ ξ
ξξ
+
+
= + =
⎧ ⎫ ⎧ ⎫= ≤ > + + > ≤ + =⎨ ⎬ ⎨ ⎬⎩ ⎭ ⎩ ⎭
⎡ ⎤⎛ ⎞ ⎛ ⎞− −⎧ ⎫ ⎧ ⎫= > + + − ≤ + =⎨ ⎬ ⎨ ⎬⎢ ⎥⎜ ⎟ ⎜ ⎟⎩ ⎭ ⎩ ⎭⎝ ⎠ ⎝ ⎠⎣ ⎦
−=
∫
( )
( ) ( )
( )
( ) ( )
( ) ( )
1 0.5
2.5
1 0.52 2
2 20.5
2 222
1
12 2
1 1 1 1 1 11 1 12 2 2 2 2 2
1 1 1 1 1 3 1 32 2 2 8 2 8 2 4
i i
i
i i
i i
s i ds
s i i ss s
i i i i i i i i i i
i i i i
ξ ξ
ξ ξ
ξ ξ
ξ ξ
ξξ ξ
ξ ξ ξ ξ
+ +
+ +
+
−+ − =
⎡ ⎤ ⎡ ⎤+= − + − =⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦
⎡ ⎤ ⎡ ⎤⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞= + ⋅ + − − + − + + − + − + − − =⎢ ⎥ ⎢ ⎥⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦
= + − − + + − =
∫ ∫
(19)
158
Removing the randomness has increased the probability from 0.67 to 0.75. Figure 29
shows the functions over which we integrate to get the probabilities. In the legend, b ξ=
and s iA ξξ−
= . “Deterministic” refers to the integrand from Section D.2.6, “random” to
the integrand in Section D.2.4.
Figure 29. Probabilities integrated over
The unbolded lines are the individual functions and the bold lines are the sums of the two
functions in the deterministic and random cases. Two things are illustrated in this graph.
The first is that that randomness causes the parabolic shape in the integrand
because we square the terms we wish to integrate over. In other words, the random case
integrates over ( )2
s i f sξξ
⎛ ⎞−⎜ ⎟⎝ ⎠
rather than just ( )s i f sξξ
⎛ ⎞−⎜ ⎟⎝ ⎠
. The deterministic case has
a factor of 2
s i ξξ> + , which acts as an indicator function and is either 0 or 1. Therefore,
we are actually integrating over only one term in either half of the time slice for the
159
deterministic case. The random case has contributions from both terms throughout the
time slice.
The second thing we observe is that 2
m i ξξ= + is the point that maximizes the
value of our integral in the deterministic case. Any point 'm m≠ would reduce the area
under the function. This is illustrated in Figure 30. Choosing m’ rather than m requires
us to integrate over the solid bold line. This reduces the value of the integral by the area
of the shaded triangle and is true for any m’ < m, and, by symmetry, for any point m’ > m.
Figure 30. Area lost by using a point other than 2
m i ξξ= + as a cutoff
D.2.7 HIDDEN MARKOV MODELS Hidden Markov Models (HMMs) are a type of model usually studied in the machine
learning/artificial intelligence communities (Jordan et al. 1999; Jordan 2003). In HMMs,
the system finds itself in multinomial state iX after the ith transition. In our case, iX is a
4-dimensional vector with each element representing one of the four states in Table 5.
A
1-A
m=(i+0.5)ξm’
160
The assumption is that, given 1iX − , the probability of iX is independent of the preceding
states.
The models are “hidden” because we cannot observe iX , i.e., ( )LDN t in this
example. Instead, we have observations iY with values in the set { }, ,LSU LUS SLU .
Figure 31 shows the possible mappings of iX ’s to iY ’s.
Figure 31. Mapping of hidden states to observable outputs of the states
It may be necessary to augment the state by including a Boolean indicating whether the
current observation is in the same time slice as the previous one; the transition
probabilities change depending on whether we are transitioning within or between time
slices. The value of the Boolean is observable using general information from region I.
The additional time slice information would make estimating transition probabilities
easier because we know that certain probabilities are zero.
The advantage of this HMM is that the probability of observing iY given iX is 0
or 1. That means that only the transition probabilities between the iX , { }1 |i iP X X+ , not
the { }|i iP Y X , need be estimated.
SLDU LSDU LDSU LDUS
SLU LSU LUS
Xi
Yi
161
The HMM is a way of approximating delay probabilities in the cases where we do
not know whether a job was delayed. How to appropriately train the model without
generating many runs with a full job trace is an open question. Generally, machine
learning problems have large amounts of training data to allow the model to refine its
estimates. We are trying to avoid gathering large amounts of data, although it may
become necessary to do so if we wish to experiment with HMMs.
D.2.8 COMPARISON OF UNCERTAIN JOB CLASSIFICATION The main competitor for the deterministic classification described in D.2.6 is the random
classification described in D.2.4. In this section, we compare the approximation errors
for the two approaches. The deterministic approach reduces the estimation error
noticeably and consistently. The experiments run are those listed in Table 9, along with
the systems with doubled and halved rates.
Only in very few cases was there no discernible difference between the two
classification methods. These are the cases where the error was already extremely
insignificant. Figure 32 shows the difference in estimation errors for the M/U/1 system
with an average interarrival time of 0.55 and an average service time of 0.5. This system
is the one (of the 54 run, see Table 9) where the random classification has the best
performance compared to the deterministic classification. The difference in error is
random deterministicerror error− .
The lines with the triangles correspond to the runs with a time slice size of 1
(twice the average service time), the lines with the squares time slice size 2, and the lines
with diamonds time slice size 3.
162
Figure 32. Difference in estimation errors for the random and deterministic Time Slice Counter Method M/U/1 implementations
Figure 33 shows the errors for the system where the improvement is the largest, the
U/U/1 system with average interarrival 0.55 and average service time 0.5.
As the average interarrival and service times increase, the improvement becomes
less pronounced. The average errors for each of the 54 runs were always positive (if
extremely close to zero). Figure 34 shows the differences for an M/B/1 system with an
average interarrival time of 3 and an average service time of 2.
163
Figure 33. Difference in estimation errors for the random and deterministic Time Slice Counter Method U/U/1 implementations
Figure 34. Difference in estimation errors for the random and deterministic Time Slice Counter Method M/B/1 implementations
164
D.3 IMPLEMENTATION DETAILS D.3.1 CLASSIFYING JOBS BASED ON COUNTERS Unlike the discussions in Sections 4.5.3 and 4.5.4.1 the implementation uses vertical time
slices, not horizontal job characteristics. In all cases, the count comparisons are done
after ( )SN t has been incremented.
• ( ) ( )S UN t N t≤ : The START event occurs in a later time slice than the actual
DELAY event; the job is delayed.
• ( ) ( )S LN t N t> : The START event occurs in an earlier time slice than the actual
DELAY event; the job is not delayed.
• ( ) ( ) ( )U S LN t N t N t< ≤ : The START event occurs in the same time slice as the
actual DELAY event; we cannot tell whether the job is delayed. Based on
experimentation and the analyses in Appendix D.2, we classify jobs
deterministically based on the relative position of the START event in the time
slice (Appendix D.2.6).
D.3.2 JOBS FINDING AN IDLE SERVER The addition of a Boolean variable in region I.2.i to indicate whether a job can begin
service immediately upon arrival increases the accuracy of the Time Slice Counter
Method. If there is an idle server available for the job, the job’s delay is zero and its wait
is less than γ for any γ>0.
165
D.3.3 UPDATING COUNTERS The way in which the time slice counters are updated has a large impact on the
computational requirements of the Time Slice Counter Method. Incrementing the
counters as described in Section 4.5.2.1 has complexity O(NΞ) for each γ; each of the N
jobs causes a traversal of the array of Ξ counters. For long and/or congested runs, the
simulation will slow dramatically. We show we can update counters in O(Ξ) time per γ.
Claim 2 We can update and maintain correct time slice counts for one delay value γ in
O(Ξ) time.
Proof: Our proof is illustrated using Figure 35, which shows the relationships
between times for updating. We can update the array by performing one addition
for each job, which occurs in constant time. The speed of the algorithm is
therefore independent of the number of jobs simulated. We show we pass through
the array of time slice counters exactly once in our updating steps, so the
complexity is linear in the size of the array.
1. At time t, DELAY counters for times before t + γ are no longer incremented.
Counters after t + γ need not yet be incremented because ( )SN t is compared
to the counter for time t. For example, in Figure 35, if the current time is t1,
we increment the counter for the time slice for time t1 + γ, time slice i. We
will no longer be incrementing time slice counters for j < i. Furthermore, we
need not increment later counters (k>i), since we have not yet reached them.
( )1SN t is compared to , 4iγ −∆ .
2. If time progresses to t’, counters between t + γ and t’ + γ can be set to the
166
value of the counter at t + γ by the argument in 1. This is true whether the
event at t’ is an arrival or a start service. For example, if time in Figure 35
advances to an arrival at t2, the DELAY occurs in time slice i+3, and there
will be no more changes to the counts for time slices i, i+1 and i+2. The
values for these time slices can be set to the current value of time slice i,
, 4iγ −∆ . If, instead, time has advanced to a start service event at time t3, we
perform the update step setting the values of time slices i+1 and i+2 to the
current value of time slice i. In fact, time slices s through i+6 (corresponding
to t3 + γ) can be updated. ■
Figure 35. Illustration of relationships of times for updating time slices
D.3.4 INITIALIZATION BIAS Currently, nothing has been done to deal with initialization bias. (Queues are initialized
to zero and servers as idle.) Typical approaches include initializing the queue to its
steady-state value, or truncating the output at a point when the user feels the system has
“warmed up.” The problem with the first is that we must know the steady-state queue
size a priori, and may have to do trial runs (themselves subject to initialization bias) to
γ γ
t1 t2 t2+γ
i-2 i i+2i-4 i+4
t3 t1+γ
167
determine it. For methods dealing with initialization bias, see (Welch 1981;
Schruben and Kulkarni 1982).
The problem with the second approach (truncation) is a question relevant to the
Time Slice Counter Method itself: How does one “truncate” information from the Time
Slice Counter Method? If we do not begin incrementing ( )W tγ until some time t’, the
counters we use to classify jobs will have been affected by the jobs that have arrived
before t’. To obtain the correct probability estimate at the end of the simulation, we must
divide random variable ( )*W tγ by random variable ( ) ( )* 'S SN t N t− . We must prove that
this estimate would be accurate, and that it would overcome initialization bias.
Another possibility is to not increment any counters until time t’. This will
introduce a different initialization bias, as we will begin incrementing ( )LN t and ( )UN t ,
but not be able to increment ( )SN t until we reach job k, where job k is the first job to
have had 'kA t≥ .
How to deal with initialization bias is an open problem and is the subject of future
research.
D.3.5 SELECTING PARAMETER VALUES Guidance is required on selecting the number and values for the γi, and for selecting a
time slice size ξ.
The selection of the γi is application-dependent. In some cases, special values
of γ are of interest. In cases where the Time Slice Counter Method is being used to
estimate the waiting time distribution, the modeler must use knowledge about the system
168
and perhaps intuition to select values. This knowledge may be acquired through trial
runs. Other possible approaches we will investigate in the future include doing
preliminary runs to estimate bounds on γ; bootstrapping; and performing transformations
on γ to create a bounded distribution. The latter is an attractive option if we are able to
find an appropriate transformation (e.g. one that works for any ( )WF ⋅ ). For a method
using both bootstrapping and transformations to estimate a (bootstrap) confidence
interval on a “parameter of interest,” see (Tibshirani 1988). In Appendix D.5.2, we
discuss ideas for adding γ values during the run.
In Section 4.5.5, we show that a value of ξ ≈ average service time resulted in
estimation errors of around 1 percentage point or less for single-stage queueing systems
in most cases. Since the main cost of ξ is memory, smaller values can be chosen with
little negative effect on speed. Although not implemented here, it is possible to choose
different values of ξ for different γi. Estimates in the right tail of the distribution appear
highly accurate, so larger values of ξ for these γ can reduce memory requirements. Other
efficiencies can be gained using a circular array of size O γξ⎛ ⎞⎜ ⎟⎝ ⎠
.
D.4 MULTIPLE-SERVER TANDEM QUEUEING SYSTEMS D.4.1 EXPERIMENTATION In this section, we show the results of our experimentations for job cycle-time estimation
in n-stage tandem queueing systems. We have results for n = 2 and n = 10, for both
FCFS and last-come-first-serve (LCFS) disciplines. To estimate cycle times, we compare
DELAY to (final) FINISH events. Cycle times are information in region III.2.ii.b.
169
When there are multiple servers at one or more stages, it is possible for job k to
leave a stage or the system before job j, even though job j arrived before k:
, ,,k j k i j iA A F F> ≤ for i = 1,…, n. This phenomenon is known as “overtaking.”
The simplest case for multiple-server queues is having 2 servers at each stage,
which is what we do here; this situation also minimizes the opportunity for overtaking.
The service distributions and rates are the same at each of the n stages in our
experiments.
In the following graphs, we show only the results for time slice size 1. Larger
time slice sizes performed more poorly. Even small time slice sizes can result in errors if
overtaking is present. In a sense, these experiments test how big an impact the
overtaking has, i.e., how much the job ordering is shuffled during the jobs’ stay in the
system.
Up to four different scenarios were simulated for each of the following
arrival/service distribution combinations: M/M/⋅, M/U/⋅, U/U/⋅, M/B/⋅, and B/B/⋅.2
Table 13 lists the scenarios. The first two are the same as in the single-stage experiments.
Since we have two servers at each stage, these are very lightly-loaded systems. The third
and fourth systems are more congested. In the 2-stage runs, the γ are the same as in
Section 4.5.5. In some of the 10-stage runs, the values are multiplied by 5 to account for
the longer time in system. (The new γ are therefore 0.4, 0.8, 1.2, 1.6, 2.0 and so forth.)
2 The Beta variable ranges for Scenarios 3 and 4, respectively, are (0, 9) and (0, 10).
170
Table 13. Experiments for multiple-server tandem queueing systems arrival rate service rate RHO
1 0.6667 1 0.3333 2 0.9091 1 0.4546 3 0.9091 0.5556 0.8181 4 0.9091 0.5 0.9091
Figure 36 shows the estimation errors (as functions of ( )WF γ ) for a 2-stage M/M/2
system under all four scenarios for time slice size 1. The largest error for the estimates in
Figure 36 is around 7 percentage points. For scenarios 3 and 4, the γ did not cover the
complete distribution. For the lower tail of the distribution, our estimates are good. This
is positive, as the left tail was often prone to larger errors than the remaining distribution
for single-stage systems. The errors also become smaller as we approach the right end of
the distribution. We underestimate the variability for the first two scenarios.
In Section 4.5.5, we observed that more congested and variable systems had
smaller errors than the others. Figure 37 shows the errors for the 2-stage M/B/2 system
with time slice size 1. The errors are sizeable for all four scenarios, although they get
smaller as we reach the right end of the distribution. We believe the large errors are
because the variability is causing more overtaking to occur. We underestimate the
variability, especially in the first two scenarios.
Figure 38 shows the estimation errors for the 2-stage M/U/2 system with time
slice size 1. The estimation errors are relatively small compared to those for the M/B/2
system. They compare favorably with the errors shown in Figure 22. We believe the
Time Slice Counter Method performs better in the tandem queueing system because the
job’s stay in the system is longer and there are two stages, which increases the variability
171
a job is subjected to. Overtaking does not appear to be as big a problem as in the other
tandem queueing systems, perhaps because the service times are not too variable.
Figure 36. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and service times
The same four scenarios outlined in Table 13 were run for the 10-stage tandem queueing
network. We have the same service discipline and rates at each of the 10 stages, and 2
servers at each of the stages.
Figure 39 shows the estimation errors for the M/M/2 system. Scenarios 1 and 2
perform extremely poorly, while scenarios 3 and 4 do well. Their errors are smaller than
those in the 2-stage case.
Figure 40 shows the errors for the estimated cycle time distribution for the M/B/2
system. The errors for the more heavily-loaded systems (Scenarios 3 and 4) are smaller
172
than those for the lightly-loaded systems, and smaller than the errors for the 2-stage
tandem queue.
Figure 37. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and Beta service times
Figure 38. Estimation errors for a 2-stage tandem queueing system with various Exponential interarrival and Uniform service times
173
Figure 39. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and service times
Figure 40. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and Beta service times
174
Figure 41 shows error results for the first two scenarios in a 10-stage M/U/2 system. All
errors are less than 1 percentage point in absolute value. These errors are far smaller than
those for the 2-stage system. We summarize and discuss our conclusions from the
experiments in the next section.
Figure 42 shows the estimation errors for the systems in Figure 39, but with a
LCFS, not FCFS, service discipline at each stage. As the congestion in the system
increases (Scenarios 3 and 4), the estimation errors become huge (almost 50%). For the
more lightly-loaded systems, the errors are not significantly different from the FCFS
errors. The errors get larger as the system becomes congested. These results are
expected, since a congested LCFS system experiences a great deal of job reordering.
Figure 41. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and Uniform service times
175
Figure 42. Estimation errors for a 10-stage tandem queueing system with various Exponential interarrival and service times, and a LCFS service protocol
D.4.2 DISCUSSION Our conclusions on estimating cycle times are slightly different for the n-stage tandem
queue with multiple servers than in the single-stage queueing system. The estimation
problem is more complicated because of overtaking.
The Time Slice Counter Method is not guaranteed to give the correct answer
when overtaking is present, even as the time slice size gets arbitrarily small. We can
ensure that the ith DELAY and ith FINISH events fall in different time slices, but it is
possible that they do not correspond to the same job. There was virtually no difference in
the errors in experiments for time slice sizes 1, 0.5, and 0.1 with the four M/M/2 scenarios
for the 10-stage tandem queues.
Congestion and variability do not increase the accuracy of the estimates with
overtaking. Specifically, if there is much service time variability, the odds are increased
176
that one job will overtake the other job. This explains why the results for the 2-stage
M/U/2 system are better than those for the M/M/2 system, which are better than those for
the M/B/2 system. We observe these phenomena in both the 2-stage and 10-stage tandem
queueing systems. This leads to the following claim, stated without proof:
Claim 3 In an n-stage tandem queueing system with multiple servers at least one of the
stages, increased variability of service times leads to overtaking, which leads to
less accurate cycle time estimates by the Time Slice Counter Method.
If we compare the same system types (e.g. M/M/2 under different scenarios), the more
congested runs still tend to perform better than the more idle runs. We believe that this is
because congestion allows DELAY and FINISH events to spread out more. Overtaking
does not appear to become more pronounced when the system gets more congested
(changing arrival/service rates, not distributions). Perhaps it would become more
pronounced if the number of servers was increased (while keeping ρ constant). On the
other hand, perhaps overtaking is driven more by the underlying system (e.g. M/M/2
versus M/B/2) than by the congestion in the system. We will research this in future work.
These observations lead to an important distinction between congestion and
variability. Variability is caused by the distributions used, while congestion is related to
the distribution parameters. If we use 2DC as our measure of variability in the system and
ρ as our measure of congestion, Table 9 shows that we can have both high variability and
low congestion (e.g., B/B/1 with mean arrival of 1.5) and low variability with high
congestion (e.g., U/U1 with mean arrival of 1.1). More congested systems may lead
individual jobs to experience more variability than they would for more idle systems,
177
because longer busy periods mean the job’s behavior is affected by a larger number of
other jobs’ behaviors.
The errors for the 10-stage system are never greater than those for the 2-stage
system. They can even be substantially smaller than in the 2-stage case. We hypothesize
that this is because the greater number of stages allows jobs that were overtaken to regain
their position farther ahead in the system (mixing of jobs):
Claim 4 As the length of a tandem queueing system increases, the cycle-time estimation
error will not increase. It may decrease.
D.5 EXTENSIONS In this section, we discuss extensions to the Time Slice Counter Method. They address
some of the problems of the method. In all cases, we are adding information to the
simulation model.
D.5.1 RUN-TIME ERROR ESTIMATION A disadvantage of the Time Slice Counter Method is that we do not know the accuracy of
the method beyond the guidelines given in Section 4.5.6. To ensure small errors, we
must do a full trace simulation of the same system (or know the analytic solution) and
compare the estimates. In doing so, we have defeated the purpose of the approximation.
In (Schruben and Yücesan 1988), the authors propose transaction tagging, a
method of gathering job statistics by tracking only a fraction 1/k of all jobs. They are
motivated by memory limitations in a process interaction-based simulation package.
178
They determine the fraction of jobs that should be tagged to minimize the probability of
the program aborting before a sufficient number of jobs have been simulated.
Another advantage is that the effective sample size for congested systems can be
significantly smaller than the actual sample size (Bayley and Hammersley 1946).
Transaction tagging reduces the inefficiencies of tracing jobs that do not provide much
additional information.
Transaction tagging relies on a feature of the software used: The software tracks
every job, but the user has the option of not tracking all attributes for all jobs. It is
unlikely that this is the case for all software packages. In the context of the information
taxonomy, a subset ( )t′J of all jobs is tracked completely, while only a minimum
amount of information is maintained on the remaining ( ) ( )\t t′J J .
In resource-driven models, we only maintain counts of the numbers of jobs at
different processing stages. Information on the jobs’ ordering in queue is lost. If we
track a fraction of the jobs to get trace information, we must know the queue location of
these jobs. The immediate answer is to have a counter for the number of jobs in queue
before the tagged job. If there is another tagged job in queue, we track the number of
jobs between the two. This is information in region I.
In determining the number of jobs to tag, there are two possibilities. We can fix
the number in a manner similar to the limited amount of (work-in-progress) WIP in a
CONWIP (CONstant WIP) production system (Hopp and Spearman 1991). An
advantage of this approach is that memory can be allocated at the beginning of the run,
rather than dynamically during the run. Disadvantages include the need to decide on a
179
limit before the run, and the possibility that all tracked jobs will converge in one part of
the system, leaving us without information on all other parts of the system.
The second possibility does not limit the number of jobs we are tracking at any
one point in time; we may specify which fraction of all jobs to tag. This approach grows
impractical as the number of tagged jobs increases. We need an unknown number of
counters for the unknown number of tagged jobs in queue. As the proportion of tagged
jobs approaches the total number of jobs, we need as many counters as there are jobs.
To solve this problem, we can add a counter to the job data structure in a job-
driven simulation. This data structure can be use for the tagged jobs in the resource-
driven simulation. In this example, we assume each job is represented as an element in a
linked list. Each station has a counter for the total jobs waiting in queue as in a normal
resource-driven simulation, and a linked list of tagged jobs. This approach is equivalent
to tracing a subset of the jobs globally.
Each tagged element has a counter in its associated data structure. If it is the first
tagged job in queue, the counter will indicate how many untagged jobs are in queue in
front of the tagged job. If it is not the first job, the counter will indicate how many
untagged jobs are between the current tagged job and the one in front of it. This is
illustrated in Figure 43. There are 10 jobs in queue, 2 of which are tagged. They are
highlighted with stars. Below the queue is the linked list with the two tagged jobs and the
associated counts.
180
Figure 43. Example of job tagging
The advantage of this implementation is that, when a job begins service, we need only
decrement two counters, the global queue counter and the counter for the first tagged job
in queue. The relative positions of the tagged jobs in queue do not change, so we need
not make any changes to the counters for the other tagged jobs. When a job arrives to the
queue, we increment the global queue counter. If it is a tagged job, we must additionally
set its counter.
We can increase the computational efficiency of this implementation by storing
an additional (global) counter at each queue containing the number of jobs behind the last
tagged job in queue.
With the ability to accurately (and quickly) track individual jobs, we are able to
assess the accuracy of the Time Slice Counter Method in real-time. If we are not using
dynamic delay values (values of γ added during the run, see Appendix D.5.2), we can do
this by counting the number of (tagged) jobs that have been delayed at most γ time units,
for every γ. We then compare the estimates obtained using the Time Slice Counter
Method and those obtained using the tagged jobs. In addition, we have rough estimates
of WF using the tagged job delays.
ID: 17 Count: 4 Join Q time: 22.19
ID: 29 Count: 2 Join Q time: 24.37
Global queue info: Q = 10 Tagged = 2
181
If we use dynamic delay values, we still may be able to assess the accuracy of the
Time Slice Counter Method. Fewer jobs are used in assessing the accuracy for those
values of γ added later. If it is difficult to simulate jobs, or we do not have the resources
to simulate many more jobs, it is desirable to keep the exact information on all the tagged
jobs. To do so, we could write the job waiting and cycle times to a file upon service
completion. This additional information will come at a computational cost when we want
to evaluate a new probability. Data must be read from the file, and we calculate the
desired new probabilities in NOk
⎛ ⎞⎜ ⎟⎝ ⎠
time. Nonetheless, it is possible to get probability
estimates for dynamic values of γ during the simulation run. Implementation details are
beyond the current scope and will be discussed in future work.
The Time Slice Counter Method is valuable even though we can get estimates of
( )WF γ using the exact job waiting and cycle times of the tagged jobs because it is able to
computationally efficiently give accurate estimates of ( )WF γ for FCFS systems without
overtaking. It does so without allocating memory during the run for new jobs, and
without pointer manipulations. It also uses information on all jobs, not just a fraction of
the jobs. We need not concern ourselves about which fraction of jobs should be tagged,
and about how to store and process the job information we obtain from the tagging.
The estimates from the Time Slice Counter Method have lower variance than the
estimates using tagged job information to estimate ( )WF γ .
182
D.5.2 DYNAMIC DELAY VALUES A second drawback of the Time Slice Counter Method is that we are forced to pick γ
values before we know anything about the system (other than the arrival and service rates
and distributions). In some cases, we miss significant portions of the distribution. In
Figure 36, we have estimates only up to 40% for one of the systems and 80% for another.
We do not have any data on the upper tail of the distribution; this upper tail is often the
interesting section because it contains the events that are most likely to be harmful
(unacceptably long waiting times, catastrophic downtimes, etc.). In many of the other
systems studied, over half of the γ values have an associated ( )WF γ of 1. This is an
inefficient use of study resources.
The final disadvantage of picking γ’s before doing the simulation is that we do not
know on which sections of the distribution to focus. We have seen instances where
( )ˆ 0.3W iF γ = and ( )1ˆ 0.7W iF γ + = . Since there is a large difference between the two, it
would be worthwhile to add granularity between γi and γi+1 to have more information in
that area.
The obvious solution to these problems is to allow values of γ to be added or
deleted during the run. If we stop collecting statistics on γ, we may use the allocated
memory for new values of γ. The problem of termination bias here is faced by the
simulation as a whole.
Adding γ values raises the following questions:
1. When do we decide to add values?
2. How do we decide that we need more values?
183
3. How many new values do we need?
4. Which new values should we choose?
5. Is the current simulation run long enough to ensure good estimates for the new
values?
The first question requires further research. We want to make the decision once we are
confident that our current estimates are “accurate enough,” which may be situation-
dependent. For example, we may wish to set a desired half-width for a 95% confidence
interval on the estimate, or to specify a certain number of simulated jobs before we
evaluate. Since we are not ending the simulation run at this time, our estimates need not
have achieved the level of accuracy required of the final answer.
The answer to the second question is situation-dependent. The user may specify
how many points on the distribution should be estimated, or that (s)he wants a point for at
least every 10% increase in probability, i.e., ( ) ( )1 11 2
ˆ ˆ0.1 , 0.2W W
F Fγ γ− −= = , etc. If the
current values of γ are not fulfilling this requirement, more must be added. Further
research is required.
The third and fourth questions are tied to the second question. If there is a large
gap between two successive values, we need to add an appropriate number of
appropriately-chosen new values. The simplest answer is to pick a number proportional
to the gap between ( )W iF γ and ( )1W iF γ + , and to space the new γ’s evenly between γi and
γi+1. More research is necessary to address this.
The final question asks whether the remaining simulation time is long enough to
give us accurate estimates of ( )WF γ for the new values of γ. We will have an idea of
184
how much time is/how many jobs are required to achieve the desired level of accuracy
from our initial run control strategy, and from the existing observations. Additional runs
may be required.
D.5.3 DYNAMIC TIME SLICES Theorem 2 states that the Time Slice Counter Method can be made arbitrarily accurate for
a single-stage FCFS queueing system. As outlined in Appendix D.5.1, we can tell during
the simulation run how accurately we are able to estimate the job-driven ( )WF γ . If our
estimates are inaccurate, we can increase the resolution of our time slices.
For example, we can double the number of time slices by halving the time slice
size. We either assign both time slices the same value as the original time slice, or we
can try to “smooth” the values. Figure 44 shows an example of this. The original time
slice counts are given on the top line. The bottom line shows the new time slice counts.
If there is a difference between two successive time slices, the counts for the new time
slices are roughly in the middle of the two original ones. This follows the assumption
that the DELAYs are uniformly distributed in the time slice.
Figure 44. Time slice counts after doubling the number of time slices
The smoothing need only be done for time slices that fall between the current simulation
time t and t + γ for all γ; that is, if ξ is the original time slice size, we smooth between
5 7 10 10 15
5 6 7 8 10 10 10 12 15 15
185
time slices ( ),t 0 ξ and ( ),t γ ξ . We are no longer interested in time slices in the past,
and time slices beyond ( ),t γ ξ (for all γ) have not yet been assigned values.
D.5.4 USING MORE INFORMATION: EXPECTED ORDER STATISTICS The three extensions proposed in previous sections do not use additional information for
the Time Slice Counter Method itself. The extensions proposed next modify the Time
Slice Counter Method by using more information than just the current (time slice) counts.
The way of classifying uncertain jobs in Appendix D.2.5 using expectations can
be refined by using the expected order statistics rather than the expectation of the location
of a single job in the time slice. The additional information used is the relative position
of the job in the time slice. For example, if there are three DELAYs occurring in a time
slice and the current job is the first, the expectation used is the first (of three) order
statistics.
We need to know the number of DELAYs that occur in the time slice; this
requires a comparison to the previous time slice’s count. We also need to know the
relative number of the current job. This is a derivation of the number of DELAYs in the
previous and current time slices, and of the current START count. We must decide on
the order statistic distribution to use: Should we use the distribution of interarrival (and
therefore DELAY) times, or must we determine the distribution of DELAY events in a
time slice? More research is required.
186
D.5.5 USING MORE INFORMATION: LCFS SERVICE DISCIPLINE We outline a method that may allow us to estimate delay probabilities for a LCFS
discipline. The disadvantage is that we need to store additional information.
Nonetheless, we need not explicitly track each job, and the additional information
consists of counters in region I.
If job i is the first or only job in a busy period for a G/G/1 system, it corresponds
to both the ith START and DELAY. If the job arrives to a non-zero queue or other jobs
arrive before it can begin service, this is not the case, and we must use additional
information to determine to which START the ith DELAY corresponds. To do so, we
store the queue sizes at the times of the START events. There are 2 possibilities when
comparing the queue sizes after successive START events. (We assume no balking.)
They are outlined in Table 14.
Table 14. Possible queue sizes and their interpretations
Scenario size comparison interpretation 1 ( ) ( )1j jQ S Q S +≤ There was at least one arrival between the two
START events. START j+1 corresponds to job ( )1A jN S + .
2 ( ) ( )1 1j jQ S Q S+ = − There has been no arrival. We cannot say which job START j+1 corresponds to. It is not job ( )1A jN S + , but we require additional history to be able to uniquely determine which job it is.
From the queue counts, we can reconstruct how many jobs have arrived, which jobs have
already been served, and which job will be served next. The details of this approach are
the subject of future work.