univariate input models for stochastic simulation

Upload: garron71

Post on 06-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    1/17

    Univariate input models for stochastic simulationME Kuhl1, JS Ivy2, EK Lada3, NM Steiger4, MA Wagner5 and JR Wilson2*1

    Rochester Institute of Technology, Rochester, NY, USA;

    2

    North Carolina State University, Raleigh, NC, USA;3SAS Institute Inc., Cary, NC, USA; 4University of Maine, Orono, ME, USA; 5SAIC, Vienna, VA, USA

    Techniques are presented for modelling and then randomly sampling many of the continuous univariate probabilistic

    input processes that drive discrete-event simulation experiments. Emphasis is given to the generalized beta distribution

    family, the Johnson translation system of distributions, and the Be zier distribution family because of the flexibility of

    these families to model a wide range of distributional shapes that arise in practical applications. Methods are described

    for rapidly fitting these distributions to data or to subjective information (expert opinion) and for randomly sampling

    from the fitted distributions. Also discussed are applications ranging from pharmaceutical manufacturing and medical

    decision analysis to smart-materials research and health-care systems analysis.

    Journal of Simulation (2010) 4, 8197. doi:10.1057/jos.2009.31; published online 26 February 2010

    Keywords: simulation; continuous univariate input models; generalized beta distributions; Johnson translation

    system of distributions; Be zier distributions

    1. Introduction

    One of the main problems in the design and construction of

    stochastic simulation experiments is the selection of valid

    input modelsthat is, probability distributions that accu-

    rately mimic the behaviour of the random input processes

    driving the system under study. Often the following

    interrelated difficulties arise in attempts to use standard

    distribution families for simulation input modelling:

    1. Standard distribution families cannot adequately repre-sent the probabilistic behaviour of many real-world

    input processes, especially in the tails of the underlying

    distribution.

    2. The parameters of the selected distribution family are

    troublesome to estimate from either sample data or

    subjective information (expert opinion).

    3. Fine-tuning or editing the shape of the fitted distribution

    is difficult because (i) there are a limited number of

    parameters available to control the shape of the fitted

    distribution, and (ii) there is no effective mechanism for

    directly manipulating the shape of the fitted distribution

    while simultaneously updating the corresponding para-meter estimates.

    In modelling a simulation input process, the practitioner

    must identify an appropriate distribution family and then

    estimate the corresponding distribution parameters; and the

    problems enumerated above can hinder the progress of both

    of these model-building activities.

    The conventional approach to identification of a stochas-

    tic simulation input model encompasses several procedures

    for using sample data to accept, reject, or somehow rank

    each of the distribution families in a list of well-known

    alternatives. These procedures include (i) informal graphical

    techniques based on probability plots, frequency distribu-

    tions, or box plots; and (ii) statistical goodness-of-fit tests

    such as the KolmogorovSmirnov, chi-squared, Anderson

    Darling, and Crame rvon Mises tests. For a detaileddiscussion of these procedures, see Sections 6.36.6 of Law

    (2007) and Stephens (1974). Unfortunately, none of these

    procedures is guaranteed to yield a definitive conclusion. For

    example, identification of an input distribution can be based

    on visual comparison of superimposed graphs of a

    histogram of the available data set and the fitted probability

    density function (p.d.f.) for each of several alternative

    distribution families. In this situation, however, the final

    conclusion depends largely on the number of class intervals

    (also called bins or cells) in the histogram as well as the

    class boundaries; and a different layout for the histo-

    gram could lead the user to identify a different distributionfamily. Similar anomalies can occur in the use of statis-

    tical goodness-of-fit tests. In small samples, these tests can

    have very low power to detect lack of fit between the

    empirical distribution and each alternative theoretical

    distribution, resulting in an inability to reject any of the

    alternative distributions. In large samples, moreover, practi-

    cally insignificant discrepancies between the empirical

    and theoretical distributions often appear to be statis-

    tically significant, resulting in rejection of all the alternative

    distributions.

    *Correspondence: JR Wilson, Edward P. Fitts Department of Industrialand Systems Engineering, North Carolina State University, 111 LampeDrive, Daniels Hall, Room 370, Campus Box 7906, Raleigh, NorthCarolina 27695-7906, USA.E-mail: [email protected]

    Journal of Simulation (2010) 4, 8197 r 2010 Operational Research Society Ltd. All rights reserved. 1747-7778/10

    www.palgrave-journals.com/jos/

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    2/17

    After somehow identifying an appropriate family of

    distributions to model an input process, the simulation user

    also faces problems in estimating the associated distribution

    parameters. The user often attempts to match the mean

    and standard deviation of the fitted distribution with the

    sample mean and standard deviation of a data set, but shape

    characteristics such as the sample skewness and kurtosis are

    less frequently considered when estimating the parameters

    of an input distribution. Some estimation methods, such as

    maximum likelihood and percentile matching, may simply

    fail to yield parameter estimates for some distribution

    families. Even if several distribution families are readily

    fitted to a set of sample data, the user generally lacks a

    definitive basis for selecting the appropriate best-fitting

    distributionin particular, several commercial input-model-

    ling packages base their model-selection procedure on an

    unspecified combination of some of the goodness-of-fit test

    statistics mentioned above, and the details of the model-

    selection procedure are actually concealed from the user on

    the grounds that such information is proprietary. A notableexception to this is the automatic distribution-fitting

    procedure of JMP 8 (SAS Institute Inc., 2008), which makes

    transparent use of the Akaike information criterion (Akaike,

    1974) as the basis for selecting the distribution that yields the

    best fit to a given data set.

    The task of building a simulation input model is further

    complicated if sample data are not available. In this

    situation, identification of an appropriate distribution family

    is arbitrarily based on whatever information can be elicited

    from knowledgeable individuals (experts); and the corre-

    sponding distribution parameters are computed from sub-

    jective estimates of simple numerical characteristics of the

    underlying distribution such as the mode, selected percen-

    tiles, or low-order moments. In summary, there is some

    evidence that many simulation practitioners lack a clear-cut,

    definitive procedure for identifying and estimating high-

    fidelity stochastic input models (or even merely acceptable,

    rough-cut input models); consequently, simulation output

    analysis is often based on input processes of questionable

    validity. The latter observation, coupled with the current

    capabilities and limitations of typical off-the-shelf simulation

    input-modelling software, has led to the research that is

    surveyed in this article for handling some of the difficulties

    outlined above.

    This invited article is an expanded version of a series ofintroductory tutorials on simulation input modelling, which

    we have been asked to present at the Winter Simulation

    Conference for the past several years (Kuhl et al, 2006,

    2008a,b). In this article techniques are presented for

    modelling and then randomly sampling many of the

    continuous univariate probabilistic input processes that

    drive discrete-event simulation experiments, with the pri-

    mary focus on methods designed to alleviate the difficulties

    encountered in using conventional approaches to simulation

    input modelling. Emphasis is given to the generalized beta

    distribution family (Section 2), the Johnson translation

    system of distributions (Section 3), and the Be zier distribu-

    tion family (Section 4) because in our experience these

    families can be most readily and effectively used in a broad

    diversity of simulation applicationsespecially in large-scale

    applications for which reasonably accurate input models

    must be delivered under severe time pressure, and the user

    may not have immediate access to detailed knowledge of the

    physics of all the input processes so that empirical input

    models must be formulated and fitted quickly using readily

    available sample data or subjective information. For each

    distribution family, we describe methods for fitting distri-

    butions to sample data or expert opinion and then for

    randomly sampling the fitted distributions. Much of the

    discussion concerns public-domain software and fitting

    procedures that facilitate rapid univariate simulation input

    modelling. To illustrate these procedures, we also discuss

    applications ranging from pharmaceutical manufacturing

    and medical decision analysis to smart-materials research

    and health-care systems analysis. Finally in Section 5conclusions and recommendations are presented, including

    a brief discussion of other discrete and continuous distri-

    bution families, which can be used for simulation input

    modelling. In a companion article (Kuhl et al, 2010), we

    discuss some multivariate distributions that frequently arise

    in probabilistic simulation input modelling; see also Sections

    34 of Kuhl et al(2006).

    2. Generalized beta distribution family

    Suppose X is a continuous random variable with lower limit

    a and upper limit b whose distribution is to be approximatedand then randomly sampled in a simulation experiment. In

    such a situation, it is often possible to model the proba-

    bilistic behaviour of X using a generalized beta distribution,

    whose p.d.f. has the form

    fXx Ga1 a2x a

    a11b xa21

    Ga1Ga2b aa1 a21

    for apxpb

    1

    where G(z) R1

    0 tz1etdt (for z40) denotes the gamma

    function. For graphs illustrating the wide range of distribu-

    tional shapes achievable with generalized beta distributions,

    see one of the following references: pp 9293 of Hahn and

    Shapiro (1967); pp 291293 of Law (2007); or pp 1114 of

    Kuhl et al (2008b), which is available online.

    If X has the p.d.f. (1), then the cumulative distribution

    function (c.d.f.) ofX, which is defined by FXx PrfXpxgRx

    1 fXwdw for all real x, unfortunately has no con-venient analytical expression; but the mean and variance of

    X are respectively given by

    mX EX a1b a2a

    a1 a22

    82 Journal of Simulation Vol. 4, No. 2

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    3/17

    and

    s2X EX mX2

    b a2a1a2

    a1 a22a1 a2 1

    3

    Recall that for a continuous p.d.f. fX( ), a mode m is a localmaximum of that function; and if there is a unique global

    maximum for fX( ), then the p.d.f. is said to be unimodal,and m is usually called the most likely value of the random

    variable X. Ifa1,a2X1 and either a141 or a241, then the

    beta p.d.f. (1) is unimodal; and the mode is given by

    m a1 1b a2 1a

    a1 a2 2a1; a2X1 and a1a241 4

    Equations (2)(4) reveal that key distributional character-

    istics of the generalized beta distribution are simple functions

    of the parameters a, b, a1, and a2; and this facilitates input

    modelling, especially in pilot studies in which rapid model

    development is critical.

    2.1. Fitting beta distributions to data or subjective

    information

    Given a random sample {Xi: i 1,y, n} of size n from thedistribution to be estimated, let X(1)pX(2)p?pX(n) denote

    the order statistics obtained by sorting the {Xi} in ascen-

    ding order so that X(1) min{Xi: i 1,y, n} and X(n) max{Xi: i 1,y, n}. We can fit a generalized beta distribu-tion to this data set using the following sample statistics:

    ba 2X1 X2;

    bb 2Xn Xn1

    X 1n P

    n

    i1X

    i; S2 1

    n1 Pn

    i1X

    i X2

    9=; 5

    In particular the method of moment matching involves (i)

    setting the right-hand sides of (2) and (3) equal to the sample

    mean X and the sample variance S2, respectively; and (ii)

    solving the resulting equations for the corresponding

    estimates ba1 and ba2 of the shape parameters. In terms ofthe auxiliary quantities

    d1 Xbabb ba and d2 Sbb ba

    the moment-matching estimates of

    ba1 and

    ba2 are given by

    a1 d21 1 d1

    d22 d1; ba2 d11 d12

    d22 1 d1 6

    AbouRizk et al (1994) discuss BetaFit, a Windows-based

    software package for fitting the generalized beta distribution

    to sample data by computing estimators ba, bb, ba1, and ba2using the following estimation methods:

    moment matching with ba X(1) and bb X(n); feasibility-constrained moment matching, so that the fea-

    sibility conditions

    baoX(1) and X(n)o

    bb are always satisfied;

    maximum likelihood (assuming a and b are known andthus are not estimated); and

    ordinary least squares (OLS) and diagonally weightedleast squares (DWLS) estimation of the c.d.f.

    Figure 1 demonstrates the application of BetaFit to a

    sample of 9980 observations of end-to-end chain lengths

    (in angstro ms) of the ionic polymer Nafion based on themethod of moment matching. In Section 3.5 below, we

    provide further details on the origin of the Nafion data set

    and its relevance to the problem of predicting the stiffness

    properties of a certain class of smart materials. Like all

    the software packages mentioned in this article, BetaFit is in

    the public domain and is available on the Web site via

    www.ise.ncsu.edu/jwilson/page3.

    For rapid development of preliminary simulation models,

    practitioners often base an initial input model for the

    random variable X on subjective estimates

    ba,

    bm, and bb of

    the minimum, mode, and maximum, respectively, of the

    distribution of X. Although the triangular distribution is

    often used in such circumstances, it can yield excessively

    heavy tailsand hence grossly unrealistic simulation re-

    sultswhen the distance bbbm between the estimates of theupper limit and mode is much larger than the distance bmbabetween the estimates of the mode and lower limit, or vice

    versa. The generalized beta distribution is usually a better

    choice in such situations; but there is some difficulty in

    selecting the shape parameters to yield the desired value bmfor the mode. For an elaboration of this point in the context

    of project-management simulations, see Vanhoucke (2010).

    In many project-management and quality-control applica-

    tions, it is convenient to assume that the standard deviation

    of the random variable at hand is one-sixth of thecorresponding range; and if we equate the right-hand sides

    of (3) and (4), respectively, with the subjective estimates

    (bbba )2/36 and bm of the variance and mode of X, then wemust solve a cubic equation to obtain the corresponding

    shape parameters of the beta p.d.f. (1). In terms of the

    auxiliary quantity

    q bm babb ba

    we see that in the special cases in which q 0 or q 1, therequired shape parameters are exactly given by

    ba1 1 and ba2 3:87227 ifq 0ba1 3:87227 and ba2 1 ifq 1' 7(For a detailed justification of (7), see the Appendix of this

    article, which contains exact computing formulas for the

    shape parameters of a beta distribution with user-specified

    values of the end-points, mode, and variance.)

    For the more common case in which 0oqo1, remarkably

    accurate, simple approximations to the shape parameters of

    the beta distribution with minimum ba, mode bm, maximum

    bb, and standard deviation (

    bb

    ba )/6 can be conveniently

    ME Kuhl et alUnivariate input models for stochastic simulation 83

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    4/17

    calculated from the asymmetry ratio

    r bb bmbm ba 1 qq

    so that the required shape parameters are given by

    ba1 r2 3r 4r2 1

    and ba2 4r2 3r 1r2 1

    8

    see pp 202203 of Wilson et al (1982) and McBride and

    McClelland (1967). If 0.02pqp0.98, then the error in the

    approximation (8) is less than 3%; and if 0.1pqp0.9, then

    the error in this approximation is less than 1.2%. To handle

    situations in which the estimated mode bm is very close to oneof the estimated end-points ba and bb (that is, qo0.02 orq40.98), see the Appendix. In the application of beta

    distributions to a problem in medical decision making that is

    detailed in Section 2.4 below, the error in using the

    approximation (8) was essentially zero (that is, less than

    108) on each of 50 different beta distributions used in the

    associated simulation study.

    AbouRizk et al(1991) discuss the Visual Interactive Beta

    Estimation System (VIBES), a Windows-based software

    package that enables graphically oriented fitting of general-ized beta distributions to subjective estimates of: (i) the end-

    points a and b; and (ii) any of the following combinations of

    distributional characteristics:

    the mean mX and the variance sX2 ,

    the mean mX and the mode m, the mode m and the variance sX

    2 ,

    the mode m and an arbitrary quantile xp FX1(p)

    for pA(0, 1), or

    two quantiles xp and xq for p, qA(0, 1).

    Figure 1 Beta p.d.f. (top panel) and c.d.f. (bottom panel) fitted to 9980 Nafion chain lengths.

    84 Journal of Simulation Vol. 4, No. 2

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    5/17

    As a general-purpose tool for simulation input modelling,

    the generalized beta distribution family has the following

    advantages:

    It is sufficiently flexible to represent with reasonableaccuracy a wide diversity of distributional shapes.

    Its parameters are easily estimated from either sampledata or subjective information.

    On the other hand, generating samples from the beta

    distribution is relatively slow; and in some applications,

    the time to generate beta random variables can be a

    substantial fraction of the overall simulation run time

    (Wilson et al, 1982).

    2.2. Generating beta variates

    Although most general-purpose simulation packages pro-

    vide a generator of beta random variables, in our experience

    some care is required to verify the performance of a betavariate generator in cases where any shape parameter is less

    than one or is very large (say, greater than 30). Note that

    Equations (7)(8) always yield 1pa1, a2p4 while Equations

    (A1)(A5) in the Appendix always yield a1, a2X1; and in

    these situations, we have obtained excellent results using two

    procedures available in Press et al (2007). To generate a

    generalized beta random variable X with minimum a,

    maximum b, and shape parameters a1 and a2, the first

    method uses Gammadev of Press et al (2007) to generate

    Y(a1, a2), a standard beta random variable on the unit

    interval [0,1] with shape parameters a1 and a2; and then the

    desired random sample is given by

    X a b aYa1; a2 9

    In terms of the incomplete beta function

    Ixa1; a2 Ga1 a2

    Ga1Ga2

    Zx0

    ta111 ta21dt

    for 0pxp1

    10

    (which coincides with the c.d.f. FY(a1, a2)(x) Pr{Y(a1,a2)px}of a standard beta random variable Y(a1,a2) for 0pxp1),

    the second method for generating X is based on inversion of

    the c.d.f. of X,

    X F1X U a b aF1Ya1; a2

    U

    a b aI1U a1; a211

    where UBUniform [0, 1] is a random number and we use the

    procedure invbetai of Press et al (2007) to obtain a highly

    accurate approximation to Ix1(a1, a2) for all x in [0, 1].

    Remark 1. In the companion paper on multivariate input

    modelling (Kuhl et al, 2010), Ix1(a1, a2), and the associated

    approximation invbetai of Press et al(2007) are important

    tools in our approach to building multivariate beta distri-

    butions as well as stationary univariate time series whose

    marginals are generalized beta distributions.

    2.3. Application of beta distributions to pharmaceutical

    manufacturing

    Pearlswig (1995) provides a good example of a pharmaceu-

    tical manufacturing simulation whose credibility depended

    critically on the use of appropriate input models. In this

    study of the estimated production capacity of a plant that

    had been designed but not yet built, the usual three-time

    estimates (ba, bm, and bb ) were obtained from the processengineer for each of the operations in manufacturing

    a certain type of effervescent tablet. Unfortunately very

    conservative (ie, large) estimates were provided for the upper

    limit

    bb of each operation time; and when triangular

    distributions were used to represent batch-to-batch variation

    in actual processing times for each operation within each

    step of production, the resulting bottlenecks resulted in very

    low estimates of the probability of reaching a prespecified

    annual production level.

    As in many simulation applications in which subjective

    estimates ba, bm, and bb are elicited from experts, the estimatebm of the modal (most likely) time to perform a givenoperation was substantially more reliable than the estimatesba and bb of the lower and upper limits on the same operationtime. When all the triangular distributions in the simulation

    were replaced by generalized beta distributions using (8) to

    ensure conformance to the engineers estimate of the most

    likely processing time for each operation within each step,the resulting annual tablet production was in excellent

    agreement with the production of similar plants already in

    existence. This simple remedy restored the faith of manage-

    ment in the validity of the overall simulation model, which

    was subsequently used to finalize certain aspects of the

    design and operation of the new plant.

    2.4. Application of beta distributions to medical decision

    analysis

    In the following application of simulation input modelling

    to medical decision analysis, we compare two alternativemethods for estimating the parameters of a generalized beta

    distribution from limited sample data or subjective informa-

    tion about the minimum, mode, and maximum values of the

    target random variable. The discussion is also intended to

    illustrate the extent to which simulation-generated outputs

    may depend on the end-points of the fitted beta distributions

    used in the simulation. This example provides insight into

    the issues surrounding the use of the generalized beta

    distribution to represent a simulation input that is subject to

    randomness or uncertainty when that distribution must be

    ME Kuhl et alUnivariate input models for stochastic simulation 85

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    6/17

    fitted to subjective information or some combination of

    limited sample data and subjective information.

    Cost-effectiveness studies are frequently used in medical

    decision making for comparing various treatment or

    intervention alternatives. The Panel on Cost-Effectiveness

    in Health and Medicine (Gold et al, 1996) defines cost-

    effectiveness analysis (CEA) as y a method designed to

    assess the comparative impacts of expenditures on different

    health interventionsy that y involves estimating the net,

    or incremental, costs and effects of an interventionits costs

    and health outcomes compared with some alternative.

    Decision models for CEA involve a large number of input

    parameters, each subject to substantial uncertainty. In

    particular, these studies involve uncertainty and random

    variability with respect to the following quantities:

    (a) Probability of occurrence for each health-related out-

    come of interest;

    (b) Utilitythat is, a number between 0 (death) and 1

    (perfect health) that is assigned to each state of health oroutcome relevant to item (a); and

    (c) Cost in constant dollars for each disease state and

    intervention.

    There is variability between patients and parameter un-

    certainty, each reflected in the standard errors associated

    with simulation-based estimates of mean performancefor

    example, the expected values of the costs, quality-adjusted

    life years, and utilities resulting from alternative treatments.

    Therefore an accurate assessment of cost effectiveness must

    involve sensitivity analysis and must attempt to model

    the inherent variability and uncertainty in these parameter

    estimates. Probabilistic sensitivity analysis is one method for

    performing a multiway sensitivity analysis in which all

    parameters subject to uncertainty are varied simultaneously

    by Monte Carlo sampling from the distributions postulated

    for those parameters.

    Xu et al (2010) develop a decision-tree model for

    determining the cost effectiveness of cesarean delivery upon

    maternal request (CDMR) for women having a single

    childbirth without indications. Their model compares

    CDMR with trial of labour (TOL) considering all possible

    short- and long-term outcomes and the resulting conse-

    quences for the mother and neonate. The model takes theform of a decision tree containing over 100 chance events.

    For each parameter in their decision model, Xu et al use

    either literature-based or expert opinionbased estimates for

    the mode, minimum, and maximum values. Typically there

    is limited information available for parameter distribution

    estimation; moreover, there is significant variability in the

    parameter values because of substantial uncertainty regard-

    ing mode of delivery with respect to utility measures, the

    probabilities of outcomes, and outcome costs. Here we

    explore two examples from Xu et al in which we fit beta

    distributions for utility and probability parameter estimates

    by two different approaches:

    Using the approximation based on Equations (7) and (8);and

    Using the version of the so-called Beta PERT distribu-tion that is implemented in the @RISK software (Palisade

    Corporation, 2009), which is usually termed the RiskPertdistribution and is detailed in Equations (12) and (13)

    below.

    To illustrate each approach, we discuss in some detail how

    we formulated probabilistic input models of the following

    quantities:

    (i) P(Vag), the probability of a vaginal delivery given that

    the decision maker pursues a trial of labour; and

    (ii) U(SpVag), the utility associated with a spontaneous

    vaginal delivery given that the decision maker pursues a

    trial of labour.

    A trial of labour is a decision to attempt a vaginal

    delivery; this will result in a vaginal delivery or an emergency

    cesarean section. Given a vaginal delivery, there are two

    possible outcomes: a spontaneous vaginal delivery or an

    instrumental vaginal delivery. For the probability of a

    vaginal delivery P(Vag), the most likely value of 0.9

    was obtained from the published literature. Not only was

    0.9 the most frequently cited value, it was also judged

    to be the highest-quality estimate in terms of sample size

    and its applicability to populations cited in the literature.

    The values 0.844 and 0.97 were taken to be the lower and

    upper bounds on P(Vag), respectively, because they

    corresponded to the smallest and largest estimates found in

    the literature. The associated estimates of the utility

    U(SpVag) resulting from a spontaneous vaginal delivery

    were obtained similarly; and the mode, minimum, and

    maximum values found in the literature were 0.92, 0.69, and

    1.0, respectively.

    While the minimum and maximum values were the

    smallest and largest values found in the available literature,

    we recognized that the true lower bound might be less than

    the estimated minimum and the true upper bound might be

    greater than the estimated maximum in many cases. In

    contrast to Xu et al, who assume that the minimum andmaximum values from the literature correspond to the 0.025

    and 0.975 percentiles, we explored the effect of assuming

    that the true lower and upper bounds could be obtained

    by taking an appropriate offset from the original estimated

    minimum and maximum values, where the offset is

    expressed as a fraction c of the original estimate of the

    range,

    a0 maxf0; a cb ag and

    b0 minfb cb a; 1g forc40

    86 Journal of Simulation Vol. 4, No. 2

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    7/17

    Based on the original estimate of the mode m as well as the

    new estimates a0 and b0 of the true minimum and maximum

    values, respectively, for each distribution used in the

    probabilistic sensitivity analysis, we fitted a beta distribution

    using the approximation for the associated shape parameters

    given by Equations (7) and (8). In addition, we fitted the

    RiskPert version of the beta distribution by assuming that

    the mean and variance of the random variable X satisfy the

    following equations,

    mX a0 4m b0

    6and s2X

    b0 mXm a0

    712

    so that the corresponding shape parameters are given by

    a1 6mX a

    0

    b0 a0

    and a2 6

    b0 mXb0 a0

    6 a1 13

    (Note that whereas Equations (2) and (3) are always true for

    a beta random variable X, Equations (12) and (13) are only

    satisfied when X has a RiskPert distribution, which is aspecial type of beta distribution.)

    The value for c was varied from 0 to 0.1. Varying c

    yielded small changes in the shape parameters for the beta

    distributions fitted by each method. However, we found that

    the value of c had an effect on the cost-effectiveness

    decision; and the effect varied depending on the type of

    distribution used for all the probabilities and utilities in the

    decision tree. For cA[0, 0.02), there was a significant

    difference in the effectiveness of CDMR and TOL (ie, the

    95% confidence interval for the mean difference in the utility

    between CDMR and TOL did not include zero) when using

    beta distributions fitted by each method. For cA[0.02, 0.07],

    there was a significant difference in the effectiveness of

    CDMR and TOL only when using beta distributions fitted

    via Equations (7) and (8). And for c40.07, the difference in

    effectiveness of CDMR and TOL was not significant for

    either method of fitting beta distributions.

    The difference in the effect of c as a function of the

    distributional assumptions can be explained by the shapes of

    the beta distributions fitted by each method. The p.d.f.s of

    the fitted beta distributions for P(Vag) and U(SpVag) are

    shown in Figure 2, subfigures 2(a)2(f), for the cases in

    which c 0, 0.05, and 0.1. For all the other betadistributions used in this application, similar behaviour

    was seen in the superimposed plots of the beta p.d.f. fittedvia Equations (7) and (8) versus the beta p.d.f. fitted via

    Equations (12) and (13). While each fitted distribution has

    the desired mode in each case, the RiskPert distribution

    based on (12) and (13) has fatter tails than those of the p.d.f.

    based on (7) and (8); moreover, we see that for the RiskPert

    distribution, the variance clearly depends on the mean. As

    indicated above, the assumptions about the variance that

    underlie Equations (7) and (8) differ substantially from the

    assumptions about the mean and variance that underlie the

    RiskPert distribution; and these differences lead to different

    conclusions about the cost-effectiveness of CDMR com-

    pared with TOL when cA[0.02, 0.07].

    Remark 2. Several general conclusions emerged from the

    foregoing applications to pharmaceutical manufacturing and

    medical decision analysis. When input modelling is based on

    estimates of the minimum, most likely, and maximum values

    of a target random variable, there is often substantial

    uncertainty in the estimates of the extreme values; and in

    such situations the fitted distribution should generally have

    most of its probability concentrated in the vicinity of the

    estimated mode, which is much more accurate than the other

    two estimates. The generalized beta distribution is usually a

    good choice for rapid input modelling in these situations;

    and often acceptable results can be obtained using either

    Equations (7) and (8) or Equations (12) and (13). In our view

    the primary disadvantage of Equations (12) and (13) is that

    the variance of the fitted distribution is a function of its

    mean. In general the analysis of a simulation-generated

    response is complicated by dependence of the variance of theresponse on its mean; and numerous variance-stabilizing

    transformations have been proposed to avoid such undesir-

    able behaviour (Irizarry et al, 2003). In some types of

    applications, it may be necessary to study systematically the

    sensitivity of the simulation-generated results to changes in

    the assumed values of the mode and variance of each input

    random variable; and in this case the development given in

    the Appendix can be used to investigate the impact of

    independently varying the postulated values of the mode and

    variance of the fitted beta distribution.

    3. Johnson translation system of distributions

    Starting from a continuous random variable X whose

    distribution is unknown and is to be approximated and

    subsequently sampled, Johnson (1949) proposes the idea of

    inferring an appropriate distribution by identifying a suitable

    translation (or transformation) of X to a standard normal

    random variable Z with mean 0 and variance 1 so that

    ZBN(0, 1). The translations have the form

    Z g d gX x

    l

    14

    where g and d are shape parameters, l is a scale parameter,

    x is a location parameter, and g( ) is a function whose formdefines the four distribution families in the Johnson

    translation system,

    gy

    lny for SL lognormal family

    ln y ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

    y2 1p

    for SUunbounded family

    ln y=1 y for SB bounded familyy for SNnormal family

    8>>>>>:DeBrota et al (1989a) detail the advantages of the Johnson

    translation system of distributions for simulation input

    ME Kuhl et alUnivariate input models for stochastic simulation 87

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    8/17

    modelling, especially in comparison with the triangular,

    beta, and normal distribution families.

    3.1. Johnson distribution and density functions

    If (14) is an exact normalizing translation ofXto a standard

    normal random variable, then the c.d.f. of X is given by

    FXx F g d gx x

    l

    !forall x 2 H

    where: (i) Fz 2p1=2Rz

    1 exp 12 w

    2

    dw denotes

    the c.d.f. of the N(0, 1) distribution; and (ii) the space H

    of X is

    H

    x; 1 for SL lognormal family

    1; 1 for SU unbounded family

    x; x l for SB bounded family

    1; 1 for SN normalfamily

    8>>>>>>>>>:

    0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

    0.5

    1

    1.5

    2

    2.5

    0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

    0.5

    1

    1.5

    2

    2.5

    U(SpVag), = 0.05 U(SpVag), = 0.10

    0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10

    0.5

    1

    1.5

    2

    2.5

    0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

    0.5

    1

    1.5

    2

    2.5

    P(Vag), = 0.10 U(SpVag ), = 0.0

    0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.980

    0.5

    1

    1.5

    2

    2.5

    0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 10

    0.5

    1

    1.5

    2

    2.5

    P(Vag), = 0.0 P(Vag), = 0.05

    Figure 2 Beta distributions fitted to P(Vag), the probability of vaginal delivery (subfigures 2(a)2(c)) and to U(SpVag), the utility ofspontaneous vaginal delivery (subfigures 2(d)2(f)), where the solid line is the fit using Equations (7) and (8) and the dashed line is theRiskPert fit using (12) and (13).

    88 Journal of Simulation Vol. 4, No. 2

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    9/17

    The p.d.f. of X is given by

    fXx d

    l2p1=2g0

    x x

    l

    exp

    1

    2g d g

    x x

    l

    !2( )

    for all xAH, where

    g0y

    1=y for SL lognormal family1=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

    y2 1p

    for SUunbounded family1=y1 y for SB bounded family1 for SNnormal family

    8>>>:For graphs illustrating the diversity of distributional shapes

    that can be achieved with the Johnson system of univariate

    distributions, see DeBrota et al(1989a) or pp 3437 of Kuhl

    et al (2008b).

    3.2. Fitting Johnson distributions to sample data

    The process of fitting a Johnson distribution to sample data

    involves first selecting an estimation method and the desired

    translation function g( ) and then obtaining estimates of thefour parameters g, d, l, and x. The Johnson translation

    system of distributions has the flexibility to match (i) any

    feasible combination of values for the mean mX, variance sX2 ,

    skewness

    SkX EX mX3=s3X

    and kurtosis

    KuX EX mX4=s4X

    or (ii) sample estimates of the moments mX, sX2

    , SkX, andKuX. Moreover, in principle the skewness SkX and kurtosis

    KuX uniquely identify the appropriate translation function

    g( ). Although there are no closed-form expressions for theparameter estimates based on the method of moment

    matching, these quantities can be accurately approximated

    using the iterative procedure of Hill et al (1976). Other

    estimation methods may also be used to fit Johnson

    distributions to sample datafor example, in the FITTR1

    software package (Swain et al, 1988), the following methods

    are available:

    OLS and DWLS estimation of the c.d.f.; minimum L1 and LN norm estimation of the c.d.f.; moment matching; and percentile matching.

    3.3. Fitting SB distributions to subjective information

    DeBrota et al (1989b) discuss VISIFIT, a public-domain

    software package for fitting Johnson SB distributions to

    subjective information, possibly combined with sample data.

    The user must provide estimates of the end-points a and b

    together with any two of the following characteristics:

    the mode m; the mean mX; the median x0.5;

    arbitrary quantile(s) xp or xq for p, qA(0, 1); the width of the central 95% of the distribution; or the standard deviation sX.

    3.4. Generating Johnson variates by inversion

    After a Johnson distribution has been fitted to a data set,

    generating samples from the fitted distribution is straight

    forward. First, a standard normal variate ZBN(0, 1) is

    generated. Then the corresponding realization of the

    Johnson random variable X is found by applying to Z the

    inverse translation

    X x l g1Z g

    d

    15

    where for all real z we define the inverse translation function

    g1z

    ez for SLlognormal familyez ez=2 for SUunbounded family1=1 ez for SBbounded familyz for SNnormal family

    8>>>: 16Remark 3. Although most popular general-purpose

    simulation packages provide an acceptable generator of

    standard normal random variables, we are particularly

    interested in generating Z by the method of inversion,

    ZF1(U), where UBUniform[0, 1] is a random numberand we use the approximation to F1( ) that is availablevia Normaldist of Press et al(2007). Also recommended is

    the approximation to F1( ) given in Section 26.2.22 ofAbramowitz and Stegun (1972). As documented in the

    companion paper on multivariate input modelling (Kuhl

    et al, 2010), an accurate approximation to F1( ) will bea key element in our approach to building multivariate

    extensions of the Johnson translation system of distribu-

    tions as well as stationary univariate time series whose

    marginals are Johnson distributions.

    3.5. Application of Johnson distributions to

    smart-materials research

    Matthews et al (2006), Weiland et al (2005), and Gao and

    Weiland (2008) present a multiscale modelling approach for

    the prediction of material stiffness of a certain class of smart

    materials called ionic polymers. The material stiffness

    depends on multiple parameters, including the effective

    length of the polymer chains composing the material. In a

    case study of Nafion, a specific type of ionic polymer,

    ME Kuhl et alUnivariate input models for stochastic simulation 89

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    10/17

    Matthews et al (2006) develop a simulation model of the

    conformation of Nafion polymer chains on a nanoscopic

    level, from which a large number of end-to-end chain lengths

    are generated. The p.d.f. of end-to-end distances is then

    estimated and used as an input to a macroscopic-level

    mathematical model to quantify material stiffness.

    Figure 3 shows the empirical distribution of 9980

    simulation-generated observations of end-to-end Nafion

    chain lengths (in angstro ms). Superimposed on the empirical

    distribution is the result of using the DWLS estimation

    method to fit an unbounded Johnson (SU) distribution to the

    chain length data. Figure 3 reveals a remarkably accurate fit

    to the given data set. Furthermore, comparing the Johnson

    fit in Figure 3 with the beta fits for the same data set in

    Figure 1, we see that the Johnson distribution is able to

    capture certain key aspects of the Nafion data set that the

    beta distribution is unable to represent adequately.

    Gao and Weiland (2008), Matthews et al (2006), and

    Weiland et al (2005) conclude that the estimates of the

    distribution of chain lengths obtained by fitting an appro-priate Johnson distribution to the data are more intuitive

    than those using other density estimation techniques for the

    following reasons. First, it is possible to write down an

    explicit functional form for the Johnson p.d.f. fX(x) that is

    simple to differentiate. This is a crucial property because the

    second derivative fX0 0

    (x) of the p.d.f. will be used as an input

    to a mathematical model to estimate material stiffness.

    Second, there is a relatively simple relationship between the

    Johnson parameters and the material stiffness. Weiland et al

    (2005) summarize the results of a sensitivity analysis for the

    Johnson parameters and the corresponding effect on

    material stiffness. In general, Weiland et al find that

    increasing the location parameter x leads to an increase in

    predicted stiffness. Similarly, increasing the shape parameter

    d or decreasing the scale parameter l both lead to marginally

    higher predicted material stiffness. Establishing a consistent

    relationship between these parameters and stiffness would

    first serve to extend the current theory to stiffness predic-

    tions, and may ultimately also serve as a step toward the

    custom design of materials with specific stiffness properties.

    3.6. Application of Johnson distributions to health-care

    systems analysis

    In a recent study of the arrival patterns of patients who have

    scheduled appointments at a community health-care clinic,

    Alexopoulos et al (2008) find that patient tardiness (ie, the

    patients deviation from the scheduled appointment time) is

    most accurately modelled using an SU distribution. Specifi-cally they consider data on patient tardiness collected by the

    Partnership of Immunization Providers, a collaborative

    public-private project created by the University of California,

    San Diego School of Medicine, Division of Community

    Pediatrics, in association with community clinics and small,

    private provider practices. Alexopoulos et al(2008) perform

    an exhaustive analysis of 18 continuous distributions, and

    they conclude that the SU distribution provides superior fits

    to the available data.

    4. Be zier distribution family

    4.1. Definition of Bezier curves

    In computer graphics, a Be zier curve is often used to

    approximate a smooth (continuously differentiable) function

    on a bounded interval by forcing the Bezier curve to pass

    in the vicinity of selected control points {pi(xi, zi)T:

    i 0,1,y, n} in two-dimensional Euclidean space. (Through-out this article, all vectors will be column vectors unless

    otherwise stated; and the roman superscript T will denote the

    transpose of a vector or matrix.) Formally, a Be zier curve of

    degree n with control points {p0, p1,y, pn} is given

    parametrically by

    Pt Xni0

    Bn;itpi for t 2 0; 1 17

    where the blending function Bn,i(t) (for all tA[0,1]) is the

    Bernstein polynomial

    Bn;it n!

    i!n i!ti1 tnifor i 0; 1; . . . ; n 18

    10 0 10 20 30 40 50 60 70 80 90

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    10 0 10 20 30 40 50 60 70 80 900

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07

    0.08

    0.09

    0.1

    Figure 3 Johnson SU c.d.f. (left panel) and p.d.f. (right panel) fitted to 9980 Nafion chain lengths.

    90 Journal of Simulation Vol. 4, No. 2

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    11/17

    4.2. Bezier distribution and density functions

    If X is a continuous random variable whose space is the

    bounded interval [a, b] and if X has c.d.f. FX( ), and p.d.f.fX( ), then in principle we can approximate FX( ) arbitrarilyclosely using a Be zier curve of the form (17) by taking a

    sufficient number (n 1) of control points with appropriate

    values for the coordinates (xi, zi)T

    of the ith control pointpi for i 0,y, n. If X is a Be zier random variable, then thec.d.f. of X is given parametrically by

    Pt fxt; FXxtgT

    for t 2 0; 1 19

    where

    xt Pni0

    Bn;itxi

    FXxt Pni0

    Bn;itzi

    9>>=>>; 20Equation (20) reveals that the control points p0, p1,y, pnconstitute the parameters regulating all the properties of

    a Be zier distribution. Thus the control points must be

    arranged so as to ensure the basic requirements of a c.d.f.: (i)

    FX(x) is monotonically nondecreasing in the cutoff value x;

    (ii) FX(a) 0; and (iii) FX(b) 1. By utilizing the Be zierproperty that the curve described by (19)(20) passes

    through the control points p0 and pn exactly, we can ensure

    that FX(a) 0 if we take p0 (a, 0)T; and we can ensure that

    FX(b) 1 if we take pn (b,1)T. See Wagner and Wilson

    (1996a) for a complete discussion of univariate Be zier

    distributions and their use in simulation input modelling.

    If X is a Be zier random variable with c.d.f. FX( ) given

    parametrically by (19), then it follows that the correspondingp.d.f. fX(x) for all real x is given parametrically by

    Pt fxt;fXxtgT

    for t 2 0; 1

    where x(t) is given by (20) and

    fXxt

    Pn1i0

    Bn1;itDzi

    Pn1i0

    Bn1;itDxi

    In the last equation, Dxi xi 1xi and Dzi zi 1zi (for

    i 0,1,y, n1) represent the corresponding first differencesof the x- and z-coordinates of the original control points

    {p0, p1,y, pn} in the parametric representation (19) of the

    c.d.f.

    4.3. Generating Bezier variates by inversion

    The method of inversion can be used to generate a Be zier

    random variable whose c.d.f. has the parametric representa-

    tion displayed in Equations (19) and (20). Given a random

    number UBUniform [0, 1], we perform the following steps:

    (i) find tUA[0, 1] such that

    Xni0

    Bn;itUzi U 21

    and (ii) deliver the variate

    XXni0

    Bn;itUxi 22

    The solution to (21) can be computed by any root-finding

    algorithm such as Mu llers method, Newtons method, or

    the bisection method. Codes to implement this approach to

    generating Be zier variates are available on Web site

    www.ise.ncsu.edu/jwilson/page3.

    Remark 4. As documented in the companion paper on

    multivariate input modelling (Kuhl et al, 2010), the inversion

    scheme specified in Equations (21) and (22) for generating

    Be zier random variables will be a key element in ourapproach to building multivariate extensions of the uni-

    variate Be zier distributions as well as stationary univariate

    time series whose marginals are Be zier distributions.

    4.4. Using PRIME to model Bezier distributions

    PRIME is a graphical, interactive software system that

    incorporates the methodology detailed in this section to help

    an analyst estimate the univariate input processes arising in

    simulation studies. PRIME is written entirely in the C

    programming language, and it has been developed to run

    under Microsoft Windows. A public-domain version of the

    software is available on the previously mentioned Web site.

    PRIME is designed to be easy and intuitive to use. The

    construction of a c.d.f. is performed through the actions of

    the mouse, and several options are conveniently available

    through menu selections. Control points are represented as

    small black squares, and each control point is given a unique

    label corresponding to its index i in Equation (17). Figure 4

    shows a typical session in PRIME, where the c.d.f. and p.d.f.

    windows are both displayed.

    In the absence of data, PRIME can be used to model an

    input process conceptualized from subjective information or

    expertise. Section 5.1 of Wagner and Wilson (1996a)

    contains a detailed example of the interactive use of PRIMEfor subjective input modelling; here we merely provide an

    overview of this approach to using PRIME. The representa-

    tion of the conceptualized distribution is achieved by adding,

    deleting, and moving the control points via the mouse. Each

    control point acts like a magnet that pulls the curve in the

    direction of the control point, where the blending functions

    (ie, the Bernstein polynomials defined by Equation (18))

    govern the strength of the magnetic attraction exerted on

    the curve by each control point. Clicking (ie, selecting) and

    dragging (ie, moving) a control point causes the displayed

    ME Kuhl et alUnivariate input models for stochastic simulation 91

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    12/17

    c.d.f. to be updated (nearly) instantaneously. If they are

    displayed, the corresponding p.d.f., the first four moments(that is, the mean, variance, skewness, and kurtosis), and

    selected percentile values of the Be zier distribution are

    updated (nearly) simultaneously in adjacent windows so that

    the user gets immediate feedback on the effects of moving

    selected control points. Thus, the user has a variety of

    readily available indicators and measures, as well as visually

    appealing displays, to aid in the construction of the

    conceptualized distribution.

    As detailed in Wagner and Wilson (1996a, b), PRIME

    includes several standard estimation procedures for fitting

    distributions to sample data sets:

    OLS estimation of the c.d.f.; minimum L1 and LN norm estimation of the c.d.f.; maximum likelihood estimation (assuming a and b are

    known);

    moment matching; and percentile matching.

    Figure 5 shows a Be zier distribution that was fitted to the

    same data set consisting of Nafion polymer chain lengths as

    shown in Figure 3. In this application of PRIME, we

    obtained the fitted Be zier distribution automatically, where:

    (i) the number of control points (n 1) was determined bythe likelihood ratio test detailed in Wagner and Wilson

    (1996b); and (ii) the components of the control points were

    estimated by the method of OLS. Figure 5 shows that

    a Be zier distribution yielded an excellent fit to the given

    data set.

    As another example that illustrates the capability of

    PRIME and the Be zier distribution family to handle

    multimodal data, we describe briefly an input-modelling

    problem that arose in a manufacturing simulation study. For

    more details on this application using an earlier version of

    PRIME that did not incorporate automatic determination of

    the number of control points to be used in the fittedBe zier distribution, see Section 5.2 of Wagner and Wilson

    (1996a). Surface mount capacitors were stored in lots of

    varying sizes in a facility adjacent to the insulation resistance

    (IR) testing area. To model the operation of the IR testing

    area, we needed to estimate the distribution of capacitor lot

    sizes in the storage facility.

    Capacitor lot-size data were available for 2083 tested lots.

    The left-hand panel of Figure 6 displays the empirical c.d.f.

    for this data set and the final fitted Be zier c.d.f.; and the

    right-hand panel displays a histogram and the final fitted

    Be zier p.d.f., where all of the original observations were

    divided by 1000 for simplicity. Notice that in the vicinity of

    20 and 270 on the new scale (that is, lot sizes expressed in

    1000s), there are pronounced peaks in the histogram. Usually

    such a bimodal distribution indicates that the sample was

    taken from two distinct distributions that must be fitted

    separately so that the overall fitted distribution is a mixture

    of the two component distributions; for an elaboration of this

    point, see Remark 5 below. However in the current context,

    the production engineers were unable to provide any addi-

    tional information that would have enabled us to model

    the lot-size distribution as a mixture of two simpler distri-

    butions; and thus we were forced to exploit the capabilities

    of PRIME for modelling multimodal distributions.

    The fitted Be zier distribution displayed in Figure 6 wasobtained in two steps using the method of OLS. First we

    simply used the default settings of PRIME to fit a Be zier

    distribution with six control points; and the resulting fit was

    unimodal and was judged to be unsatisfactory based on

    visual inspection of the fitted p.d.f. and c.d.f. (As detailed in

    Wagner and Wilson (1996a), several other widely used

    commercial input-modelling packages also yielded unsatis-

    factory fits to this data set precisely because they do not

    include any distribution families that can adequately handle

    multimodal data sets.) In the second step of using PRIME to

    Figure 4 PRIME windows showing the Be zier c.d.f. (left panel) with its control points and the p.d.f. (right panel).

    92 Journal of Simulation Vol. 4, No. 2

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    13/17

    fit a Be zier distribution to the lot-size data set, we used the

    option for automatic determination of the number of control

    points starting from the current configuration. As shown in

    Figure 6, the final fitted Be zier distribution had 13 control

    points; and the fitted p.d.f. and c.d.f. closely approximated

    the corresponding histogram and empirical c.d.f. for the lot-

    size data set.

    Remark 5. If a data set has two or more clearly

    distinguishable sources each with its own distribution, then

    an alternative approach to fitting a multimodal distribution

    to the overall data set is to represent the corresponding

    c.d.f. (or p.d.f.) as a mixture of the c.d.f.s (or p.d.f.s) for

    the individual sources, where the mixing probabilities are

    the associated long-run percentages of the overall data set

    Figure 5 Be zier distribution fitted to 9980 Nafion chain lengths.

    Figure 6 Be zier distribution fitted to capacitor lot-size data set of size 2083.

    ME Kuhl et alUnivariate input models for stochastic simulation 93

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    14/17

    obtained from each source; see Section 8.2.2 of Law (2007).

    In this situation it is natural to fit a distribution to the

    subsample from each source separately; and then the

    corresponding estimate of the mixing probability is simply

    the fraction of the entire data set obtained from the relevant

    source. This approach could not be used in the lot-size

    application described above because separate sources of data

    could not be identified.

    The Be zier distribution family, which is entirely specified

    by its control points {p0, p1,y, pn}, has the following

    advantages:

    It is extremely flexible and can represent a wide diversityof distributional shapes. For instance, Figures 4 and 6

    depict multimodal distributions that are easily constructed

    using PRIME, yet impossible to achieve with other

    distribution families.

    If data are available, then the likelihood ratio test of

    Wagner and Wilson (1996b) can be used in conjunctionwith any of the estimation methods enumerated above to

    find automatically both the number and location of the

    control points.

    In the absence of data, PRIME can be used to determinethe conceptualized distribution based on known quanti-

    tative or qualitative information that the user perceives to

    be pertinent.

    As the number (n 1) of control points increases, so doesthe flexibility in fitting Be zier distributions. The inter-

    pretation and complexity of the control points, however,

    does not change with the number of control points.

    5. Conclusions and recommendations

    The common thread running through this article is the focus

    on robust input models that are computationally tractable

    and sufficiently flexible to represent adequately many of the

    probabilistic phenomena that arise in many applications of

    discrete-event stochastic simulation. For another approach

    to input modelling with no data, see Craney and White

    (2004).

    The emphasis in this article has been on the beta, Johnson,

    and Be zier families because of their flexibility and because

    we have found that in practice, they can be most effectivelyapplied to simulation projects in which a large number of

    input models must be built under conditions in which the

    user lacks either of the following: (i) detailed information

    about the mechanism generating the target inputs; or (ii) the

    time to gather the information specified in (i) and use that

    information to derive the precise functional form of the

    relevant distribution. For situations in which the user has

    more information about the genesis of the continuous

    univariate distribution to be modelled, we have found the

    Pearson system of distributions can often be used effectively;

    see Chapter 4 of Elderton and Johnson (1969) and Sections

    6.26.13 of Stuart and Ord (1994). Johnson et al(1994, 2004)

    provide a comprehensive discussion of continuous univariate

    distributions; see also Kotz and van Dorp (2004). For a

    similar treatment of discrete univariate distributions, see

    Johnson et al (2005).

    Notably missing from this article is a discussion of

    Bayesian techniques for simulation input modelling, a topic

    that we think will receive increasing attention from

    practitioners and researchers alike in the future. In selecting

    the input models for a simulation, we must account for three

    main sources of uncertainty:

    1. Stochastic uncertainty arises from dependence of the

    simulation output on the random numbers generated and

    used on each runfor example, the random number U

    used in generate a generalized beta random variable Xvia

    Equation (11).

    2. Model uncertainty arises when the correct input model isunknown, and we must choose between alternative input

    models with different functional forms that adequately fit

    available sample data or subjective informationfor

    example, the generalized beta, Johnson SU, and Be zier

    distributions fitted to the Nafion data set as depicted in

    Figures 1, 3 and 5, respectively.

    3. Parameter uncertainty arises when the parameters of the

    selected input model(s) are unknown and must be

    estimated from sample data or subjective information.

    Although stochastic uncertainty is much more widely

    recognized by simulation practitioners than the other twotypes of uncertainty, it is not always a major source of

    variation in simulation output as demonstrated by Zouaoui

    and Wilson (2004) using an M/G/1 queueing system

    simulation in which stochastic uncertainty accounts for only

    2% of the posterior variance of the average waiting time in

    the queue, while model uncertainty regarding the exact

    functional form of the service-time distribution accounts for

    18% of the posterior varianceand thus 80% of the

    posterior variance is due to uncertainty regarding the exact

    numerical values of the arrival rate and the parameters of the

    service-time distribution. In such a situation, conventional

    approaches to input modelling have the potential to yield a

    grossly misleading picture of the inherent accuracy of

    simulation-generated system performance measures such as

    the average queue waiting time. For an introduction to

    Bayesian input modelling, see Chick (1999, 2001) and

    Zouaoui and Wilson (2003, 2004).

    Another topic not discussed in this article is the use of

    heavy-tailed distributions in simulation input modelling. If

    the random variable X has a heavy-tailed distribution, then

    1 FXx PrfX4xg $ cxa as x ! 1 23

    94 Journal of Simulation Vol. 4, No. 2

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    15/17

    where c40 is a location parameter, a is a shape parameter

    with aA(1,2), and B means that the ratio of the left- and

    right-hand sides of (23) tends to 1 as x-N. Heavy-tailed

    distributions frequently arise in simulations of computer and

    communications systems (Crovella and Lipsky, 1997;

    Greiner et al, 1999; Heyde and Kou, 2004). Fishman and

    Adan (2005) discuss some situations in which the lognormal

    distribution (a member of the Johnson translation system)

    can provide a reasonable substitute for a heavy-tailed

    distribution.

    Additional material on techniques for simulation input

    modelling will be posted to the Web site http://www.ise

    .ncsu.edu/jwilson/more_info.

    AcknowledgementsPartial support for some of the research describedin this article was provided by National Science Foundation GrantDMI-9900164.

    References

    AbouRizk SM, Halpin DW and Wilson JR (1991). Visual

    interactive fitting of beta distributions. J Constr Eng Mngt

    117: 589605.

    AbouRizk SM, Halpin DW and Wilson JR (1994). Fitting beta

    distributions based on sample data. J Constr Eng Mngt 120:

    288305.

    Abramowitz M and Stegun IA (1972). Handbook of Mathematical

    Functions with Formulas, Graphs, and Mathematical Tables.

    Dover: New York.

    Akaike H (1974). A new look at the statistical model identification.

    IEEE T Automat Contr AC-19: 716723.

    Alexopoulos C et al (2008). Modeling patient arrival times in

    community clinics. Omega 36: 3343.Chick SE (1999). Steps to implement Bayesian input distribution

    selection. In: Farrington PA, Nembhard HB, Sturrock DT and

    Evans GW (eds). Proceedings of the 1999 Winter Simulation

    Conference. Institute of Electrical and Electronics Engineers:

    Piscataway, NJ, pp 317324, http://www.informs-sim.org/

    wsc99papers/044.PDF, accessed 28 March 2009.

    Chick SE (2001). Input distribution selection for simulation

    experiments: Accounting for input uncertainty. Opns Res 49:

    744758.

    Craney TA and White N (2004). Distribution selection with no data

    using VBA and Excel. Qual Eng 16: 643656.

    Crovella ME and Lipsky L (1997). Long-lasting transient

    conditions in simulations with heavy-tailed workloads. In:

    Andradottir S, Healy KJ, Withers DH and Nelson BL (eds).

    Proceedings of the 1997 Winter Simulation Conference. Instituteof Electrical and Electronics Engineers: Piscataway, NJ,

    pp 10051012, http://www.informs-sim.org/wsc97papers/1005

    .PDF, accessed 8 July 2009.

    DeBrota DJ et al (1989a). Modeling input processes with

    Johnson distributions. In: MacNair EA, Musselman KJ and

    Heidelberger P (eds). Proceedings of the 1989 Winter Simulation

    Conference. Institute of Electrical and Electronics Engineers:

    Piscataway, NJ, pp 308318, http://www.ise.ncsu.edu/jwilson/

    files/debrota89wsc.pdf, accessed 28 March 2009.

    DeBrota DJ, Dittus RS, Roberts SD and Wilson JR (1989b). Visual

    interactive fitting of bounded Johnson distributions. Simulation

    52: 199205.

    Dickson LE (1939). New First Course in the Theory of Equations .

    Wiley: New York.

    Elderton WP and Johnson NL (1969). Systems of Frequency Curves.

    Cambridge University Press: Cambridge.

    Fishman GS and Adan IJB (2005). How heavy-tailed distributions

    affect simulation-generated time averages. ACM Trans Model

    Comput Simul 16: 152173.

    Gao F and Weiland LM (2008). A multiscale model applied to ionic

    polymer stiffness prediction. J Mater Res 23: 833841.Gold MR, Siegel JE, Russell LB and Weinstein MC (1996).

    Cost-effectiveness in Health and Medicine. Oxford University

    Press: New York.

    Greiner M, Jobmann M and Lipsky L (1999). The importance of

    power-tail distributions for modeling queueing systems. Opns

    Res 47: 313326.

    Hahn GJ and Shapiro SS (1967). Statistical Models in Engineering.

    Wiley: New York.

    Heyde CC and Kou SG (2004). On the controversy over tailweight

    distributions. Opns Res Lett 32: 399408.

    Hill ID, Hill R and Holder RL (1976). Algorithm AS99: Fitting

    Johnson curves by moments. Appl Stat 25: 180189.

    Irizarry MA et al (2003). Analyzing transformation-based simula-

    tion metamodels. IIE Trans 35: 271283.

    Johnson NL (1949). Systems of frequency curves generated by

    methods of translation. Biometrika 36: 149176.

    Johnson NL, Kemp AW and Kotz S (2005). Univariate Discrete

    Distributions, 3rd edn, Wiley-Interscience: New York.

    Johnson NL, Kotz S and Balakrishnan N (1994). Continuous

    Univariate Distributions, Vol. 1, 2nd edn, Wiley-Interscience:

    New York.

    Johnson NL, Kotz S and Balakrishnan N (2004). Continuous

    Univariate Distributions, Vol. 2, 2nd edn, Wiley-Interscience:

    New York.

    Kotz S and van Dorp JR (2004). Beyond Beta: Other Continuous

    Families of Distributions with Bounded Support and Applications.

    World Scientific: Singapore.

    Kuhl ME et al (2006). Introduction to modeling and generating

    probabilistic input processes for simulation. In: Perrone LF,et al. (eds). Proceedings of the 2006 Winter Simulation

    Conference. Institute of Electrical and Electronics Engineers:

    Piscataway, NJ, pp 1935, http://www.informs-sim.org/

    wsc06papers/003.pdf, accessed 28 March 2009.

    Kuhl ME et al (2008a). Introduction to modeling and generating

    probabilistic input processes for simulation. In: Mason SJ, et al.

    (eds). Proceedings of the 2008 Winter Simulation Conference.

    Institute of Electrical and Electronics Engineers: Piscataway,

    NJ, pp 4861, http://www.informs-sim.org/wsc08papers/

    008.pdf, accessed 28 March 2009.

    Kuhl ME et al (2008b). Introduction to modeling and generating

    probabilistic input processes for simulation. Slides accom-

    panying the oral presentation of Kuhl et al (2008a), http://

    www.ise.ncsu.edu/jwilson/files/wsc08imt.pdf, accessed 28 March

    2009.Kuhl ME et al (2010). Multivariate input models for stochastic

    simulation. J Simul (in preparation).

    Law AM (2007). Simulation Modeling and Analysis 4th edn,

    McGraw-Hill: New York.

    Matthews JL et al (2006). Monte Carlo simulation of a solvated

    ionic polymer with cluster morphology. Smart Mater Struct 15:

    187199.

    McBride WJ and McClelland CW (1967). PERT and the beta

    distribution. IEEE Trans Eng Mngt EM-14: 166169.

    Palisade Corp (2009). Getting started in @RISK. Palisade

    Corp.: Ithaca, NY, http://www.palisade.com/risk/5/tips/EN/gs/,

    accessed 5 July 2009.

    ME Kuhl et alUnivariate input models for stochastic simulation 95

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    16/17

    Pearlswig DM (1995). Simulation modeling applied to the single pot

    processing of effervescent tablets. Masters thesis, Integrated

    Manufacturing Systems Engineering Institute, North Carolina

    State University, Raleigh, NC, http://www.ise.ncsu.edu/jwilson/

    files/pearlswig95.pdf, accessed 28 March 2009.

    Press WH, Teukolsky SA, Vetterling WT and Flannery BP (2007).

    Numerical Recipes: The Art of Scientific Computing, 3rd edn.

    Cambridge University Press: Cambridge.

    SAS Institute Inc (2008). JMP 8 Statistics and Graphics Guide.http://www.jmp.com/support/downloads/pdf/jmp8/jmp_stat_

    graph_guide.pdf, accessed 28 October 2009.

    Stephens MA (1974). EDF statistics for goodness of fit and some

    comparisons. J Am Stat Assoc 69: 730737.

    Stuart A and Ord K (1994). Kendalls Advanced Theory of Statistics,

    Volume 1: Distribution Theory, 6th edn, Edward Arnold: London.

    Swain JJ, Venkatraman S and Wilson JR (1988). Least-squares

    estimation of distribution functions in Johnsons translation

    system. J Stat Comput Simul 29: 271297.

    Vanhoucke M (2010). Using activity and sensitivity and network

    topology information to monitor project time performance.

    Omega (forthcoming).

    Wagner MAF and Wilson JR (1996a). Using univariate Be zier

    distributions to model simulation input processes. IIE Trans 28:

    699711.

    Wagner MAF and Wilson JR (1996b). Recent developments in

    input modeling with Bezier distributions. In: Charnes JM,

    Morrice DJ, Brunner DT and Swain JJ (eds). Proceedings of the

    1996 Winter Simulation Conference. Institute of Electrical

    and Electronics Engineers: Piscataway, NJ, pp 14481456,

    http://www.ise.ncsu.edu/jwilson/files/wagner96wsc.pdf, accessed

    28 March 2009.

    Weiland LM, Lada EK, Smith RC and Leo DJ (2005). Application

    of rotational isomeric state theory to ionic polymer stiffness

    predictions. J Mater Res 20: 24432455.

    Wilson JR, Vaughan DK, Naylor E and Voss RG (1982). Analysis

    of Space Shuttle ground operations. Simulation 38: 187203.

    Xu X et al (2010). Pelvic floor consequences of cesarean delivery

    on maternal request in women with a single birth: A cost-effectiveness analysis. J Womens Health 19: 147160.

    Zouaoui F and Wilson JR (2003). Accounting for parameter uncer-

    tainty in simulation input modeling. IIE Trans 35: 781792.

    Zouaoui F and Wilson JR (2004). Accounting for input-model

    and input-parameter uncertainties in simulation. IIE Trans 36:

    11351151.

    Appendix

    Exact computation of shape parameters for beta

    distribution fitted to user-specified mode and variance

    To simplify the notation in this appendix, we let a, m,and b denote the user-specified minimum, mode, and

    maximum of the target distribution with aob and mA[a, b]

    as if these quantities were known exactly; in practice of

    course it is often necessary to use estimates ba, bm, and bbof these quantities in the following development. In

    this appendix, we provide exact computing formulas

    for the shape parameters a1 and a2 of the generalized

    beta distribution (1) on the interval [a, b] that has the

    user-specified mode m and the user-specified variance

    sX2 (ba)2/o.

    If o412 (so that the desired beta distribution has a

    smaller variance than that of the uniform distribution on the

    interval [a, b]), then for any value of mA[a, b], there is a

    unique generalized beta distribution on [a, b] with a unique

    mode at m. (Ifo 12, then it can be shown that we musthave a1 a2 1 so that the beta distribution with the givenmode and variance coincides with the uniform distribution

    on [a, b]. Since the mode is assumed to be unique, this

    uninteresting case is eliminated from further consideration.)

    If we set the right-hand side of (4) equal to m and the right-

    hand side of (3) equal to (ba)2/o, then we obtain thefollowing equivalent system of equations in terms of the

    asymmetry ratio r (bm)/(ma), provided m4a so thatroN:

    a31 Ba21 Ca1 D 0

    a2 ra1 1 r

    'A1

    where

    B 3r3 2r2 5 or 41 r3

    C 3r3 5r2 o 3r 5 o

    1 r3

    D r3 4r2 5r 2

    1 r3

    9>>>>>=>>>>>;A2

    Remark 6. In the case that m a so that r N, we solvethe mirror image problem for which m b and r 0; andthen we interchange the resulting shape parameters to obtain

    a generalized beta distribution whose mode coincides with its

    minimum. See also Remark 7 below.

    It can be proved that ifo412, then for all rA[0,N] thecubic equation in a1 defined by (A1)(A2) has a nonnegative

    discriminant

    D 18BCD 4B3D B2C2 4C3 27D2

    so that the cubic equation has three real roots {zj:j 1,2,3}such that:

    z141

    z2; z3o1

    'A3

    As possible values ofa1, the roots z2 and z3 are unacceptable

    for the following reasons:

    (i) The assignment a1A(0, 1) yields a generalized beta

    distribution with an asymptote at its lower limit a,

    which seems intuitively problematic and is clearly

    unacceptable when the user-specified mode m exceeds

    the lower limit.

    (ii) The assignment a1p0 does not define a legitimate

    generalized beta distribution.

    We are therefore left with the unique assignment a1 z1;and a computing formula for a1 can be derived from the

    96 Journal of Simulation Vol. 4, No. 2

  • 8/3/2019 Univariate Input Models for Stochastic Simulation

    17/17

    explicit solution to a cubic equation as follows (see Sections

    3338 of Dickson, 1939). In terms of the auxiliary quantities

    P C 13B

    2

    Q D 13BC 2

    27B3

    'A4

    we have

    a1 z1

    43P

    1=2cos 1

    3cos1 12Q

    3P

    3=2n o

    13B; ifD40

    B; ifD 0

    (A5

    .Finally we take a2 ra1 1r to complete the specificationof the generalized beta distribution.

    Remark 7. In general to avoid numerical difficulties that

    can occur with large values of r (that is, when r ) 1),we recommend the following approach to the use of

    Equations (A1)(A5). If (bm)/(ma)41, then we solvethe mirror image problem for which r (ma)/(bm)o1;and finally we interchange the resulting shape parameters to

    obtain a generalized beta distribution with the user-specified

    mode m.

    Received 13 July 2009;

    accepted 9 November 2009 after one revision

    ME Kuhl et alUnivariate input models for stochastic simulation 97