random variate generation big mckenzie added handouts

114
Generating Random Variates • Use Trace Driven Simulations • Use Empirical Distributions • Use Parametric Distributions Ref: L. Devroye Non-Uniform Random Variate Generation

Upload: mohit-gupta

Post on 11-Nov-2015

13 views

Category:

Documents


0 download

DESCRIPTION

Random

TRANSCRIPT

  • Generating Random VariatesUse Trace Driven SimulationsUse Empirical DistributionsUse Parametric Distributions

    Ref: L. Devroye Non-Uniform Random Variate Generation

  • Trace-driven SimulationsStrong validity argumentEasy to model systemsExpensiveValidity (data representative?)Sensitivity analysis difficultResults cannot be generalized to other systems (restrictive)SlowNo rare data information

  • Empirical Distributions (histograms)Strong validity argumentEasy to model systemsReplicateData may not be representativeSensitivity analysis is difficultResults cannot be generalized to other systemsDifficult to model dependent dataNeed a lot of data (must do conditional sampling)Outliers and rare data problems

  • Parametric Probability ModelResults can be generalizedCan do sensitivity analysisReplicateData may not be representativeProbability distribution selection errorParametric fitting error

  • Random Variate Generation:General MethodsInverse TransformCompositionAcceptance/RejectionSpecial Properties

  • Criteria for Comparing AlgorithmsMathematical validity does it give what it is supposed to?Numerical stability do some seeds cause problems?SpeedMemory requirements are they excessively large?Implementation (portability)Parametric stability is it uniformly fast for all input parameters (e.g. will it take longer to generate PP as rate increases?)

  • Inverse TransformConsider a random variable X with a (cumulative) distribution function FX Algorithm to generate X ~ FX :Generate U ~ U(0,1)Return X = F-1(U)F-1(U) will always be defined because 0U1 and range of F is [0,1] (if monotonic)What if F discrete or has atoms? (rt. cont.)

  • Consider discrete random variate, X -421P{X=k} = PkU (0,1)Random Number, U, falls in interval k with Prob = PkReturn k corresponding to interval (use an array or table)01Pk = 1 so lay out Pks along Unit Interval

  • Equivalent to inverting the CDF

    x1U1U2X1=2X2=-4-421P{X=k} Fx(k) = Prob{Xk}

  • Generating Continuous Random Variates

  • Proof Algorithm WorksMust show the X generated by the algorithm is, in fact, from FX : P{X x} = FX(x)P{X x} = P{FX-1(U) x} = P{U FX(x)} = FX(x) = P{X x} First equality is conjecture, X = FX-1 (U), second is mult. by Fx (monotone non-dec)Note: The = in the conjecture should really be , meaning RVs have same distribution.defndefn

  • e.g. Exponential R.V.X is an exponential random variable with mean

    To generate values of X set U = Fx(X) and solve for (rv) X.

    X = - ln(1-U) (or X = - ln(U))

  • Weibull Random Variable (with parameters and )PDF:f(x) = - x-1 , x0, 0 otherwiseCDF:F(x) = 1- , x0 Setting F(x) = u~U(0,1)

    leads to x = (-LN(1-u))1/.

  • Triangular DistributionUsed when a crude, single mode distribution is needed.

    2(x-a)/(b-a)(c-a) for a x b The PDF is f(x) = 2(c-x)/(c-a)(c-b) for b x c 0 otherwisewhere a and c are the boundaries, and b is the mode.acbf(x)x

  • The CDF is F(v) = (v-a)2/(b-a)(c-a) for a v b1 - (c-v)2/(c-a)(c-b) for b v c

    By the inversion method,v = a + (u)1/2[(b-a)(c-a)]1/2for 0 u (b-a)/(c-a) c - (1-u)1/2[(c-a)(c-b)]1/2for (b-a)/(c-a) u 1where F(b) = (b-a)/(c-a)

  • Geometric Random Variable (with parameter p)- Number of trials until the first success (with probability p)- Probability mass function (PMF) P{X=j} = p(1-p) j-1, j=1,2,- Cumulative distribution function (CDF)F(k) = j
  • Inverse Transform: AdvantagesIntuitiveCan be very fastAccurateOne random variate generated per U(0,1) random numberAllows variance-reduction techniques to be applied (later)Truncation easyOrder statistics are easy

  • Inverse Transform: DisadvantagesMay be hard to calculate F-1(U) in terms of computer operationsF-1(U) may not even existIf use power-series expansions, stopping rule?For discrete distributions, must perform search

  • Conditional Distributions - want to generate X conditioned that it is between a and b (truncated)Fx(x)abFx(b)Fx(a)U = Fx(a) + RND*(Fx(b) - Fx(a)) U

  • Inverse Transform for Discrete Random VariatesInverse transform can be used for discrete random variables, alsoE.g., can use empirical distribution function (see next slide)If using known probabilities, replace 1/n step size with p(xi) step size

  • Variate Generation Techniques: Empirical Distribution

    How to sample from a discrete empirical distribution:Determine {(x1, F(x1)), (x2, F(x2)),, (xn, F(xn))}Generate uSearch for interval (F(xi) u F(xi+1))Report xi

  • Order Statisticsith order statistics is the ith smallest of n observationsGenerate n iid observations x1, x2,,xnOrder the n observationsx[1], x[2],,x[n]x[1] describes the failure time in a serial systemx[n] describes the failure time in a parallel systemHow can we generate x[1] and x[n] using one U(0,1) variate?

  • Order StatisticsSerial system

    Parallel system

  • Order StatisticsF[n](a) = P{x[n]a} = P{Max{xi}a}= P{x1a, x2a, , xna}= P{x1a} P{x2a} P{xna}(indep)= Fn(a)(identically distributed)Represents CDF of failure time of parallel system using CDF of failure time of individual componentInversion: Fn(a) = U implies a = F-1((U)1/n)

  • Order StatisticsF[1](a) = P{x[1]a} = 1 - P{x[1]>a} =1-P{Min{xi}>a}= 1 - P{x1>a, x2>a, , xn>a}= 1 - P{x1>a} P{x2>a} P{xn>a}= 1 (1 - F(a))n Represents CDF of failure time of serial system using CDF of failure time of individual component

  • Order StatisticsInversion: 1 (1 F(a))n = u implies a = F-1(1 (1 u)1/n)Find ith order statistic: e.g. 2nd order statistic: find X[1], now sample n-1 from U[F(X[1]), 1] need 2 uniforms to generate X[2]

  • NOTEAs n+, u1/n 1 and 1 - (1 - u)1/n 0 for u ~ U(0, 1)Once the CDF of the order statistics are known, the densities (PDFs) can be obtained by differentiating.u0F-1(1-(1-u)1/n) F-1(u) F-1(u1/n) XF(X)1U(0,1)u1/n1- (1-u)1/n

  • ExampleLet X be exponential with mean = 1/5, and u U(0, 1). Let n = 100Thenx[100] = (-1/5)1n(1-u1/100) andx[1] = (-1/5)1n(1 - (1 - (1 - u)1/100)) = (-1/5) 1n((1 - u)1/100))

    For u=.2, x[100] = .827 and x[1] = .0004.For u=.5, x[100] = .995 and x[1] = .0014.For u=.8, x[100] = 1.22 and x[1] = .0032.

  • CompositionCan be used when F can be expressed as a convex combination of other distributions Fi, where we hope to be able to sample from Fi more easily than from F directly.

    pi is the probability of generating from Fi

  • Composition: Graphical Example

  • Composition: AlgorithmGenerate positive random integer I such that P{I = i} = pifor i = 1, 2, Return X with distribution function FI

    Think of Step 1 as generating I with mass function pI. (Can use inverse transform.)

  • Composition: Another Example f(x)2-aax01Area = aArea = 1 - aGenerate U1If U1 a, generate and return U2Else generate and return X from right-triangular distribution

  • Acceptance/RejectionX has density f(x) with bounded supportIf F is hard (or impossible) to invert, too messy for composition... what to do?

    Generate Y from a more manageable distribution and accept as coming from f with a certain probability

  • xf(x)M(x)Acceptance/Rejection Intuition:Density f(x) is really ugly ... Say, Orange!M is a Nice Majorizing function..., Say Uniform

  • xf(x)M(x)Intuition:Throw darts at rectangle under M until hit fXReject!Missed again!Accept X! - doneProof: Prob{Accept X} is Proportional to height of f(X)

  • Acceptance/RejectionCreate majorizing function M(x) f(x) for all x; normalize M( ) to be a density (area=1) r(x) = M(x)/c where c is the area under M(x)Algorithm:Generate X from r(x)Generate Y from U(0,M(X)) (independent of X)If Y f(X), Return X; else go to 1 and repeat

  • Generalized Acceptance Rejection: f(x)
  • Proof: P{X t |Y f(X)} = P{X t,Y f(X)} / P{Y f(X)} P{X t, Y f(X)}= (1/g(x)(g(x)/Ag) dy dx= (1/Ag) dy dx= (f(x)/Ag) dxY uniform 0-g(x)X distn g(x)/Agand indep.

  • Second,P{Y f(X)}= g(x)/(g(x)/Ag) dy dx= (1/Ag) dy dx= (f(x)/Ag) dx = 1/AgTherefore, P{X t | Y f(X)} = f(x) dx

  • Performance of AlgorithmWill Accept X with a probability equal to (area under f(X))/ (area under M(X))If c = area under M( ), Probability of acceptance is 1/c so want c to be small What is the expected number of Us needed to generate one X?2c

    Why?Each iteration is coin flip (Bernoulli Trial). Number of trials until first success is Geometric, G(p)E[G] = 1/p and here p=1/c (2 Us per iteration)

  • Increase Prob(accept) and stop with tighter Majorizing Function, M(X) xf(x)M(x)Xf(X)f(X)/M(X)M(x)

  • What if f(x) is hard to evaluate?Use minorizing function m(x) f(x)f(x)M(x)m(x)

  • Acceptance/RejectionFinal algorithm:Generate X from r(x)Generate U from U(0,1) (indep. of X)If U m(X)/M(X), return X and stopElse if U f(X)/M(X), return X and stopElse go to 1 and try again

  • Biggest ProblemsChoosing majorizing and minorizing functions such that it is easy to generate points under the majorizing function and the minorizing function is easy to compute(f(x) may be difficult to compute)We want it to be easy to sample under g(x)AND close to f(x) (contradictory constraints)ResultAs (dimension of x) +, (Area under f(x))/(Area in cube) 0

  • Special PropertiesMethods that make use of special properties of distributions to generate them

    E.g. n2 = where Xi is N(0,1) random variableZ1 and Z2 are 2 with k1 and k2 degrees of freedom, then is Fk1,k2

  • Special PropertiesAn Erlang is a sum of exponentialsERL (r,) =i=1,2,,r (- 1n(ui)) = - 1n(i=1,2,,r ui)) (Gamma allows r to be non-negative real) If X is Gamma (,1), and Y = X, then Y is Gamma (, ).If = 1, then X is Exp (1).Beta is a ratio of GammasX1 is distributed Gamma (1, )X2 is distributed Gamma (2, )X1 and X2 are independentThen X1/(X1+X2) is distributed Beta (1, 2)

  • Binomial Random Variable Binomial(n,p) (Bernoulli Approach)0)X = 0, i = 11)Generate U distributed U(0, 1)2)If U p, X X+1If U > p, X XIf in, set i i+1 and go to 1)If i = n, stop with X distributed Binomial (n,p)

  • Geometric Random Variable Geometric (p) (Bernoulli Approach) (already saw inversion approach)

    0)X = 11)Generate U U(0, 1)2)If U p, X geometric (p)If U> p, X X+1 and go to 1)

    Negative Binomial (r, p) is the sum of r IID Geometric (p)

  • Example of Special Properties:

    Poisson RV, P, is sum of exponentials in unit interval01E[1/]E[1/]E[1/]E[1/]Has same probability as a Poisson() = 3...Generate E[1/] - ln{Ui} (L*LN{RND} in sigma)

  • Ui e-1/ Uii=1kk+1i=1(log of sum is product of logs)(multiple by -1/ ) (take exp of everything-monotonic)Algorithm: to generate Poisson (mean = ), multiplyiid Uniforms (RND) until product is less than e-1/ .

  • Special Properties:

    Assume have Census Data only for Poisson Arrivalseg: Hospitals, Deli Queues, etc. Use fact that, given K Poisson events occur in T,The Event times are distributed as K Uniforms on (0,T)

    First arrival time: Generate T1 = T*U[1:K] = smallest in K Uniforms on (0,T) (min order statistic) = T*(1 U1/K) (beware of rounding errors on computer)

    Next interarrival time = smallest of K-1 Uniforms on (0,T-T1): Generate T2 = (T-T1)*(1 U1/(K-1))

    and so on until get K events in T

  • Special PropertiesGiven X ~ N(0,1), can generate X ~ N(, 2) as X = + XSo sufficient to be able to generate XInversion not possibleAcceptance/rejection an optionBut better methods have been developed

  • Normal Random VariablesNeed only generate N(0, 1)If X is IID N(0, 1), then Y = X+ is N(,)

    1)Crude (Central Limit Theorem) MethodU U(0, 1), hence E(U) = .5 and V(U) = 1/12.Set Y = Ui, hence E(Y) = n/2 and V(Y) = n/12By CLT, (Y-n/2)/(n/12)1/2 D N(0, 1) as n + Consider Ui-6 (i.e., n = 12). However n may be too small!

  • Box-Muller MethodGenerate U1, U2 U(0,1) random numbersGenerate and

    X1 and X2 will be iid N(0,1) random variables

  • Box-Muller ExplanationIf X1 and X2 iid N(0,1)D2 = X12 + X22 is 22 which is the SAME as an Exponential(2)X1 = D cos X2 = D sin where = 2U2

  • Polar MethodLet U1, U2 be U(0,1), define Vi = 2Ui 1: Vi is U(-1,1)Define S = V12 + V22If S>1, go to 1.If S1, then

    X1=V1Y and X2=V2Y are iid N(0,1)

  • Polar Method Graphically Lets Look at some Normal Random Variates...

  • Notes is distributed or exponential (2)

    or ( )Y2 = -2ln(S)

    is true if Theorem: S ~ U(0, 1) given S 1.

  • Proof:P{S z | S 1} = P{Sz , S1}/ P{S1} = P{Sz }/ P{S1}= P{ z | S 1} (since =S) and ( )Y2 = -21n(S))= P{-(z- )1/2 V1 z- )1/2} / P{S1}= 1/2 (2/4) (z- )1/2 dV2 / ( /4)= (zcos2()/2) d/ ( /4) = z(later)

  • Proof:P{S 1} = P{ 1}= P{-(1- V1 (1- }= dV2 /2= (cos2 ()/2)d = /4 If S>1, go to (*). Rejection occurs with probability P{S>1}=1- /4

  • Scatter Plot of X1 and X2 from NOR{} in Sigma

  • Histograms of X1 and X2 from NOR{} in Sigma

  • Autocorrelations of X1 and X2 from NOR{} in Sigma

  • Histograms of X1 and X2 from Exact Box Muller Algorithm

  • Autocorrelations of X1 and X2 - Exact Box Muller Algorithm

  • Scatter plot of X1 and X2 - Exact Box Muller AlgorithmThis is weird! WHY? Marsaglias Theorem in polar coordinates

  • Discrete random variables can be tough to generate.eg.Binomial (N,p), with N large ( say...yield from a machine)... always check web for updates. (approx.)

    If a large number of discrete observations are needed, how can they be generated efficiently?Discrete Random Variate Generation

  • Discrete Random Variate GenerationCrude method: Inversion Requires searching (sort mass function first)Continuous approximation e.g. Geometric is the greatest integer < ExponentialAlias method, ref: Kromal & Peterson, Statistical Computing 33.4, pp.214-218

  • Alias MethodUse whenthere are a large number of discrete values.you want to generate many variates from this distribution.Requires only one U(0, 1) variate.Transforms a discrete random variable into a discrete uniform random variable with aliases at each value (using conditional probability).

  • Example:p1 = .2p2 = .4p3 = .35p4 = .05If uniform, pi = .25 for i = 1, 2, 3, 4.251234.25

  • Define Qi = probability that i is actually chosen given that i is first selected =P{i chosen | i selected}Ai is where to move (alias) to if i is not chosen

  • 1.251234234322A1 = 2 A2 = 3 A3 = 3 A4 = 2Q1 = .8 Q2 = .6 Q3 = 1. Q4 = .2Qi = probability that i is actually chosen given that i is first selected =P{i chosen | i selected}

    Ai is where to move (alias) to if i is not chosen

  • Other Possible Alias Combinations

    A1 = 3A2 = 2A3 = 2A4 = 3Q1 = .8Q2 = 1.Q1 = .4Q2 = .2A1 = 3A2 = 3A3 = 3A4 = 2Q1 = .8Q2 = .8Q1 = 1.Q2 = .20. .2 .25 .4 .50 .75 .80 1.0 1 2 2 3 3 4 2

  • Alias Table Generation AlgorithmFor i = 1, 2, . . .,n, Do: Qi = npiG = {i: Qi>1}(needs extra probability above 1/n}H = {i: Qi
  • While H is nonempty Do:j: any member of Hk: any member of GAj = kQk = Qk - (1-Qj)If Qk
  • Example

    i:12345pi:.210.278.089.189.234Qi:1.0501.390.445.9451.170GXXXHXX

    j = 3, k = 1i:12345pi:.210.278.089.189.234Qi:.4951.390.445.9451.170GXXHXXAi1

  • j = 1, k =2i:12345pi:.210.278.089.189.234Qi:.495.885.445.9451.170GXHXXAi 21

    j = 2, k = 5i:12345pi:.210.278.089.189.234Qi:.495.885.445.9451.055GXHXAi 251

  • j = 4, k =5i:12345pi:.210.278.089.189.234Qi:.495.885.445.9451.000GXHAi 2515

  • To verify that the table is correct, useP{i chosen} = P{i chosen | j selected} P{j selected} = (Qi/n) + (1-Qj)(1/n)p1 = (1/5) (.495 +.555) = .210p2 = (1/5) (.885 +.505) = .278p3 = (1/5) (.445) = .089p4 = (1/5) (.945) = .189p5 = (1/5) (1.000 +.115 + .055) = .234

  • Using the Alias TableSuppose u = .67. Therefore, 5u=3.35, which gives i = 4.Since .67 is .35 of the way between .6 and .8, and .35
  • Marsaglia TablesFor discrete random variablesMust have probabilities with denominator as a power of 2Use whenthere are a large number of discrete values(shifts work from n values to log2 (n) values)you want to generate many values from this distributionRequires 2 U(0, 1) variates (actually only need one)

  • ExampleProb.Binary.5.25.125.0625.03125p0=7/32.00111xxxp1=10/32.01010xxp2=10/32.01010xxp3=3/32.00011xxp4=2/32.00010xqi.5.125.3125.0625Algorithm:1)pick an urn with probability qi2)pick a value from the urn with discrete uniform

  • NOTE:At most log2 (n) values of qi needed (= #columns).

    Check with Law of Total Probability:p0 = 0 +(.125)(1) + (.3125)(1/5) + (.0625)(1/2) = 7/32p1 = (.5)(1/2) 0 + (.3125)(1/5) +0 = 10/32p2 = (.5)(1/2) 0 + (.3125)(1/5) +0 = 10/32p3 = 0 +0 + (.3125)(1/5) + (.0625)(1/2) = 3/32p4 = 0 +0 + (.3125)(1/5) +0 = 2/32

  • Poisson Process

    Key: times between arrivals are exponentially distributed

  • Nonhomogeneous PPRate no longer constant: (t), with distribution

    Tempting: time between ti+1 and ti exponentially distributed with rate (ti)BAD IDEA

  • Nonhomogeneous PP

  • NHPP (t)tUse thinning same idea as acceptance/rejection:Generate from PP with rate maxAccept with probability (t)/maxmax

  • NHPP: Thinning max

  • exp(m)m = max (t)tm(t)0t* Generate a homogeneous Poisson process with rate m. * Accept arrivals with probability (t)/m Inhomogeneous Poisson Process (Thinning)AcceptRejectsee Sigma model NONPOIS.MOD

  • NHPP: Inversion What about fact we dont really know (t)?

  • Generating Dependent DataMany ways to generate dependent dataAR(p) process: autoregressive process: Yt = a1Yt-1 + a2Yt-2 + + apYt-p + tMA(q) process: moving average process: Yt = t + b1t-1 + + bqt-qEAR process: exponential autoregressive: Yt = R*Yt-1 + M*ERL{1}*(RND>R)

  • AutocorrelationAR(1) autocorrelation lag k = k Satisfies 1st order difference equn k = k-1 k>0

    with boundary condition 0=1Soln: k = k

  • EAR ProcessEAR_Q.MODHistograms of the interarrival and service times will appear exponentially distributed.Plots of the values over time will look very differentOutput from the model will be very different as R is varied demo earQ.MOD

  • Driving ProcessesProblem: Real World Stochastic Processes are not independent nor identically distributed...not even stationary. Serial Dependencies: yield drift Spikes: hard and soft failures Cross Dependencies: maintenance Nonstationarity trends, cycles (rush hours)

  • Driving ProcessesExample: Machine subject to unplanned down-time...TTR distribution held constant but not independent... EAR(1)Count

  • Driving ProcessesExample: WIP ChartsaveW=4.9aveW=44.3

  • Ref: Edward McKenzie "Time Series Analysis in Water Resources" in Water Resources Bulletin; 21.4 pp. 645-650.

    Motivation: wish to model realistic situation where random process has serial correlation. This is different from time-dependency (say, for modeling rush-hour traffic) in that value of process depends on its own history, maybe in addition to being dependent on its index (time).

    Criteria: (P.A.W.Lewis: Multivariate Analysis - V. (pp: 151-166) North Holland)1. Models specified by marginal distributions and correlation structure.

    2. Few parameters that can be easily interpreted.

    3. Structure is linear in parameters, making them easy to fit and generate

    Models for Dependent Discrete-Valued Processes

  • AR(order 1) (Markov Models)Continuous Autoregressive order-1 sequence, {Xn} satisfies the difference equation (Xn-) = (Xn-1- ) + n or remove means Xn = Xn-1 + n with {n} being a sequence of iid random variables and is a positive fraction...process retains fraction, , of previous value.

    Note: If Xn is discrete then n must be dependent on Xn-1 ...want to reduce Xn-1 by the "right" amount to have same distn.

  • Models for Dependent Discrete-Valued Processes McKenzie's idea: Generate each unit in integer Xn-1 separately, and "keep" it with probability .

    Replace Xn-1 with *Xn-1 defined as

    where {Bi()} is an iid sequence of Bernoulli trials with Prob{B=1} = . This "reduces" Xn-1 by the same amount (expected) as in the continuous autoregression.

  • Poisson random variables (eg. number arriving in a interval at bus stop).

    Xn = *Xn-1+ n

    With n Poisson with mean (1-), if Xo is Poisson with mean , then so are the Xn.

    The correlation at lag k is k.

    This process is time-reversible.Applications

  • Negative Binomial (Note: check reference on this?)

    (Number of trials until successes where prob of success on each trial is ) .

    Xn = *Xn-1+ n

    n is NB(,) and i (Binomial probability in the term *X ) is Beta with parameters and (1-).

    This has the same AR(1) correlation structure as the other models. k= k Process is also time-reversible.

  • (Special case of Neg. Binomial)

    Xn = Xn-1+ BnGn

    with Bn Bernoulli with Prob(B=1) = 1- and Gn is Geometric with parameter .

    This is discrete analog with the EAR process.

    McKenzie also discusses Binomial and Bernoulli as well as adding time dependencies (seasonally, trends, etc.)Geometric

  • Summary: Generating Random VariatesCould Use Trace Driven SimulationsCould Use Empirical DistributionsCould Use Parametric Distributions

    Know the advantages and disadvantages of each approach...

  • Summary: General MethodsInverse TransformCompositionAcceptance/RejectionSpecial Properties

    Look at the data! Scatter plots, histograms, autocorrelations, etc.

  • Data GatheringNeeding more Data is a common, but usually invalid, excuse. Why? (sensitivity?)Timing Devices RFID chipsBenefits vs. hassle of tracking customers Collect queue lengths Why? L/=WCoordinating between observersHawthorn effect

  • Distributions and BestFit or ExpertFit or Stat::fit or...GammaExponential (ie. Gamma w/ param 1)Erlang (ie. Gamma w/ integer param)Beta (-> Gamma in the limit)Log-LogisticLognormal *** ln not log10 ***

  • 5 Dastardly Ds of Data(updated from Reader)Data may beDistortedMaterial move times include (un)load underestimate value of AGVs (loaders?)Want demand but get backorders (those willing to wait) overestimate service levels (resources?)

  • 5 Dastardly Ds of Data, cont.DependentGet means or histograms but not correlations, cycles, or trends underestimate congestion (capacity?)Fail to get cross dependencies skill levels of operators, shift change effects

  • 5 Dastardly Ds of Data, cont.Deleted Data is censored (accounting) model not valid with new dataDamaged Data entry errors, collection errors (observer effect)Dated last months data, different product mix valid models later fail validation

  • 6th and 7th Dastardly D of Data...Doctored well intentioned people tried to clean it up...Deceptive - any of the other problems might be intentional!

    Data often used for performance evaluations,or thought to be...

  • Data Collection: General ConceptsRelevance:Degree of control SensitivityWhen:Need to observe early in study Need sensitivities late in studyCost:Setup (include training and P.R.) Sample size (sensitivity)Accuracy: Technique, skill, motivation, training, timing (Monday a.m.??),

  • Data Collection, cont.Precision:Technique, etc. Sample size Sample interval (dependencies)Analysis:Error control Verification Dependencies within data set Dependencies between data sets

  • What to do? - recommendationsDont wait for data before you build modelRun sensitivity experiments on current system design to see what matters.Collect data for validation, if required.Remember: you are ultimately making a forecast not predicting the past!... so Do sensitivity analysis on new systemsPresent output using Interval Estimators!!

  • Simulation study