cpsc 531: data analysis1 cpsc 531: output data analysis instructor: anirban mahanti office: ict 745...
Post on 04-Jan-2016
236 Views
Preview:
TRANSCRIPT
CPSC 531: Data Analysis 1
CPSC 531: Output Data Analysis
Instructor: Anirban MahantiOffice: ICT 745Email: mahanti@cpsc.ucalgary.caClass Location: TRB 101Lectures: TR 15:30 – 16:45 hours
Slides primarily adapted from:“The Art of Computer Systems
Performance Analysis” by Raj Jain, Wiley 1991.
[Chapters 12, 13, and 25]
CPSC 531: Data Analysis 2
Outline Measures of Central Tendency
Mean, Median, Mode How to Summarize Variability? Comparing Systems Using Sample Data Comparing Two Alternatives Transient Removal
CPSC 531: Data Analysis 3
Measures of Central Tendency (1) Sample mean – sum of all observations
divided by the total number of observations Always exists and is unique Mean gives equal weight to all observations Mean is strongly affected by outliers
Sample median – list observations in an increasing order; the observation in the middle of the list is the median; Even # of observations – mean of middle two
values Always exists and is unique Resistant to outliers (compared to mean)
CPSC 531: Data Analysis 4
0
0.1
0.2
0.3
0.4
0 4 8 12 16 20
x
PD
F f
(x)
Measures of Central Tendency (2) Sample mode – plot
histogram from the observations; find bucket with peak frequency; the middle point of this bucket is the mode; Mode may not exists
(e.g., all sample have equal weight)
More than one mode may exist (i.e. bimodal)
If only one mode then distribution is unimodal
0
0.05
0.1
0.15
0.2
0 4 8 12 16 20
x
PD
F f
(x)
0
0.1
0.2
0.3
0.4
0.5
0.6
0 4 8 12
x
PD
F f
(x)
mode
mode mode
mode
CPSC 531: Data Analysis 5
Measure of Central Tendency (3)
Is data categorical? Yes: use mode e.g. most used resource in a system
Is total of interest? Yes: use mean e.g. total response time for Web requests
Is distribution skewed? Yes: use median
• Median less influenced by outlier than mean. No: use mean. Why?
CPSC 531: Data Analysis 6
Common Misuses of Means (1)
Usefulness of mean depends on the number of observations and the variance E.g. two response time samples: 10 ms and
1000 ms. Mean is 505 ms! Correct index but useless.
Using mean without regard to skewness System A System B10 59 5
11 5 10 4 10 31Mean: 10 10Mode: 10 5Min,Max: [9,11] [4,31]
CPSC 531: Data Analysis 7
Common Misuses of Means (2)
Mean of a Product by Multiplying means
Mean of product equals product of means if
the two random variables are independent.
If x and y are correlated E(xy) != E(x)E(y)
Avg. users in system 23; avg.
processes/user 2. Avg. # of processes in
system? Is it 46?
No! Number of processes spawned by users
depends on the load.
CPSC 531: Data Analysis 8
Outline Measures of Central Tendency How to Summarize Variability? Comparing Systems Using Sample Data Comparing Two Alternatives Transient Removal
CPSC 531: Data Analysis 9
Summarizing Variability Summarizing by a single number rarely enough.
Given two systems with same mean, we generally prefer one with less variability
Freq
uenc
y
Mean=2s
Response Time
1.5 s80%
4 s20%
Freq
uenc
y
Mean=2s
Response Time
60%~ 0.001 s40%
~5 s
Indices of dispersion• Range, Variance, 10- and 90-percentiles, Semi-
interquantile range, and mean absolute deviation
CPSC 531: Data Analysis 10
Range Easy to calculate; range = max – min
In many scenarios, not very useful: Min may be zero Max may be an “outlier” With more samples, max may keep increasing
and min may keep decreasing → no “stable” point
Range is useful if systems performance is bounded
CPSC 531: Data Analysis 11
Variance and Standard Deviation Given sample of n observations {x1, x2, …, xn} the
sample variance is calculated as:
Sample variance: s2 (square of the unit of observation) Sample standard deviation: s (in unit of observation) Note the (n-1) in variance computation
(n-1) of the n differences are independent Given (n-1) differences, the nth difference can be computed Number of independent terms is the degrees of freedom (df)
n
ii
n
ii x
nxxx
ns
1
2
1
2 1 e wher
1
1
CPSC 531: Data Analysis 12
Standard Deviation (SD) Standard deviation and mean have same
units Preferred! E.g. a) Mean = 2 s, SD = 2 s; high variability? E.g. b) Mean = 2 s, SD = 0.2 s; low variability?
Another widely used measure – C.O.V C.O.V = Ratio of standard deviation to mean C.O.V does not have any units C.O.V shows magnitude of variability C.O.V in (a) is 1 and in (b) is .1
CPSC 531: Data Analysis 13
Percentiles, Quantiles, Quartiles Lower and upper bounds expressed in
percents or as fractions 90-percentile →0.9-quantile –quantile: sort and take [(n-1)+1]th observation
• [] means round to nearest integer
Quartiles divide data into parts at 25%, 50%, 75% → quartiles (Q1, Q2, Q3) 25% of the observations ≤ Q1 (the first quartlie) Second quartile Q2 is also the median
The range (Q3 – Q1) is interquartile range (Q3 – Q1)/2 is semi-interquartile (SIQR) range
CPSC 531: Data Analysis 14
Mean Absolute Deviation
Mean absolute deviation is calculated as:
xxn
n
ii
1
1
CPSC 531: Data Analysis 15
Influence of Outliers
Range: considerably Sample variance: considerably, but less than
range Mean absolute deviation: less than variance
Doesn’t square (aka magnify) the outliers SIQR range: very resistant
Use SIQR for index of dispersion whenever median is used as index of central tendency
CPSC 531: Data Analysis 16
Outline Measures of Central Tendency How to Summarize Variability? Comparing Systems Using Sample Data
Sample vs. Population Confidence Interval for Mean
Comparing Two Alternatives Transient Removal
CPSC 531: Data Analysis 17
Comparing Systems Using Sample Data
The words “sample” and “example” have a common root – “essample” (French)
One sample does not prove a theory - a sample is just an example
The point is - definite statement cannot be made about characteristics of all systems.
However, probabilistic statements about the range of most systems can be made
Confidence interval concept as a building block
CPSC 531: Data Analysis 18
Sample versus Population Generate 1-million random numbers
with mean and SD and put them in an urn Draw sample of n observations
{x1, x2, …, xn} has mean , standard deviation s
is likely different than !
The population mean is unknown or impossible to obtain in many real-world scenarios Therefore, obtain estimate of from
xx
x
CPSC 531: Data Analysis 19
Confidence Interval for the Mean Define bounds c1 and c2 such that:
Prob{c1 < < c2} = 1- (c1, c2) is confidence interval is significance level 100(1- ) is confidence level
Typically small desired confidence level 90%, 95% or 99%
One approach: take k samples, find sample means, sort, and take the [1+0.05(k-1)]th as c1 and [1+0.95(k-1)]th as c2
CPSC 531: Data Analysis 20
Central Limit Theorem We do not need many samples. Confidence
intervals can be determined from one sample because ~ N(, /sqrt(n))
SD of sample mean /sqrt(n) called Standard error
Using the CLT, a 100(1- )% confidence interval for a population mean is
( -z1-/2s/sqrt(n), +z1-/2s/sqrt(n)) z1-/2 is the (1-/2)-quantile of a unit normal
variate (and is obtained from a table!) s is the sample SD
x
x
x
CPSC 531: Data Analysis 21
Confidence Interval Example CPU times obtained by repeating
experiment 32 times. The sorted set consists of {1.9,2.7,2.8,2.8,2.8,2.9,3.1,3.1,3.2,3.2,3.3,3.4,3.6,3.7,3.8,
3.9,3.9,4.1,4.1,4.2,4.2,4.4,4.5,4.5,4.8,4.9,5.1,5.1,5.3,5.6,5.9}
Mean = 3.9, standard deviation (s) = 0.95, n=32
For 90% confidence interval z1-/2 = 1.645, and we get {3.90 + (1.645)(0.95)/(sqrt(32))} = (3.62,4.17)
CPSC 531: Data Analysis 22
Meaning of Confidence Interval
xx
- c
x
+ c
90% chance that this interval contains
What does this mean? With 90% confidence, we can say population mean is within the above bounds; that is, chance of error is 10%. E.g., Take 100 samples and construct CI’s. In 10
cases, the interval will not contain population mean
CPSC 531: Data Analysis 23
Length of Confidence Interval Let z1-/2s/sqrt(n) = c
Then, z1-/2 = (c.sqrt(n))/s Larger s implies wider confidence interval Larger n implies shorter confidence interval
• → with more observations, we are better able to predict population mean
• → square-root n relationship implies increasing observations by a factor of 4 only cuts confidence interval by a factor of 2.
Confidence Interval computation, as described here works for n ≥ 30.
CPSC 531: Data Analysis 24
What if n not large? For smaller samples, can construct
confidence intervals only if observations come from normally distributed population
t[1-α/2;n-1] is the (1-α/2)-quantile of a t-variate with (n-1) degrees of freedom
nstxnstx nn /,/ ]1;2/1[]1;2/1[
CPSC 531: Data Analysis 25
Testing for a Zero Mean Check if measured value is significantly
different than zero Determine confidence interval Then check if zero is inside interval. Procedure applicable to any other value a
0
mean
Mean is zero
Mean is nonzero
CPSC 531: Data Analysis 26
Outline Measures of Central Tendency How to Summarize Variability? Comparing Systems Using Sample Data Comparing Two Alternatives Transient Removal
CPSC 531: Data Analysis 27
Comparing Two Alternatives Often interested in comparing systems
“naïve” VOD vs. “batching” VOD (assignment 3) “SJF” vs. “FIFO” request scheduling (assignment
1)
Statistical techniques for such comparison: Paired Observations Unpaired Observations (we will omit this!) Approximate Visual Test
Did you use any of these in your assignments?
CPSC 531: Data Analysis 28
Paired Observations (1) n experiments with one-to-one corrsp.
between test on system A and test on system B no correspondence => unpaired This test uses the zero mean idea…
Treat the two samples as one sample of n pairs
For each pair, compute difference Construct confidence interval for difference
CI includes zero => systems not significantly different
CPSC 531: Data Analysis 29
Paired Observations (2)
Six similar workloads used on two systems. {(5.4, 19.1), (16.6, 3.5), (0.6,3.4), (1.4,2.5), (0.6, 3.6) (7.3, 1.7)} Is one system better?
The performance differences are {-13.7, 13.1, -2.8, -1.1, -3.0, 5.6}
Sample mean = -.32, sample SD = 9.03 CI = -0.32 + t[sqrt(81.62/6)] = -0.32 + t(3.69) .95 quantile of t with 5 DF’s is 2.015 90% confidence interval = (-7.75, 7.11) Systems not different as zero mean in CI
CPSC 531: Data Analysis 30
Approximate Visual Test Compute confidence interval for means If CI’s don’t overlap, one system better
than the other
meanmean mean
CI’s do not overlap => alternatives different
CI’s overlap and mean of one is in the CI of the other => not significantly diff.
CI’s overlap but mean of one is not in the CI of the other => need more testing
CPSC 531: Data Analysis 31
Determining Sample Size Goal: find the smallest sample size n such that
desired confidence in the results Method:
small set of preliminary measurements estimate variance from the measurements use estimate to determine sample size for accuracy
r% accuracy=> +r% at 100(1-)% confidence
2100
1001
xr
zsn
rx
n
szx
CPSC 531: Data Analysis 32
Outline Measures of Central Tendency How to Summarize Variability? Comparing Systems Using Sample Data Comparing Two Alternatives Transient Removal
CPSC 531: Data Analysis 33
Transient Removal In many simulations, we are interested in
steady state performance Remove initial transient state
However, defining exactly what constitutes end of transient state is difficult!
Several heuristics developed: Long runs Proper initialization Truncation Initial data deletion Moving average of replications Batch means
CPSC 531: Data Analysis 34
Long Runs Use very long runs Impact of transient state becomes
negligible Wasteful use of resources How long is “long enough”? Raj Jain text recommends that this method
not be used in isolation
CPSC 531: Data Analysis 35
Batch Means Run simulation for long
duration Divide observations (N)
into m batches, each of size n
Compute variance of batch means using procedure shown for n = 2, 3, 4, 5 …
Plot variance vs. batch size
2
1
1
1
)(1
1)(Var
meansbatch of varianceCompute 3)
1
mean overall 2)Compute
,...,2,1 ,1
meanbatch Compute 1)
xxm
x
xm
x
mixn
x
m
ii
m
ii
n
iiji
Ignore
Variance ofBatch means
Batch Size n
Transientinterval
top related