chap 10: summarizing data 10.1: intro: univariate/multivariate data (random samples or batches) can...
TRANSCRIPT
Chap 10: Summarizing Data10.1: INTRO: Univariate/multivariate data (random
samples or batches) can be described using procedures to reveal their structures via graphical
displays (Empirical CDFs, Histograms,…) that are to Data what PMFs and PDFs are to Random Variables. Numerical summaries (location and spread measures)
and the effects of outliers on these measures & graphical summaries (Boxplots) will be investigated.
10.2: CDF based methods
10.2.1: The Empirical CDF (ecdf)
ECDF is the data analogue of the CDF of a random variable.
The ECDF is a graphical display that conveniently summarizes data sets.
1
#( ) , ....in n
x xF x where x x is a batch of numbers
n
( , ]1
1
( )( ) , ....
1 ,( )
0 ,
n
x ii
n n
A
I XF x where X X is a random sample
nt A
and I t is an indicator functiont A
The Empirical CDF (cont’d)
The random variables are independent Bernoulli random variables:
( , ]1
( ) ( ) ~ , ( )
( ) 1 ( )( ) ( ) ( )
lim ( ) 0 max var
n
n x ii
n n
n
n nt
nF x I X Bin n F x
F x F xE F x F x and Var F x
nF is an unbiased estimate of F
Var F x and F has a imum iance at the median of F
( , ]
1 ( )( )
0 1 ( )x i
with probability F xI X
with probabilty F x
( , ]( )x iI X
10.2.2: The Survival Function
In medical or reliability studies, sometime data consist of times of failure or death; thus, it becomes more convenient to use the survival function rather than the CDF. The sample survival function (ESF) gives the proportion of the data greater than t and is given by:
Survival plots (plots of ESF) may be used to provide information about the hazard function that may be thought as the instantaneous rate of mortality for an individual alive at time t and is defined to be:
( ) 1 ( )n nS t F t
( ) 1 ( ) , varS t F t T is a random iable with cdf F
( )( ) log 1 ( ) log ( )
1 ( )
f t d dh t F t S t
F t dt dt
The Survival Function (cont’d)
From page 149, the method for the first order:
which expresses how extremely unreliable (huge variance for large values of t) the empirical log-survival function is.
2
2
[ ( )] ( )[ '( )]
1 ( ){log 1 ( ) } ( ) *
1 ( ) 1 ( )
X
n
Var g X Var X g
F tVar F t Var F t
F t n F t
10.2.3:QQP(quantile-quantile plots)
Useful for comparing CDFs by plotting quantiles of one dist’n versus quantiles of another dist’n.
0int
0)(
~~:
int1
)()(
~~:
erceptyandcslopewithlineaisplotQQ
cforc
yFyGcxy
GYvsFXeffecttreatmenttiveMultiplica
herceptyandslopewithlineaisplotQQ
hxFyGhxy
GYvsFXeffecttreatmentAdditive
pp
pp
10.2.3: Q-Q plots
Q-Q plot is useful in comparing CDFs as it plots the quantiles of one dist’n versus the quantiles of the other dist’n.
Additive modelAdditive model:
Multiplicative modelMultiplicative model:
~ ~ ( )
int 0
yX F and Y G when Y cX G y F
c
Q Q plot is a straigth linewith slope c and y ercept
~ ~ ( )
1 int
X F and Y G when Y X h G y F y h
Q Q plot is a straigth linewith slope and y ercept h
10.3: Histograms, Density curves & Stem-and-Leaf Plots
Kernel PDF estimate:
!!''
.
1)(
1)(
1)(
...
11
1
smalltoonotbigtoonothreasonableaChoose
histogramtheofwidthbinh
fofsmoothnessthecontrolsthatparameterbandwidthh
h
Xxw
nhXxw
nxfThen
functionweightsmoothabeh
xw
hxwLet
PDFKernelfisfoffunctionEstimating
PDFfXX
h
n
i
ii
n
ihh
h
h
iidn
10.4: Location Measures
10.4.1: The Arithmetic Mean is sensitive to outliers (not robust).
10.4.2: The Median is a robust measure of location.
10.4.3: The Trimmed Mean is another robust location measure )(2.01.0 drecommendehighly
meantrimmedtheisxnn
xStep
dataremainingtheformeanarithmeticthetakeStep
highestandlowestdiscardStep
setdatatheorderStep
nn
nii %100*
][2
1:4
:3
%100*%100*:2
:1
][
1][)(
n
iixn
x1
1
x~
Location Measures (cont’d)
The trimmed mean (discard only a certain number of the observations) is introduced as a natural compromise between the mean (discard no observations) and the median (discard all but 1 or 2 observations)
Another compromise between is was proposed by Huber (1981) who suggested to minimize:
or to solve (its solution will be called an M-estimate)
xandx ~
.)(,1
givenbetoisxwheretorespectwithXn
i
i
',01
whereforXn
i
i
10.4.4: M-Estimates (Huber, 1964)
."mod"2
3
~0
)sgn()(||)(;~ˆ
)(2
1)(;ˆ
],[
],[)(
||2
1||
||2
1
)(,
2
2
2
2
1
compromiseerateaassuggestedis
centerthefromawayknsobservatiooutliersagainstprotectsk
xmedianthetoscorrespondkandxmeanthetoscorrespondk
xkxorxkxxmedianthetoclosercomeskSmall
xxorxxxmeanthetoclosercomeskBig
kkoutsidelinesstraigthbyarcsparabolicthereplaces
kkinsidextoalproportionisx
kxifkxk
kxifxxwhereX
n
ii
10.4.4: M-Estimates (cont’d)
M-estimates coincide with MLEs because:
The computation of an M-estimate is a nonlinear minimization problem that must be solved using an iterative method (such as Newton-Raphson,…)
Such a minimizer is unique for convex functions. Here, we assume that is known; but in practice, a robust estimate of (to be seen in Section 10.5) should be used instead.
fXXXwithxfxfunctionUser
wrtX
fMaximizewrtX
Minimize
iidn
n
i
in
i
i
,....,,)(log)( 21
11
10.4.5: Comparison of Location Estimates
Among the location estimate introduced in this section, which one is the best? Not easy !
For symmetric underlying dist’n, all 4 statistics (sample mean, sample median, alpha-trimmed mean, and M-estimate) estimate the center of symmetry.
For non symmetric underlying dist’n, these 4 statistics estimate 4 different pop’n parameters namely (pop’n mean, pop’n median, pop’n trimmed mean, and a functional of the CDF by ways of the weight function ).
Idea: Run some simulations; compute more than one estimate of location and pick the winner.
10.4.6: Estimating Variability of Location Estimates
by the BootstrapUsing a computer, we can generate (simulate) many
samples B (large) of size n from a common known dist’n F. From each sample, we compute the value of the location estimate .
The empirical dist’n of the resulting values is a good approximation (for large B) to the dist’n function of . Unfortunately, F is NOT known in general. Just plug-in the empirical cdf for F and bootstrap ( = resample from ).
**
2*1 ,...,, B
nF
nF
n
n
xxxvalueobservedeachfornyprobabilitsamewithPMFdiscreteaisF
,...,,
1
21
10.4.6: Bootstrap (cont’d)
A sample of size n from is a sample of size n drawn with replacement from the observed data that produce .
Thus,
Read example A on page 368.
Bootstrap dist’n can be used to form an approximate CI and to test for hypotheses.
nxx ,....,1
nF
BbtheofmeantheisB
where
Bs
b
B
bb
B
bb
,...,2,11
1
*
1
**
1
2**ˆ
),...,1(* Bbb
10.5:Measures of DispersionA measure of dispersion (scale) gives a numerical
indication of the “scatteredness” of a batch of numbers. The most common measure of dispersion is the sample standard deviation
Like the sample mean, the sample standard deviation
is NOT robust (sensitive to outliers).
Two simple robust measures of dispersion are the IQR (interquartile range) and the MAD (median absolute deviation from the median).
n
ii XX
ns
1
2
1
1
10.6: Box PlotsTukey invented a graphical display (boxplot)
that indicates the center of a data set (median), the spread of the data (IQR) and the presence of outliers (possible).
Boxplot gives also an indication of the symmetry / asymmetry (skewness) of the dist’n of data values.
Later, we will see how boxplots can be effectively used to compare batches of numbers.
10.7: Conclusion
Several graphical tools were introduced in this chapter as methods of presenting and summarizing data. Some aspects of the sampling dist’ns (assume a stochastic model for the data) of these summaries were discussed.
Bootstrap methods (approximating a sampling dist’n and functionals) were also revisited.
Parametric Bootstrap:Example: Estimating a population mean
It is known that explosives used in mining leave a crater that is circular in shape with a diameter that follows an exponential dist’n . Suppose a new form of explosive is tested. The sample crater diameters (cm) are as follows:
121 847 591 510 440 205 3110 142 65 1062
211 269 115 586 983 115 162 70 565 114
It would be inappropriate to use
as a 90% CI for the pop’n mean via the t-curve (df=19)
0,1)( / xexF x
SDsamplesandmeansamplex 60.68515.514 23.779,07.24995.0
n
stx
Parametric Bootstrap: (cont’d)because such a CI is based on the normality
assumption for the parent pop’n. The parametric bootstrap replaces the exponential
pop’n dist’n F with unknown mean by the known exponential dist’n F* with mean
Then resamples of size n=20 are drawn from this surrogate pop’n. Using Minitab, we can generate B=1000 such samples of size n=20 and compute the sample mean of each of these B samples. A bootstrap CI can be obtained by trimming off 5% from each tail. Thus, a parametric bootstrap 90% CI is given by:
(50th smallest = 332.51,951st largest = 726.45)
15.514* x
Non-Parametric Bootstrap:If we do not assume that we are sampling from a
normal pop’n or some other specified shape pop’n, then we must extract all the information about the pop’n from the sample itself.
Nonparametric bootstrapping is to bootstrap a sampling dist’n for our estimate by drawing samples with replacement from our original (raw) data.
Thus, a nonparametric bootstrap 90% CI of is obtained by taking the 5th and 95th percentiles of among these resamples.