chap 10: summarizing data 10.1: intro: univariate/multivariate data (random samples or batches) can...

Chap 10: Summarizing Data10.1: INTRO: Univariate/multivariate data (random

samples or batches) can be described using procedures to reveal their structures via graphical

displays (Empirical CDFs, Histograms,…) that are to Data what PMFs and PDFs are to Random Variables. Numerical summaries (location and spread measures)

and the effects of outliers on these measures & graphical summaries (Boxplots) will be investigated.

10.2: CDF based methods

10.2.1: The Empirical CDF (ecdf)

ECDF is the data analogue of the CDF of a random variable.

The ECDF is a graphical display that conveniently summarizes data sets.

1

#( ) , ....in n

x xF x where x x is a batch of numbers

n

( , ]1

1

( )( ) , ....

1 ,( )

0 ,

n

x ii

n n

A

I XF x where X X is a random sample

nt A

and I t is an indicator functiont A

The Empirical CDF (cont’d)

The random variables are independent Bernoulli random variables:

( , ]1

( ) ( ) ~ , ( )

( ) 1 ( )( ) ( ) ( )

lim ( ) 0 max var

n

n x ii

n n

n

n nt

nF x I X Bin n F x

F x F xE F x F x and Var F x

nF is an unbiased estimate of F

Var F x and F has a imum iance at the median of F

( , ]

1 ( )( )

0 1 ( )x i

with probability F xI X

with probabilty F x

( , ]( )x iI X

10.2.2: The Survival Function

In medical or reliability studies, sometime data consist of times of failure or death; thus, it becomes more convenient to use the survival function rather than the CDF. The sample survival function (ESF) gives the proportion of the data greater than t and is given by:

Survival plots (plots of ESF) may be used to provide information about the hazard function that may be thought as the instantaneous rate of mortality for an individual alive at time t and is defined to be:

( ) 1 ( )n nS t F t

( ) 1 ( ) , varS t F t T is a random iable with cdf F

( )( ) log 1 ( ) log ( )

1 ( )

f t d dh t F t S t

F t dt dt

The Survival Function (cont’d)

From page 149, the method for the first order:

which expresses how extremely unreliable (huge variance for large values of t) the empirical log-survival function is.

2

2

[ ( )] ( )[ '( )]

1 ( ){log 1 ( ) } ( ) *

1 ( ) 1 ( )

X

n

Var g X Var X g

F tVar F t Var F t

F t n F t

10.2.3:QQP(quantile-quantile plots)

Useful for comparing CDFs by plotting quantiles of one dist’n versus quantiles of another dist’n.

0int

0)(

~~:

int1

)()(

~~:

erceptyandcslopewithlineaisplotQQ

cforc

yFyGcxy

GYvsFXeffecttreatmenttiveMultiplica

herceptyandslopewithlineaisplotQQ

hxFyGhxy

GYvsFXeffecttreatmentAdditive

pp

pp

10.2.3: Q-Q plots

Q-Q plot is useful in comparing CDFs as it plots the quantiles of one dist’n versus the quantiles of the other dist’n.

Additive modelAdditive model:

Multiplicative modelMultiplicative model:

~ ~ ( )

int 0

yX F and Y G when Y cX G y F

c

Q Q plot is a straigth linewith slope c and y ercept

~ ~ ( )

1 int

X F and Y G when Y X h G y F y h

Q Q plot is a straigth linewith slope and y ercept h

10.3: Histograms, Density curves & Stem-and-Leaf Plots

Kernel PDF estimate:

!!''

.

1)(

1)(

1)(

...

11

1

smalltoonotbigtoonothreasonableaChoose

histogramtheofwidthbinh

fofsmoothnessthecontrolsthatparameterbandwidthh

h

Xxw

nhXxw

nxfThen

functionweightsmoothabeh

xw

hxwLet

PDFKernelfisfoffunctionEstimating

PDFfXX

h

n

i

ii

n

ihh

h

h

iidn

10.4: Location Measures

10.4.1: The Arithmetic Mean is sensitive to outliers (not robust).

10.4.2: The Median is a robust measure of location.

10.4.3: The Trimmed Mean is another robust location measure )(2.01.0 drecommendehighly

meantrimmedtheisxnn

xStep

dataremainingtheformeanarithmeticthetakeStep

highestandlowestdiscardStep

setdatatheorderStep

nn

nii %100*

][2

1:4

:3

%100*%100*:2

:1

][

1][)(

n

iixn

x1

1

x~

Location Measures (cont’d)

The trimmed mean (discard only a certain number of the observations) is introduced as a natural compromise between the mean (discard no observations) and the median (discard all but 1 or 2 observations)

Another compromise between is was proposed by Huber (1981) who suggested to minimize:

or to solve (its solution will be called an M-estimate)

xandx ~

.)(,1

givenbetoisxwheretorespectwithXn

i

i

',01

whereforXn

i

i

10.4.4: M-Estimates (Huber, 1964)

."mod"2

3

~0

)sgn()(||)(;~ˆ

)(2

1)(;ˆ

],[

],[)(

||2

1||

||2

1

)(,

2

2

2

2

1

compromiseerateaassuggestedis

centerthefromawayknsobservatiooutliersagainstprotectsk

xmedianthetoscorrespondkandxmeanthetoscorrespondk

xkxorxkxxmedianthetoclosercomeskSmall

xxorxxxmeanthetoclosercomeskBig

kkoutsidelinesstraigthbyarcsparabolicthereplaces

kkinsidextoalproportionisx

kxifkxk

kxifxxwhereX

n

ii

10.4.4: M-Estimates (cont’d)

M-estimates coincide with MLEs because:

The computation of an M-estimate is a nonlinear minimization problem that must be solved using an iterative method (such as Newton-Raphson,…)

Such a minimizer is unique for convex functions. Here, we assume that is known; but in practice, a robust estimate of (to be seen in Section 10.5) should be used instead.

fXXXwithxfxfunctionUser

wrtX

fMaximizewrtX

Minimize

iidn

n

i

in

i

i

,....,,)(log)( 21

11

10.4.5: Comparison of Location Estimates

Among the location estimate introduced in this section, which one is the best? Not easy !

For symmetric underlying dist’n, all 4 statistics (sample mean, sample median, alpha-trimmed mean, and M-estimate) estimate the center of symmetry.

For non symmetric underlying dist’n, these 4 statistics estimate 4 different pop’n parameters namely (pop’n mean, pop’n median, pop’n trimmed mean, and a functional of the CDF by ways of the weight function ).

Idea: Run some simulations; compute more than one estimate of location and pick the winner.

10.4.6: Estimating Variability of Location Estimates

by the BootstrapUsing a computer, we can generate (simulate) many

samples B (large) of size n from a common known dist’n F. From each sample, we compute the value of the location estimate .

The empirical dist’n of the resulting values is a good approximation (for large B) to the dist’n function of . Unfortunately, F is NOT known in general. Just plug-in the empirical cdf for F and bootstrap ( = resample from ).

**

2*1 ,...,, B

nF

nF

n

n

xxxvalueobservedeachfornyprobabilitsamewithPMFdiscreteaisF

,...,,

1

21

10.4.6: Bootstrap (cont’d)

A sample of size n from is a sample of size n drawn with replacement from the observed data that produce .

Thus,

Read example A on page 368.

Bootstrap dist’n can be used to form an approximate CI and to test for hypotheses.

nxx ,....,1

nF

BbtheofmeantheisB

where

Bs

b

B

bb

B

bb

,...,2,11

1

*

1

**

1

2**ˆ

),...,1(* Bbb

10.5:Measures of DispersionA measure of dispersion (scale) gives a numerical

indication of the “scatteredness” of a batch of numbers. The most common measure of dispersion is the sample standard deviation

Like the sample mean, the sample standard deviation

is NOT robust (sensitive to outliers).

Two simple robust measures of dispersion are the IQR (interquartile range) and the MAD (median absolute deviation from the median).

n

ii XX

ns

1

2

1

1

10.6: Box PlotsTukey invented a graphical display (boxplot)

that indicates the center of a data set (median), the spread of the data (IQR) and the presence of outliers (possible).

Boxplot gives also an indication of the symmetry / asymmetry (skewness) of the dist’n of data values.

Later, we will see how boxplots can be effectively used to compare batches of numbers.

10.7: Conclusion

Several graphical tools were introduced in this chapter as methods of presenting and summarizing data. Some aspects of the sampling dist’ns (assume a stochastic model for the data) of these summaries were discussed.

Bootstrap methods (approximating a sampling dist’n and functionals) were also revisited.

Parametric Bootstrap:Example: Estimating a population mean

It is known that explosives used in mining leave a crater that is circular in shape with a diameter that follows an exponential dist’n . Suppose a new form of explosive is tested. The sample crater diameters (cm) are as follows:

121 847 591 510 440 205 3110 142 65 1062

211 269 115 586 983 115 162 70 565 114

It would be inappropriate to use

as a 90% CI for the pop’n mean via the t-curve (df=19)

0,1)( / xexF x

SDsamplesandmeansamplex 60.68515.514 23.779,07.24995.0

n

stx

Parametric Bootstrap: (cont’d)because such a CI is based on the normality

assumption for the parent pop’n. The parametric bootstrap replaces the exponential

pop’n dist’n F with unknown mean by the known exponential dist’n F* with mean

Then resamples of size n=20 are drawn from this surrogate pop’n. Using Minitab, we can generate B=1000 such samples of size n=20 and compute the sample mean of each of these B samples. A bootstrap CI can be obtained by trimming off 5% from each tail. Thus, a parametric bootstrap 90% CI is given by:

(50th smallest = 332.51,951st largest = 726.45)

15.514* x

Non-Parametric Bootstrap:If we do not assume that we are sampling from a

normal pop’n or some other specified shape pop’n, then we must extract all the information about the pop’n from the sample itself.

Nonparametric bootstrapping is to bootstrap a sampling dist’n for our estimate by drawing samples with replacement from our original (raw) data.

Thus, a nonparametric bootstrap 90% CI of is obtained by taking the 5th and 95th percentiles of among these resamples.

chap 10: summarizing data 10.1: intro: univariate/multivariate data (random samples or batches) can...

Documents