statistical data analysis 2011/2012 m. de gunst lecture 3

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis

2011/2012

M. de Gunst

Lecture 3

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 2

Statistical Data Analysis: Introduction

TopicsSummarizing dataExploring distributions (continued)Bootstrap (first part)Robust methodsNonparametric testsAnalysis of categorical dataMultiple linear regression

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Today’s topics:Exploring distributions (Chapter 3: 3.5.2, 3.5.3 )Bootstrap (Chapter 4: 4.1, 4.2)

Exploring distributions3.5. Tests for goodness of fit 3.5.1. Shapiro-Wilk test for normal distribution (last week)3.5.2. Kolmogorov-Smirnov test for general distribution 3.5.3. Chi-square test for goodness of fit for general distribution

Bootstrap4.1. Simulation (read yourself)4.2. Bootstrap estimators for distribution Parametric bootstrap estimators Empirical bootstrap estimators

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


3.5. Exploring distributions: reminder

TestingIngredients of test? Hypotheses H0 and H1 Test statistic T Distribution of T under H0 and know how it is changed/shifted under H1 Rule for when H0 will be rejected:

Rejection rule either based on critical region or on p-value

How to perform test? Describe the above Choose significance level α Calculate and report value t of T Report whether t is in critical region, or whether p-value < α Formulate conclusion of test: “H0 rejected” or “H0 not rejected” If possible translate conclusion to practical context

NB. When asked to perform a test,you have to do

all 6 steps!

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Tests for goodness of fit: for (one) general distribution

Situation independent realizations from unknown distribution F

now: , one specific distribution:

Which statistic gives information about distribution F?

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


3.5.2. Kolmogorov-Smirnov test (1)

independent realizations from unknown distribution F

Idea: use empirical distribution function

Makes sense: is r.v., ~ binom(n, F(x))

so that for n → ∞,

Then also under H0 , for n → ∞,

Base test on distance between and

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Kolmogorov-Smirnov test (2)

Test statistic:

Distribution of Dn under H0: same for all continuous F0: Dn is distribution free over class of continuous distribution functions K-S test is nonparametric test

Because

When is H0 rejected? For large values of Dn

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



Test statistic:

p-values from tables or computer package.

Note: standard K-S test with these p-values not suitable for composite H0 Then adjusted K-S test with adjusted p-values

Example: for “H0: F is normal” adjusted test statistic for K-S test is

What is difference?

adj

Additional stochasticity!

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



Data: x H0: F = N(0,1) H1 : F ≠ N(0,1)

Test statistic:

R:> ks.test(x,pnorm)

One-sample Kolmogorov-Smirnov test

data: x D = 0.1163, p-value = 0.4735alternative hypothesis: two-sided

H0 rejected?

of xExample

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



Data: yH0: F is normal ← composite null hypothesisH1 : F is not normal Test statistic:

R:> ks.test(y,pnorm)D = 0.6922, p-value = 6.661e-16

> ks.test(y,pnorm,mean=mean(y),sd=sd(y))D = 0.1081, p-value = 0.5655> mean(y)[1] 3.62158> sd(y)[1] 3.043356

adj

Incorrect: this is test for H0: F = N(0,1) H1: F ≠ N(0,1)

Incorrect : this is test for

H0: F = N(3.62158,(3.04335)2)

H1: F ≠ N(3.62158,(3.04335)2)

of y

Example

We have not used Dadj ! ! p-value should be

0.126 (next week)

Correct?

Correct?

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


3.5.3. Chi-square test for goodness of fit (1)

independent realizations from unknown distribution F

Idea: use empirical distribution in different way:divide real line in intervals I1, … ,Ik and compare number of data in intervals with expected number in intervals under H0

Ni = number of observations in Ii

pi = probability of observation in Ii under F0

Then npi = expected number in intervals under H0

Test statistic:

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Chi-square test for goodness of fit (2)

Test statistic:

Distribution of X2 under H0: different for different F0, but for n → ∞ distribution of X2 under H0 : chi-square with k-1 dfsame for all F0

For large enough n, X2 distribution free chi-square test nonparametric test

When is H0 rejected? For large values of X2

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



Test statistic:

How to choose intervals I1, … ,Ik ?

How many?More is better, but not too manyRule of Thumb: at least 5 observations expected in each interval under H0

Where?About same number expected in each interval under H0

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



Data: y H0: F = N(4,9) H1 : F ≠ N(4,9)

Test statistic: R: > chisquare(y,pnorm,k=8, lb=0, ub=16,mean=4,sd=3)$chisquare[1] 13.11222$pr[1] 0.06942085$N (0,2] (2,4] (4,6] (6,8] (8,10] (10,12] (12,14] (14,16] 14 13 9 4 5 0 1 0 $np[1] 8 12 12 8 3 0 0 0 #Expected numbers under H0 do not satisfy rule of thumbBetter: choose suitable vector b of `breaks’ > chisquare(y,pnorm,breaks=b,mean=4,sd=3)

of y

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



Test statistic:

under H0: χ2k-1

Standard chi-square test not suitable for composite H0 Then adjusted chi-square test with adjusted chi-square distribution

Example: for “H0: F is normal” adjusted chi-square test statistic is

under H0: χ2k-m-1 only for one specific type of estimators

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Recap

Exploring distributions3.5. Tests for goodness of fit 3.5.2. Kolmogorov-Smirnov test for general distribution 3.5.3. Chi-square test for goodness of fit for general distribution

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


4. Bootstrap

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Bootstrap: Introduction(1) Data: 59 melting temperatures of beewax

P unknown true underlying distribution of beewax data Estimator of location of P? Tn = (sample) Mean Estimate of location of P?tn = mean(beewax) = 63.589

How accurate is estimate?How good is estimator?Distribution of Tn ? Broad/narrow?

Main question:How to estimate unknown distribution of estimator TnNotation: QP

Example

R:> beewax [1] 63.78 63.34

63.36 63.51 ….> mean(beewax)[1] 63.58881 > sd(beewax)[1] 0.3472209> var(beewax)[1] 0.1205624

R:> beewax [1] 63.78 63.34

63.36 63.51 ….> mean(beewax)[1] 63.58881 > sd(beewax)[1] 0.3472209> var(beewax)[1] 0.1205624

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Bootstrap: Introduction(2) (Continued)

Simple case: assume P ~ N(μ,σ2) Tn = (sample) Mean

What is distribution QP of Tn ? We estimate: N(63.589, 0.121/59)

How did we find this?

i) Estimator of P: N((sample) Mean, (sample) Variance)ii) Estimate: N(63.589,0.121)iii) QP is distribution of Mean of 59 independent observations from Piv) Estimator of QP : N((sample) Mean, (sample)Variance/59)v) Estimate of QP : N(63.589, 0.121/59)

Example

R:> beewax [1] 63.78 63.34

63.36 63.51 ….> mean(beewax)[1] 63.58881 > sd(beewax)[1] 0.3472209> var(beewax)[1] 0.1205624> length(beewax) [1] 59

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



Other case: assume P ~ N(μ,σ2) Now Tn = (sample) Median What is distribution QP of Tn ?

How to proceed now?i) Estimator of P: N((sample) Mean, (sample) Variance)ii) Estimate: N(63.589,0.121)iii) QP is distribution of Median of 59 independent observations from Piv) Estimator of QP : ?v) Estimate of QP : ?

This is what bootstrap is about: estimate distribution QP of function Tn of 59 independent observations from unknown P n

Example

R:> beewax [1] 63.78 63.34


-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



Again other case: no assumption about PTn = (sample) MeanWhat is distribution QP of Tn ?

How to proceed now?i) Estimator of P: ?ii) Estimate: ?iii) QP is distribution of Mean of 59 independent observations from Piv) Estimator of QP : ?v) Estimate of QP : ?

This is what bootstrap is about: estimate distribution QP of function Tn of 59 independent observations from unknown P n

Example

R:> beewax [1] 63.78 63.34


-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


4.2. Bootstrap estimators for a distribution

This is what bootstrap is about: estimate distribution QP of function Tn of n independent observations from unknown P

Situation realizations of , independent, unknown distr.

P

Goal Estimate distribution of estimator

Cases1. Assume P is some parametric distribution with unknown parameters2. Assume nothing about P

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Bootstrap estimators for a distribution;Case 1: parametric bootstrap estimator (1)

(Beewax; case 1)

Case 1: Assume P ~ N(μ,σ2) ; Tn = (sample) Median What is distribution QP of Tn ?

How to proceed?i) Estimator of P: N( X ,S2) = P ii) Estimate: N(63.589,0.121)iii) QP is distribution of Median of 59 independent observations from Piv) Estimator of QP : distribution of Median of 59 independent observations from N( X ,S2) = Pv) Estimate of QP : distribution of Median of 59 independent observations from N(63.589,0.121) Unknown: use computer to generate realizations from estimate of QP

Empirical distribution of generated set is parametric bootstrap estimate of QP

Example

θ̂n

θ̂n

Which distribution is this?

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



(Continued; case 1)

How to generate realizations from estimate of QP : i.e. from distribution of Median of 59 independent observations from N(63.589,0.121)?# 1. Generate one bootstrap sample:> xstar=rnorm(59, 63.589,sqrt(0.121)) # Check:> xstar [1] 63.84819 62.88915 63.71705 64.06793…..[57] 63.56481 64.03403 63.75276 #Note: xstar is of same length as beewax

# 2. Now compute one bootstrap value tstar from xstar:> tstar=median(xstar)> tstar[1] 63.70498

# 3. Do 1 and 2 B times. The B values tstar are generated realizations from estimate of QP

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



(Continued; case 1)

The B values tstar are generated realizations from estimate of QP

i.e. from distribution of Median of 59 independent observations from N(63.589,0.121)

Recall: empirical distribution of generated set is parametric bootstrap estimate of QP

Also: sample variance 0.0038 of generated set is parametric bootstrap estimate of variance of QP

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Bootstrap estimators for a distribution;Case 2: empirical bootstrap estimator (1)

(Beewax; case 2)

Case 2: Assume nothing about P; Tn = (sample) Mean What is distribution QP of Tn ?

How to proceed?i) Estimator of P: empirical distribution of data = Pn ii) Estimate: empirical distribution of beewax data iii) QP is distribution of Mean of 59 independent observations from Piv) Estimator of QP : distribution of Mean of 59 independent observations from empirical distribution of data v) Estimate of QP : distribution of Mean of 59 independent observations from empirical distribution of beewax data Unknown: use computer to generate realizations from estimate of QP

Empirical distribution of generated set is Empirical bootstrap estimate of QP

Example

Which distribution is this?

^

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



(Continued; case 2)

How to generate realizations from estimate of QP : i.e. from distribution of Mean of 59 independent observations from empirical distribution of beewax data ?# 1. Generate one bootstrap sample:> xstar=sample(beewax, replace = TRUE) # Check:> xstar [1] 63.69 64.42 63.30 63.03 63.13 63.13 63.08 63.27 63.08 64.12 64.21 63.43 …..#Note: xstar is of same length as beewax and consists of values sampled from the set of #beewax values.

# 2. Now compute one bootstrap value tstar from xstar:> tstar=mean(xstar)> tstar[1] 63.60271

# 3. Do 1 and 2 B times. The B values tstar are generated realizations from estimate of QP

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x



(Continued; case 2)

The B values tstar are generated realizations from estimate of QP

i.e. from distribution of Mean of 59 independent observations from empirical distribution of beewax data

Recall: empirical distribution of this generated set is empirical bootstrap estimate of QP

Also: sample variance 0.00193 of this generated set is empirical bootstrap estimate of variance of QP

Note: this value is comparable to value 0.121/59 = 0.0020of estimate of variance of QP under normality assumption for P

Example

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Empirical bootstrap with R

# Can be done in one go with local R-function bootstrap:> bootstrap = function(x, statistic, B = 100., ...){# returns a vector of B bootstrap values of real-valued statistic.# statistic(x) should be R-function ; arguments of # statistic can be inserted on ...# resampling is done from empirical distribution of x y <- numeric(B) for(j in 1.:B) y[j] <- statistic(sample(x, replace = TRUE), ...) y}

# Compute 1000 bootstrap values tstar:> tstarvector=bootstrap(beewax,mean,B=1000)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Bootstrap: two errors

Recall goal: to estimate distribution QP of function Tn of n independent observations from unknown P Note: in bootstrap estimation procedure two types of “errors” are made; Which ones?

Given the data: - Estimate of QP : distribution of function Tn of n independent observations from estimate of P

- Which is estimated in turn by empirical distribution of computer generated realizations of this distribution

How can we make these errors small?Size first error depends on quality estimator of P Size second error can be made small by taking B large

First errorSecond error

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Recap

Bootstrap4.1. Simulation (read yourself)4.2. Bootstrap estimators for distribution Parametric bootstrap estimators Empirical bootstrap estimators

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x


Exploring distributions/Bootstrap

The end

statistical data analysis 2011/2012 m. de gunst lecture 3

Documents