the multiple comparisons problem in ies impact evaluations: guidelines and applications peter z....

46
The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Upload: adele-gordon

Post on 14-Dec-2015

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

The Multiple Comparisons Problem in IES Impact Evaluations:

Guidelines and Applications

Peter Z. Schochet and John Deke

The Multiple Comparisons Problem in IES Impact Evaluations:

Guidelines and Applications

Peter Z. Schochet and John Deke

June 2009, IES Research ConferenceJune 2009, IES Research Conference

Page 2: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

What Is the Problem?What Is the Problem?Multiple hypothesis tests are often

conducted in impact studies

– Outcomes– Subgroups – Treatment groups

Standard testing methods could yield:– Spurious significant impacts – Incorrect policy conclusions

Multiple hypothesis tests are often conducted in impact studies

– Outcomes– Subgroups – Treatment groups

Standard testing methods could yield:– Spurious significant impacts – Incorrect policy conclusions

2

Page 3: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Overview of Presentation Overview of Presentation

Background

Testing guidelines adopted by IES

Examples of their use by the RELs

New guidance on statistical methods for “between-domain” analyses

Background

Testing guidelines adopted by IES

Examples of their use by the RELs

New guidance on statistical methods for “between-domain” analyses

3

Page 4: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

BackgroundBackground

Page 5: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Assume a Classical Hypothesis Testing Framework

Assume a Classical Hypothesis Testing Framework

Test H0j: Impactj = 0

Reject H0j if p-value of t-test < =.05

Chance of finding a spurious impact is 5 percent for each test alone

Test H0j: Impactj = 0

Reject H0j if p-value of t-test < =.05

Chance of finding a spurious impact is 5 percent for each test alone

5

Page 6: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

But If Tests Are Considered Together and No True Impacts…

But If Tests Are Considered Together and No True Impacts…

Probability 1 t-test

Number of Testsa Is Statistically Significant

1 .05

5 .23

10 .40

20 .64

50 .92aAssumes independent tests

6

Page 7: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Impact Findings Can Be Misrepresented

Impact Findings Can Be Misrepresented

Publishing bias

A focus on “stars”

Publishing bias

A focus on “stars”

7

Page 8: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Adjustment Procedures Lower Levels for Individual Tests

Adjustment Procedures Lower Levels for Individual Tests

Methods control the “combined” error rate

Many available methods:

– Bonferroni: Compare p-values to (.05 / # of tests)

– Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953)

– Resampling methods (Westfall and Young 1993)

– Benjamini-Hochberg (1995)

Methods control the “combined” error rate

Many available methods:

– Bonferroni: Compare p-values to (.05 / # of tests)

– Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953)

– Resampling methods (Westfall and Young 1993)

– Benjamini-Hochberg (1995)

8

Page 9: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

These Methods Reduce Statistical Power:

The Chances of Finding Real Effects These Methods Reduce Statistical Power:

The Chances of Finding Real Effects

Simulated Statistical Powera

Number of Tests Unadjusted Bonferroni

5 .80 .59

10 .80 .50

20 .80 .41

50 .80 .31

a Assumes 1,000 treatments and 1,000 controls, 20 percent of all null hypotheses are true, and independent tests

9

Page 10: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Basic Testing Guidelines

Balance Type I and II Errors

Basic Testing Guidelines

Balance Type I and II Errors

Page 11: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Problem Should Be Addressed by First Structuring the Data

Problem Should Be Addressed by First Structuring the Data

Structure will depend on the research questions, previous evidence, and theory

Adjustments should not be conducted blindly across all contrasts

Structure will depend on the research questions, previous evidence, and theory

Adjustments should not be conducted blindly across all contrasts

11

Page 12: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

The Plan Must Be Specified Up Front

The Plan Must Be Specified Up Front

To avoid “fishing” for findings

Study protocols should specify:

– Data structure– Confirmatory analyses– Exploratory analyses – Testing strategy

To avoid “fishing” for findings

Study protocols should specify:

– Data structure– Confirmatory analyses– Exploratory analyses – Testing strategy

12

Page 13: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Delineate Separate Outcome Domains

Delineate Separate Outcome Domains

Based on a conceptual framework

Represent key clusters of constructs

Domain “items” are likely to measure the same underlying trait (have high correlations)

– Test scores– Teacher practices– Student behavior

Based on a conceptual framework

Represent key clusters of constructs

Domain “items” are likely to measure the same underlying trait (have high correlations)

– Test scores– Teacher practices– Student behavior

13

Page 14: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Testing Strategy: Both Confirmatory and Exploratory Components

Testing Strategy: Both Confirmatory and Exploratory Components

Confirmatory component

– Addresses central study hypotheses

– Used to make overall decisions about program

– Must adjust for multiple comparisons

Exploratory component

– Identify impacts or relationships for future study

– Findings should be regarded as preliminary

Confirmatory component

– Addresses central study hypotheses

– Used to make overall decisions about program

– Must adjust for multiple comparisons

Exploratory component

– Identify impacts or relationships for future study

– Findings should be regarded as preliminary

14

Page 15: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Focus of Confirmatory Analysis Is on Experimental Impacts

Focus of Confirmatory Analysis Is on Experimental Impacts

Focus is on key child outcomes, such as test scores

Targeted subgroups: eg. ELL students

Some experimental impacts could be exploratory

– Subgroups – Secondary child and teacher outcomes

Focus is on key child outcomes, such as test scores

Targeted subgroups: eg. ELL students

Some experimental impacts could be exploratory

– Subgroups – Secondary child and teacher outcomes

15

Page 16: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Confirmatory Analysis Has Two Potential Parts

Confirmatory Analysis Has Two Potential Parts

1. Domain-specific analysis

2. Between-domain analysis

1. Domain-specific analysis

2. Between-domain analysis

16

Page 17: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Domain-Specific Analysis: Test Impacts for Outcomes as a Group

Domain-Specific Analysis: Test Impacts for Outcomes as a Group

Create a composite domain outcome

– Weighted average of standardized outcomes

Equal weights Expert judgment Predictive validity weights Factor analysis weights MANOVA not recommended

Conduct a t-test on the composite

Create a composite domain outcome

– Weighted average of standardized outcomes

Equal weights Expert judgment Predictive validity weights Factor analysis weights MANOVA not recommended

Conduct a t-test on the composite

17

Page 18: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Between-Domain Analysis: Test Impacts for Composites Across Domains

Between-Domain Analysis: Test Impacts for Composites Across Domains

Are impacts significant in all domains? – No adjustments are needed

Are impacts significant in any domain? – Adjustments are needed

– Discussed later

Are impacts significant in all domains? – No adjustments are needed

Are impacts significant in any domain? – Adjustments are needed

– Discussed later

18

Page 19: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Application of Guidelines by the Regional Educational Labs

Application of Guidelines by the Regional Educational Labs

Page 20: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Basic Features of the REL Studies

Basic Features of the REL Studies

25 Randomized Control Trials

– Single treatment and control groups– Testing diverse interventions– Typically grades K-8– Fall-spring data collection, some longer– Collecting data on teachers and students

25 Randomized Control Trials

– Single treatment and control groups– Testing diverse interventions– Typically grades K-8– Fall-spring data collection, some longer– Collecting data on teachers and students

20

Page 21: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Each RCT Provided a Detailed Analysis Plan to IES

Each RCT Provided a Detailed Analysis Plan to IES

Confirmatory research questions

Confirmatory domains and outcomes

Within- and between-domain testing strategy

Study samples

Statistical power levels

Confirmatory research questions

Confirmatory domains and outcomes

Within- and between-domain testing strategy

Study samples

Statistical power levels

21

Each Plan Included Information on:

Page 22: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Key Features of Confirmatory Domains

Key Features of Confirmatory Domains

Student academic achievement domains are specified in all RCTs

Some domains pertain to:

– Behavioral outcomes

– A specific time period for longitudinal studies

– Subgroups: ELL students

Student academic achievement domains are specified in all RCTs

Some domains pertain to:

– Behavioral outcomes

– A specific time period for longitudinal studies

– Subgroups: ELL students

22

Page 23: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Most RCTs Have Specified Structured Research Questions

Most RCTs Have Specified Structured Research Questions

Most have fewer than 3 domains

– Some have only 1

– Most domains have a small number of outcomes

Main between-domain question:

“Are there positive impacts in any domain?”

Most have fewer than 3 domains

– Some have only 1

– Most domains have a small number of outcomes

Main between-domain question:

“Are there positive impacts in any domain?”

23

Page 24: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Adjustment Methods for Between-Domain

Confirmatory Analyses

Adjustment Methods for Between-Domain

Confirmatory Analyses

Page 25: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Focus on Methods to Control the Familywise Error Rate (FWER)

Focus on Methods to Control the Familywise Error Rate (FWER)

FWER = Prob (find ≥1 significant impact given that no impacts truly exist)

Preferred over the false discovery rate developed by Benjamini-Hochberg (BH)

– BH is a preponderance-of-evidence method

– BH does not control the FDR for all forms of dependencies across test statistics

FWER = Prob (find ≥1 significant impact given that no impacts truly exist)

Preferred over the false discovery rate developed by Benjamini-Hochberg (BH)

– BH is a preponderance-of-evidence method

– BH does not control the FDR for all forms of dependencies across test statistics

25

Page 26: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Consider Four FWER Adjustment Methods Consider Four FWER Adjustment Methods

Sidak: Exact adjustment when tests are independent

Bonferroni: Approximate adjustment when tests are independent

Generalized Tukey: Adjusts for correlated tests that follow a multivariate t-distribution

Resampling: Robust adjustment for correlated tests for general distributions

Sidak: Exact adjustment when tests are independent

Bonferroni: Approximate adjustment when tests are independent

Generalized Tukey: Adjusts for correlated tests that follow a multivariate t-distribution

Resampling: Robust adjustment for correlated tests for general distributions

26

Page 27: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Main Research QuestionsMain Research Questions

How do these four methods work?

Are the more complex methods likely to provide more powerful tests for between-domain analyses?

– There are no single-routine statistical packages for the complex methods under clustered designs

How do these four methods work?

Are the more complex methods likely to provide more powerful tests for between-domain analyses?

– There are no single-routine statistical packages for the complex methods under clustered designs

27

Page 28: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Basic Setup for the Between-Domain Analysis

Basic Setup for the Between-Domain Analysis

Assume N domain composites

Test whether any domain composite is statistically significant

Aim to control the FWER at = .05

All methods reduce the level for individual tests: * = .05/fact

Assume N domain composites

Test whether any domain composite is statistically significant

Aim to control the FWER at = .05

All methods reduce the level for individual tests: * = .05/fact

28

Page 29: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

SidakSidak

Uses the relation that the FWER =[1 – Pr(correctly rejecting all N null hypotheses)]

For independent tests, FWER = 1 – (1- *)N

Sidak picks * so that FWER = 0.05

For example, if N = 3: – * = 0.017

– fact = 0.05/ 0.017 = 2.949

Uses the relation that the FWER =[1 – Pr(correctly rejecting all N null hypotheses)]

For independent tests, FWER = 1 – (1- *)N

Sidak picks * so that FWER = 0.05

For example, if N = 3: – * = 0.017

– fact = 0.05/ 0.017 = 2.949

29

Page 30: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

The Bonferroni Method Tends to Be More Conservative

The Bonferroni Method Tends to Be More Conservative

* = (.05 / N); fact = N * = (.05 / N); fact = N

30

N Sidak Bonferroni

1 1 1

2 1.975 2

3 2.949 3

4 3.924 4

5 4.899 5

The Value of fact for the Sidak and Bonferroni

Page 31: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Sidak and Bonferroni Are Likely To Be Conservative with Correlated Tests

Sidak and Bonferroni Are Likely To Be Conservative with Correlated Tests

Correlated tests can occur if:

– Domain composites are correlated– Treatment effects are heterogeneous

Yields tests with lower power

Correlated tests can occur if:

– Domain composites are correlated– Treatment effects are heterogeneous

Yields tests with lower power

31

Page 32: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Generalized Tukey and Resampling Methods Adjust for Correlated TestsGeneralized Tukey and Resampling Methods Adjust for Correlated Tests

Let pi be the p-value from test i

Both methods use the relation: FWER = Pr(min(p1, p2, p3,…, pN)≤.05 | H0 is true)

Both methods calculate FWER using the distribution of min(p1, p2, p3,…, pN) or max(t1, t2, t3,…, tN)

Let pi be the p-value from test i

Both methods use the relation: FWER = Pr(min(p1, p2, p3,…, pN)≤.05 | H0 is true)

Both methods calculate FWER using the distribution of min(p1, p2, p3,…, pN) or max(t1, t2, t3,…, tN)

32

Page 33: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Generalized TukeyGeneralized Tukey

Assumes test statistics have multivariate t distributions with known correlations

The MULTCOMP package in R can implement this adjustment (Hothorn, Bretz, Westfall 2008)

– Multi-stage procedure that requires user inputs

Assumes test statistics have multivariate t distributions with known correlations

The MULTCOMP package in R can implement this adjustment (Hothorn, Bretz, Westfall 2008)

– Multi-stage procedure that requires user inputs

33

Page 34: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Using the MULTCOMP PackageUsing the MULTCOMP Package

Inputs are a vector of impact estimates and the corresponding variance-covariance matrix

Challenge is to get cross-equation covariances of the impact estimates

One option: use the suest command in STATA, then copy resulting covariance matrix to R

– Uses GEE rather than HLM to adjust for clustering

Inputs are a vector of impact estimates and the corresponding variance-covariance matrix

Challenge is to get cross-equation covariances of the impact estimates

One option: use the suest command in STATA, then copy resulting covariance matrix to R

– Uses GEE rather than HLM to adjust for clustering

34

Page 35: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Resampling/BootstrappingResampling/Bootstrapping

The distribution of the maximum t-statistic can be estimated through resampling (Westfall and Young 1993)– Allows for general forms of correlations and

outcome distributions

Resampling must be performed “under the null hypothesis”

The distribution of the maximum t-statistic can be estimated through resampling (Westfall and Young 1993)– Allows for general forms of correlations and

outcome distributions

Resampling must be performed “under the null hypothesis”

35

Page 36: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Homoskedastic Bootstrap Algorithm

Homoskedastic Bootstrap Algorithm

1. Calculate impacts and tstats using the original data

2. Define Y* as the residuals from these regressions

3. Repeat the following at least 10,000 times:

– Randomly sample schools, with replacement, from Y*

– Randomly assign sampled schools to treatment and control groups in the same proportion as in the original data

– Calculate impacts and save the maximum absolute tstat

4. Adjusted p-values = proportion of maximum tstats that lie above the absolute value of the original tstats

1. Calculate impacts and tstats using the original data

2. Define Y* as the residuals from these regressions

3. Repeat the following at least 10,000 times:

– Randomly sample schools, with replacement, from Y*

– Randomly assign sampled schools to treatment and control groups in the same proportion as in the original data

– Calculate impacts and save the maximum absolute tstat

4. Adjusted p-values = proportion of maximum tstats that lie above the absolute value of the original tstats

36

Page 37: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Example of Resampling MethodExample of Resampling Method

37

Original tstats are 0.793 and 3.247; Adjusted p-values are 0.89 and 0.00

tstat 1 tstat 2 Maximum abs(tstat)a

0.909 2.635 2.6351

0.892 1.227 1.2271

-2.768 1.342 2.7681

0.570 -0.237 0.570

-0.574 -1.472 1.4721

-1.245 -0.545 1.2451

0.798 0.083 0.7981

-0.138 0.027 0.1381

-1.810 0.494 1.8101

a1 = Max tstat > 0.793; 2 = Max tstat > 3.247

Page 38: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Implementation of ResamplingImplementation of Resampling

The MULTTEST procedure in SAS implements resampling, but only for non-clustered data

Simple approach: Aggregate data to the school level, and use MULTTEST

More complex approach: Write a program to implement the algorithm with clustering

The MULTTEST procedure in SAS implements resampling, but only for non-clustered data

Simple approach: Aggregate data to the school level, and use MULTTEST

More complex approach: Write a program to implement the algorithm with clustering

38

Page 39: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Comparing MethodsComparing Methods

Assume 3 composite domain outcomes with correlations of 0.20, 0.50, and 0.80

Outcomes are normally distributed or heavily skewed normals (focus on skewed)

Four types of comparisons:– FWER– Values of fact – Minimum Detectable Effect Size (MDES)– “Goal Line” scenario

Assume 3 composite domain outcomes with correlations of 0.20, 0.50, and 0.80

Outcomes are normally distributed or heavily skewed normals (focus on skewed)

Four types of comparisons:– FWER– Values of fact – Minimum Detectable Effect Size (MDES)– “Goal Line” scenario

39

Page 40: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

FWER Values Are Similar by Method Except With Large Correlations

FWER Values Are Similar by Method Except With Large Correlations

40

FWER Values, by Method and Test Correlations

ρ=0.2 ρ=0.5 ρ=0.8

No Adjustment 0.146 0.125 0.097

Bonferroni 0.048 0.045 0.034

Sidak 0.050 0.048 0.036

Generalized Tukey 0.049 0.051 0.049

Bootstrap 0.054 0.052 0.051

Page 41: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Values of fact Are Similar by Method Except With Large Correlations

Values of fact Are Similar by Method Except With Large Correlations

41

Values of fact, by Method and Test Correlations

ρ=0.2 ρ=0.5 ρ=0.8

Bonferroni 3.00 3.00 3.00

Sidak 2.85 2.85 2.85

Generalized Tukey 2.84 2.58 2.02

Bootstrap 2.83 2.57 2.01

Page 42: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

All Methods Yield Similar MDES All Methods Yield Similar MDES

42

MDE Values, by Method and Test Correlationsa

ρ=0.2 ρ=0.5 ρ=0.8

No Adjustment 0.21 0.21 0.21

Bonferroni 0.25 0.25 0.25

Sidak 0.24 0.24 0.24

Generalized Tukey 0.24 0.24 0.23

Bootstrap 0.24 0.24 0.23

aAssumes 60 schools, 60 students per school, R2=0.50, ICC=0.15

Page 43: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

“Goal Line” Scenario: The Method Could Matter for Marginally Significant Impacts “Goal Line” Scenario: The Method Could Matter for Marginally Significant Impacts

43

Adjusted p-values, by Method and Test Correlationsa

aAssumes 60 schools, 60 students per School, R2=0.50, ICC=0.15

ρ=0.2 ρ=0.5 ρ=0.8

No Adjustment 0.019 0.019 0.019

Bonferroni 0.057 0.057 0.057

Sidak 0.054 0.054 0.054

Generalized Tukey 0.054 0.049 0.038

Bootstrap 0.054 0.049 0.038

Page 44: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Summary and ConclusionsSummary and Conclusions

Multiple comparisons guidelines:– Specify confirmatory analyses in study

protocols– Delineate outcome domains– Conduct hypothesis tests on domain

composites

RELs have implemented guidelines

Multiple comparisons guidelines:– Specify confirmatory analyses in study

protocols– Delineate outcome domains– Conduct hypothesis tests on domain

composites

RELs have implemented guidelines

44

Page 45: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

Summary and ConclusionsSummary and Conclusions

Adjustments are needed for between-domain analyses

– For calculating MDEs in the design stage, using the Bonferroni is sufficient

– For estimating impacts, the more complex methods may be preferred in “goal-line situations” when test correlations are large

Adjustments are needed for between-domain analyses

– For calculating MDEs in the design stage, using the Bonferroni is sufficient

– For estimating impacts, the more complex methods may be preferred in “goal-line situations” when test correlations are large

45

Page 46: The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

References and Contact Information

References and Contact Information

Guidelines in Multiple Testing in Impact Evaluations (Schochet 2008)– ies.ed.gov/ncee/pubs/20084018.asp

Resampling-Based Multiple Testing (Westfall and Young 1993; John Wiley and Sons)

[email protected]

[email protected]

Guidelines in Multiple Testing in Impact Evaluations (Schochet 2008)– ies.ed.gov/ncee/pubs/20084018.asp

Resampling-Based Multiple Testing (Westfall and Young 1993; John Wiley and Sons)

[email protected]

[email protected]

46