program eval review

7/27/2019 Program Eval Review

1/27

1

Program Evaluation Study Guide

Below is a description of a program. By answering each of the 11 questions listed here, explain the research design

you would use to carry out an evaluation of this program. Number your responses to each of the 11 questions.

1. Clearly state the evaluation question(s) you seek to answer.

Make sure it is clear, focused, and appropriately complex.

2. Clearly state the null hypothesis(es) that your analysis will test.

The null hypothesis is a statement that you want to test. In general, the null hypothesis is that things are the same as each other,

or the same as a theoretical expectation. Here the slope of the treatment should be zero indicating no change. The alternative

hypothesis is that things are different from each other, or different from a theoretical expectation.

3. Define the unit of analysis that your dataset(s) will use.

The unit of analysis is the major entity that you are analyzing in your study. Why is it called the 'unit of analysis' and not

something else (like, the unit of sampling)? Because it is the analysis you do in your study that determines what the unit is. For

instance, if you are comparing the children in two classrooms on achievement test scores, the unit is the individual child because

you have a score for each child. On the other hand, if you are comparing the two classes on classroom climate, your unit of

analysis is the group, in this case the classroom, because you only have a classroom climate score for the class as a whole and

not for each individual student. For different analyses in the same study you may have different units of analysis. If you decide to

base an analysis on student scores, the individual is the unit. But you might decide to compare average classroom performance.In this case, since the data that goes into the analysis is the average itself (and not the individuals' scores) the unit of analysis is

actually the group. Even though you had data at the student level, you use aggregates in the analysis. In many areas of social

research these hierarchies of analysis units have become particularly important and have spawned a whole area of statistical

analysis sometimes referred to as hierarchical modeling. This is true in education, for instance, where we often compare

classroom performance but collected achievement data at the individual student level.


2/27

2

4. Define the dependent variable(s) and then describe how you would operationalize (measure) the dependent

variable(s).

Operationalizing a variable means creating a clear protocol (list of steps) for how to quantitatively measure the concept (i.e. tochange the concept into a variable with measurable indicators). The protocol must be specific and precise enough so that

someone else could use your operational definition and obtain the same results you did. To sum up, operationalization means to

add measurable indicators to concepts to make them measurable variables.

5. Define the key explanatory variable(s) and then describe how you would operationalize (measure) the

independent key explanatory variable(s).

See above

6. Make a list of control variables that you will include in your analysis(es)

A control variable is a variable that is held constant in a research analysis. The use of control variables is generally done to

answer four basic kinds of questions: 1. Is an observed relationship between two variables just a statistical accident? 2. If one

variable has a causal effect on another, is this effect a direct one or is it indirect with other variable intervening? 3. If several

variables all have causal effects on the dependent variable, how does the strength of those effects vary? 4. Does a particular

relationship between two variables look the same under various conditions? Common variables include age, gender, race,

income, education, weight, and marital status.

7. Describe your research design. Do not just give a term. Explain exactly what your research design will be. Ifyour research design involves one or more Treatment groups and control or comparison groups, clearly explain

what each of these groups is and how these groups will be formed.

Experiment (Gold Standard)


3/27

3

Each subject has the same chance of being in the Treatment or the control group

One subjects assignment is independent of any other subjects assignment

Selection into one group or the other is unrelated to the individuals characteristics

If assignment to the Treatment group vs. the control group is truly random, then they should on average have the samecharacteristics.

Works best with large numbers of subjects

Limits of Experiments

Cannot yield estimates of the impact on any unit higher than the kind of unit sampled and randomly assigned

Ex: An experiment that randomly assigned individuals to Treatment and control groups cannot give estimates of program

impact on neighborhoods, cities, the economy, etc.

Cannot tell us the impact on specific individuals. Can only give an estimate of the average impact on groups of

participants. External validity is questionable

Classic Experiment

Random assignment of subjects into the Treatment and control group

Pre-test administered to both groups, measuring each on the DV

Treatment group gets the Treatment. Control group does not. Both groups should otherwise be exposed to the same

conditions.

Post-test administered to both groups, measuring each on the DV

Difference in difference is attributed to the Treatment Treatment is the explanatory variable

Treatment Group: DV1 Treatment DV2

Control Group: DV1 ( ) DV2

Randomized Post-Test Only Experiment


4/27

4

Treatment is the explanatory variable

Treatment Group: Treatment DV2

Control Group: ( ) DV2

No pretest due to random assignment. Already know they are the same.

Randomized Block Design

Instead of random assignment at the unit level, first assign units to blocks that reflect known differences among subjects,

then randomly assign units from within each block

Blocks are internally homogenous

Example: If an experimenter had reason to believe that education level would be a significant factor in the effect of a

given program, she/he might first divide the experimental subjects into education level groups: less than high school, high

school, some college, bachelors, graduate. Then, within each education level group, individuals would be assigned to

Treatment groups using random assignment.

Advantage: can lower variance and thus yield narrower confidence intervals

This is NOT the same thing as randomly assigning at the cluster level rather than the unit level. You are still carrying out

random assignment at the unit level and can carry out analyses at the unit level.

Quasi-Experiments (When experiments just arent an option)

Need to assess the comparison group to the Treatment group. The more similar they are, the better

Are the groups pre-program means similar?Are the SD of these pre-program means also similar?

Do historical forces affect both groups the same way and to the same degree?

Are rates of participant loss similar?

Are the groups maturing at the same rate?

Did people self-select into one group or the other? If so, this is bad


5/27

5

Lack of randomization leads to internal validity problems

Regression Discontinuity Design Considered to be one of the strongest quasi-experimental designs

Estimates the effect of program eligibility(not receipt) on some outcome

Requirements:

Continuous outcome variable (DV)

Pre-test and post-test (pre-program and post-program) data for all respondents

May be a pre-test on the outcome variable

Some cutoff score (cut-point) on the assignment variable puts cases on one side of the score into Treatment (program)

group and others not

The cut-point must not be designed to deliberately assign some cases to Treatment and others not.

Other than assignment to receive the program vs. not, the cases on one side or the other of the cut-point need to be

treated in the same way

Assignment Variable/Rating Variable

The assignment variable can be any continuous variable measured before the Treatment. It must not be a variable

that the Treatment causes or influences.

Needs to relate to the DV in an unbroken (though not necessarily a linear) manner throughout its range.

Absent Treatment, we should not see any discontinuities if we plot the relationship between the outcome variable

and the assignment variable. When graphing for RD, the assignment variable goes on the x-axis of a graph. The y-axis is the post-Treatment

outcome. Drop a vertical line at the cut-point.

Sharp RD Designs: All cases receive the Treatment or non-Treatment to which they are assigned (our focus)

Fuzzy RD Designs: Some cases do not receive the Treatment or non-Treatment to which they were assigned. Problem

of no-shows and/or crossovers


6/27

6

Two types of RDD

- Discontinuity at a cut-point* The direction and magnitude of the jump at the cut-point is taken as a measure of the causal effect of Treatment on the

outcome variable for cases near the cut-point.

* Use every observation in the same, even those that are far to the right or left of the cut-point

* Global/ Parametric Estimation

* Recommended: Center the assignment variable on the cut-point by creating a new version of the score

()

Logic: Pre-test score should be a strong predictor of the post-test score. Is the program (the Treatment) also a strong

predictor of post-test?

*Coefficient on Program indicator is the program effect

*Important: Try different functional forms when modeling

*Linear relationship b/w assignment variable and posttest score

*Quadratic

*Cubic

*Interactions with Treatment

* If we have specified the functional form of the relationship between the outcome and assignment variable correctly, the

estimator (coefficient on program) will be unbiased estimator of the mean program impact at the cut-point.

* If functional form is incorrect, it will be biased.

* More power and more bias

* Analyze with regression using parametric techniques.


7/27

7

- Nonparametric/ Local randomization

* Differences between cases who pass the cut-point and those who just miss it are random

* Example: random error determines whether someone scores right above or right below the cut-point on some exam

(e.g. scores a 49 or a 51)* So on average the only difference between those just above and just below the cut-point will be exposure to

the program (bandwidth)

* So compare the mean outcomes for cases just to the left and just to the right of the cut-point

* This is the how near question

* Selecting the right bandwidth is the challenge with this method

* Less potential for bias than with the global approach, but also more limited power given more limited n

* Simplest scenario is to calculate a simple difference of means: Using just those in the bandwidth, compare the mean

outcome for those who got Treatment and those who did not* But can produce a biased estimator unless the bandwidth is exceedingly narrow

* Solution? Local linear regression

* Local linear regression is acceptable since the functional form within the limited range is likely to be linear

=+0+1+2+

Where:

is case score on the assignment variable, centered at the cut-point

is 1 if case is in the program (Treatment) and 0 otherwise

The resulting intercept for the control group estimates the mean outcome w/o the programThe resulting intercept for the program group estimates the outcome with the program

The difference between them is an estimate of the Treatment effect

* Can calculate a simple difference of means if using just those cases close to the cut-point.

You can try both parametric and non-parametric approaches


8/27

8

It is useful to couple the analyses above with a test of whether other DVs that are similar to your DV of interest, but not

expected to be influenced by the program, exhibit discontinuity at the cutoff point

i.e. Look at similar DV that you do not expect to be impacted by the program (such other DV are called control-construct

measures)You hope they do NOT show discontinuity at the cutoff point

Suppose DV of interest is a math exam and the Treatment is a math program

Good control construct measure: Reading exam

Bad control construct measure: Physics exam

Internal Validity Issues:

Cut point must have been chosen independent of knowledge of cases scores on the assignment variable, and scores must have

been assigned independent of the cut point. Should check for manipulation at the cut point.

Assessing RD Internal Validity:

Learn as much as possible about how cases were scored on the assignment variable and about how cut point was set

Plot the probability of receiving Treatment (y-axis) against the assignment variable (x-axis). There should be a jump (ideally a

jump of 1) at the cut-point.

Plot the relationship between non-outcome variables (y-axis) and the assignment variable (x-axis). These non-outcome

variables should not be affected by Treatment. There should NOT be a jump at the cut-point.

Plot the assignment variable against the number of observations at each of its values. There should NOT be a discontinuity in

the number of observations just above or just below the cut-point. A jump suggests manipulation of either the scores on theassignment variable, or the choice of the cut-point.

External Validity Issues (Generalizability)

Criticism of RD is that the estimated impact only applies to observations near the cut-point

But, Lee (2008) argues that the cases around the cut-point are actually fairly heterogeneous since random error plays some


9/27

9

role in determining scores on the assignment variable

Therefore that the results may generalize more broadly

Border Matching A geographic border clearly separates the Treatment and comparison group.

Look at data just to the side of the geographic border

A weakness is that not every Treatment has clear geographic borders. This works best with small geographic scales.

State borders are not a great way to differentiate the Treatment and comparison group.

It is a strong quasi experimental design.

It addresses history as a threat to internal validity.

Analyze it using regression with control variables.

Instrumental Variables

Can be useful when you have unobserved omitted variables.

Can be useful when you have simultaneity

Must have at least one instrument for each endogenous regressor.

Instrument but be exogenous and relevant.

If an instrument is relevant, this means that variance in the instrument relates to variance in the independent variable. As

an instrument becomes more relevant, it is able to explain more of the variance in the independent variable. This makes

it more useful and a more accurate estimator for regression analysis. In addition, as an instrument becomes morerelevant, the normal approximation to the sampling distribution becomes better. If an instrument is unable to explain

variance in the independent variable, it is known as a weak instrument. This is a problem because the normal

distribution becomes a weak approximation to the sampling distribution of the estimator. This problem cannot be fixed

by simply increasing the sample size. This problem can be identified by looking at the F-statistic which tests the

hypothesis that the coefficients on the instruments are zero in the first stage. This can be done when there is a single


10/27

10

endogenous repressor. If the F-statistic is greater than 10 at this stage, your instrument is most likely not weak. If you

have several instrumental variables, it may be a good idea to eliminate some of the weaker ones. In this case you would

also use the most relevant subset for your analysis. If you drop weak instrument variables, your standard errors can

increase but that is okay because the original standard errors bear no significant meaning. On the other hand, ifcoefficients are exactly identified, you should not eliminate the weak instruments. This is because you might not have

enough needed strong instruments. In this case, you can attempt to identify new strong instrument variables. You could

also use the weak instrument but look at employing different methods of analysis.

If an instrument is exogenous, then a part of the variance in the independent variable identified by the instrumental

variable is exogenous. In other words, the instrument needs to contain information about variation in the independent

variable that is not related to the error term. If an instrument is not exogenous, the two-stage least squares regression

becomes inconsistent. The instrument cannot identify exogenous variation in the independent variable. As a result, the

regression is unable to provide a constant estimator. It is not possible to test a hypothesis that the instrument is

exogenous when the coefficients are not exactly identified. This makes it difficult to violate violations of this assumption.

On the other hand, if a coefficient is over identified, researchers can test the overriding restrictions. This means that they

can test the hypothesis that additional instruments are exogenous using the assumption that there are a sufficient amount

of instruments available to identify the coefficients of interest. If the coefficients are exactly identified, the only way to

assess instrument exogeneity is by referencing expert opinion. If there are more instruments than endogenous

regressions, researchers can use a statistical tool called the test of over identifying restrictions to assess instrument

endogeneity.

It can be tough to satisfy the exogeneity condition.

Not feasible to statistically test the exogeneity condition unless you have over determination Analyze using 2SLS

Consider an equation Yi= B0+ B1Xi+ ui where Xiis endogenous

IV regression splits Xi into two parts : 1. correlated with ui2. uncorrelated with ui

The basic idea of IV method is to identify and use only the exogenous variation of Xi and estimate B1


11/27

11

Instrumental variables are variables that mimic the troublesome regressors, but are uncorrelated with the error term.

Comparison Group Pre/ Post Test

Identify a group of subjects post-hoc that appear comparable to the group that already received, or is already slated toreceive, the Treatment

Match each unit in the Treatment group with a unit that did not receive the Treatment but that is similar on key

characteristics.

Identify and acknowledge systematic differences between the groups later.

Comparison group is not perfectly equivalent to the Treatment group since random assignment did not produce the

groups.

Treatment group: DV1TreatmentDV2

Comparison group: DV1( )DV2

If panel data, can do diff-in-diff w/control variables

If repeated cross-sectional data, diff-in-diff is not possible but analysis can be done

One group pretest posttest

No comparison group

Better than a post-test only

Difference in means test

Regression using difference method


12/27

12

One group time series

Observations taken at regular, evenly spaced, and frequent intervals

A time series traces the path of a single variable over a period of months, years, etc. Think of that variable as the DV

DV1DV2DV3DV4 DV5 TreatmentDV6DV7DV8DV9DV10

Use panel or tscs data

TSCS Data: #time periods >#entities

Panel Data: #entities >#time periods

Balanced or unbalanced panel

2 wave panel

Include the intercept term to allow for the possibility that the mean change in the DV is non-zero even if there is no

change in your key EV of interest

Effectively controls for any variable V that does not change over time. This is useful if V is some variable that is

unobservable

You should not apply this method to multi-wave panels. In other words, do not run across several years differencing

year Y and Y+1 repeatedly. Instead use a multi-wave panel.

Multi-wave panel

Use fixed effects regression.

- Entity fixed effects

* Control for variables that vary across entities (e.g. vary across states) but not across time* Slope is the same for all entities

* Intercept is allowed to differ by entity

- Time fixed effects

* Control for variables that vary across time but not across units (e.g. a national level policy change that affects all

states)


13/27

13

* Each time period has a different intercept.

- Entity and time fixed effects together

* Use when you have some (omitted) variables that vary across entities but are fixed across time, as well as some

(omitted) variables that vary across time but are fixed across entities

Heteroskedasticity may be an issue.

This means the variance (spread) of the errors is not constant across the values of the key explanatory variables.

OLS estimator will be unbiased and consistent but will not be efficient

The standard error on the estimator is likely to be understated, meaning that we may conclude that a coefficient is

statistically significant when it is not

Solution is to use Robust/White/Huber-White standard errors. This is not sufficient, however, if there is any chance

the errors are correlated over time (auto-correlation/serial correlation).

Auto-Correlation and Serial Correlation

There is a correlation (positive or negative) between a variables score at time t and its score at time t 1.

Pervasive with time-series data.

Identify it using Durbin Watson dstatistic.

Fix it using HAC (heteroskedasticity and autocorrelation consistent standard errors)

One kind of HAC is the clustered standard error

- Allow error to be correlated within a cluster but assumes that the errors are uncorrelated across clusters

- For autocorrealted panel data, the cluster is the obesrvations of the same entity across times (e.g. the state is the

cluster)- Panel Corrected Standard Errors (PCSE)

* Use for TSCS data

- Segmented regression analysis.

Must have a continuous DV

Data must be observed at regular, evenly-spaced intervals.


14/27

14

Generally want at least 12 time points before and 12 time points after intervention, especially if you think there is

seasonal variation.

Segment: A sequence of measures is divided into two or more portions at change points (points in time where we

expect the values of the time series to change because of an event/intervention). Each segment has a level (intercept) and a trend (slope). Each segment is allowed to have its own level and trend.

Comparison Group Time Series

Same as interrupted time series, but now we add in a comparison group that did not experience the event or intervention

and also look at its pattern on that variable over time

8. Describe the major threats to the internal validity of your specific study. Do not just list terms. For each threat

you discuss, explain why it is a threat.

Issues with internal validity

Does the key EV actually account for change we may see in the DV?

Are we sure that the relationship between the assumed EV and the DV does not owe to some other factor?

Selection effects are not a threat to internal validity if the Treatment and C groups are truly formed by random

assignment

But, experiments ARE susceptible to selection-mortality threats to internal validity, especially if one of the conditions

(Treatment or Control) is noxious or intolerable

And, experiments ARE also susceptible to social interaction threats to internal validity

Threats Relevant to the Single Group Pretest-Posttest Quasi-Experimental Designs

History

History refers to any external or historical event that occurred during the course of the study that may be responsible for the

effects instead of the program itself. For instance, in the 70s soon after Head Start began, Sesame Street also began airing.

15


15/27

15

Because Sesame Street includes lots of educational information, perhaps it could have been responsible for the apparent effects

of Head Start. If we were studying psychotherapy with depressed patients at the time Prozac went on the market, all of our

patients might have gotten better by the end of the study, because they went to a psychiatrist and got a prescription for Prozac.

Maturation

Maturation refers to a natural process that leads participants to change on the dependent measure. For instance, perhaps Head

Start kids start getting better just because they were getting older during the study. One classic problem with many health studies

is that patients get better by themselves (e.g., headache medicine trial). Maturation can refer to a decline as well as an

improvement. For instance, some loss of short term memory abilities as you get older, or divergence in girls math scores and

boys verbal scores around middle school age. You can distinguish between history and maturation because maturation is

internal, a natural course of things having to do with some quality of the participants in the study. History has to do with an

external event of some kind.

Instrumentation

Instrumentation refers to an improvement or decline that because of the measure itself. Instrumentation is most commonly found

with observational measures. There is a common problem with observers getting better at observing. As an example, say we

were observing young kids verbal behaviors in Head Start. It may be that the observers miss a number of behaviors indicative

of good verbal skills during the pretest, but they were more likely to count these behaviors at post-test. Instrumentation can also

be a decline.

TestingA threat that involves an improvement in scores on the post-test due to taking the test a pretest is called testing. People

commonly improve on standardized tests such as intelligence tests, SATs, or GREs. Something about taking the test the first

time leads to a change in the test the second time, such as learning the answers or how the trick questions are set up. Perhaps

the Head Start students learned the types of verbal questions (e.g., analogy) on the pretest and therefore knew how to do them

better by posttest. In this case, it would not be the Treatment that had an effect but the experience of taking the test once that

16


16/27

16

led to an improvement.

Mortality (Attrition)

Mortality refers to people dropping out of the study during its course. For instance, Head Start children having the mostdifficulty drop out of the program and by the end of the study the participants who remain have higher academic skills on

average. I wonder whether the recent trends of more frequently expelling students from public schools will help improve a

schools test scores.

Statistical Regression to the Mean

Regression to the mean is the most difficult to understand. It has to do with a sort of statistical fluke. Whenever scores fluctuate

over time for any reason, extreme scores tend to move toward the middle, and middle scores tend to move toward the extreme.

Regression to the mean is most likely to be a problem in a study in which participants are chosen for their extreme values. For

example, an early study evaluating the effects of Sesame Street indicated that it was especially helpful to the most disadvantaged

kids. The kids with the lowest skills improved the most. This could have been because the educational material that the Sesame

Street presents was the most helpful to the kids who knew the least or because those who did especially poorly on the pretest

had lower scores by chance, improvement was simply a matter of random changes back toward their true scores.

Threats Relevant to Two-Group Quasi-experimental Designs

Selection Bias

Selection bias refers to any difference between the groups before the start of the study. Two common ways selection bias canoccur is through self-selection or experimenter-selection. Self-selection bias is when the participants themselves choose which

group they are in. In the Head Start example, parents that choose to be in the program, may have more motivation to teach their

kids certain skills. In this example, kids improved because, motivated kids or families choosing to be in Head Start were more

motivated at the outset than those who chose not to be in the program. Experimenter selection bias is when the researcher

chooses who is assigned to each group (e.g., program vs. control group). In the Head Start study, for instance, because of

17


17/27

17

scarce resources, there may only be a limited number of slots available in the program, so the families that contact the program

first are assigned to the program group and the families who contact the program last are assigned to the control. The families

that contact the program earliest may be more motivated to educate their kids, and, therefore, the kids do better by the end of

the studynot because of the effectiveness of the program but because of initial differences between the characteristics of theprogram and control groups. Because the researcher assigned families to the groups in a way that selected more motivated

families, the results are due to experimenter selection bias. As long as we start out with comparable groups initially, the previous

threats to internal validity are not a problem. It is very difficult to ensure that groups are completely comparable initially,

however. The general weakness of two-group designs is selection factors, but selection can take on number of different forms.

All of the following represent a differential change in the groups as a result of selection plus other threats.

Selection & HistoryHead Start kids are more motivated and more likely to watch Sesame Street when it came on.

Selection & MaturationHead Start kids are more motivated and thus more likely to develop skills faster than those in control.

Selection & TestingHead Start kids are more motivated so they are more likely to figure out the tricks to the math questions.

Consequently, with a second testing, they improved.

Selection & InstrumentationPerhaps observers in control group get bored and less likely to pick up on kids improvement, so

it looks like the program kids do better.

Selection & MortalityIn the control group, the most motivated kids and families are more likely to drop out of the study

because they found another preschool opportunity. This leads to differential attrition in which the kids most likely to improve will

leave the control group. At posttest, the Treatment group will be better than the control group.

Selection & RegressionKids who are the furthest behind or have the most disadvantaged families are assigned to the

Treatment group. Because they are more extreme to begin with, they show greater improvement due to regression rather thanprogram effects.

Threats Relevant to Social Interaction

Diffusion or Contamination

18


18/27

18

This occurs when a comparison group learns about the program either directly or indirectly from program group participants. In

a school context, children from different groups within the same school might share experiences during lunch hour. Or,

comparison group students, seeing what the program group is getting, might set up their own experience to try to imitate that of

the program group. In either case, if the diffusion of imitation affects the posttest performance of the comparison group, it canhave and jeopardize your ability to assess whether your program is causing the outcome. Notice that this threat to validity tend

to equalize the outcomes between groups, minimizing the chance of seeing a program effect even if there is one.

Compensatory Rivalry

Here, the comparison group knows what the program group is getting and develops a competitive attitude with them. The

students in the comparison group might see the special math tutoring program the program group is getting and feel jealous. This

could lead them to deciding to compete with the program group "just to show them" how well they can do. Sometimes, in

contexts like these, the participants are even encouraged by well-meaning teachers or administrators to compete with each other

(while this might make educational sense as a motivation for the students in both groups to work harder, it works against our

ability to see the effects of the program). If the rivalry between groups affects posttest performance, it could make it more

difficult to detect the effects of the program. As with diffusion and imitation, this threat generally works to in the direction of

equalizing the posttest performance across groups, increasing the chance that you won't see a program effect, even if the

program is effective.

Resentful Demoralization

This is almost the opposite of compensatory rivalry. Here, students in the comparison group know what the program group is

getting. But here, instead of developing a rivalry, they get discouraged or angry and they give up (sometimes referred to as the"screw you" effect!). Unlike the previous two threats, this one is likely to exaggerate posttest differences between groups,

making your program look even more effective than it actually is.

Compensatory Equalization of Treatment

This is the only threat of the four that primarily involves the people who help manage the research context rather than the

19


19/27

19

participants themselves. When program and comparison group participants are aware of each other's conditions they may wish

they were in the other group (depending on the perceived desirability of the program it could work either way). Often they or

their parents or teachers will put pressure on the administrators to have them reassigned to the other group. The administrators

may begin to feel that the allocation of goods to the groups is not "fair" and may be pressured to or independently undertake tocompensate one group for the perceived advantage of the other. If the special math tutoring program was being done with

state-of-the-art computers, you can bet that the parents of the children assigned to the traditional non-computerized comparison

group will pressure the principal to "equalize" the situation. Perhaps the principal will give the comparison group some other

good, or let them have access to the computers for other subjects. If these "compensating" programs equalize the groups on

posttest performance, it will tend to work against your detecting an effective program even when it does work. For instance, a

compensatory program might improve the self-esteem of the comparison group and eliminate your chance to discover whether

the math program would cause changes in self-esteem relative to traditional math training.

9. Discuss the major threats to the external validity of your specific study. Do not just list terms. For each threat

you discuss, explain why it is a threat.

External validity is related to generalizing. That's the major thing you need to keep in mind. Recall that validity refers to the

approximate truth of propositions, inferences, or conclusions. So, external validity refers to the approximate truth of

conclusions the involve generalizations. Put in more pedestrian terms, external validity is the degree to which the conclusions in

your study would hold for other persons in other places and at other times.

A threat to external validity is an explanation of how you might be wrong in making a generalization. For instance, you conclude

that the results of your study (which was done in a specific place, with certain types of people, and at a specific time) can be

generalized to another context (for instance, another place, with slightly different people, at a slightly later time). There are threemajor threats to external validity because there are three ways you could be wrong -- people, places or times. Your critics

could come along, for example, and argue that the results of your study are due to the unusual type of people who were in the

study. Or, they could argue that it might only work because of the unusual place you did the study in (perhaps you did your

educational study in a college town with lots of high-achieving educationally-oriented kids). Or, they might suggest that you did

your study in a peculiar time. For instance, if you did your smoking cessation study the week after the Surgeon General issues

20


20/27

20

the well-publicized results of the latest smoking and cancer studies, you might get different results than if you had done it the

week before.

How can we improve external validity? One way, based on the sampling model, suggests that you do a good job of drawing a

sample from a population. For instance, you should use random selection, if possible, rather than a nonrandom procedure. And,once selected, you should try to assure that the respondents participate in your study and that you keep your dropout rates low.

A second approach would be to use the theory of proximal similarity more effectively. How? Perhaps you could do a better job

of describing the ways your contexts and others differ, providing lots of data about the degree of similarity between various

groups of people, places, and even times. You might even be able to map out the degree of proximal similarity among various

contexts with a methodology like concept mapping. Perhaps the best approach to criticisms of generalizations is simply to

show them that they're wrong -- do your study in a variety of places, with different people and at different times. That is, your

external validity (ability to generalize) will be stronger the more you replicate your study.

Overall Issues with external validity

Do the results generalize to the extent that the research claims they do?

- Generalize from sample to target population?

- Generalize to other settings, situations, or locations?

- Generalize from research arrangement to real world? The participant is doing something that directly or indirectly

generates the behavior that is being measured. Will the results generalize to other tasks or stimuli?

- Will the findings continue to apply as society changes over the years (societal/temporal changes)

10. Given the design and the operationalized variables you propose, what statistical tools will you use to test thehypothesis(ses)? (If you are doing regression, what kind of regression will you perform? Will there be anything

special about the standard errors you use?)

Regression-Adjusted Impact Estimates

Random assignment should produce Treatment and control groups that do not differ systematically on a range of characteristics,

thus reducing the need for statistical modeling that controls for baseline differences between the Treatment and C groups. It

21


21/27

21

should theoretically be possible to just compare means across the groups, or run a bivariate regression of the outcome variable

on the Treatment indicator.

But, many think tanks use multiple regression to adjust for random baseline differences among the groups. (Note that random

assignment does not guarantee that Treatment and control groups are identical on all characteristics that could relate todifferences in the outcome variable (Orr: 188)). Taking account of these extant differences can improve the power of the

designallowing the researcher to detect smaller program effects than would otherwise be visible for the given n (Orr: 188).

Collecting the baseline data on assorted covariates to permit regression-adjusted impact estimates is often cheaper than

increasing the sample size n to increase the power of the design. This basic model estimates the average impact of the program

on the entire Treatment group, not on any subgroup.

One baseline covariate that goes into the model could well be the pre-program score on the outcome variable. Be careful

choosing covariates! Only characteristics that are prior to random assignment (pre-Treatment) should be used as covariates in

the model. Including post-Treatment characteristics in the model could bias the estimate of the effect of Treatment on the

outcome variable. Aside from being pre-Treatment, the control variables should be correlated with the outcome and not

affected by the Treatment.

Adjusted Means

The adjusted mean of Y (the outcome variable) for a given group is the regression function for that group evaluated at the mean

values across all the groups for the covariates. If x is the mean of x for the groups combined, then the adjusted mean of Y for

any particular group is the groups regression function (the predicted equation for that group) evaluated at x (evaluated at inx

practice). Note that is the average across all cases in the dataset for variable x. Note: If the adjusted means differ greatly from

the arithmetic means, this indicates that the groups had very different means on covariates in the model, and the results you are

getting from calculating adjusted means are premised on two assumptions that may be unrealistic: (1) that it makes sense to

adjust the groups on this covariate, and (2) that the relationship between the DV and the covariate has the same linear from

within each group, even if the mean of the covariate is quite different between groups. This latter assumption in particular may be

unwarranted.

22


22/27

22

The Difference Estimator

=0+1+

is the post-Treatment score earned by case

is the Treatment indicator (or level) to which case was assigned1is the Treatment effectthe causal effect of a unit change in X, but if any differences exist between the Treatment and

Control groups before Treatment(i.e. any randomization failure), it will be biased

If X is binary (Treatment or no Treatment), then 1is the difference estimator: =1(|=0)

Avgscore for TreatmentGroupafter TreatmentAvgscore for Control group after Treatment: 1=

Regression-Adjusted Difference Estimator

If there was true random assignment, then the difference estimator is unbiased

But it could be made more efficient (smaller variance) through the addition of relevant covariates to the model

=0+1+21+++

Careful: Do NOT interpret coefficients on other covariates as causal!

Do not put any post-Treatment variables into the model!

The Differences-in-Differences Estimator AKA Double Difference Estimator (Panel Data)

=0+1+

is the difference between the post-Treatment score and the pre-Treatment score for case

1is the differences-in-differences estimatorthe causal effect of the Treatment(Avgscore for TreatmentGroupafter TreatmentAvgscore for TreatmentGroupbefore Treatment)(Avgscore for Control group

after TreatmentAvgscore for Control group before Treatment)

1= Y

Note: Will be biased if the Treatment and control groups follow different time trends for some reason (failure of parallel trends

assumption).

23


23/27

23

The Regression-Adjusted Differences-in-Differences Estimator

=0+1+21+++

Do not interpret coefficients on other covariates as causalDo not put any post-Treatment variables into the model

Intent to Treat (ITT)

You can use random assignment to create a Treatment group and a C group

But differences can arise between the groups after randomization (e.g. some people drop out of experiment)

If there is attrition from both groups and either group, then you cannot simply compare those who actually end up being treated

to who actually ended up staying in the control group and assume the difference is the causal impact of the Treatment. These

two groups were not produced by random assignment.

Assessing Impact w/Repeated Cross-Sectional DataWhat do you do with the dropouts/non-compliers?

Usually it is not possible to observe their outcomes

But simply excluding them would bias the Treatment evaluation

There is no single, agreed-upon method for handling this

You can build an estimate of what their outcome would have been had they stayed in Treatment

Assume their outcome was the undesirable one to generate conservative estimates of ITT and TOT

ITT is the average effect of assignment to Treatment.

E.g. the average effect of being offered a voucher to moveDoes not accurately portray the effect of actually receiving Treatment if there is any non-compliance.

E.g. does not tell us the effect of actually using the voucher to move

Analysis uses all cases, in the groups to which they were randomly assigned. Ignore imperfect compliance. Ignore deviations

from protocol.

24


24/27

24

If you have crossover (units assigned to Treatment actually being subjected to C and vice versa), ITT usually gives a

conservative estimate of the effect of Treatment

All other estimators involve making model-dependent (and often very sensitive) corrections for crossover or attrition. ITT does

not and therefore is the most robustIf the noncompliance rate in the population is similar to noncompliance rate in the experiment, then ITT can give a useful

estimate of impact of Treatment were it actually adopted in population.

But recall the limits on external validity that come from having to conduct random assignment among volunteers

If you are doing ITT analysis, say so clearly:

Do not claim to be estimating the causal effect of actually receiving Treatment.

Rather, you are estimating the causal effect of being assigned to Treatment vs. to control.

Calculating ITT

- ITT is the average effect of being assigned to Treatment. The causal effect of assignment to Treatment or control. The mean

outcome difference when Treatment is offered and when it is not. The Difference Estimator we saw last week=1(|=0). The average outcome for cases that received the Treatment scenario and the average outcome for cases

that received the C scenario. Can do regression-adjusted impact estimates or unadjusted impact estimates.

But what if you want to estimate the impact of receiving Treatment? Effect of Treatment on the treated (TOT). TOT > ITT if

you had non-compliance with Treatment.

- One approach, which works if impact of the program on non-participants is zero, is this (the no-show adjustment). This

approach makes no assumptions about whether those who drop out are similar to those who participate. It only assumes that

the effect of the program on non-participants is zero. This is probably a valid assumption in voluntary programs but not

necessarily a valid assumption when program is mandatory (e.g. work for welfare)- King et al. arguments against regression-adjustments

Randomization allows model-free impact estimates. These estimates are not dependent on particular decisions and thus are not

going to vanish with slight changes in modeling decisions.

If regression adjustments suddenly made a Treatment effect appear significant when it did not without those adjustments, there

was probably something wrong with the experiment. It is possible that a post-Treatment variable was put into the model amid

25


25/27

these adjustments.

They say it is better to do a good job at the design stage than to try to compensate for poor designs after the fact

Matching

-A way to try to create comparison group that looks very similar to the Treatment group when you did not have random

assignment generate these groups

-A way to pre-process an observational dataset before running regression

-A way to thereby reduce the degree to which the results one gets depend on the model one specifies

- Get your dataset

* Run a series of matching procedures

* Make sure that matching is improving the balance on the covariates (ideally all of them, but minimally those that you havetheoretical reason to think are most important)

* Run regression analysis, but do so using the matched dataset rather than the original dataset

* Make sure that matching is improving the balance on the covariates (ideally all of them, but minimally those that you have

theoretical reason to think are most important)

* Calculate the simple difference in means between the Treatment and Comparison groups formed from matching

* Except where exact matching is possible (rarely), you should still use regression adjustments rather than just calculating the

difference in means

One-to-one exact matching

Match each Treatment unit with one C unit. For the matched cases, values are identical on all covariates on which you

matched.

If feasible, this procedure eliminates all dependence on functional form when running regression.

Here, can just do difference in means test

26


26/27

But it is usually infeasible.

General Exact Matching

Matches all control units to a Treatment unit with exactly the same covariate values So one Treatment case could be matched to 3 C cases

Need to use a weighted difference in means to account for there being a different number of Treatment and C units

Nearest Neighbor Matching on the Propensity Score

Match each Treatment unit with the C having the most similar estimated propensity score

Less restrictive that exact matching

Check for Balance Improvements

Check Matching

Best: QQ plots for each covariate

Compare the histograms

Compare the mean of each covariate for the Treatment and the C group before vs. after matching. The smaller the

difference the better. Do not carry out difference of means t-test to assess balance (see Ho et al. 2011: 4)

Check the balance even for covariates that are not part of the matching procedure.

What if data can not be improved? The data may too simple be too fragile to permit causal inference

Matching only works when the data permit it to work

Data where all Treatment cases are severely different from all C cases may not be amenable to matching.

Linear Probability Model (Vs. Logistic Regression)

27


27/27

LPM is the linear multiple regression model applied to a binary DV. It models the probability that the DV is 1.

If Y is a binary DV, and we use the linear regression model: Y i = B0 + B1X1i + B2X2i + B3X3i + .. i One advantage is that it is easy to use. Coefficients are directly interpretable where those from probit and logit are not.

A drawback is that it can predict probabilities less than 0 or greater than 1 which make no sense but is an inevitableconsequence of linear regression. Also cannot capture nonlinear relationships. Probit and logit force predicted values

to be between 0 and 1 and are specifically designed for binary DVs.

11. Describe at least one robustness check that you will do in addition to the analyses you have described so far.

If you make assumptions about data availability, or about how programs work, clearly state those assumptions. If

the prompt indicates that data is unavailable, or that a program will not be administered in some particular way, do

not contradict those statements with your assumptions. The program described below is hypothetical.

.

program eval review

Documents