program eval review
TRANSCRIPT
-
7/27/2019 Program Eval Review
1/27
1
Program Evaluation Study Guide
Below is a description of a program. By answering each of the 11 questions listed here, explain the research design
you would use to carry out an evaluation of this program. Number your responses to each of the 11 questions.
1. Clearly state the evaluation question(s) you seek to answer.
Make sure it is clear, focused, and appropriately complex.
2. Clearly state the null hypothesis(es) that your analysis will test.
The null hypothesis is a statement that you want to test. In general, the null hypothesis is that things are the same as each other,
or the same as a theoretical expectation. Here the slope of the treatment should be zero indicating no change. The alternative
hypothesis is that things are different from each other, or different from a theoretical expectation.
3. Define the unit of analysis that your dataset(s) will use.
The unit of analysis is the major entity that you are analyzing in your study. Why is it called the 'unit of analysis' and not
something else (like, the unit of sampling)? Because it is the analysis you do in your study that determines what the unit is. For
instance, if you are comparing the children in two classrooms on achievement test scores, the unit is the individual child because
you have a score for each child. On the other hand, if you are comparing the two classes on classroom climate, your unit of
analysis is the group, in this case the classroom, because you only have a classroom climate score for the class as a whole and
not for each individual student. For different analyses in the same study you may have different units of analysis. If you decide to
base an analysis on student scores, the individual is the unit. But you might decide to compare average classroom performance.In this case, since the data that goes into the analysis is the average itself (and not the individuals' scores) the unit of analysis is
actually the group. Even though you had data at the student level, you use aggregates in the analysis. In many areas of social
research these hierarchies of analysis units have become particularly important and have spawned a whole area of statistical
analysis sometimes referred to as hierarchical modeling. This is true in education, for instance, where we often compare
classroom performance but collected achievement data at the individual student level.
-
7/27/2019 Program Eval Review
2/27
2
4. Define the dependent variable(s) and then describe how you would operationalize (measure) the dependent
variable(s).
Operationalizing a variable means creating a clear protocol (list of steps) for how to quantitatively measure the concept (i.e. tochange the concept into a variable with measurable indicators). The protocol must be specific and precise enough so that
someone else could use your operational definition and obtain the same results you did. To sum up, operationalization means to
add measurable indicators to concepts to make them measurable variables.
5. Define the key explanatory variable(s) and then describe how you would operationalize (measure) the
independent key explanatory variable(s).
See above
6. Make a list of control variables that you will include in your analysis(es)
A control variable is a variable that is held constant in a research analysis. The use of control variables is generally done to
answer four basic kinds of questions: 1. Is an observed relationship between two variables just a statistical accident? 2. If one
variable has a causal effect on another, is this effect a direct one or is it indirect with other variable intervening? 3. If several
variables all have causal effects on the dependent variable, how does the strength of those effects vary? 4. Does a particular
relationship between two variables look the same under various conditions? Common variables include age, gender, race,
income, education, weight, and marital status.
7. Describe your research design. Do not just give a term. Explain exactly what your research design will be. Ifyour research design involves one or more Treatment groups and control or comparison groups, clearly explain
what each of these groups is and how these groups will be formed.
Experiment (Gold Standard)
-
7/27/2019 Program Eval Review
3/27
3
Each subject has the same chance of being in the Treatment or the control group
One subjects assignment is independent of any other subjects assignment
Selection into one group or the other is unrelated to the individuals characteristics
If assignment to the Treatment group vs. the control group is truly random, then they should on average have the samecharacteristics.
Works best with large numbers of subjects
Limits of Experiments
Cannot yield estimates of the impact on any unit higher than the kind of unit sampled and randomly assigned
Ex: An experiment that randomly assigned individuals to Treatment and control groups cannot give estimates of program
impact on neighborhoods, cities, the economy, etc.
Cannot tell us the impact on specific individuals. Can only give an estimate of the average impact on groups of
participants. External validity is questionable
Classic Experiment
Random assignment of subjects into the Treatment and control group
Pre-test administered to both groups, measuring each on the DV
Treatment group gets the Treatment. Control group does not. Both groups should otherwise be exposed to the same
conditions.
Post-test administered to both groups, measuring each on the DV
Difference in difference is attributed to the Treatment Treatment is the explanatory variable
Treatment Group: DV1 Treatment DV2
Control Group: DV1 ( ) DV2
Randomized Post-Test Only Experiment
-
7/27/2019 Program Eval Review
4/27
4
Treatment is the explanatory variable
Treatment Group: Treatment DV2
Control Group: ( ) DV2
No pretest due to random assignment. Already know they are the same.
Randomized Block Design
Instead of random assignment at the unit level, first assign units to blocks that reflect known differences among subjects,
then randomly assign units from within each block
Blocks are internally homogenous
Example: If an experimenter had reason to believe that education level would be a significant factor in the effect of a
given program, she/he might first divide the experimental subjects into education level groups: less than high school, high
school, some college, bachelors, graduate. Then, within each education level group, individuals would be assigned to
Treatment groups using random assignment.
Advantage: can lower variance and thus yield narrower confidence intervals
This is NOT the same thing as randomly assigning at the cluster level rather than the unit level. You are still carrying out
random assignment at the unit level and can carry out analyses at the unit level.
Quasi-Experiments (When experiments just arent an option)
Need to assess the comparison group to the Treatment group. The more similar they are, the better
Are the groups pre-program means similar?Are the SD of these pre-program means also similar?
Do historical forces affect both groups the same way and to the same degree?
Are rates of participant loss similar?
Are the groups maturing at the same rate?
Did people self-select into one group or the other? If so, this is bad
-
7/27/2019 Program Eval Review
5/27
5
Lack of randomization leads to internal validity problems
Regression Discontinuity Design Considered to be one of the strongest quasi-experimental designs
Estimates the effect of program eligibility(not receipt) on some outcome
Requirements:
Continuous outcome variable (DV)
Pre-test and post-test (pre-program and post-program) data for all respondents
May be a pre-test on the outcome variable
Some cutoff score (cut-point) on the assignment variable puts cases on one side of the score into Treatment (program)
group and others not
The cut-point must not be designed to deliberately assign some cases to Treatment and others not.
Other than assignment to receive the program vs. not, the cases on one side or the other of the cut-point need to be
treated in the same way
Assignment Variable/Rating Variable
The assignment variable can be any continuous variable measured before the Treatment. It must not be a variable
that the Treatment causes or influences.
Needs to relate to the DV in an unbroken (though not necessarily a linear) manner throughout its range.
Absent Treatment, we should not see any discontinuities if we plot the relationship between the outcome variable
and the assignment variable. When graphing for RD, the assignment variable goes on the x-axis of a graph. The y-axis is the post-Treatment
outcome. Drop a vertical line at the cut-point.
Sharp RD Designs: All cases receive the Treatment or non-Treatment to which they are assigned (our focus)
Fuzzy RD Designs: Some cases do not receive the Treatment or non-Treatment to which they were assigned. Problem
of no-shows and/or crossovers
-
7/27/2019 Program Eval Review
6/27
6
Two types of RDD
- Discontinuity at a cut-point* The direction and magnitude of the jump at the cut-point is taken as a measure of the causal effect of Treatment on the
outcome variable for cases near the cut-point.
* Use every observation in the same, even those that are far to the right or left of the cut-point
* Global/ Parametric Estimation
* Recommended: Center the assignment variable on the cut-point by creating a new version of the score
()
Logic: Pre-test score should be a strong predictor of the post-test score. Is the program (the Treatment) also a strong
predictor of post-test?
*Coefficient on Program indicator is the program effect
*Important: Try different functional forms when modeling
*Linear relationship b/w assignment variable and posttest score
*Quadratic
*Cubic
*Interactions with Treatment
* If we have specified the functional form of the relationship between the outcome and assignment variable correctly, the
estimator (coefficient on program) will be unbiased estimator of the mean program impact at the cut-point.
* If functional form is incorrect, it will be biased.
* More power and more bias
* Analyze with regression using parametric techniques.
-
7/27/2019 Program Eval Review
7/27
7
- Nonparametric/ Local randomization
* Differences between cases who pass the cut-point and those who just miss it are random
* Example: random error determines whether someone scores right above or right below the cut-point on some exam
(e.g. scores a 49 or a 51)* So on average the only difference between those just above and just below the cut-point will be exposure to
the program (bandwidth)
* So compare the mean outcomes for cases just to the left and just to the right of the cut-point
* This is the how near question
* Selecting the right bandwidth is the challenge with this method
* Less potential for bias than with the global approach, but also more limited power given more limited n
* Simplest scenario is to calculate a simple difference of means: Using just those in the bandwidth, compare the mean
outcome for those who got Treatment and those who did not* But can produce a biased estimator unless the bandwidth is exceedingly narrow
* Solution? Local linear regression
* Local linear regression is acceptable since the functional form within the limited range is likely to be linear
=+0+1+2+
Where:
is case score on the assignment variable, centered at the cut-point
is 1 if case is in the program (Treatment) and 0 otherwise
The resulting intercept for the control group estimates the mean outcome w/o the programThe resulting intercept for the program group estimates the outcome with the program
The difference between them is an estimate of the Treatment effect
* Can calculate a simple difference of means if using just those cases close to the cut-point.
You can try both parametric and non-parametric approaches
-
7/27/2019 Program Eval Review
8/27
8
It is useful to couple the analyses above with a test of whether other DVs that are similar to your DV of interest, but not
expected to be influenced by the program, exhibit discontinuity at the cutoff point
i.e. Look at similar DV that you do not expect to be impacted by the program (such other DV are called control-construct
measures)You hope they do NOT show discontinuity at the cutoff point
Suppose DV of interest is a math exam and the Treatment is a math program
Good control construct measure: Reading exam
Bad control construct measure: Physics exam
Internal Validity Issues:
Cut point must have been chosen independent of knowledge of cases scores on the assignment variable, and scores must have
been assigned independent of the cut point. Should check for manipulation at the cut point.
Assessing RD Internal Validity:
Learn as much as possible about how cases were scored on the assignment variable and about how cut point was set
Plot the probability of receiving Treatment (y-axis) against the assignment variable (x-axis). There should be a jump (ideally a
jump of 1) at the cut-point.
Plot the relationship between non-outcome variables (y-axis) and the assignment variable (x-axis). These non-outcome
variables should not be affected by Treatment. There should NOT be a jump at the cut-point.
Plot the assignment variable against the number of observations at each of its values. There should NOT be a discontinuity in
the number of observations just above or just below the cut-point. A jump suggests manipulation of either the scores on theassignment variable, or the choice of the cut-point.
External Validity Issues (Generalizability)
Criticism of RD is that the estimated impact only applies to observations near the cut-point
But, Lee (2008) argues that the cases around the cut-point are actually fairly heterogeneous since random error plays some
-
7/27/2019 Program Eval Review
9/27
9
role in determining scores on the assignment variable
Therefore that the results may generalize more broadly
Border Matching A geographic border clearly separates the Treatment and comparison group.
Look at data just to the side of the geographic border
A weakness is that not every Treatment has clear geographic borders. This works best with small geographic scales.
State borders are not a great way to differentiate the Treatment and comparison group.
It is a strong quasi experimental design.
It addresses history as a threat to internal validity.
Analyze it using regression with control variables.
Instrumental Variables
Can be useful when you have unobserved omitted variables.
Can be useful when you have simultaneity
Must have at least one instrument for each endogenous regressor.
Instrument but be exogenous and relevant.
If an instrument is relevant, this means that variance in the instrument relates to variance in the independent variable. As
an instrument becomes more relevant, it is able to explain more of the variance in the independent variable. This makes
it more useful and a more accurate estimator for regression analysis. In addition, as an instrument becomes morerelevant, the normal approximation to the sampling distribution becomes better. If an instrument is unable to explain
variance in the independent variable, it is known as a weak instrument. This is a problem because the normal
distribution becomes a weak approximation to the sampling distribution of the estimator. This problem cannot be fixed
by simply increasing the sample size. This problem can be identified by looking at the F-statistic which tests the
hypothesis that the coefficients on the instruments are zero in the first stage. This can be done when there is a single
-
7/27/2019 Program Eval Review
10/27
10
endogenous repressor. If the F-statistic is greater than 10 at this stage, your instrument is most likely not weak. If you
have several instrumental variables, it may be a good idea to eliminate some of the weaker ones. In this case you would
also use the most relevant subset for your analysis. If you drop weak instrument variables, your standard errors can
increase but that is okay because the original standard errors bear no significant meaning. On the other hand, ifcoefficients are exactly identified, you should not eliminate the weak instruments. This is because you might not have
enough needed strong instruments. In this case, you can attempt to identify new strong instrument variables. You could
also use the weak instrument but look at employing different methods of analysis.
If an instrument is exogenous, then a part of the variance in the independent variable identified by the instrumental
variable is exogenous. In other words, the instrument needs to contain information about variation in the independent
variable that is not related to the error term. If an instrument is not exogenous, the two-stage least squares regression
becomes inconsistent. The instrument cannot identify exogenous variation in the independent variable. As a result, the
regression is unable to provide a constant estimator. It is not possible to test a hypothesis that the instrument is
exogenous when the coefficients are not exactly identified. This makes it difficult to violate violations of this assumption.
On the other hand, if a coefficient is over identified, researchers can test the overriding restrictions. This means that they
can test the hypothesis that additional instruments are exogenous using the assumption that there are a sufficient amount
of instruments available to identify the coefficients of interest. If the coefficients are exactly identified, the only way to
assess instrument exogeneity is by referencing expert opinion. If there are more instruments than endogenous
regressions, researchers can use a statistical tool called the test of over identifying restrictions to assess instrument
endogeneity.
It can be tough to satisfy the exogeneity condition.
Not feasible to statistically test the exogeneity condition unless you have over determination Analyze using 2SLS
Consider an equation Yi= B0+ B1Xi+ ui where Xiis endogenous
IV regression splits Xi into two parts : 1. correlated with ui2. uncorrelated with ui
The basic idea of IV method is to identify and use only the exogenous variation of Xi and estimate B1
-
7/27/2019 Program Eval Review
11/27
11
Instrumental variables are variables that mimic the troublesome regressors, but are uncorrelated with the error term.
Comparison Group Pre/ Post Test
Identify a group of subjects post-hoc that appear comparable to the group that already received, or is already slated toreceive, the Treatment
Match each unit in the Treatment group with a unit that did not receive the Treatment but that is similar on key
characteristics.
Identify and acknowledge systematic differences between the groups later.
Comparison group is not perfectly equivalent to the Treatment group since random assignment did not produce the
groups.
Treatment group: DV1TreatmentDV2
Comparison group: DV1( )DV2
If panel data, can do diff-in-diff w/control variables
If repeated cross-sectional data, diff-in-diff is not possible but analysis can be done
One group pretest posttest
No comparison group
Better than a post-test only
Difference in means test
Regression using difference method
-
7/27/2019 Program Eval Review
12/27
12
One group time series
Observations taken at regular, evenly spaced, and frequent intervals
A time series traces the path of a single variable over a period of months, years, etc. Think of that variable as the DV
DV1DV2DV3DV4 DV5 TreatmentDV6DV7DV8DV9DV10
Use panel or tscs data
TSCS Data: #time periods >#entities
Panel Data: #entities >#time periods
Balanced or unbalanced panel
2 wave panel
Include the intercept term to allow for the possibility that the mean change in the DV is non-zero even if there is no
change in your key EV of interest
Effectively controls for any variable V that does not change over time. This is useful if V is some variable that is
unobservable
You should not apply this method to multi-wave panels. In other words, do not run across several years differencing
year Y and Y+1 repeatedly. Instead use a multi-wave panel.
Multi-wave panel
Use fixed effects regression.
- Entity fixed effects
* Control for variables that vary across entities (e.g. vary across states) but not across time* Slope is the same for all entities
* Intercept is allowed to differ by entity
- Time fixed effects
* Control for variables that vary across time but not across units (e.g. a national level policy change that affects all
states)
-
7/27/2019 Program Eval Review
13/27
13
* Each time period has a different intercept.
- Entity and time fixed effects together
* Use when you have some (omitted) variables that vary across entities but are fixed across time, as well as some
(omitted) variables that vary across time but are fixed across entities
Heteroskedasticity may be an issue.
This means the variance (spread) of the errors is not constant across the values of the key explanatory variables.
OLS estimator will be unbiased and consistent but will not be efficient
The standard error on the estimator is likely to be understated, meaning that we may conclude that a coefficient is
statistically significant when it is not
Solution is to use Robust/White/Huber-White standard errors. This is not sufficient, however, if there is any chance
the errors are correlated over time (auto-correlation/serial correlation).
Auto-Correlation and Serial Correlation
There is a correlation (positive or negative) between a variables score at time t and its score at time t 1.
Pervasive with time-series data.
Identify it using Durbin Watson dstatistic.
Fix it using HAC (heteroskedasticity and autocorrelation consistent standard errors)
One kind of HAC is the clustered standard error
- Allow error to be correlated within a cluster but assumes that the errors are uncorrelated across clusters
- For autocorrealted panel data, the cluster is the obesrvations of the same entity across times (e.g. the state is the
cluster)- Panel Corrected Standard Errors (PCSE)
* Use for TSCS data
- Segmented regression analysis.
Must have a continuous DV
Data must be observed at regular, evenly-spaced intervals.
-
7/27/2019 Program Eval Review
14/27
14
Generally want at least 12 time points before and 12 time points after intervention, especially if you think there is
seasonal variation.
Segment: A sequence of measures is divided into two or more portions at change points (points in time where we
expect the values of the time series to change because of an event/intervention). Each segment has a level (intercept) and a trend (slope). Each segment is allowed to have its own level and trend.
Comparison Group Time Series
Same as interrupted time series, but now we add in a comparison group that did not experience the event or intervention
and also look at its pattern on that variable over time
8. Describe the major threats to the internal validity of your specific study. Do not just list terms. For each threat
you discuss, explain why it is a threat.
Issues with internal validity
Does the key EV actually account for change we may see in the DV?
Are we sure that the relationship between the assumed EV and the DV does not owe to some other factor?
Selection effects are not a threat to internal validity if the Treatment and C groups are truly formed by random
assignment
But, experiments ARE susceptible to selection-mortality threats to internal validity, especially if one of the conditions
(Treatment or Control) is noxious or intolerable
And, experiments ARE also susceptible to social interaction threats to internal validity
Threats Relevant to the Single Group Pretest-Posttest Quasi-Experimental Designs
History
History refers to any external or historical event that occurred during the course of the study that may be responsible for the
effects instead of the program itself. For instance, in the 70s soon after Head Start began, Sesame Street also began airing.
15
-
7/27/2019 Program Eval Review
15/27
15
Because Sesame Street includes lots of educational information, perhaps it could have been responsible for the apparent effects
of Head Start. If we were studying psychotherapy with depressed patients at the time Prozac went on the market, all of our
patients might have gotten better by the end of the study, because they went to a psychiatrist and got a prescription for Prozac.
Maturation
Maturation refers to a natural process that leads participants to change on the dependent measure. For instance, perhaps Head
Start kids start getting better just because they were getting older during the study. One classic problem with many health studies
is that patients get better by themselves (e.g., headache medicine trial). Maturation can refer to a decline as well as an
improvement. For instance, some loss of short term memory abilities as you get older, or divergence in girls math scores and
boys verbal scores around middle school age. You can distinguish between history and maturation because maturation is
internal, a natural course of things having to do with some quality of the participants in the study. History has to do with an
external event of some kind.
Instrumentation
Instrumentation refers to an improvement or decline that because of the measure itself. Instrumentation is most commonly found
with observational measures. There is a common problem with observers getting better at observing. As an example, say we
were observing young kids verbal behaviors in Head Start. It may be that the observers miss a number of behaviors indicative
of good verbal skills during the pretest, but they were more likely to count these behaviors at post-test. Instrumentation can also
be a decline.
TestingA threat that involves an improvement in scores on the post-test due to taking the test a pretest is called testing. People
commonly improve on standardized tests such as intelligence tests, SATs, or GREs. Something about taking the test the first
time leads to a change in the test the second time, such as learning the answers or how the trick questions are set up. Perhaps
the Head Start students learned the types of verbal questions (e.g., analogy) on the pretest and therefore knew how to do them
better by posttest. In this case, it would not be the Treatment that had an effect but the experience of taking the test once that
16
-
7/27/2019 Program Eval Review
16/27
16
led to an improvement.
Mortality (Attrition)
Mortality refers to people dropping out of the study during its course. For instance, Head Start children having the mostdifficulty drop out of the program and by the end of the study the participants who remain have higher academic skills on
average. I wonder whether the recent trends of more frequently expelling students from public schools will help improve a
schools test scores.
Statistical Regression to the Mean
Regression to the mean is the most difficult to understand. It has to do with a sort of statistical fluke. Whenever scores fluctuate
over time for any reason, extreme scores tend to move toward the middle, and middle scores tend to move toward the extreme.
Regression to the mean is most likely to be a problem in a study in which participants are chosen for their extreme values. For
example, an early study evaluating the effects of Sesame Street indicated that it was especially helpful to the most disadvantaged
kids. The kids with the lowest skills improved the most. This could have been because the educational material that the Sesame
Street presents was the most helpful to the kids who knew the least or because those who did especially poorly on the pretest
had lower scores by chance, improvement was simply a matter of random changes back toward their true scores.
Threats Relevant to Two-Group Quasi-experimental Designs
Selection Bias
Selection bias refers to any difference between the groups before the start of the study. Two common ways selection bias canoccur is through self-selection or experimenter-selection. Self-selection bias is when the participants themselves choose which
group they are in. In the Head Start example, parents that choose to be in the program, may have more motivation to teach their
kids certain skills. In this example, kids improved because, motivated kids or families choosing to be in Head Start were more
motivated at the outset than those who chose not to be in the program. Experimenter selection bias is when the researcher
chooses who is assigned to each group (e.g., program vs. control group). In the Head Start study, for instance, because of
17
-
7/27/2019 Program Eval Review
17/27
17
scarce resources, there may only be a limited number of slots available in the program, so the families that contact the program
first are assigned to the program group and the families who contact the program last are assigned to the control. The families
that contact the program earliest may be more motivated to educate their kids, and, therefore, the kids do better by the end of
the studynot because of the effectiveness of the program but because of initial differences between the characteristics of theprogram and control groups. Because the researcher assigned families to the groups in a way that selected more motivated
families, the results are due to experimenter selection bias. As long as we start out with comparable groups initially, the previous
threats to internal validity are not a problem. It is very difficult to ensure that groups are completely comparable initially,
however. The general weakness of two-group designs is selection factors, but selection can take on number of different forms.
All of the following represent a differential change in the groups as a result of selection plus other threats.
Selection & HistoryHead Start kids are more motivated and more likely to watch Sesame Street when it came on.
Selection & MaturationHead Start kids are more motivated and thus more likely to develop skills faster than those in control.
Selection & TestingHead Start kids are more motivated so they are more likely to figure out the tricks to the math questions.
Consequently, with a second testing, they improved.
Selection & InstrumentationPerhaps observers in control group get bored and less likely to pick up on kids improvement, so
it looks like the program kids do better.
Selection & MortalityIn the control group, the most motivated kids and families are more likely to drop out of the study
because they found another preschool opportunity. This leads to differential attrition in which the kids most likely to improve will
leave the control group. At posttest, the Treatment group will be better than the control group.
Selection & RegressionKids who are the furthest behind or have the most disadvantaged families are assigned to the
Treatment group. Because they are more extreme to begin with, they show greater improvement due to regression rather thanprogram effects.
Threats Relevant to Social Interaction
Diffusion or Contamination
18
-
7/27/2019 Program Eval Review
18/27
18
This occurs when a comparison group learns about the program either directly or indirectly from program group participants. In
a school context, children from different groups within the same school might share experiences during lunch hour. Or,
comparison group students, seeing what the program group is getting, might set up their own experience to try to imitate that of
the program group. In either case, if the diffusion of imitation affects the posttest performance of the comparison group, it canhave and jeopardize your ability to assess whether your program is causing the outcome. Notice that this threat to validity tend
to equalize the outcomes between groups, minimizing the chance of seeing a program effect even if there is one.
Compensatory Rivalry
Here, the comparison group knows what the program group is getting and develops a competitive attitude with them. The
students in the comparison group might see the special math tutoring program the program group is getting and feel jealous. This
could lead them to deciding to compete with the program group "just to show them" how well they can do. Sometimes, in
contexts like these, the participants are even encouraged by well-meaning teachers or administrators to compete with each other
(while this might make educational sense as a motivation for the students in both groups to work harder, it works against our
ability to see the effects of the program). If the rivalry between groups affects posttest performance, it could make it more
difficult to detect the effects of the program. As with diffusion and imitation, this threat generally works to in the direction of
equalizing the posttest performance across groups, increasing the chance that you won't see a program effect, even if the
program is effective.
Resentful Demoralization
This is almost the opposite of compensatory rivalry. Here, students in the comparison group know what the program group is
getting. But here, instead of developing a rivalry, they get discouraged or angry and they give up (sometimes referred to as the"screw you" effect!). Unlike the previous two threats, this one is likely to exaggerate posttest differences between groups,
making your program look even more effective than it actually is.
Compensatory Equalization of Treatment
This is the only threat of the four that primarily involves the people who help manage the research context rather than the
19
-
7/27/2019 Program Eval Review
19/27
19
participants themselves. When program and comparison group participants are aware of each other's conditions they may wish
they were in the other group (depending on the perceived desirability of the program it could work either way). Often they or
their parents or teachers will put pressure on the administrators to have them reassigned to the other group. The administrators
may begin to feel that the allocation of goods to the groups is not "fair" and may be pressured to or independently undertake tocompensate one group for the perceived advantage of the other. If the special math tutoring program was being done with
state-of-the-art computers, you can bet that the parents of the children assigned to the traditional non-computerized comparison
group will pressure the principal to "equalize" the situation. Perhaps the principal will give the comparison group some other
good, or let them have access to the computers for other subjects. If these "compensating" programs equalize the groups on
posttest performance, it will tend to work against your detecting an effective program even when it does work. For instance, a
compensatory program might improve the self-esteem of the comparison group and eliminate your chance to discover whether
the math program would cause changes in self-esteem relative to traditional math training.
9. Discuss the major threats to the external validity of your specific study. Do not just list terms. For each threat
you discuss, explain why it is a threat.
External validity is related to generalizing. That's the major thing you need to keep in mind. Recall that validity refers to the
approximate truth of propositions, inferences, or conclusions. So, external validity refers to the approximate truth of
conclusions the involve generalizations. Put in more pedestrian terms, external validity is the degree to which the conclusions in
your study would hold for other persons in other places and at other times.
A threat to external validity is an explanation of how you might be wrong in making a generalization. For instance, you conclude
that the results of your study (which was done in a specific place, with certain types of people, and at a specific time) can be
generalized to another context (for instance, another place, with slightly different people, at a slightly later time). There are threemajor threats to external validity because there are three ways you could be wrong -- people, places or times. Your critics
could come along, for example, and argue that the results of your study are due to the unusual type of people who were in the
study. Or, they could argue that it might only work because of the unusual place you did the study in (perhaps you did your
educational study in a college town with lots of high-achieving educationally-oriented kids). Or, they might suggest that you did
your study in a peculiar time. For instance, if you did your smoking cessation study the week after the Surgeon General issues
20
-
7/27/2019 Program Eval Review
20/27
20
the well-publicized results of the latest smoking and cancer studies, you might get different results than if you had done it the
week before.
How can we improve external validity? One way, based on the sampling model, suggests that you do a good job of drawing a
sample from a population. For instance, you should use random selection, if possible, rather than a nonrandom procedure. And,once selected, you should try to assure that the respondents participate in your study and that you keep your dropout rates low.
A second approach would be to use the theory of proximal similarity more effectively. How? Perhaps you could do a better job
of describing the ways your contexts and others differ, providing lots of data about the degree of similarity between various
groups of people, places, and even times. You might even be able to map out the degree of proximal similarity among various
contexts with a methodology like concept mapping. Perhaps the best approach to criticisms of generalizations is simply to
show them that they're wrong -- do your study in a variety of places, with different people and at different times. That is, your
external validity (ability to generalize) will be stronger the more you replicate your study.
Overall Issues with external validity
Do the results generalize to the extent that the research claims they do?
- Generalize from sample to target population?
- Generalize to other settings, situations, or locations?
- Generalize from research arrangement to real world? The participant is doing something that directly or indirectly
generates the behavior that is being measured. Will the results generalize to other tasks or stimuli?
- Will the findings continue to apply as society changes over the years (societal/temporal changes)
10. Given the design and the operationalized variables you propose, what statistical tools will you use to test thehypothesis(ses)? (If you are doing regression, what kind of regression will you perform? Will there be anything
special about the standard errors you use?)
Regression-Adjusted Impact Estimates
Random assignment should produce Treatment and control groups that do not differ systematically on a range of characteristics,
thus reducing the need for statistical modeling that controls for baseline differences between the Treatment and C groups. It
21
-
7/27/2019 Program Eval Review
21/27
21
should theoretically be possible to just compare means across the groups, or run a bivariate regression of the outcome variable
on the Treatment indicator.
But, many think tanks use multiple regression to adjust for random baseline differences among the groups. (Note that random
assignment does not guarantee that Treatment and control groups are identical on all characteristics that could relate todifferences in the outcome variable (Orr: 188)). Taking account of these extant differences can improve the power of the
designallowing the researcher to detect smaller program effects than would otherwise be visible for the given n (Orr: 188).
Collecting the baseline data on assorted covariates to permit regression-adjusted impact estimates is often cheaper than
increasing the sample size n to increase the power of the design. This basic model estimates the average impact of the program
on the entire Treatment group, not on any subgroup.
One baseline covariate that goes into the model could well be the pre-program score on the outcome variable. Be careful
choosing covariates! Only characteristics that are prior to random assignment (pre-Treatment) should be used as covariates in
the model. Including post-Treatment characteristics in the model could bias the estimate of the effect of Treatment on the
outcome variable. Aside from being pre-Treatment, the control variables should be correlated with the outcome and not
affected by the Treatment.
Adjusted Means
The adjusted mean of Y (the outcome variable) for a given group is the regression function for that group evaluated at the mean
values across all the groups for the covariates. If x is the mean of x for the groups combined, then the adjusted mean of Y for
any particular group is the groups regression function (the predicted equation for that group) evaluated at x (evaluated at inx
practice). Note that is the average across all cases in the dataset for variable x. Note: If the adjusted means differ greatly from
the arithmetic means, this indicates that the groups had very different means on covariates in the model, and the results you are
getting from calculating adjusted means are premised on two assumptions that may be unrealistic: (1) that it makes sense to
adjust the groups on this covariate, and (2) that the relationship between the DV and the covariate has the same linear from
within each group, even if the mean of the covariate is quite different between groups. This latter assumption in particular may be
unwarranted.
22
-
7/27/2019 Program Eval Review
22/27
22
The Difference Estimator
=0+1+
is the post-Treatment score earned by case
is the Treatment indicator (or level) to which case was assigned1is the Treatment effectthe causal effect of a unit change in X, but if any differences exist between the Treatment and
Control groups before Treatment(i.e. any randomization failure), it will be biased
If X is binary (Treatment or no Treatment), then 1is the difference estimator: =1(|=0)
Avgscore for TreatmentGroupafter TreatmentAvgscore for Control group after Treatment: 1=
Regression-Adjusted Difference Estimator
If there was true random assignment, then the difference estimator is unbiased
But it could be made more efficient (smaller variance) through the addition of relevant covariates to the model
=0+1+21+++
Careful: Do NOT interpret coefficients on other covariates as causal!
Do not put any post-Treatment variables into the model!
The Differences-in-Differences Estimator AKA Double Difference Estimator (Panel Data)
=0+1+
is the difference between the post-Treatment score and the pre-Treatment score for case
1is the differences-in-differences estimatorthe causal effect of the Treatment(Avgscore for TreatmentGroupafter TreatmentAvgscore for TreatmentGroupbefore Treatment)(Avgscore for Control group
after TreatmentAvgscore for Control group before Treatment)
1= Y
Note: Will be biased if the Treatment and control groups follow different time trends for some reason (failure of parallel trends
assumption).
23
-
7/27/2019 Program Eval Review
23/27
23
The Regression-Adjusted Differences-in-Differences Estimator
=0+1+21+++
Do not interpret coefficients on other covariates as causalDo not put any post-Treatment variables into the model
Intent to Treat (ITT)
You can use random assignment to create a Treatment group and a C group
But differences can arise between the groups after randomization (e.g. some people drop out of experiment)
If there is attrition from both groups and either group, then you cannot simply compare those who actually end up being treated
to who actually ended up staying in the control group and assume the difference is the causal impact of the Treatment. These
two groups were not produced by random assignment.
Assessing Impact w/Repeated Cross-Sectional DataWhat do you do with the dropouts/non-compliers?
Usually it is not possible to observe their outcomes
But simply excluding them would bias the Treatment evaluation
There is no single, agreed-upon method for handling this
You can build an estimate of what their outcome would have been had they stayed in Treatment
Assume their outcome was the undesirable one to generate conservative estimates of ITT and TOT
ITT is the average effect of assignment to Treatment.
E.g. the average effect of being offered a voucher to moveDoes not accurately portray the effect of actually receiving Treatment if there is any non-compliance.
E.g. does not tell us the effect of actually using the voucher to move
Analysis uses all cases, in the groups to which they were randomly assigned. Ignore imperfect compliance. Ignore deviations
from protocol.
24
-
7/27/2019 Program Eval Review
24/27
24
If you have crossover (units assigned to Treatment actually being subjected to C and vice versa), ITT usually gives a
conservative estimate of the effect of Treatment
All other estimators involve making model-dependent (and often very sensitive) corrections for crossover or attrition. ITT does
not and therefore is the most robustIf the noncompliance rate in the population is similar to noncompliance rate in the experiment, then ITT can give a useful
estimate of impact of Treatment were it actually adopted in population.
But recall the limits on external validity that come from having to conduct random assignment among volunteers
If you are doing ITT analysis, say so clearly:
Do not claim to be estimating the causal effect of actually receiving Treatment.
Rather, you are estimating the causal effect of being assigned to Treatment vs. to control.
Calculating ITT
- ITT is the average effect of being assigned to Treatment. The causal effect of assignment to Treatment or control. The mean
outcome difference when Treatment is offered and when it is not. The Difference Estimator we saw last week=1(|=0). The average outcome for cases that received the Treatment scenario and the average outcome for cases
that received the C scenario. Can do regression-adjusted impact estimates or unadjusted impact estimates.
But what if you want to estimate the impact of receiving Treatment? Effect of Treatment on the treated (TOT). TOT > ITT if
you had non-compliance with Treatment.
- One approach, which works if impact of the program on non-participants is zero, is this (the no-show adjustment). This
approach makes no assumptions about whether those who drop out are similar to those who participate. It only assumes that
the effect of the program on non-participants is zero. This is probably a valid assumption in voluntary programs but not
necessarily a valid assumption when program is mandatory (e.g. work for welfare)- King et al. arguments against regression-adjustments
Randomization allows model-free impact estimates. These estimates are not dependent on particular decisions and thus are not
going to vanish with slight changes in modeling decisions.
If regression adjustments suddenly made a Treatment effect appear significant when it did not without those adjustments, there
was probably something wrong with the experiment. It is possible that a post-Treatment variable was put into the model amid
25
-
7/27/2019 Program Eval Review
25/27
these adjustments.
They say it is better to do a good job at the design stage than to try to compensate for poor designs after the fact
Matching
-A way to try to create comparison group that looks very similar to the Treatment group when you did not have random
assignment generate these groups
-A way to pre-process an observational dataset before running regression
-A way to thereby reduce the degree to which the results one gets depend on the model one specifies
- Get your dataset
* Run a series of matching procedures
* Make sure that matching is improving the balance on the covariates (ideally all of them, but minimally those that you havetheoretical reason to think are most important)
* Run regression analysis, but do so using the matched dataset rather than the original dataset
* Make sure that matching is improving the balance on the covariates (ideally all of them, but minimally those that you have
theoretical reason to think are most important)
* Calculate the simple difference in means between the Treatment and Comparison groups formed from matching
* Except where exact matching is possible (rarely), you should still use regression adjustments rather than just calculating the
difference in means
One-to-one exact matching
Match each Treatment unit with one C unit. For the matched cases, values are identical on all covariates on which you
matched.
If feasible, this procedure eliminates all dependence on functional form when running regression.
Here, can just do difference in means test
26
-
7/27/2019 Program Eval Review
26/27
But it is usually infeasible.
General Exact Matching
Matches all control units to a Treatment unit with exactly the same covariate values So one Treatment case could be matched to 3 C cases
Need to use a weighted difference in means to account for there being a different number of Treatment and C units
Nearest Neighbor Matching on the Propensity Score
Match each Treatment unit with the C having the most similar estimated propensity score
Less restrictive that exact matching
Check for Balance Improvements
Check Matching
Best: QQ plots for each covariate
Compare the histograms
Compare the mean of each covariate for the Treatment and the C group before vs. after matching. The smaller the
difference the better. Do not carry out difference of means t-test to assess balance (see Ho et al. 2011: 4)
Check the balance even for covariates that are not part of the matching procedure.
What if data can not be improved? The data may too simple be too fragile to permit causal inference
Matching only works when the data permit it to work
Data where all Treatment cases are severely different from all C cases may not be amenable to matching.
Linear Probability Model (Vs. Logistic Regression)
27
-
7/27/2019 Program Eval Review
27/27
LPM is the linear multiple regression model applied to a binary DV. It models the probability that the DV is 1.
If Y is a binary DV, and we use the linear regression model: Y i = B0 + B1X1i + B2X2i + B3X3i + .. i One advantage is that it is easy to use. Coefficients are directly interpretable where those from probit and logit are not.
A drawback is that it can predict probabilities less than 0 or greater than 1 which make no sense but is an inevitableconsequence of linear regression. Also cannot capture nonlinear relationships. Probit and logit force predicted values
to be between 0 and 1 and are specifically designed for binary DVs.
11. Describe at least one robustness check that you will do in addition to the analyses you have described so far.
If you make assumptions about data availability, or about how programs work, clearly state those assumptions. If
the prompt indicates that data is unavailable, or that a program will not be administered in some particular way, do
not contradict those statements with your assumptions. The program described below is hypothetical.
.