program eval review

Upload: sjgrubbs

Post on 02-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Program Eval Review

    1/27

    1

    Program Evaluation Study Guide

    Below is a description of a program. By answering each of the 11 questions listed here, explain the research design

    you would use to carry out an evaluation of this program. Number your responses to each of the 11 questions.

    1. Clearly state the evaluation question(s) you seek to answer.

    Make sure it is clear, focused, and appropriately complex.

    2. Clearly state the null hypothesis(es) that your analysis will test.

    The null hypothesis is a statement that you want to test. In general, the null hypothesis is that things are the same as each other,

    or the same as a theoretical expectation. Here the slope of the treatment should be zero indicating no change. The alternative

    hypothesis is that things are different from each other, or different from a theoretical expectation.

    3. Define the unit of analysis that your dataset(s) will use.

    The unit of analysis is the major entity that you are analyzing in your study. Why is it called the 'unit of analysis' and not

    something else (like, the unit of sampling)? Because it is the analysis you do in your study that determines what the unit is. For

    instance, if you are comparing the children in two classrooms on achievement test scores, the unit is the individual child because

    you have a score for each child. On the other hand, if you are comparing the two classes on classroom climate, your unit of

    analysis is the group, in this case the classroom, because you only have a classroom climate score for the class as a whole and

    not for each individual student. For different analyses in the same study you may have different units of analysis. If you decide to

    base an analysis on student scores, the individual is the unit. But you might decide to compare average classroom performance.In this case, since the data that goes into the analysis is the average itself (and not the individuals' scores) the unit of analysis is

    actually the group. Even though you had data at the student level, you use aggregates in the analysis. In many areas of social

    research these hierarchies of analysis units have become particularly important and have spawned a whole area of statistical

    analysis sometimes referred to as hierarchical modeling. This is true in education, for instance, where we often compare

    classroom performance but collected achievement data at the individual student level.

  • 7/27/2019 Program Eval Review

    2/27

    2

    4. Define the dependent variable(s) and then describe how you would operationalize (measure) the dependent

    variable(s).

    Operationalizing a variable means creating a clear protocol (list of steps) for how to quantitatively measure the concept (i.e. tochange the concept into a variable with measurable indicators). The protocol must be specific and precise enough so that

    someone else could use your operational definition and obtain the same results you did. To sum up, operationalization means to

    add measurable indicators to concepts to make them measurable variables.

    5. Define the key explanatory variable(s) and then describe how you would operationalize (measure) the

    independent key explanatory variable(s).

    See above

    6. Make a list of control variables that you will include in your analysis(es)

    A control variable is a variable that is held constant in a research analysis. The use of control variables is generally done to

    answer four basic kinds of questions: 1. Is an observed relationship between two variables just a statistical accident? 2. If one

    variable has a causal effect on another, is this effect a direct one or is it indirect with other variable intervening? 3. If several

    variables all have causal effects on the dependent variable, how does the strength of those effects vary? 4. Does a particular

    relationship between two variables look the same under various conditions? Common variables include age, gender, race,

    income, education, weight, and marital status.

    7. Describe your research design. Do not just give a term. Explain exactly what your research design will be. Ifyour research design involves one or more Treatment groups and control or comparison groups, clearly explain

    what each of these groups is and how these groups will be formed.

    Experiment (Gold Standard)

  • 7/27/2019 Program Eval Review

    3/27

    3

    Each subject has the same chance of being in the Treatment or the control group

    One subjects assignment is independent of any other subjects assignment

    Selection into one group or the other is unrelated to the individuals characteristics

    If assignment to the Treatment group vs. the control group is truly random, then they should on average have the samecharacteristics.

    Works best with large numbers of subjects

    Limits of Experiments

    Cannot yield estimates of the impact on any unit higher than the kind of unit sampled and randomly assigned

    Ex: An experiment that randomly assigned individuals to Treatment and control groups cannot give estimates of program

    impact on neighborhoods, cities, the economy, etc.

    Cannot tell us the impact on specific individuals. Can only give an estimate of the average impact on groups of

    participants. External validity is questionable

    Classic Experiment

    Random assignment of subjects into the Treatment and control group

    Pre-test administered to both groups, measuring each on the DV

    Treatment group gets the Treatment. Control group does not. Both groups should otherwise be exposed to the same

    conditions.

    Post-test administered to both groups, measuring each on the DV

    Difference in difference is attributed to the Treatment Treatment is the explanatory variable

    Treatment Group: DV1 Treatment DV2

    Control Group: DV1 ( ) DV2

    Randomized Post-Test Only Experiment

  • 7/27/2019 Program Eval Review

    4/27

    4

    Treatment is the explanatory variable

    Treatment Group: Treatment DV2

    Control Group: ( ) DV2

    No pretest due to random assignment. Already know they are the same.

    Randomized Block Design

    Instead of random assignment at the unit level, first assign units to blocks that reflect known differences among subjects,

    then randomly assign units from within each block

    Blocks are internally homogenous

    Example: If an experimenter had reason to believe that education level would be a significant factor in the effect of a

    given program, she/he might first divide the experimental subjects into education level groups: less than high school, high

    school, some college, bachelors, graduate. Then, within each education level group, individuals would be assigned to

    Treatment groups using random assignment.

    Advantage: can lower variance and thus yield narrower confidence intervals

    This is NOT the same thing as randomly assigning at the cluster level rather than the unit level. You are still carrying out

    random assignment at the unit level and can carry out analyses at the unit level.

    Quasi-Experiments (When experiments just arent an option)

    Need to assess the comparison group to the Treatment group. The more similar they are, the better

    Are the groups pre-program means similar?Are the SD of these pre-program means also similar?

    Do historical forces affect both groups the same way and to the same degree?

    Are rates of participant loss similar?

    Are the groups maturing at the same rate?

    Did people self-select into one group or the other? If so, this is bad

  • 7/27/2019 Program Eval Review

    5/27

    5

    Lack of randomization leads to internal validity problems

    Regression Discontinuity Design Considered to be one of the strongest quasi-experimental designs

    Estimates the effect of program eligibility(not receipt) on some outcome

    Requirements:

    Continuous outcome variable (DV)

    Pre-test and post-test (pre-program and post-program) data for all respondents

    May be a pre-test on the outcome variable

    Some cutoff score (cut-point) on the assignment variable puts cases on one side of the score into Treatment (program)

    group and others not

    The cut-point must not be designed to deliberately assign some cases to Treatment and others not.

    Other than assignment to receive the program vs. not, the cases on one side or the other of the cut-point need to be

    treated in the same way

    Assignment Variable/Rating Variable

    The assignment variable can be any continuous variable measured before the Treatment. It must not be a variable

    that the Treatment causes or influences.

    Needs to relate to the DV in an unbroken (though not necessarily a linear) manner throughout its range.

    Absent Treatment, we should not see any discontinuities if we plot the relationship between the outcome variable

    and the assignment variable. When graphing for RD, the assignment variable goes on the x-axis of a graph. The y-axis is the post-Treatment

    outcome. Drop a vertical line at the cut-point.

    Sharp RD Designs: All cases receive the Treatment or non-Treatment to which they are assigned (our focus)

    Fuzzy RD Designs: Some cases do not receive the Treatment or non-Treatment to which they were assigned. Problem

    of no-shows and/or crossovers

  • 7/27/2019 Program Eval Review

    6/27

    6

    Two types of RDD

    - Discontinuity at a cut-point* The direction and magnitude of the jump at the cut-point is taken as a measure of the causal effect of Treatment on the

    outcome variable for cases near the cut-point.

    * Use every observation in the same, even those that are far to the right or left of the cut-point

    * Global/ Parametric Estimation

    * Recommended: Center the assignment variable on the cut-point by creating a new version of the score

    ()

    Logic: Pre-test score should be a strong predictor of the post-test score. Is the program (the Treatment) also a strong

    predictor of post-test?

    *Coefficient on Program indicator is the program effect

    *Important: Try different functional forms when modeling

    *Linear relationship b/w assignment variable and posttest score

    *Quadratic

    *Cubic

    *Interactions with Treatment

    * If we have specified the functional form of the relationship between the outcome and assignment variable correctly, the

    estimator (coefficient on program) will be unbiased estimator of the mean program impact at the cut-point.

    * If functional form is incorrect, it will be biased.

    * More power and more bias

    * Analyze with regression using parametric techniques.

  • 7/27/2019 Program Eval Review

    7/27

    7

    - Nonparametric/ Local randomization

    * Differences between cases who pass the cut-point and those who just miss it are random

    * Example: random error determines whether someone scores right above or right below the cut-point on some exam

    (e.g. scores a 49 or a 51)* So on average the only difference between those just above and just below the cut-point will be exposure to

    the program (bandwidth)

    * So compare the mean outcomes for cases just to the left and just to the right of the cut-point

    * This is the how near question

    * Selecting the right bandwidth is the challenge with this method

    * Less potential for bias than with the global approach, but also more limited power given more limited n

    * Simplest scenario is to calculate a simple difference of means: Using just those in the bandwidth, compare the mean

    outcome for those who got Treatment and those who did not* But can produce a biased estimator unless the bandwidth is exceedingly narrow

    * Solution? Local linear regression

    * Local linear regression is acceptable since the functional form within the limited range is likely to be linear

    =+0+1+2+

    Where:

    is case score on the assignment variable, centered at the cut-point

    is 1 if case is in the program (Treatment) and 0 otherwise

    The resulting intercept for the control group estimates the mean outcome w/o the programThe resulting intercept for the program group estimates the outcome with the program

    The difference between them is an estimate of the Treatment effect

    * Can calculate a simple difference of means if using just those cases close to the cut-point.

    You can try both parametric and non-parametric approaches

  • 7/27/2019 Program Eval Review

    8/27

    8

    It is useful to couple the analyses above with a test of whether other DVs that are similar to your DV of interest, but not

    expected to be influenced by the program, exhibit discontinuity at the cutoff point

    i.e. Look at similar DV that you do not expect to be impacted by the program (such other DV are called control-construct

    measures)You hope they do NOT show discontinuity at the cutoff point

    Suppose DV of interest is a math exam and the Treatment is a math program

    Good control construct measure: Reading exam

    Bad control construct measure: Physics exam

    Internal Validity Issues:

    Cut point must have been chosen independent of knowledge of cases scores on the assignment variable, and scores must have

    been assigned independent of the cut point. Should check for manipulation at the cut point.

    Assessing RD Internal Validity:

    Learn as much as possible about how cases were scored on the assignment variable and about how cut point was set

    Plot the probability of receiving Treatment (y-axis) against the assignment variable (x-axis). There should be a jump (ideally a

    jump of 1) at the cut-point.

    Plot the relationship between non-outcome variables (y-axis) and the assignment variable (x-axis). These non-outcome

    variables should not be affected by Treatment. There should NOT be a jump at the cut-point.

    Plot the assignment variable against the number of observations at each of its values. There should NOT be a discontinuity in

    the number of observations just above or just below the cut-point. A jump suggests manipulation of either the scores on theassignment variable, or the choice of the cut-point.

    External Validity Issues (Generalizability)

    Criticism of RD is that the estimated impact only applies to observations near the cut-point

    But, Lee (2008) argues that the cases around the cut-point are actually fairly heterogeneous since random error plays some

  • 7/27/2019 Program Eval Review

    9/27

    9

    role in determining scores on the assignment variable

    Therefore that the results may generalize more broadly

    Border Matching A geographic border clearly separates the Treatment and comparison group.

    Look at data just to the side of the geographic border

    A weakness is that not every Treatment has clear geographic borders. This works best with small geographic scales.

    State borders are not a great way to differentiate the Treatment and comparison group.

    It is a strong quasi experimental design.

    It addresses history as a threat to internal validity.

    Analyze it using regression with control variables.

    Instrumental Variables

    Can be useful when you have unobserved omitted variables.

    Can be useful when you have simultaneity

    Must have at least one instrument for each endogenous regressor.

    Instrument but be exogenous and relevant.

    If an instrument is relevant, this means that variance in the instrument relates to variance in the independent variable. As

    an instrument becomes more relevant, it is able to explain more of the variance in the independent variable. This makes

    it more useful and a more accurate estimator for regression analysis. In addition, as an instrument becomes morerelevant, the normal approximation to the sampling distribution becomes better. If an instrument is unable to explain

    variance in the independent variable, it is known as a weak instrument. This is a problem because the normal

    distribution becomes a weak approximation to the sampling distribution of the estimator. This problem cannot be fixed

    by simply increasing the sample size. This problem can be identified by looking at the F-statistic which tests the

    hypothesis that the coefficients on the instruments are zero in the first stage. This can be done when there is a single

  • 7/27/2019 Program Eval Review

    10/27

    10

    endogenous repressor. If the F-statistic is greater than 10 at this stage, your instrument is most likely not weak. If you

    have several instrumental variables, it may be a good idea to eliminate some of the weaker ones. In this case you would

    also use the most relevant subset for your analysis. If you drop weak instrument variables, your standard errors can

    increase but that is okay because the original standard errors bear no significant meaning. On the other hand, ifcoefficients are exactly identified, you should not eliminate the weak instruments. This is because you might not have

    enough needed strong instruments. In this case, you can attempt to identify new strong instrument variables. You could

    also use the weak instrument but look at employing different methods of analysis.

    If an instrument is exogenous, then a part of the variance in the independent variable identified by the instrumental

    variable is exogenous. In other words, the instrument needs to contain information about variation in the independent

    variable that is not related to the error term. If an instrument is not exogenous, the two-stage least squares regression

    becomes inconsistent. The instrument cannot identify exogenous variation in the independent variable. As a result, the

    regression is unable to provide a constant estimator. It is not possible to test a hypothesis that the instrument is

    exogenous when the coefficients are not exactly identified. This makes it difficult to violate violations of this assumption.

    On the other hand, if a coefficient is over identified, researchers can test the overriding restrictions. This means that they

    can test the hypothesis that additional instruments are exogenous using the assumption that there are a sufficient amount

    of instruments available to identify the coefficients of interest. If the coefficients are exactly identified, the only way to

    assess instrument exogeneity is by referencing expert opinion. If there are more instruments than endogenous

    regressions, researchers can use a statistical tool called the test of over identifying restrictions to assess instrument

    endogeneity.

    It can be tough to satisfy the exogeneity condition.

    Not feasible to statistically test the exogeneity condition unless you have over determination Analyze using 2SLS

    Consider an equation Yi= B0+ B1Xi+ ui where Xiis endogenous

    IV regression splits Xi into two parts : 1. correlated with ui2. uncorrelated with ui

    The basic idea of IV method is to identify and use only the exogenous variation of Xi and estimate B1

  • 7/27/2019 Program Eval Review

    11/27

    11

    Instrumental variables are variables that mimic the troublesome regressors, but are uncorrelated with the error term.

    Comparison Group Pre/ Post Test

    Identify a group of subjects post-hoc that appear comparable to the group that already received, or is already slated toreceive, the Treatment

    Match each unit in the Treatment group with a unit that did not receive the Treatment but that is similar on key

    characteristics.

    Identify and acknowledge systematic differences between the groups later.

    Comparison group is not perfectly equivalent to the Treatment group since random assignment did not produce the

    groups.

    Treatment group: DV1TreatmentDV2

    Comparison group: DV1( )DV2

    If panel data, can do diff-in-diff w/control variables

    If repeated cross-sectional data, diff-in-diff is not possible but analysis can be done

    One group pretest posttest

    No comparison group

    Better than a post-test only

    Difference in means test

    Regression using difference method

  • 7/27/2019 Program Eval Review

    12/27

    12

    One group time series

    Observations taken at regular, evenly spaced, and frequent intervals

    A time series traces the path of a single variable over a period of months, years, etc. Think of that variable as the DV

    DV1DV2DV3DV4 DV5 TreatmentDV6DV7DV8DV9DV10

    Use panel or tscs data

    TSCS Data: #time periods >#entities

    Panel Data: #entities >#time periods

    Balanced or unbalanced panel

    2 wave panel

    Include the intercept term to allow for the possibility that the mean change in the DV is non-zero even if there is no

    change in your key EV of interest

    Effectively controls for any variable V that does not change over time. This is useful if V is some variable that is

    unobservable

    You should not apply this method to multi-wave panels. In other words, do not run across several years differencing

    year Y and Y+1 repeatedly. Instead use a multi-wave panel.

    Multi-wave panel

    Use fixed effects regression.

    - Entity fixed effects

    * Control for variables that vary across entities (e.g. vary across states) but not across time* Slope is the same for all entities

    * Intercept is allowed to differ by entity

    - Time fixed effects

    * Control for variables that vary across time but not across units (e.g. a national level policy change that affects all

    states)

  • 7/27/2019 Program Eval Review

    13/27

    13

    * Each time period has a different intercept.

    - Entity and time fixed effects together

    * Use when you have some (omitted) variables that vary across entities but are fixed across time, as well as some

    (omitted) variables that vary across time but are fixed across entities

    Heteroskedasticity may be an issue.

    This means the variance (spread) of the errors is not constant across the values of the key explanatory variables.

    OLS estimator will be unbiased and consistent but will not be efficient

    The standard error on the estimator is likely to be understated, meaning that we may conclude that a coefficient is

    statistically significant when it is not

    Solution is to use Robust/White/Huber-White standard errors. This is not sufficient, however, if there is any chance

    the errors are correlated over time (auto-correlation/serial correlation).

    Auto-Correlation and Serial Correlation

    There is a correlation (positive or negative) between a variables score at time t and its score at time t 1.

    Pervasive with time-series data.

    Identify it using Durbin Watson dstatistic.

    Fix it using HAC (heteroskedasticity and autocorrelation consistent standard errors)

    One kind of HAC is the clustered standard error

    - Allow error to be correlated within a cluster but assumes that the errors are uncorrelated across clusters

    - For autocorrealted panel data, the cluster is the obesrvations of the same entity across times (e.g. the state is the

    cluster)- Panel Corrected Standard Errors (PCSE)

    * Use for TSCS data

    - Segmented regression analysis.

    Must have a continuous DV

    Data must be observed at regular, evenly-spaced intervals.

  • 7/27/2019 Program Eval Review

    14/27

    14

    Generally want at least 12 time points before and 12 time points after intervention, especially if you think there is

    seasonal variation.

    Segment: A sequence of measures is divided into two or more portions at change points (points in time where we

    expect the values of the time series to change because of an event/intervention). Each segment has a level (intercept) and a trend (slope). Each segment is allowed to have its own level and trend.

    Comparison Group Time Series

    Same as interrupted time series, but now we add in a comparison group that did not experience the event or intervention

    and also look at its pattern on that variable over time

    8. Describe the major threats to the internal validity of your specific study. Do not just list terms. For each threat

    you discuss, explain why it is a threat.

    Issues with internal validity

    Does the key EV actually account for change we may see in the DV?

    Are we sure that the relationship between the assumed EV and the DV does not owe to some other factor?

    Selection effects are not a threat to internal validity if the Treatment and C groups are truly formed by random

    assignment

    But, experiments ARE susceptible to selection-mortality threats to internal validity, especially if one of the conditions

    (Treatment or Control) is noxious or intolerable

    And, experiments ARE also susceptible to social interaction threats to internal validity

    Threats Relevant to the Single Group Pretest-Posttest Quasi-Experimental Designs

    History

    History refers to any external or historical event that occurred during the course of the study that may be responsible for the

    effects instead of the program itself. For instance, in the 70s soon after Head Start began, Sesame Street also began airing.

    15

  • 7/27/2019 Program Eval Review

    15/27

    15

    Because Sesame Street includes lots of educational information, perhaps it could have been responsible for the apparent effects

    of Head Start. If we were studying psychotherapy with depressed patients at the time Prozac went on the market, all of our

    patients might have gotten better by the end of the study, because they went to a psychiatrist and got a prescription for Prozac.

    Maturation

    Maturation refers to a natural process that leads participants to change on the dependent measure. For instance, perhaps Head

    Start kids start getting better just because they were getting older during the study. One classic problem with many health studies

    is that patients get better by themselves (e.g., headache medicine trial). Maturation can refer to a decline as well as an

    improvement. For instance, some loss of short term memory abilities as you get older, or divergence in girls math scores and

    boys verbal scores around middle school age. You can distinguish between history and maturation because maturation is

    internal, a natural course of things having to do with some quality of the participants in the study. History has to do with an

    external event of some kind.

    Instrumentation

    Instrumentation refers to an improvement or decline that because of the measure itself. Instrumentation is most commonly found

    with observational measures. There is a common problem with observers getting better at observing. As an example, say we

    were observing young kids verbal behaviors in Head Start. It may be that the observers miss a number of behaviors indicative

    of good verbal skills during the pretest, but they were more likely to count these behaviors at post-test. Instrumentation can also

    be a decline.

    TestingA threat that involves an improvement in scores on the post-test due to taking the test a pretest is called testing. People

    commonly improve on standardized tests such as intelligence tests, SATs, or GREs. Something about taking the test the first

    time leads to a change in the test the second time, such as learning the answers or how the trick questions are set up. Perhaps

    the Head Start students learned the types of verbal questions (e.g., analogy) on the pretest and therefore knew how to do them

    better by posttest. In this case, it would not be the Treatment that had an effect but the experience of taking the test once that

    16

  • 7/27/2019 Program Eval Review

    16/27

    16

    led to an improvement.

    Mortality (Attrition)

    Mortality refers to people dropping out of the study during its course. For instance, Head Start children having the mostdifficulty drop out of the program and by the end of the study the participants who remain have higher academic skills on

    average. I wonder whether the recent trends of more frequently expelling students from public schools will help improve a

    schools test scores.

    Statistical Regression to the Mean

    Regression to the mean is the most difficult to understand. It has to do with a sort of statistical fluke. Whenever scores fluctuate

    over time for any reason, extreme scores tend to move toward the middle, and middle scores tend to move toward the extreme.

    Regression to the mean is most likely to be a problem in a study in which participants are chosen for their extreme values. For

    example, an early study evaluating the effects of Sesame Street indicated that it was especially helpful to the most disadvantaged

    kids. The kids with the lowest skills improved the most. This could have been because the educational material that the Sesame

    Street presents was the most helpful to the kids who knew the least or because those who did especially poorly on the pretest

    had lower scores by chance, improvement was simply a matter of random changes back toward their true scores.

    Threats Relevant to Two-Group Quasi-experimental Designs

    Selection Bias

    Selection bias refers to any difference between the groups before the start of the study. Two common ways selection bias canoccur is through self-selection or experimenter-selection. Self-selection bias is when the participants themselves choose which

    group they are in. In the Head Start example, parents that choose to be in the program, may have more motivation to teach their

    kids certain skills. In this example, kids improved because, motivated kids or families choosing to be in Head Start were more

    motivated at the outset than those who chose not to be in the program. Experimenter selection bias is when the researcher

    chooses who is assigned to each group (e.g., program vs. control group). In the Head Start study, for instance, because of

    17

  • 7/27/2019 Program Eval Review

    17/27

    17

    scarce resources, there may only be a limited number of slots available in the program, so the families that contact the program

    first are assigned to the program group and the families who contact the program last are assigned to the control. The families

    that contact the program earliest may be more motivated to educate their kids, and, therefore, the kids do better by the end of

    the studynot because of the effectiveness of the program but because of initial differences between the characteristics of theprogram and control groups. Because the researcher assigned families to the groups in a way that selected more motivated

    families, the results are due to experimenter selection bias. As long as we start out with comparable groups initially, the previous

    threats to internal validity are not a problem. It is very difficult to ensure that groups are completely comparable initially,

    however. The general weakness of two-group designs is selection factors, but selection can take on number of different forms.

    All of the following represent a differential change in the groups as a result of selection plus other threats.

    Selection & HistoryHead Start kids are more motivated and more likely to watch Sesame Street when it came on.

    Selection & MaturationHead Start kids are more motivated and thus more likely to develop skills faster than those in control.

    Selection & TestingHead Start kids are more motivated so they are more likely to figure out the tricks to the math questions.

    Consequently, with a second testing, they improved.

    Selection & InstrumentationPerhaps observers in control group get bored and less likely to pick up on kids improvement, so

    it looks like the program kids do better.

    Selection & MortalityIn the control group, the most motivated kids and families are more likely to drop out of the study

    because they found another preschool opportunity. This leads to differential attrition in which the kids most likely to improve will

    leave the control group. At posttest, the Treatment group will be better than the control group.

    Selection & RegressionKids who are the furthest behind or have the most disadvantaged families are assigned to the

    Treatment group. Because they are more extreme to begin with, they show greater improvement due to regression rather thanprogram effects.

    Threats Relevant to Social Interaction

    Diffusion or Contamination

    18

  • 7/27/2019 Program Eval Review

    18/27

    18

    This occurs when a comparison group learns about the program either directly or indirectly from program group participants. In

    a school context, children from different groups within the same school might share experiences during lunch hour. Or,

    comparison group students, seeing what the program group is getting, might set up their own experience to try to imitate that of

    the program group. In either case, if the diffusion of imitation affects the posttest performance of the comparison group, it canhave and jeopardize your ability to assess whether your program is causing the outcome. Notice that this threat to validity tend

    to equalize the outcomes between groups, minimizing the chance of seeing a program effect even if there is one.

    Compensatory Rivalry

    Here, the comparison group knows what the program group is getting and develops a competitive attitude with them. The

    students in the comparison group might see the special math tutoring program the program group is getting and feel jealous. This

    could lead them to deciding to compete with the program group "just to show them" how well they can do. Sometimes, in

    contexts like these, the participants are even encouraged by well-meaning teachers or administrators to compete with each other

    (while this might make educational sense as a motivation for the students in both groups to work harder, it works against our

    ability to see the effects of the program). If the rivalry between groups affects posttest performance, it could make it more

    difficult to detect the effects of the program. As with diffusion and imitation, this threat generally works to in the direction of

    equalizing the posttest performance across groups, increasing the chance that you won't see a program effect, even if the

    program is effective.

    Resentful Demoralization

    This is almost the opposite of compensatory rivalry. Here, students in the comparison group know what the program group is

    getting. But here, instead of developing a rivalry, they get discouraged or angry and they give up (sometimes referred to as the"screw you" effect!). Unlike the previous two threats, this one is likely to exaggerate posttest differences between groups,

    making your program look even more effective than it actually is.

    Compensatory Equalization of Treatment

    This is the only threat of the four that primarily involves the people who help manage the research context rather than the

    19

  • 7/27/2019 Program Eval Review

    19/27

    19

    participants themselves. When program and comparison group participants are aware of each other's conditions they may wish

    they were in the other group (depending on the perceived desirability of the program it could work either way). Often they or

    their parents or teachers will put pressure on the administrators to have them reassigned to the other group. The administrators

    may begin to feel that the allocation of goods to the groups is not "fair" and may be pressured to or independently undertake tocompensate one group for the perceived advantage of the other. If the special math tutoring program was being done with

    state-of-the-art computers, you can bet that the parents of the children assigned to the traditional non-computerized comparison

    group will pressure the principal to "equalize" the situation. Perhaps the principal will give the comparison group some other

    good, or let them have access to the computers for other subjects. If these "compensating" programs equalize the groups on

    posttest performance, it will tend to work against your detecting an effective program even when it does work. For instance, a

    compensatory program might improve the self-esteem of the comparison group and eliminate your chance to discover whether

    the math program would cause changes in self-esteem relative to traditional math training.

    9. Discuss the major threats to the external validity of your specific study. Do not just list terms. For each threat

    you discuss, explain why it is a threat.

    External validity is related to generalizing. That's the major thing you need to keep in mind. Recall that validity refers to the

    approximate truth of propositions, inferences, or conclusions. So, external validity refers to the approximate truth of

    conclusions the involve generalizations. Put in more pedestrian terms, external validity is the degree to which the conclusions in

    your study would hold for other persons in other places and at other times.

    A threat to external validity is an explanation of how you might be wrong in making a generalization. For instance, you conclude

    that the results of your study (which was done in a specific place, with certain types of people, and at a specific time) can be

    generalized to another context (for instance, another place, with slightly different people, at a slightly later time). There are threemajor threats to external validity because there are three ways you could be wrong -- people, places or times. Your critics

    could come along, for example, and argue that the results of your study are due to the unusual type of people who were in the

    study. Or, they could argue that it might only work because of the unusual place you did the study in (perhaps you did your

    educational study in a college town with lots of high-achieving educationally-oriented kids). Or, they might suggest that you did

    your study in a peculiar time. For instance, if you did your smoking cessation study the week after the Surgeon General issues

    20

  • 7/27/2019 Program Eval Review

    20/27

    20

    the well-publicized results of the latest smoking and cancer studies, you might get different results than if you had done it the

    week before.

    How can we improve external validity? One way, based on the sampling model, suggests that you do a good job of drawing a

    sample from a population. For instance, you should use random selection, if possible, rather than a nonrandom procedure. And,once selected, you should try to assure that the respondents participate in your study and that you keep your dropout rates low.

    A second approach would be to use the theory of proximal similarity more effectively. How? Perhaps you could do a better job

    of describing the ways your contexts and others differ, providing lots of data about the degree of similarity between various

    groups of people, places, and even times. You might even be able to map out the degree of proximal similarity among various

    contexts with a methodology like concept mapping. Perhaps the best approach to criticisms of generalizations is simply to

    show them that they're wrong -- do your study in a variety of places, with different people and at different times. That is, your

    external validity (ability to generalize) will be stronger the more you replicate your study.

    Overall Issues with external validity

    Do the results generalize to the extent that the research claims they do?

    - Generalize from sample to target population?

    - Generalize to other settings, situations, or locations?

    - Generalize from research arrangement to real world? The participant is doing something that directly or indirectly

    generates the behavior that is being measured. Will the results generalize to other tasks or stimuli?

    - Will the findings continue to apply as society changes over the years (societal/temporal changes)

    10. Given the design and the operationalized variables you propose, what statistical tools will you use to test thehypothesis(ses)? (If you are doing regression, what kind of regression will you perform? Will there be anything

    special about the standard errors you use?)

    Regression-Adjusted Impact Estimates

    Random assignment should produce Treatment and control groups that do not differ systematically on a range of characteristics,

    thus reducing the need for statistical modeling that controls for baseline differences between the Treatment and C groups. It

    21

  • 7/27/2019 Program Eval Review

    21/27

    21

    should theoretically be possible to just compare means across the groups, or run a bivariate regression of the outcome variable

    on the Treatment indicator.

    But, many think tanks use multiple regression to adjust for random baseline differences among the groups. (Note that random

    assignment does not guarantee that Treatment and control groups are identical on all characteristics that could relate todifferences in the outcome variable (Orr: 188)). Taking account of these extant differences can improve the power of the

    designallowing the researcher to detect smaller program effects than would otherwise be visible for the given n (Orr: 188).

    Collecting the baseline data on assorted covariates to permit regression-adjusted impact estimates is often cheaper than

    increasing the sample size n to increase the power of the design. This basic model estimates the average impact of the program

    on the entire Treatment group, not on any subgroup.

    One baseline covariate that goes into the model could well be the pre-program score on the outcome variable. Be careful

    choosing covariates! Only characteristics that are prior to random assignment (pre-Treatment) should be used as covariates in

    the model. Including post-Treatment characteristics in the model could bias the estimate of the effect of Treatment on the

    outcome variable. Aside from being pre-Treatment, the control variables should be correlated with the outcome and not

    affected by the Treatment.

    Adjusted Means

    The adjusted mean of Y (the outcome variable) for a given group is the regression function for that group evaluated at the mean

    values across all the groups for the covariates. If x is the mean of x for the groups combined, then the adjusted mean of Y for

    any particular group is the groups regression function (the predicted equation for that group) evaluated at x (evaluated at inx

    practice). Note that is the average across all cases in the dataset for variable x. Note: If the adjusted means differ greatly from

    the arithmetic means, this indicates that the groups had very different means on covariates in the model, and the results you are

    getting from calculating adjusted means are premised on two assumptions that may be unrealistic: (1) that it makes sense to

    adjust the groups on this covariate, and (2) that the relationship between the DV and the covariate has the same linear from

    within each group, even if the mean of the covariate is quite different between groups. This latter assumption in particular may be

    unwarranted.

    22

  • 7/27/2019 Program Eval Review

    22/27

    22

    The Difference Estimator

    =0+1+

    is the post-Treatment score earned by case

    is the Treatment indicator (or level) to which case was assigned1is the Treatment effectthe causal effect of a unit change in X, but if any differences exist between the Treatment and

    Control groups before Treatment(i.e. any randomization failure), it will be biased

    If X is binary (Treatment or no Treatment), then 1is the difference estimator: =1(|=0)

    Avgscore for TreatmentGroupafter TreatmentAvgscore for Control group after Treatment: 1=

    Regression-Adjusted Difference Estimator

    If there was true random assignment, then the difference estimator is unbiased

    But it could be made more efficient (smaller variance) through the addition of relevant covariates to the model

    =0+1+21+++

    Careful: Do NOT interpret coefficients on other covariates as causal!

    Do not put any post-Treatment variables into the model!

    The Differences-in-Differences Estimator AKA Double Difference Estimator (Panel Data)

    =0+1+

    is the difference between the post-Treatment score and the pre-Treatment score for case

    1is the differences-in-differences estimatorthe causal effect of the Treatment(Avgscore for TreatmentGroupafter TreatmentAvgscore for TreatmentGroupbefore Treatment)(Avgscore for Control group

    after TreatmentAvgscore for Control group before Treatment)

    1= Y

    Note: Will be biased if the Treatment and control groups follow different time trends for some reason (failure of parallel trends

    assumption).

    23

  • 7/27/2019 Program Eval Review

    23/27

    23

    The Regression-Adjusted Differences-in-Differences Estimator

    =0+1+21+++

    Do not interpret coefficients on other covariates as causalDo not put any post-Treatment variables into the model

    Intent to Treat (ITT)

    You can use random assignment to create a Treatment group and a C group

    But differences can arise between the groups after randomization (e.g. some people drop out of experiment)

    If there is attrition from both groups and either group, then you cannot simply compare those who actually end up being treated

    to who actually ended up staying in the control group and assume the difference is the causal impact of the Treatment. These

    two groups were not produced by random assignment.

    Assessing Impact w/Repeated Cross-Sectional DataWhat do you do with the dropouts/non-compliers?

    Usually it is not possible to observe their outcomes

    But simply excluding them would bias the Treatment evaluation

    There is no single, agreed-upon method for handling this

    You can build an estimate of what their outcome would have been had they stayed in Treatment

    Assume their outcome was the undesirable one to generate conservative estimates of ITT and TOT

    ITT is the average effect of assignment to Treatment.

    E.g. the average effect of being offered a voucher to moveDoes not accurately portray the effect of actually receiving Treatment if there is any non-compliance.

    E.g. does not tell us the effect of actually using the voucher to move

    Analysis uses all cases, in the groups to which they were randomly assigned. Ignore imperfect compliance. Ignore deviations

    from protocol.

    24

  • 7/27/2019 Program Eval Review

    24/27

    24

    If you have crossover (units assigned to Treatment actually being subjected to C and vice versa), ITT usually gives a

    conservative estimate of the effect of Treatment

    All other estimators involve making model-dependent (and often very sensitive) corrections for crossover or attrition. ITT does

    not and therefore is the most robustIf the noncompliance rate in the population is similar to noncompliance rate in the experiment, then ITT can give a useful

    estimate of impact of Treatment were it actually adopted in population.

    But recall the limits on external validity that come from having to conduct random assignment among volunteers

    If you are doing ITT analysis, say so clearly:

    Do not claim to be estimating the causal effect of actually receiving Treatment.

    Rather, you are estimating the causal effect of being assigned to Treatment vs. to control.

    Calculating ITT

    - ITT is the average effect of being assigned to Treatment. The causal effect of assignment to Treatment or control. The mean

    outcome difference when Treatment is offered and when it is not. The Difference Estimator we saw last week=1(|=0). The average outcome for cases that received the Treatment scenario and the average outcome for cases

    that received the C scenario. Can do regression-adjusted impact estimates or unadjusted impact estimates.

    But what if you want to estimate the impact of receiving Treatment? Effect of Treatment on the treated (TOT). TOT > ITT if

    you had non-compliance with Treatment.

    - One approach, which works if impact of the program on non-participants is zero, is this (the no-show adjustment). This

    approach makes no assumptions about whether those who drop out are similar to those who participate. It only assumes that

    the effect of the program on non-participants is zero. This is probably a valid assumption in voluntary programs but not

    necessarily a valid assumption when program is mandatory (e.g. work for welfare)- King et al. arguments against regression-adjustments

    Randomization allows model-free impact estimates. These estimates are not dependent on particular decisions and thus are not

    going to vanish with slight changes in modeling decisions.

    If regression adjustments suddenly made a Treatment effect appear significant when it did not without those adjustments, there

    was probably something wrong with the experiment. It is possible that a post-Treatment variable was put into the model amid

    25

  • 7/27/2019 Program Eval Review

    25/27

    these adjustments.

    They say it is better to do a good job at the design stage than to try to compensate for poor designs after the fact

    Matching

    -A way to try to create comparison group that looks very similar to the Treatment group when you did not have random

    assignment generate these groups

    -A way to pre-process an observational dataset before running regression

    -A way to thereby reduce the degree to which the results one gets depend on the model one specifies

    - Get your dataset

    * Run a series of matching procedures

    * Make sure that matching is improving the balance on the covariates (ideally all of them, but minimally those that you havetheoretical reason to think are most important)

    * Run regression analysis, but do so using the matched dataset rather than the original dataset

    * Make sure that matching is improving the balance on the covariates (ideally all of them, but minimally those that you have

    theoretical reason to think are most important)

    * Calculate the simple difference in means between the Treatment and Comparison groups formed from matching

    * Except where exact matching is possible (rarely), you should still use regression adjustments rather than just calculating the

    difference in means

    One-to-one exact matching

    Match each Treatment unit with one C unit. For the matched cases, values are identical on all covariates on which you

    matched.

    If feasible, this procedure eliminates all dependence on functional form when running regression.

    Here, can just do difference in means test

    26

  • 7/27/2019 Program Eval Review

    26/27

    But it is usually infeasible.

    General Exact Matching

    Matches all control units to a Treatment unit with exactly the same covariate values So one Treatment case could be matched to 3 C cases

    Need to use a weighted difference in means to account for there being a different number of Treatment and C units

    Nearest Neighbor Matching on the Propensity Score

    Match each Treatment unit with the C having the most similar estimated propensity score

    Less restrictive that exact matching

    Check for Balance Improvements

    Check Matching

    Best: QQ plots for each covariate

    Compare the histograms

    Compare the mean of each covariate for the Treatment and the C group before vs. after matching. The smaller the

    difference the better. Do not carry out difference of means t-test to assess balance (see Ho et al. 2011: 4)

    Check the balance even for covariates that are not part of the matching procedure.

    What if data can not be improved? The data may too simple be too fragile to permit causal inference

    Matching only works when the data permit it to work

    Data where all Treatment cases are severely different from all C cases may not be amenable to matching.

    Linear Probability Model (Vs. Logistic Regression)

    27

  • 7/27/2019 Program Eval Review

    27/27

    LPM is the linear multiple regression model applied to a binary DV. It models the probability that the DV is 1.

    If Y is a binary DV, and we use the linear regression model: Y i = B0 + B1X1i + B2X2i + B3X3i + .. i One advantage is that it is easy to use. Coefficients are directly interpretable where those from probit and logit are not.

    A drawback is that it can predict probabilities less than 0 or greater than 1 which make no sense but is an inevitableconsequence of linear regression. Also cannot capture nonlinear relationships. Probit and logit force predicted values

    to be between 0 and 1 and are specifically designed for binary DVs.

    11. Describe at least one robustness check that you will do in addition to the analyses you have described so far.

    If you make assumptions about data availability, or about how programs work, clearly state those assumptions. If

    the prompt indicates that data is unavailable, or that a program will not be administered in some particular way, do

    not contradict those statements with your assumptions. The program described below is hypothetical.

    .