standards of evidence for program efficacy, effectiveness and readiness for dissemination society...
TRANSCRIPT
Standards of Evidence for Program Efficacy,
Effectiveness and Readiness for Dissemination
Society for Prevention Research (SPR)
Brian R. Flay, D.Phil.Oregon State University
For the SPR Standards Committee
Three Main Sections to Talk
• Presentation of The SPR Standards
• Elements of an Ideal Trial
• Discussion of Particular Methodological Issues
Members of the SPR Committee
• Brian R. Flay (Chair), D. Phil., U of Illinois at Chicago
• Anthony Biglan, Ph.D., Oregon Research Institute
• Robert F. Boruch, Ph.D., U of Pennsylvania
• Felipe G. Castro, Ph.D., MPH, Arizona State U
• Denise Gottfredson, Ph.D., Maryland U
• Sheppard Kellam, M.D., AIR • Eve K. Moscicki, Sc.D., MPH, NIMH
• Steven Schinke, Ph.D., Columbia U • Jeff Valentine, Ph.D., Duke University• With help from Peter Ji, Ph.D., U of Illinois at Chicago
Pressure for programs of “proven effectiveness”
• The Federal Government increasingly requires that Federal money be spent only on programs of “proven effectiveness”– Substance Abuse and Mental Health
Services Administration (SAMHSA) and Center for Substance Abuse Prevention (CSAP)
– U.S. Department of Education (DE)– Office of Juvenile Justice and Delinquency
Prevention (OJJDP)
What is “Proven Effectiveness”?
• Requires rigorous research methods to determine that observed effects were caused by the program being tested, rather than some other cause.
• Randomized controlled trials (RCTs) are the gold standard.
• E.g., the Food and Drug Administration (FDA) requires at least two randomized clinical trials before approving a new drug.
• Randomized controlled trials (RCTs) are always expensive, and there are many challenges to conducting RCTs in schools.
Multiple Approaches to Standards of Evidence
• Each government agency and academic group that has reviewed programs for lists has come up with its own set of standards.
• Some examples:– CDC - Guide to Preventive Services– SAMHSA - National Registry of Evidence-
based Programs and Practices– US Dept of Education - What Works
Clearinghouse– Justice Department – Multiple lists– Many others ….
Why So Many Lists of Evidence-Based Programs
and Practices?• They are all similar but not equal
– E.g., CSAP allows more studies in than ED
• All concern the rigor of the research• The Society for Prevention Research
recently created standards for the field• Our innovation was to consider nested
sets of standards for efficacy, effectiveness and readiness for dissemination
Four Kinds of Validity(Cook & Campbell, 1979; Shadish, Cook & Campbell,
2002)
• Construct validity– Program description and measures of
outcomes• Internal validity
– Was the intervention the cause of the change in the outcomes?
• External validity (Generalizability)– Was the intervention tested on relevant
participants and in relevant settings?• Statistical validity
– Can accurate effect sizes be derived from the study?
Standards for 3 Levels
• Efficacy– What effects can the intervention have under
ideal conditions?
• Effectiveness– What effects does the intervention have under
real-world conditions?
• Dissemination– Is an effective intervention ready for broad
application or distribution?
Desirable– Additional criteria that provide added value to
evaluated interventions
Phases of Research in Prevention Program
Development (Flay, 1986)1. Basic Research2. Hypothesis Development3. Component Development and Pilot Studies4. Prototype Studies of Complete Programs5. Efficacy Trials of Refined Programs6. Treatment Effectiveness Trials
– Generalizability of effects under standardized delivery
7. Implementation Effectiveness Trials– Effectiveness with real-world variations in
implementation
8. Demonstration Studies– Implementation and evaluation in multiple
systems
Specificity of Efficacy Statement
• “Program X is efficacious for producing Y outcomes for Z population.” – The program (or policy, treatment,
strategy) is named and described– The outcomes for which proven
outcomes are claimed are clearly stated
– The population to which the claim can be generalized is clearly defined
Program Description• Efficacy
– Intervention must be described at a level that would allow others to implement or replicate it
• Effectiveness– Manuals, training and technical support must be
available– The intervention should be delivered under the
same kinds of conditions as one would expect in the real world
– A clear theory of causal mechanisms should be stated
– Clear statement of “for whom?” and “under what conditions?” the intervention is expected to work
• Dissemination– Provider must have the ability to “go-to-scale”
Outcomes
• ALL claimed public health or behavioral outcome(s) must be measured– Attitudes or intentions cannot
substitute for actual behavior
• At least one long-term follow-up– The appropriate interval may vary
by type of intervention and state-of-the-field
Measures• Efficacy
– Psychometrically sound• Valid• Reliable (internal consistency, test-retest or inter-
rater reliability)• Data collectors independent of the intervention
• Effectiveness– Level of exposure also should be measured
• Integrity and level of implementation• Acceptance/compliance/adherence/involvement
of target audience in the intervention
• Dissemination– Monitoring and evaluation tools available
Desirable Measures
• For ALL– Multiple measures– Mediating variables (or immediate
effects)– Moderating variables– Potential side-effects– Potential iatrogenic effects
• At least one comparison group– No-treatment, usual care, placebo or wait-
list• Assignment to conditions must
maximize causal clarity– Random assignment is “the gold standard”– Other acceptable designs
• Multiple baseline or • Repeated time-series designs• Regression-discontinuity• Well-done matched controls
– Demonstrated pretest equivalence on multiple measures
– Known selection mechanism
Design – for Causal Clarity
Generalizability of Findings
• Efficacy– Sample is defined
• Who it is (from what “defined” population)• How it was obtained (sampling methods)
• Effectiveness– Description of real-world target population and
sampling methods– Degree of generalizability should be evaluated
• Desirable• Subgroup analyses• Dosage studies/analyses• Replication with different populations• Replication with different program providers
Precision of Outcomes:Statistical Analysis
• Statistical analysis allows unambiguous causal statements– At same level as randomization and
includes all cases assigned to conditions– Tests for pretest differences– Adjustments for multiple comparisons– Analyses of (and adjustments for) attrition
• Rates, patterns and types
• Desirable– Report extent and patterns of missing data
Precision of Outcomes: Statistical Significance
• Statistically significant effects– Results must be reported for all
measured outcomes– Efficacy can be claimed only for
constructs with a consistent pattern of statistically significant positive effects
– There must be no statistically significant negative (iatrogenic) effects on important outcomes
Precision of Outcomes:Practical Value
• Efficacy– Demonstrated practical significance in terms of
public health (or other relevant) impact– Report of effects for at least one follow-up
• Effectiveness– Report empirical evidence of practical
importance
• Dissemination– Clear cost information available
• Desirable– Cost-effectiveness or cost-benefit analyses
Precision of Outcomes:Replication
• Consistent findings from at least two different high-quality studies/replicates that meet all of the other criteria for efficacy and each of which has adequate statistical power – Flexibility may be required in the application of
this standard in some substantive areas
• When more than 2 studies are available, the preponderance of evidence must be consistent with that from the 2 most rigorous studies
• Desirable– The more replications the better
Additional Desirable Criteriafor Dissemination
• Organizations that choose to adopt a prevention program that barely or not quite meets all criteria should seriously consider undertaking a replication study as part of the adoption effort so as to add to the body of knowledge.
• A clear statement of the factors that are expected to assure the sustainability of the program once it is implemented.
Nested Standards
Efficacy Effectiveness Dissemination Desirable
20 28 31 43
How you can use the Standards when Questioning Public
Officials• Has the program been evaluated in a
RCT?• Were units randomized to program
and control (no program or alternative program) conditions?
• Has the program been evaluated on populations like yours?
• Have the findings been replicated?• Were any evaluations independent
from the program developers?
School- or Community-Based Prevention/ Promotion Studies are Large and Complex
• Large randomized trials– With multiple schools or other units per condition
• Comparisons with “treatment as usual”• Measurement of implementation process and
program integrity• Assessment of effects on presumed mediators
– Helps test theories
• Multiple measures/sources of data– Surveys of students, parents, teachers, staff, community– Teacher and parent reports of behavior– School records for behavior and achievement
• Multiple, independent trials of promising programs– At both efficacy and effectiveness levels
• Cost-effectiveness analyses
Example Programs that Come Close to Meeting The SPR Standards?
• Life Skills Training (Botvin)– Multiple RCTs with different populations,
implementers and types of training– Only one long-term follow-up– Independent replications of short-term effects
are now appearing (as well as some failures)– No independent replications of long-term effects
yet
• Positive Action– Two high-quality matched control studies– Two randomized trials
• Others?
Programs for Which the Research Meets the Standards, But Do Not
Work• DARE
– Many quasi-experimental and non-experimental studies suggested effectiveness
– Multiple RCTs found no effects (Ennett meta-analysis
• Hutchinson– Well-designed RCT– But no published information on the program or
short-term effects – need to demonstrate short-term effects before you can say anything meaningful about long-term effects
– But no long-term effects– Cannot be interpreted because of lack of information
Much of Prevention is STUCK … We’re Spinning our
Wheels• At the Efficacy Trial phase
– How can we get more programs into effectiveness trials?
• At the Effectiveness Trial phase– How can we get more proven programs
adopted?• At the “Model Program” phase
– How can we ensure the ongoing effectiveness of model programs?
• Lots more prevention research is needed – at all levels!
II. Example of a well-designed evaluation
RCT: Randomized Controlled (Community) Trial
Community-Based Tobacco Control Program
Comprehensive Intervention
• Mass Media Campaign– Legacy “truth” Campaign
• School– Curriculum– School-wide Policy– Family Involvement
• Community Mobilization– Youth Access– Smoke-free environments– Other policy changes
RCT with Matched Pairs
• 6-12 matched pairs (need power analysis)
• Community = Defined Media Markets
• Matched pairs or strata:– Use existing data to match
• Population, smoking rates, media reach
– Possible use of baseline data to match• Youth smoking survey
Randomization
• Each member of each pair agrees to randomization after baseline data collected
• OR
• Match on basis of archival data, recruit to study, then randomize
• OR
• Randomize to conditions, then recruit
Design Elements
O OxxxxOxxxxOxxxxOxxxxOxxxxOO
RO O O O O O O O O
Improve Power:Multiple baseline measuresMultiple control groups per intervention groupMultiple waves of data
Steps
• Baseline studies– Tobacco use prevalence and survey – Qualitative eval of appropriateness of messages
• Second or more baselines• Implement all components of intervention• Monitoring of intervention implementation• Follow-up surveys
– Awareness of campaign components– Of intermediate changes - e.g., prices,
availability– Youth tobacco use attitudes and behaviors
Specific Measures: Process• Pre-intervention measure of readiness
• Implementation– Exposure to media campaign or number of
lessons taught– Fidelity of implementation– # youth involved in empowerment
programs– Community organizational
structures/involvement
• Post-study sustainability of intervention
Measures: Intermediate
• Policy Change– School tobacco-free policies– Tobacco prices (and change in
prices)– Community smoke-free policies
• Enforcement of tobacco-free policies– School– Community
Specific Measures: Outcomes
• Tobacco use by adolescents 13-15 – % ever used– % use currently
• % of youth “susceptible” to initiation
• Exposure to 2nd hand smoke– % children, nonsmoking
adolescents/young adults
• Per capita sales of tobacco products– Cigarettes– Other tobacco products
Appropriate Analyses
• Take account of unit of assignment• Take account of nesting
– times within subjects within settings
• Take account of subject mobility– E.g., stayers, leavers and joiners
• Take account of missing data• Make full use of longitudinal data
– E.g., growth curves
III. Methodological Issues – Some More Detail
• What would an idea trial look like?
• Issues re Randomization• Sample sizes• Where are the control groups?• Intensive measurement• Unit of Analysis• Nature of the target population• Moderation and Mediation
Why Random Assignment of Schools
• Intervention is delivered to intact classrooms
• Random assignment of classes is subject to contamination across classes within schools
• Program includes school-wide components• Credible causal statements require group
equivalence at both the group and individual levels– On the outcome variable– On presumed mediating variables– On motivation or desire to change
Randomized Prevention Research Studies
• First Waterloo Study was the first with a sufficient N of schools for randomization to be “real”– Some earlier studies were claimed as “randomized” with
only one or two schools per condition
• Other smoking prevention studies in early-mid 80’s– McAlister, Hansen/Evans, Murray/Luepker, Perry/Murray,
Biglan/Ary, Dielman
• Other substance abuse prevention studies in late 80’s and 90’s– Johnson/Hansen/Flay/Pentz, Botvin and colleagues
• Extended to sexual behavior, AIDs, & Violence in 90’s– Also McArthur Network initiated trials of Comer
• Character Education in 2002– New DoE funding
Issues re Randomization• Ethical resistance to the idea of randomization needs
to be addressed, though it’s becoming rare• Control schools like to have a program too
– Use usual Health Education (treatment as usual)– Offer special, but unrelated, program
• E.g., Aban Aya Health Enhancement Curriculum as control for Social Development (violence, sex, drug prevention) Curriculum
– Pay schools for access to collect data from students, parents and teachers -- $500-$2,000 per year
• Currently, because of NCLB demands, many schools are too busy to want to be in an intervention condition– Too many teaching and testing demands– Too many other special programs already– Pay them for staff support and special activities
Approaches to Randomization
• Pure randomization from a large population• Obtain agreement first
– Even prior agreements can break down (Waterloo)
• Then randomize from matched sets defined by– Presumed predictors of the outcome (Graham et al., Aban
Aya)– Actual predictors of the outcome (Hawaii Positive Action
trial)– Individual-level pretest levels of the outcome (has anyone
ever achieved this?)
• If schools refuse or drop out, replace from the same set– Only one school of 15 initial selections/assignments for
Aban Aya refused and was replaced– But this compromises randomization – different probabilities
• Drop the set– We had to drop multiple sets in the Hawaii Positive Action
trial because of refusals by schools assigned to the program
Breakdown of Randomization/Design
• Failure of randomization– Don’t use posttest-only designs (to my knowledge none have)
• Schools drop out during course of study– Use signed agreements (none dropped out of Aban Aya or
Hawaii)• Configuration of schools is changed during course of
study– E.g., A school is closed, two schools are combined– Drop the paired school as well (& add replacement set if it’s
soon enough)• A program school refuses to deliver the program, or
delivers it poorly– Try to avoid this with incentive and technical support– E.g., Schaps Child Development Study only had 5 schools of
12 implement the program well – and reports emphasize results from these 5.
– Botvin often reported results only for students who received more than 60% of the lessons
– “Intention to Treat” analysis should be reported first. Reporting results for the high-implementation group is appropriate only as a secondary level of analysis
Expense --> Small Ns?
• Yes, in many cases– Average efficacy trial to date (where research
funds support the intervention) had 4-8 schools per condition, and cost ~$500-900,000 per year.
– Effectiveness trials (where intervention is less costly) have 10-20 schools per condition for $500,000 per year.
• Limit costs by using more small schools– Raises questions about generalizability of results
to large schools
• Limit costs by limiting variability between schools– Also limits generalizability of results
The Changing Nature of Control Groups
• The medical model suggests use of a placebo and double blinding, neither of which is possible for educational programs
• Subjects (both students and schools) should have equal expectations of what they will get from the program
• Few studies have used alternative programs to control for Hawthorne effect or student expectancies– TVSFP, Sussman, Aban Aya
• It is not possible to have pure controls in schools today – they all have multiple programs– Must monitor other programs in both sets of schools
Implications of no blinding• Requires careful monitoring of program delivery• Assessment of acceptance of, involvement in, and
expectations of program by target audience• Monitoring of what happens in control schools• Data collectors blinded to conditions
– Or at least to comparisons being made– This condition has rarely been met in prevention research
• Data collectors not known to students– To ensure greater confidentiality and more honest reports of
behavior
• Classroom teachers should not be present (or be unobtrusive) during student surveys
• Use unobtrusive measures -- rarely used so far– Use of archival data and playground observations are
possibilities– Though they have their own problems
Parental Consent Issues• Historical use of “passive” consent
– Parents informed, but only respond if want to “opt out” their child or themselves
• More and more IRBs are requiring active signed consent• When is active consent required?
– If asking “sensitive” questions• Drug use, sexual behavior, illegal behavior, family relationships
– If students are “required” to participate• Protection of Pupil Rights Act (PPRA)
– Data are not anonymous (or totally confidential)– There is more than minimal risk if data become non-confidential
• Thus, passive consent should be allowed if:– Not asking about sensitive issues
• Allows surveys of young students (K-3/4)– Students not required to participate
• By NIH rules, students already must be given the opportunity to opt out of complete surveys or to skip questions
• Requires careful “assent” procedures– Strict non-disclosure protocols are followed
• Multiple levels of ID numbers for tracking• No individual (or classroom or school)–level data are ever released
Changes in Student Body During a Study
• Transfers out and in– Students who transfer out of or into a study school are, on
average, at higher risk than other students– Are transfers out replaced by transfers in, or are rates different– Are rates the same across experimental conditions?
• Absenteeism– Students with higher rates of absenteeism are also, on average,
at higher risk than others– Are rates the same across experimental conditions?
• Rates of transfers in/out, absenteeism, or dropout that are differential by condition present the most serious problem– Requires careful assessment and analysis– Missing data techniques of limited value when rates are
differential because not MCAR– But may be useful for MAR (that is, if missing is predictable)
• We do not follow students who leave study schools and we add students who enter during the study
Complex Interventions• Always thought of as curricula, or whole programs,
not separate components– Few field-based tests of efficacy of separate components to
date– But curricula/programs based on basic and hypothesis-driven
research
• Programs have grown more complex over the years• Multiple outcomes are the norm
– Achievement + multiple Behaviors + Character (ABCs)– Also multiple ecologies are involved – moderators are likely– School-wide– Involvement of parents/families– Involvement of community (e.g., Aban Aya)
• Therefore, multiple mediators, both distal and proximal– Distal: Family patterns, school climate, community
involvement– Proximal: Attitudes, normative beliefs, self-efficacy,
intentions
Complex Outcomes, Intensive Measurement and Long-term Follow-
up• Many expected outcomes and mediators leads to
extensive and intensive measurement• Early concern with measurement reactivity
– Led to recommendation of complex designs to rule it out
– No longer considered very seriously --– “If only behavior were so easily changed!”
• Long-term follow-up imperative– Few programs with documented effects into or through
high school
• The longer the study, the more the attrition– Due to drop-outs, transfers, absenteeism, refusals– Include incoming students in the study
Unit of Analysis• Has received the most persistent attention• Early studies were analyzed at the student
level• Early recommendation was to analyze at the
school level – the level of random assignment• Much attention to intraclass correlation
– Typically only in the .01-.05 range– With 4-10 schools per condition, analyses at the
student and school level can produce same p values
• Development of multi-level analysis techniques– Bryke & Raudenbush, Goldstein, Hedeker & Gibbons– Longitudinal data seen as another level of nesting– Growth curve analyses becoming popular
The Nature of the Target Population
• Universal, Selective and Indicated Interventions– Universal = complete population– Selective = those at higher risk– Indicated = those already evidencing early stages
• Implications of variation in risk levels of students in universal interventions– Suggests multi-level/nested interventions might be
desirableE.g., Fast Track
– Suggests analyses by risk level
Hypothetical example of differential effects by risk level
0
1
2
3
4
5
6
T1 T2 T3 T4 T5
Time of measurement
Lev
el o
f b
ehav
ior
Hi Risk Program
Hi Risk Control
Med Risk Program
Med Risk Control
Lo Risk Program
Lo Risk Control
Examples of Moderation and Mediation
• Example of major moderation from Aban Aya– Effects for males only
• Examples of mediation from Aban Aya– Following slides
• Example of another kind of process analysis– Later slide from Positive Action
Aban Aya: Male violence was brought down to the level of female
violence
8
9
10
11
12
13
14
15
0 0.5 1 1.5 2 2.5 3 3.5
Males C Males Tx All Females
Summary of Mediation Results for Males (Aban Aya)
SDC & SCEstimate
Sub. Use
Attitudes
Frnd Bhv.
Encourage
SDC & SC
Attitudes
Violence
Intentions
Estimate
Frnd Bhv.
SDC Condom Use
Mediation analyses not yet done
Another kind of process analysisEFFECTS OF THE POSITIVE ACTION PROGRAM
Multi-group analysis, School-level data for 55 PA schools and 29 control schools. First path parameter (Standardized) is for Controls, second is for PA schools.
Average % or means shown for all variables. Percentage of variance explained (R2) shown for outcomes.
% African- American (25.2)
% Mobility (43.7)
% Free/reducedlunch (59.6)
% White (51.7)
VIOLENCE: Incidentsper 100 students (3).
R2 .35/.13
Out of SchoolSUSPENSIONS (1.7).
R2 .73/.51
% ABSENT>20 days (3.5).
R2 .64/.61
ACHIEVEMENTR2 .92/.81
FloridaComprehensiveAptitude (Total)(330)
Grade 5 NRT (Total) (319)
1.00
.59/.36
.43/.24
.54/.44
.34/.25
.30/.41 .99
-.30/.-.67
-.53/-.14
Model Fit: 2 = 48.03
@ 40 dfp=.18
RMSEA .069
Constrained model fit: 2 = 69.4 @ 51 df
p=.09, RMSEA=.09.
2 diff = 21.37 @ 11 df, p=.03.
.26
.70
-.73
-.36
.63
-.76
.53/.59
.00/.32