standards of evidence for program efficacy, effectiveness and readiness for dissemination society...

Standards of Evidence for Program Efficacy,

Effectiveness and Readiness for Dissemination

Society for Prevention Research (SPR)

Brian R. Flay, D.Phil.Oregon State University

For the SPR Standards Committee

Three Main Sections to Talk

• Presentation of The SPR Standards

• Elements of an Ideal Trial

• Discussion of Particular Methodological Issues

Members of the SPR Committee

• Brian R. Flay (Chair), D. Phil., U of Illinois at Chicago

• Anthony Biglan, Ph.D., Oregon Research Institute

• Robert F. Boruch, Ph.D., U of Pennsylvania

• Felipe G. Castro, Ph.D., MPH, Arizona State U

• Denise Gottfredson, Ph.D., Maryland U

• Sheppard Kellam, M.D., AIR • Eve K. Moscicki, Sc.D., MPH, NIMH

• Steven Schinke, Ph.D., Columbia U • Jeff Valentine, Ph.D., Duke University• With help from Peter Ji, Ph.D., U of Illinois at Chicago

Pressure for programs of “proven effectiveness”

• The Federal Government increasingly requires that Federal money be spent only on programs of “proven effectiveness”– Substance Abuse and Mental Health

Services Administration (SAMHSA) and Center for Substance Abuse Prevention (CSAP)

– U.S. Department of Education (DE)– Office of Juvenile Justice and Delinquency

Prevention (OJJDP)

What is “Proven Effectiveness”?

• Requires rigorous research methods to determine that observed effects were caused by the program being tested, rather than some other cause.

• Randomized controlled trials (RCTs) are the gold standard.

• E.g., the Food and Drug Administration (FDA) requires at least two randomized clinical trials before approving a new drug.

• Randomized controlled trials (RCTs) are always expensive, and there are many challenges to conducting RCTs in schools.

Multiple Approaches to Standards of Evidence

• Each government agency and academic group that has reviewed programs for lists has come up with its own set of standards.

• Some examples:– CDC - Guide to Preventive Services– SAMHSA - National Registry of Evidence-

based Programs and Practices– US Dept of Education - What Works

Clearinghouse– Justice Department – Multiple lists– Many others ….

Why So Many Lists of Evidence-Based Programs

and Practices?• They are all similar but not equal

– E.g., CSAP allows more studies in than ED

• All concern the rigor of the research• The Society for Prevention Research

recently created standards for the field• Our innovation was to consider nested

sets of standards for efficacy, effectiveness and readiness for dissemination

Four Kinds of Validity(Cook & Campbell, 1979; Shadish, Cook & Campbell,

2002)

• Construct validity– Program description and measures of

outcomes• Internal validity

– Was the intervention the cause of the change in the outcomes?

• External validity (Generalizability)– Was the intervention tested on relevant

participants and in relevant settings?• Statistical validity

– Can accurate effect sizes be derived from the study?

Standards for 3 Levels

• Efficacy– What effects can the intervention have under

ideal conditions?

• Effectiveness– What effects does the intervention have under

real-world conditions?

• Dissemination– Is an effective intervention ready for broad

application or distribution?

Desirable– Additional criteria that provide added value to

evaluated interventions

Phases of Research in Prevention Program

Development (Flay, 1986)1. Basic Research2. Hypothesis Development3. Component Development and Pilot Studies4. Prototype Studies of Complete Programs5. Efficacy Trials of Refined Programs6. Treatment Effectiveness Trials

– Generalizability of effects under standardized delivery

7. Implementation Effectiveness Trials– Effectiveness with real-world variations in

implementation

8. Demonstration Studies– Implementation and evaluation in multiple

systems

Specificity of Efficacy Statement

• “Program X is efficacious for producing Y outcomes for Z population.” – The program (or policy, treatment,

strategy) is named and described– The outcomes for which proven

outcomes are claimed are clearly stated

– The population to which the claim can be generalized is clearly defined

Program Description• Efficacy

– Intervention must be described at a level that would allow others to implement or replicate it

• Effectiveness– Manuals, training and technical support must be

available– The intervention should be delivered under the

same kinds of conditions as one would expect in the real world

– A clear theory of causal mechanisms should be stated

– Clear statement of “for whom?” and “under what conditions?” the intervention is expected to work

• Dissemination– Provider must have the ability to “go-to-scale”

Outcomes

• ALL claimed public health or behavioral outcome(s) must be measured– Attitudes or intentions cannot

substitute for actual behavior

• At least one long-term follow-up– The appropriate interval may vary

by type of intervention and state-of-the-field

Measures• Efficacy

– Psychometrically sound• Valid• Reliable (internal consistency, test-retest or inter-

rater reliability)• Data collectors independent of the intervention

• Effectiveness– Level of exposure also should be measured

• Integrity and level of implementation• Acceptance/compliance/adherence/involvement

of target audience in the intervention

• Dissemination– Monitoring and evaluation tools available

Desirable Measures

• For ALL– Multiple measures– Mediating variables (or immediate

effects)– Moderating variables– Potential side-effects– Potential iatrogenic effects

• At least one comparison group– No-treatment, usual care, placebo or wait-

list• Assignment to conditions must

maximize causal clarity– Random assignment is “the gold standard”– Other acceptable designs

• Multiple baseline or • Repeated time-series designs• Regression-discontinuity• Well-done matched controls

– Demonstrated pretest equivalence on multiple measures

– Known selection mechanism

Design – for Causal Clarity

Generalizability of Findings

• Efficacy– Sample is defined

• Who it is (from what “defined” population)• How it was obtained (sampling methods)

• Effectiveness– Description of real-world target population and

sampling methods– Degree of generalizability should be evaluated

• Desirable• Subgroup analyses• Dosage studies/analyses• Replication with different populations• Replication with different program providers

Precision of Outcomes:Statistical Analysis

• Statistical analysis allows unambiguous causal statements– At same level as randomization and

includes all cases assigned to conditions– Tests for pretest differences– Adjustments for multiple comparisons– Analyses of (and adjustments for) attrition

• Rates, patterns and types

• Desirable– Report extent and patterns of missing data

Precision of Outcomes: Statistical Significance

• Statistically significant effects– Results must be reported for all

measured outcomes– Efficacy can be claimed only for

constructs with a consistent pattern of statistically significant positive effects

– There must be no statistically significant negative (iatrogenic) effects on important outcomes

Precision of Outcomes:Practical Value

• Efficacy– Demonstrated practical significance in terms of

public health (or other relevant) impact– Report of effects for at least one follow-up

• Effectiveness– Report empirical evidence of practical

importance

• Dissemination– Clear cost information available

• Desirable– Cost-effectiveness or cost-benefit analyses

Precision of Outcomes:Replication

• Consistent findings from at least two different high-quality studies/replicates that meet all of the other criteria for efficacy and each of which has adequate statistical power – Flexibility may be required in the application of

this standard in some substantive areas

• When more than 2 studies are available, the preponderance of evidence must be consistent with that from the 2 most rigorous studies

• Desirable– The more replications the better

Additional Desirable Criteriafor Dissemination

• Organizations that choose to adopt a prevention program that barely or not quite meets all criteria should seriously consider undertaking a replication study as part of the adoption effort so as to add to the body of knowledge.

• A clear statement of the factors that are expected to assure the sustainability of the program once it is implemented.

Nested Standards

Efficacy Effectiveness Dissemination Desirable

20 28 31 43

How you can use the Standards when Questioning Public

Officials• Has the program been evaluated in a

RCT?• Were units randomized to program

and control (no program or alternative program) conditions?

• Has the program been evaluated on populations like yours?

• Have the findings been replicated?• Were any evaluations independent

from the program developers?

School- or Community-Based Prevention/ Promotion Studies are Large and Complex

• Large randomized trials– With multiple schools or other units per condition

• Comparisons with “treatment as usual”• Measurement of implementation process and

program integrity• Assessment of effects on presumed mediators

– Helps test theories

• Multiple measures/sources of data– Surveys of students, parents, teachers, staff, community– Teacher and parent reports of behavior– School records for behavior and achievement

• Multiple, independent trials of promising programs– At both efficacy and effectiveness levels

• Cost-effectiveness analyses

Example Programs that Come Close to Meeting The SPR Standards?

• Life Skills Training (Botvin)– Multiple RCTs with different populations,

implementers and types of training– Only one long-term follow-up– Independent replications of short-term effects

are now appearing (as well as some failures)– No independent replications of long-term effects

yet

• Positive Action– Two high-quality matched control studies– Two randomized trials

• Others?

Programs for Which the Research Meets the Standards, But Do Not

Work• DARE

– Many quasi-experimental and non-experimental studies suggested effectiveness

– Multiple RCTs found no effects (Ennett meta-analysis

• Hutchinson– Well-designed RCT– But no published information on the program or

short-term effects – need to demonstrate short-term effects before you can say anything meaningful about long-term effects

– But no long-term effects– Cannot be interpreted because of lack of information

Much of Prevention is STUCK … We’re Spinning our

Wheels• At the Efficacy Trial phase

– How can we get more programs into effectiveness trials?

• At the Effectiveness Trial phase– How can we get more proven programs

adopted?• At the “Model Program” phase

– How can we ensure the ongoing effectiveness of model programs?

• Lots more prevention research is needed – at all levels!

II. Example of a well-designed evaluation

RCT: Randomized Controlled (Community) Trial

Community-Based Tobacco Control Program

Comprehensive Intervention

• Mass Media Campaign– Legacy “truth” Campaign

• School– Curriculum– School-wide Policy– Family Involvement

• Community Mobilization– Youth Access– Smoke-free environments– Other policy changes

RCT with Matched Pairs

• 6-12 matched pairs (need power analysis)

• Community = Defined Media Markets

• Matched pairs or strata:– Use existing data to match

• Population, smoking rates, media reach

– Possible use of baseline data to match• Youth smoking survey

Randomization

• Each member of each pair agrees to randomization after baseline data collected

• OR

• Match on basis of archival data, recruit to study, then randomize

• OR

• Randomize to conditions, then recruit

Design Elements

O OxxxxOxxxxOxxxxOxxxxOxxxxOO

RO O O O O O O O O

Improve Power:Multiple baseline measuresMultiple control groups per intervention groupMultiple waves of data

Steps

• Baseline studies– Tobacco use prevalence and survey – Qualitative eval of appropriateness of messages

• Second or more baselines• Implement all components of intervention• Monitoring of intervention implementation• Follow-up surveys

– Awareness of campaign components– Of intermediate changes - e.g., prices,

availability– Youth tobacco use attitudes and behaviors

Specific Measures: Process• Pre-intervention measure of readiness

• Implementation– Exposure to media campaign or number of

lessons taught– Fidelity of implementation– # youth involved in empowerment

programs– Community organizational

structures/involvement

• Post-study sustainability of intervention

Measures: Intermediate

• Policy Change– School tobacco-free policies– Tobacco prices (and change in

prices)– Community smoke-free policies

• Enforcement of tobacco-free policies– School– Community

Specific Measures: Outcomes

• Tobacco use by adolescents 13-15 – % ever used– % use currently

• % of youth “susceptible” to initiation

• Exposure to 2nd hand smoke– % children, nonsmoking

adolescents/young adults

• Per capita sales of tobacco products– Cigarettes– Other tobacco products

Appropriate Analyses

• Take account of unit of assignment• Take account of nesting

– times within subjects within settings

• Take account of subject mobility– E.g., stayers, leavers and joiners

• Take account of missing data• Make full use of longitudinal data

– E.g., growth curves

III. Methodological Issues – Some More Detail

• What would an idea trial look like?

• Issues re Randomization• Sample sizes• Where are the control groups?• Intensive measurement• Unit of Analysis• Nature of the target population• Moderation and Mediation

Why Random Assignment of Schools

• Intervention is delivered to intact classrooms

• Random assignment of classes is subject to contamination across classes within schools

• Program includes school-wide components• Credible causal statements require group

equivalence at both the group and individual levels– On the outcome variable– On presumed mediating variables– On motivation or desire to change

Randomized Prevention Research Studies

• First Waterloo Study was the first with a sufficient N of schools for randomization to be “real”– Some earlier studies were claimed as “randomized” with

only one or two schools per condition

• Other smoking prevention studies in early-mid 80’s– McAlister, Hansen/Evans, Murray/Luepker, Perry/Murray,

Biglan/Ary, Dielman

• Other substance abuse prevention studies in late 80’s and 90’s– Johnson/Hansen/Flay/Pentz, Botvin and colleagues

• Extended to sexual behavior, AIDs, & Violence in 90’s– Also McArthur Network initiated trials of Comer

• Character Education in 2002– New DoE funding

Issues re Randomization• Ethical resistance to the idea of randomization needs

to be addressed, though it’s becoming rare• Control schools like to have a program too

– Use usual Health Education (treatment as usual)– Offer special, but unrelated, program

• E.g., Aban Aya Health Enhancement Curriculum as control for Social Development (violence, sex, drug prevention) Curriculum

– Pay schools for access to collect data from students, parents and teachers -- $500-$2,000 per year

• Currently, because of NCLB demands, many schools are too busy to want to be in an intervention condition– Too many teaching and testing demands– Too many other special programs already– Pay them for staff support and special activities

Approaches to Randomization

• Pure randomization from a large population• Obtain agreement first

– Even prior agreements can break down (Waterloo)

• Then randomize from matched sets defined by– Presumed predictors of the outcome (Graham et al., Aban

Aya)– Actual predictors of the outcome (Hawaii Positive Action

trial)– Individual-level pretest levels of the outcome (has anyone

ever achieved this?)

• If schools refuse or drop out, replace from the same set– Only one school of 15 initial selections/assignments for

Aban Aya refused and was replaced– But this compromises randomization – different probabilities

• Drop the set– We had to drop multiple sets in the Hawaii Positive Action

trial because of refusals by schools assigned to the program

Breakdown of Randomization/Design

• Failure of randomization– Don’t use posttest-only designs (to my knowledge none have)

• Schools drop out during course of study– Use signed agreements (none dropped out of Aban Aya or

Hawaii)• Configuration of schools is changed during course of

study– E.g., A school is closed, two schools are combined– Drop the paired school as well (& add replacement set if it’s

soon enough)• A program school refuses to deliver the program, or

delivers it poorly– Try to avoid this with incentive and technical support– E.g., Schaps Child Development Study only had 5 schools of

12 implement the program well – and reports emphasize results from these 5.

– Botvin often reported results only for students who received more than 60% of the lessons

– “Intention to Treat” analysis should be reported first. Reporting results for the high-implementation group is appropriate only as a secondary level of analysis

Expense --> Small Ns?

• Yes, in many cases– Average efficacy trial to date (where research

funds support the intervention) had 4-8 schools per condition, and cost ~$500-900,000 per year.

– Effectiveness trials (where intervention is less costly) have 10-20 schools per condition for $500,000 per year.

• Limit costs by using more small schools– Raises questions about generalizability of results

to large schools

• Limit costs by limiting variability between schools– Also limits generalizability of results

The Changing Nature of Control Groups

• The medical model suggests use of a placebo and double blinding, neither of which is possible for educational programs

• Subjects (both students and schools) should have equal expectations of what they will get from the program

• Few studies have used alternative programs to control for Hawthorne effect or student expectancies– TVSFP, Sussman, Aban Aya

• It is not possible to have pure controls in schools today – they all have multiple programs– Must monitor other programs in both sets of schools

Implications of no blinding• Requires careful monitoring of program delivery• Assessment of acceptance of, involvement in, and

expectations of program by target audience• Monitoring of what happens in control schools• Data collectors blinded to conditions

– Or at least to comparisons being made– This condition has rarely been met in prevention research

• Data collectors not known to students– To ensure greater confidentiality and more honest reports of

behavior

• Classroom teachers should not be present (or be unobtrusive) during student surveys

• Use unobtrusive measures -- rarely used so far– Use of archival data and playground observations are

possibilities– Though they have their own problems

Parental Consent Issues• Historical use of “passive” consent

– Parents informed, but only respond if want to “opt out” their child or themselves

• More and more IRBs are requiring active signed consent• When is active consent required?

– If asking “sensitive” questions• Drug use, sexual behavior, illegal behavior, family relationships

– If students are “required” to participate• Protection of Pupil Rights Act (PPRA)

– Data are not anonymous (or totally confidential)– There is more than minimal risk if data become non-confidential

• Thus, passive consent should be allowed if:– Not asking about sensitive issues

• Allows surveys of young students (K-3/4)– Students not required to participate

• By NIH rules, students already must be given the opportunity to opt out of complete surveys or to skip questions

• Requires careful “assent” procedures– Strict non-disclosure protocols are followed

• Multiple levels of ID numbers for tracking• No individual (or classroom or school)–level data are ever released

Changes in Student Body During a Study

• Transfers out and in– Students who transfer out of or into a study school are, on

average, at higher risk than other students– Are transfers out replaced by transfers in, or are rates different– Are rates the same across experimental conditions?

• Absenteeism– Students with higher rates of absenteeism are also, on average,

at higher risk than others– Are rates the same across experimental conditions?

• Rates of transfers in/out, absenteeism, or dropout that are differential by condition present the most serious problem– Requires careful assessment and analysis– Missing data techniques of limited value when rates are

differential because not MCAR– But may be useful for MAR (that is, if missing is predictable)

• We do not follow students who leave study schools and we add students who enter during the study

Complex Interventions• Always thought of as curricula, or whole programs,

not separate components– Few field-based tests of efficacy of separate components to

date– But curricula/programs based on basic and hypothesis-driven

research

• Programs have grown more complex over the years• Multiple outcomes are the norm

– Achievement + multiple Behaviors + Character (ABCs)– Also multiple ecologies are involved – moderators are likely– School-wide– Involvement of parents/families– Involvement of community (e.g., Aban Aya)

• Therefore, multiple mediators, both distal and proximal– Distal: Family patterns, school climate, community

involvement– Proximal: Attitudes, normative beliefs, self-efficacy,

intentions

Complex Outcomes, Intensive Measurement and Long-term Follow-

up• Many expected outcomes and mediators leads to

extensive and intensive measurement• Early concern with measurement reactivity

– Led to recommendation of complex designs to rule it out

– No longer considered very seriously --– “If only behavior were so easily changed!”

• Long-term follow-up imperative– Few programs with documented effects into or through

high school

• The longer the study, the more the attrition– Due to drop-outs, transfers, absenteeism, refusals– Include incoming students in the study

Unit of Analysis• Has received the most persistent attention• Early studies were analyzed at the student

level• Early recommendation was to analyze at the

school level – the level of random assignment• Much attention to intraclass correlation

– Typically only in the .01-.05 range– With 4-10 schools per condition, analyses at the

student and school level can produce same p values

• Development of multi-level analysis techniques– Bryke & Raudenbush, Goldstein, Hedeker & Gibbons– Longitudinal data seen as another level of nesting– Growth curve analyses becoming popular

The Nature of the Target Population

• Universal, Selective and Indicated Interventions– Universal = complete population– Selective = those at higher risk– Indicated = those already evidencing early stages

• Implications of variation in risk levels of students in universal interventions– Suggests multi-level/nested interventions might be

desirableE.g., Fast Track

– Suggests analyses by risk level

Hypothetical example of differential effects by risk level

0

1

2

3

4

5

6

T1 T2 T3 T4 T5

Time of measurement

Lev

el o

f b

ehav

ior

Hi Risk Program

Hi Risk Control

Med Risk Program

Med Risk Control

Lo Risk Program

Lo Risk Control

Examples of Moderation and Mediation

• Example of major moderation from Aban Aya– Effects for males only

• Examples of mediation from Aban Aya– Following slides

• Example of another kind of process analysis– Later slide from Positive Action

Aban Aya: Male violence was brought down to the level of female

violence

8

9

10

11

12

13

14

15

0 0.5 1 1.5 2 2.5 3 3.5

Males C Males Tx All Females

Summary of Mediation Results for Males (Aban Aya)

SDC & SCEstimate

Sub. Use

Attitudes

Frnd Bhv.

Encourage

SDC & SC

Attitudes

Violence

Intentions

Estimate

Frnd Bhv.

SDC Condom Use

Mediation analyses not yet done

Another kind of process analysisEFFECTS OF THE POSITIVE ACTION PROGRAM

Multi-group analysis, School-level data for 55 PA schools and 29 control schools. First path parameter (Standardized) is for Controls, second is for PA schools.

Average % or means shown for all variables. Percentage of variance explained (R2) shown for outcomes.

% African- American (25.2)

% Mobility (43.7)

% Free/reducedlunch (59.6)

% White (51.7)

VIOLENCE: Incidentsper 100 students (3).

R2 .35/.13

Out of SchoolSUSPENSIONS (1.7).

R2 .73/.51

% ABSENT>20 days (3.5).

R2 .64/.61

ACHIEVEMENTR2 .92/.81

FloridaComprehensiveAptitude (Total)(330)

Grade 5 NRT (Total) (319)

1.00

.59/.36

.43/.24

.54/.44

.34/.25

.30/.41 .99

-.30/.-.67

-.53/-.14

Model Fit: 2 = 48.03

@ 40 dfp=.18

RMSEA .069

Constrained model fit: 2 = 69.4 @ 51 df

p=.09, RMSEA=.09.

2 diff = 21.37 @ 11 df, p=.03.

.26

.70

-.73

-.36

.63

-.76

.53/.59

.00/.32

standards of evidence for program efficacy, effectiveness and readiness for dissemination society...

Documents