randomized controlled trial: a historical perspective...randomized controlled trial: a historical...

Randomized controlled trial: a historicalperspective

Luc Behaghel (PSE and Crest) and Philippe Zamora (Crest)

J-PAL Advanced course, Paris 2012

Randomized controlled trials: a long history in socialsciences

experimental psychology (late 19th century)

education (early 20th century)

experimental sociology (E. Greenwood, F.S. Chapin – early20th century)

rural health educationsocial effects of public housingrecreation programs for “delinquent” boys...

I The statistical framework used nowadays only came later (R.A.Fischer, Design of Experiments, 1935)

What about randomized clinical trials?

Large-scale randomized clinical trials sometimes viewed as a modelfor social sciences

A norm since the 1962 Drug Amendments

proof of “efficacy” required prior to marketing (“safety”required since 1938)based on “adequate and well-controlled studies”I rapidly interpreted as implying a control group and randomassignment to control or treatment, i.e. RCT

yet: only emerged very progressively(Marks, The Progress of Experiment, 1997)

went (and goes) through substantial debates

What about randomized clinical trials?

Large-scale randomized clinical trials sometimes viewed as a modelfor social sciences

A norm since the 1962 Drug Amendments

proof of “efficacy” required prior to marketing (“safety”required since 1938)based on “adequate and well-controlled studies”I rapidly interpreted as implying a control group and randomassignment to control or treatment, i.e. RCT

yet: only emerged very progressively(Marks, The Progress of Experiment, 1997)

went (and goes) through substantial debates

The Progress of Experiment

No influence of statistics on (clinical) medicine before 1950⇒ how can we explain its influence today?

1 Successful alliance of statisticians and “therapeutic reformers”(academic physicians) anxious to discipline physicians (in theirprescriptions) and pharmaceutical companies (in their claims),in the 1950s

Coincides with diffusion of statistical concepts and methods inmany areas (genetics, psychology, economics, physics)

2 Main arguments: objectivity and “common sense” against theinvestigator’s subjective bias: “The random method removesall responsibility from the observer.” (Bradford Hill, 1953)

3 Caveat: “an incomplete revolution, one in which the mostphysicians were acquainted neither with the intellectual powerthat lay behind the procedures advocated by statisticians norwith the limitations of statistical methods.”(Marks, 1997, p. 138)

Dilemmas of authority

Latent resistance of practicing physicians and clinicians.I RCTs were not the first attempt of therapeutic reformers

American Medical Association Council on Pharmacy andChemistry (1906)I system of consultants to gather and assess evidence on newdrugs

“Collective investigations” (late 1920s)I organized collaboration of university clinics for standardizedevaluation of therapies on hundreds of patients (but norandomization)

⇒ The shift to RCTs can be viewed as an attempt to transferauthority “from institutions to methods”.

The University Group Diabetes Program Study

Large RCT in the late 1960s, strongly supported by NIH, tosolve a lasting debate on diabetes therapy and to develop theRCT methodology

Surprising results (tested drug increases mortality)I treatment discontinued, FDA informed

15-year long controversy

1 Statistical inference issueI study validated by Biometric Society

2 Relevance of tested treatment (external validity)I a standardized protocol goes against the practice ofindividual diagnostic and customized treatment

3 Who has authority?I “When reasonable people disagree, where do the boundariesof unreasonable behavior begin?” (Marks, 1997, p. 233)

1 The early years

2 The golden age of evaluation

3 The credibility revolution

4 The creative years

5 The times of maturity

1. The early years: R.A. Fischer and the econometricsociety

Formal theory of RCTs due to R.A. Fisher (StatisticalMethods for Research Workers, 1925).

Influences economists, in particular members of theEconometric society (e.g. Hotelling, former volunteer at theRothamsted farm)Yet, Fisher reluctant to apply statistics to social sciences, dueto their “non-experimental” nature

Early developments of econometrics steer away fromexperimentation

Extreme view: theory cannot be proven false by statisticalevidence (Keynes!)More constructive compromise (Haavelmo, 1944): statisticsused to confirm theory in probabilistic wayHeckman (2010): the fundamental insight of the firsteconometricians in the Cowles commission is to show thatthere is no causal inference possible without a theoreticalmodelN the influence of statisticians in econometrics in the 1990s(e.g. Rubin) is to forget this lesson

2. from the end of 60’s to the beginning of 80’s : a goldenage of evaluation

From the mid 60’s, a huge and sharp increase of randomisedexperiment

According to Baruch (1978), 245 randomised fieldexperiments had been conducted in U.S for social policiesevaluations up to 1978

Some of them were ambitious and very costly

affected different kind of policies (subsidized work, incomemaintenance, job search counseling)

2. from the end of 60’s to the beginning of 80’s : a goldenage of evaluation

One of the first RCT is the famous ”‘Perry Pre-schoolprogram”’ (1961) whose results keep on feeding currentpapers : the follow-up survey has been following the 123people (control and treated ones) up to 2000.

This huge effort has been prompted by the 1% part of everysocial budget devoted to evaluation

Two famous examples : the Perry Preschool Program / TheRand Health Insurance Experiment

Pioneer study: the Perry Preschool Program

123 children born between 1958 and 1962 in Michigan

Half of them (drawn at random) entered the perry schoolprogram at 3 or 4 years old.

Education by skilled professionals in nurseries and kindergarten

the program includes too help to parents to improve theirinvolvement.

Program duration circa 30 weeks

follow-up survey (age : 14, 15, 19, 27 and 40 years old)

Pioneer study: the Perry Preschool Program

Number treatedgroup

Controlgroup

p-value

Test Evaluation cognitive skills at 15 y. old 95 122.2 94.5 < .001

Access to University 121 38% 21% .029

Jailed or arrested at least once 121 31% 51% .022

Welfare 120 18% 32% .044

Employed at 19 years old 121 50% 32% .032

Pioneer study 2: The Rand Health Insurance Experiment

5809 people randomly assigned in 1974 to different insuranceprograms with 0%, 25%, 50% and 75% sharing.

They were followed until 1982.

Main results : paying a portion of health cost make peoplegive up some “superfluous” cares, with little harm on theirhealth

But some heterogeneity : This result seems to be not true forpoor people.

I strong influence on the development of cost-sharing

I debate on the external validity of results (see later on)

RCTs in the US today

Still used in US to evaluate large and ambitious programs,and routinely used in education, public health

Moving To Opportunity (1994)Job Corps (training program for youth) and replication studies

But at a lower pace. Two reasons:

Impacts are often disappointingly small: in the welfare state,marginal effects of new policies are weakEvaluation takes time: too long for the political agenda

RCTs in the US today

Still used in US to evaluate large and ambitious programs,and routinely used in education, public health

Moving To Opportunity (1994)Job Corps (training program for youth) and replication studies

But at a lower pace. Two reasons:

Impacts are often disappointingly small: in the welfare state,marginal effects of new policies are weakEvaluation takes time: too long for the political agenda

3. The credibility revolution: overview

Starting in the 1990s, applied (micro)econometrics undergoes a“credibility revolution” (Angrist and Pischke, JEPerspectives,2010)

Selection bias taken (even) more seriouslyI influential within-study comparisons (LaLonde, 1986)

Standard selection correction procedures (like heckit)questioned for their lack of robustness

Search for

More credible sources of variationsI “design-based” studyLess parametric assumptionsI treatment effect model allowing for heterogeneous effects,flexible estimation, local interpretation of estimates

Search for

The treatment effect model

(akas “Rubin causal model”)

Counterfactual outcomes y(1), y(0)⇒ (individual) treatment effect y(1)− y(0)

Observe only one realization of counterfactual outcomes:

y = (1− T )y(0) + Ty(1)

Parameters of interest: average causal effects (ATE:E (y(1)− y(0)), ATT: E (y(1)− y(0)|T = 1)) but also otherfeatures of the distribution of treatment effects (e.g. fractionlosing)

y = (1− T )y(0) + Ty(1)

Selection bias

E(y |T = 1) − E(y |T = 0)

= E(y(1)|T = 1)− E(y(0)|T = 0)

= E(y(1)− y(0)|T = 1) + E(y(0)|T = 1)− E(y(0)|T = 0)

E (y(0)|T = 1)− E (y(0)|T = 0): selection bias.I Treated individuals do better not because of treatment, butbecause they would have done better anyway without treatment.

Selection bias

E(y |T = 1) − E(y |T = 0)

= E(y(1)|T = 1)− E(y(0)|T = 0)

= E(y(1)− y(0)|T = 1) + E(y(0)|T = 1)− E(y(0)|T = 0)

E (y(0)|T = 1)− E (y(0)|T = 0): selection bias.I Treated individuals do better not because of treatment, butbecause they would have done better anyway without treatment.

Approaches to selection bias

Model selection (structural approach)I Roy model: self-selection into treatment is informative onpotential outcomes

Assume conditional independence (CIA)

y(1), y(0)⊥T |X

I selection after controlling for X is “as good as random”I matching, regressionpossibly controlling for lagged outcomeI diff-in-diff, diff-in-diff matching

“Design-based” approach

Randomized experimentsQuasi-experiments

y(1), y(0)⊥T |X

I selection after controlling for X is “as good as random”

I matching, regressionpossibly controlling for lagged outcomeI diff-in-diff, diff-in-diff matching

y(1), y(0)⊥T |X

I selection after controlling for X is “as good as random”I matching, regression

possibly controlling for lagged outcomeI diff-in-diff, diff-in-diff matching

y(1), y(0)⊥T |X

Can non-experimental evaluations match experimentalresults? LaLonde vs. Dehejia-Wahba

LaLonde (1986) has the idea to use ex-post evaluationmethods to the data of a randomized experiment (trainingprogram included in the larger National Supported Workprogram) [“within-study comparison”]

uses a two-step Heckman method with different sets ofexclusion variables

non-experimental results are very dependent of the choice ofexclusion variables and far from the benchmark results

LaLonde (1986)

Matching LaLonde

Dehejia and Wahba (2002) use propensity score matchingmethods

varying the set of matching variables / the comparison samplesproviding guidelines to assess the quality of matching variables

I Results closer to the benchmarkI At this time, matching methods appear as a big advance.

Matching LaLonde

Dehejia and Wahba (2002) use propensity score matchingmethods

varying the set of matching variables / the comparison samplesproviding guidelines to assess the quality of matching variables

I Results closer to the benchmarkI At this time, matching methods appear as a big advance.

Dehejia and Wahba (2002)

Matching epilogue

The debate is now less vivid

Smith and Todd (2005) identify 3 “non sufficient” criteria formatching estimators:

same data for control and treated groupssame local areas for treated and control groupsa rich set of matching variables

in fact not general results but matching has now lost its pastglory...

... and most of researchers prefer quasi-experiments.

Within-study comparisons : Arceneaux et al.

“Comparing Experimental and Matching Methods Using aLarge-Scale Voter Mobilization Experiment”I shows the difficulties to correct selection biases with ex-postmethods that account only for observable selection variables

The program: a randomized phone call to remind electors theimportance of voting (2002 mid-term elections)

Outcome: participation to election (data from electoralregisters)

Result: no effect

Arceneaux et al.

Trying to replicate experimental results

First non-experimental method : OLS on the whole sampleI Whatever the used covariates, the (biased) estimate ispositive and significant

Second non experimental method : matchingI Still positive (biased) estimate, even though smaller

Large influence in the U.S. political science researchcommunity (which considered matching methods as the newGraal)

4. The creative years: RCTs in development economics

Banerjee and Duflo (2010): “The experimental approach todevelopment economics”

Since mid 1990s, rapid surge in experiments in developingcountries

Starting with simple trials of inputs in the education / healthproduction function (e.g. textbook or flipcharts in education?)

Going to increasingly “smart designs”

The new wave: key characteristics

1 Test conventional wisdom

2 Micro approach, field involvement of researchers

3 Smaller experiments⇒ in a less saturated policy environment, interventions havelarge impact that can be detected on small samples

4 Importing insights from theory / lab (e.g. List, Levitt, Karlan)

I Useful for all this: variety of randomization approaches

How to randomize? Individual vs. collective

Depends on type of intervention and type of question

Some interventions only make sense at collective level (egclass-level interventions)

Ethical or political issues (avoid treating peers differently)

May want to combine the two, typically when spillovers arethe issue (eg Deworming in Miguel and Kremer, 04)

Randomizing entire groups may come closer to what wouldhappen at scale-up (equilibrium / crowding out effects)

But randomizing entire groups reduces statistical precision –see next lecture

Lottery

Phase-in

Encouragement

Rotation

5. The times of maturity (and debates!)

Are randomized experiments the “gold standard”?

Hot debate in development economics (“randomistas” –Banerjee, Duflo, Kremer,... – and their critics – Deaton,Ravallion, Rodrik)

Discussion among econometricians (Angrist and Imbens vs.Heckman)

French debate too

Internal validity (1): Hawthorne and John Henry effects

Occur when the experimental group (control or treatment) react tobeing part of an experiment (and being monitored)

True concern (and mostly specific to experiments)

Risk varies across treatments and contexts (e.g. probablyhigher when randomizing individuals)

Possible solutions

“Blind” experiments

Two-stage randomization (proposing vs. not proposing theexperiment; assigning to treatment or not): creates anadditional control group for “placebo test”

Internal validity (1): Hawthorne and John Henry effects

Occur when the experimental group (control or treatment) react tobeing part of an experiment (and being monitored)

True concern (and mostly specific to experiments)

Risk varies across treatments and contexts (e.g. probablyhigher when randomizing individuals)

Possible solutions

“Blind” experiments

Two-stage randomization (proposing vs. not proposing theexperiment; assigning to treatment or not): creates anadditional control group for “placebo test”

Internal validity (2): Spillover effects

Occur when the control group is affected by the existence of atreatment group nearby⇒ the difference in outcome is no longer the impact on thetreatment group

A. Finkelstein’s famous and recent paper shows that ATTcomputed from HIE Rand experiment was hugely biasedbecause of such effects (HIE estimates : +37% vs Finkelsteinestimation +400% using ex-post evaluation)

The Medicare introduction had impacts on control andtreated groups (induced technological progress and diffusionof behavior changes beyon the tretament group)

Possible solutions:

Ensure control and treatment units are sufficiently far away

Use variations in distance between control and treatmentunits to identify (and net out) the spillover effects (Migueland Kremer, Econometrica 04)

Internal validity (2): Spillover effects

Occur when the control group is affected by the existence of atreatment group nearby⇒ the difference in outcome is no longer the impact on thetreatment group

A. Finkelstein’s famous and recent paper shows that ATTcomputed from HIE Rand experiment was hugely biasedbecause of such effects (HIE estimates : +37% vs Finkelsteinestimation +400% using ex-post evaluation)

The Medicare introduction had impacts on control andtreated groups (induced technological progress and diffusionof behavior changes beyon the tretament group)

Possible solutions:

Ensure control and treatment units are sufficiently far away

Use variations in distance between control and treatmentunits to identify (and net out) the spillover effects (Migueland Kremer, Econometrica 04)

External validity (1): environmental dependence

1 Heterogeneous effect: The same program may have differenteffects across contexts and target populations

2 “Implementer effect”: The “same” program may havedifferent effects (and different contents) depending on theimplementer

External validity (1): environmental dependence

1 Heterogeneous effect: The same program may have differenteffects across contexts and target populations

2 “Implementer effect”: The “same” program may havedifferent effects (and different contents) depending on theimplementer

Responses to environmental dependence

1. Heterogeneous effect:

Individual studies:

1 check for heterogeneous effects by sub-groups2 replicate

Meta-analysis: cumulated knowledge from similar experiments(e.g.: Kremer and Holla (2004): high price elasticity ofdemand for health and education, especially around 0)

1. Heterogeneous effect:

Individual studies:

1 check for heterogeneous effects by sub-groups2 replicate

Meta-analysis: cumulated knowledge from similar experiments(e.g.: Kremer and Holla (2004): high price elasticity ofdemand for health and education, especially around 0)

2. Implementer effect:

Individual studies: need to emphasize the place of theprogram in the overall action plan of the implementingorganization, and the place of the organization itself

Allocate the evaluation effort systematically: role of fundingagencies

⇒ strength of the experimental approach: can – in principle – besystematically applied across a variety of environments

External validity (2): compliance issues

Not all members of the treatment group end up benefiting fromthe program⇒ only the impact on beneficiaries is identified. May differ fromthe impact on the whole population.

1 May reflect the actual policy

2 Related to analyzing the heterogeneity of program impacts– but sub-populations defined in terms of willingness to enterthe program: requires to reveal that willingness

External validity (3): equilibrium effects

What if everybody was to benefit from the program?Hard to experiment: no control group! But:

1 Experimental designs: vary size of treatment group acrosslocal markets

2 Use partial equilibrium “assumption-free” estimates as abuilding block in a broader, “assumption-dependent” model

External validity (3): equilibrium effects

What if everybody was to benefit from the program?Hard to experiment: no control group! But:

1 Experimental designs: vary size of treatment group acrosslocal markets

2 Use partial equilibrium “assumption-free” estimates as abuilding block in a broader, “assumption-dependent” model

To sum up:

External validity issues to be addressed for increased policyrelevance

Suggests to combine experiments, and to combineexperiments and theory

⇒ can be embedded in a broader process: “creativeexperimentation”

A few references

Introduction to RCTs

Duflo, Glennester and Kremer (2008) “Using randomization indevelopment economics research: a toolkit”, in Handbook of developmenteconomics, vol. 4, ch. 61.

The debate where it stands

Journal of Economic Perspectives, 24 (2010): “The Credibility Revolutionin Empirical Economics” (by Angrist and Pischke) and comments bymacroeconomists

Heckman (2010), “Building bridges between structural and programevaluation approaches to evaluating policy”, Journal of EconomicLiterature, 48(2)

Banerjee and Duflo (2010), “The experimental approach to developmenteconomics”, Annual Review of Economics, 1, 151-178.

A few references (2)

Benchmarking different approaches

LaLonde (1986), “Evaluating the Econometric Evaluation of TrainingPrograms with Experimental Data”, American Economic Review 76(4)

Heckman, Ichimura, Smith and Todd (1998), “Characterizing SelectionBias using Experimental Data”, Econometrica, 1017-98.

Dehejia and Wahba (2002), “Propensity score matching methods fornonexperimental causal studies”, Review of Economics and Statistics,84(1).

randomized controlled trial: a historical perspective...randomized controlled trial: a historical...

Documents