inferring welfare maximizing treatment assignment under...

29
Journal of Econometrics 167 (2012) 168–196 Contents lists available at SciVerse ScienceDirect Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom Inferring welfare maximizing treatment assignment under budget constraints Debopam Bhattacharya a,, Pascaline Dupas b a University of Oxford, United Kingdom b Stanford University, United States article info Article history: Received 11 November 2010 Received in revised form 19 June 2011 Accepted 3 November 2011 Available online 25 November 2011 JEL classification: C14 C21 abstract This paper concerns the problem of allocating a binary treatment among a target population based on observed covariates. The goal is to (i) maximize the mean social welfare arising from an eventual outcome distribution, when a budget constraint limits what fraction of the population can be treated and (ii) to infer the dual value, i.e. the minimum resources needed to attain a specific level of mean welfare via efficient treatment assignment. We consider a treatment allocation procedure based on sample data from randomized treatment assignment and derive asymptotic frequentist confidence interval for the welfare generated from it. We propose choosing the conditioning covariates through cross-validation. The methodology is applied to the efficient provision of anti-malaria bed net subsidies, using data from a randomized experiment conducted in Western Kenya. We find that subsidy allocation based on wealth, presence of children and possession of bank account can lead to a rise in subsidy use by about 9% points compared to allocation based on wealth only, and by 17% points compared to a purely random allocation. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Governments of developing countries often subsidize access to key health and educational resources for the most vulnerable sections of their populations. However, such subsidizing efforts are typically constrained by binding budget ceilings. When budgets are such that only a small fraction of a target population can receive a given subsidy, the eligibility rule used to decide who will receive the subsidy can have an important effect on the overall benefit arising from the subsidy program. In this paper, we consider the problem of allocating a fixed amount of treatment resources to a target population with the aim of maximizing the mean population outcome, and the dual problem of estimating the minimum cost of achieving a given mean outcome in the population by efficient targeting of a treatment. We set-up an econometric framework for studying this problem and apply it to design welfare-maximizing allocation of subsidies for an effective malaria control tool – insecticide-treated bed nets (ITNs) – among households in a malaria-endemic region of Kenya. Our paper contributes to a recent but steadily expanding literature in statistics and economics on how experimental evidence on treatment effect heterogeneity may be used to maximize gains from social programs. This problem has been analyzed in the literature as a statistical decision problem with Correspondence to: Department of Economics, University of Oxford, Manor Road Building, OX1 3UQ, United Kingdom. E-mail address: [email protected] (D. Bhattacharya). finite samples under ambiguity but with no budget constraint (c.f., Dehejia (2005), Manski (2004), Manski (2005), Stoye (2009), Hirano and Porter (2009) and Tetenov (2012)). Our substantive contribution is to study such problems in the presence of aggregate budget constraints — an extremely common situation in real life but largely ignored in the treatment choice literature. 1 The constraint makes the treatment assignment problem substantively realistic, especially for developing countries, and raises a set of unique substantive and technical issues that are absent in the unconstrained case. On the empirical side, treatment targeting has been investi- gated in job-training contexts, c.f., Berger et al. (2001), Frolich (2008), Behnke et al. (2009) and Lechner and Smith (2007). See also the papers in Eberts et al. (2002), the special issue of the Economic Journal (Nov, 2006) on profiling and Bhattacharya (2009) on opti- mal peer assignment. A few recent studies have used experimen- tal data to estimate the parameters of dynamic structural models of behavior and utilized the estimates to simulate the effects of coun- terfactual policy interventions (c.f. Attanasio et al. (in press) and Duflo et al. (forthcoming)). Mahajan et al. (2009) discuss identifi- cation and estimation of a static structural model of ITN adoption using observational data alone and use the estimated parameters to perform counterfactual policy analysis. Todd and Wolpin (2006) 1 Manski (2005) studies planning problems which satisfy ‘separability’ and specifically mentions (pp. 10–11) budget constraints as a situation where separability is violated and, consequently, not studied by him. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.11.007

Upload: others

Post on 25-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

Journal of Econometrics 167 (2012) 168–196

Contents lists available at SciVerse ScienceDirect

Journal of Econometrics

journal homepage: www.elsevier.com/locate/jeconom

Inferring welfare maximizing treatment assignment under budget constraintsDebopam Bhattacharya a,∗, Pascaline Dupas b

a University of Oxford, United Kingdomb Stanford University, United States

a r t i c l e i n f o

Article history:Received 11 November 2010Received in revised form19 June 2011Accepted 3 November 2011Available online 25 November 2011

JEL classification:C14C21

a b s t r a c t

This paper concerns the problem of allocating a binary treatment among a target population based onobserved covariates. The goal is to (i) maximize themean social welfare arising from an eventual outcomedistribution, when a budget constraint limits what fraction of the population can be treated and (ii) toinfer the dual value, i.e. the minimum resources needed to attain a specific level of mean welfare viaefficient treatment assignment. We consider a treatment allocation procedure based on sample datafrom randomized treatment assignment and derive asymptotic frequentist confidence interval for thewelfare generated from it. We propose choosing the conditioning covariates through cross-validation.The methodology is applied to the efficient provision of anti-malaria bed net subsidies, using data from arandomized experiment conducted in Western Kenya. We find that subsidy allocation based on wealth,presence of children and possession of bank account can lead to a rise in subsidy use by about 9% pointscompared to allocation based onwealth only, and by 17% points compared to a purely random allocation.

© 2011 Elsevier B.V. All rights reserved.

1. Introduction

Governments of developing countries often subsidize accessto key health and educational resources for the most vulnerablesections of their populations. However, such subsidizing efforts aretypically constrained by binding budget ceilings. When budgetsare such that only a small fraction of a target population canreceive a given subsidy, the eligibility rule used to decide whowill receive the subsidy can have an important effect on theoverall benefit arising from the subsidy program. In this paper, weconsider the problem of allocating a fixed amount of treatmentresources to a target population with the aim of maximizing themean population outcome, and the dual problem of estimatingthe minimum cost of achieving a given mean outcome in thepopulation by efficient targeting of a treatment. We set-up aneconometric framework for studying this problem and apply it todesign welfare-maximizing allocation of subsidies for an effectivemalaria control tool – insecticide-treated bed nets (ITNs) – amonghouseholds in a malaria-endemic region of Kenya.

Our paper contributes to a recent but steadily expandingliterature in statistics and economics on how experimentalevidence on treatment effect heterogeneity may be used tomaximize gains from social programs. This problem has beenanalyzed in the literature as a statistical decision problem with

∗ Correspondence to: Department of Economics, University of Oxford, ManorRoad Building, OX1 3UQ, United Kingdom.

E-mail address: [email protected] (D. Bhattacharya).

0304-4076/$ – see front matter© 2011 Elsevier B.V. All rights reserved.doi:10.1016/j.jeconom.2011.11.007

finite samples under ambiguity but with no budget constraint(c.f., Dehejia (2005), Manski (2004), Manski (2005), Stoye (2009),Hirano and Porter (2009) and Tetenov (2012)). Our substantivecontribution is to study such problems in the presence of aggregatebudget constraints — an extremely common situation in reallife but largely ignored in the treatment choice literature.1 Theconstraintmakes the treatment assignment problem substantivelyrealistic, especially for developing countries, and raises a set ofunique substantive and technical issues that are absent in theunconstrained case.

On the empirical side, treatment targeting has been investi-gated in job-training contexts, c.f., Berger et al. (2001), Frolich(2008), Behnke et al. (2009) and Lechner and Smith (2007). See alsothe papers in Eberts et al. (2002), the special issue of the EconomicJournal (Nov, 2006) on profiling and Bhattacharya (2009) on opti-mal peer assignment. A few recent studies have used experimen-tal data to estimate the parameters of dynamic structuralmodels ofbehavior and utilized the estimates to simulate the effects of coun-terfactual policy interventions (c.f. Attanasio et al. (in press) andDuflo et al. (forthcoming)). Mahajan et al. (2009) discuss identifi-cation and estimation of a static structural model of ITN adoptionusing observational data alone and use the estimated parametersto perform counterfactual policy analysis. Todd andWolpin (2006)

1 Manski (2005) studies planning problems which satisfy ‘separability’ andspecifically mentions (pp. 10–11) budget constraints as a situation whereseparability is violated and, consequently, not studied by him.

Page 2: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 169

discuss the estimation of structural models of behavior using pre-program data and compare predictions of their estimated modelwith subsequent experimental data.

The rest of the paper is organized as follows. Section 2 describesthe treatment assignment problem under budget constraints. Sec-tion 3 describes our sample-based approach. Section 4 introducesthe estimators; Section 5 develops the relevant distribution theoryand Section 6 discusses the covariate choice problemwhen assign-ment is based on parametric models of treatment effects. Section 7presents the application to the welfare-maximizing allocation ofbed nets in Kenya and presents a Monte Carlo exercise to informthe covariate choice problem. Section 8 concludes with directionsfor future research. The Appendix contains technical proofs.

In the text, the symbol := will mean equal by definition, thesymbols φ and Φ will refer to the standard normal density andc.d.f., respectively and FW |V (w | v) will denote the c.d.f. of therandom variableW atw, conditional on V = v.

2. Set-up

The assignment problem: To fix ideas, let Y denote a household-level outcome and let D denote a binary treatment whose valuecan be affected directly by policy. Let X denote observed covariateswhich includes both discrete and continuous components. In thebed-net example, analyzed below in detail, the population ofinterest is rural households of Western Kenya. We have a simplerandom sample drawn from two districts in Western Kenya. Eachhousehold is an observation. Y is a binary outcome which equals1 if the household owns and uses a bed net. X is the presence ofa child under 10, the wealth per capita and ownership of a bankaccount. The treatment, D = 1, is offering a highly subsidizedbed net to the household. Y1 and Y0 are the value of the potentialoutcome Y with and without the treatment, respectively: Y =

DY1 + (1 − D)Y0. Let βd(x) denote the expected outcome forhouseholds with X = x when their treatment status D is setexternally at d ∈ {0, 1}, i.e., βd(x) = E(Yd | X = x). If the observedD is independent of (Y1, Y0) conditional on X , as in a randomizedtrial (the case studied here), then a nonparametric regression ofY on X for households with D = d in the sample can be used torecover this function.

First, consider an idealized version of a social planner’s problemwhere βd(x) and the marginal distribution of X with support Xare known to the planner. The planner faces a budget constraintthat the fraction of households who can be treated is at most c.We define the planner’s idealized problem as the choice of a mapp : X → [0, 1] where p(x) equals the probability that a householdwith X = x is assigned to D = 1 with probability p(x). We willassume that the planner wants to maximize mean outcome.2 Thenthe planner’s problem is

maxp(·)

x∈X

[β1(x)p(x)+ β0(x) {1 − p(x)}] dFX (x)

subject to

c =

x∈X

p(x)dFX (x). (1)

Define β(x) to be β1(x) − β0(x), i.e., the conditional averagetreatment effect (CATE). We allow for negative program impacts(i.e., Pr[β(X) < 0] ≥ 0), but to keep the present problemsubstantively interesting and analytically tractable, we willmaintain the following condition throughout the paper:

2 More generally, if the planner is interested in maximizing (a possibly covariateweighted) outcomeutility, thenβd(x) represents the expected value of theplanner’sutility defined on outcomes for individuals with X = x and D = d.

Assumption maintained (AM): (i) For some δ ≥ 0, we havePr(β(X) > δ) > c (the constraint binds); (ii) β(X) has boundedsupport with Lebesgue density bounded away from zero.

Condition AM(i) lets us ignore some technical complicationsdue to (additional) nondifferentiability which would arise ifPr(β(X) > 0) is close to c , relative to the sample size (c.f., Andrews(1999)). It also makes the problem more realistic in a developingcountry setting by making the budget constraint bind. conditionAM(ii) is not strictly necessary for us but it simplifies the expositionand proofs.3

It is straightforward to show that under condition AM(i) andAM(ii), the solution to the problem subject to (1) is of the form p(x) =

1 {β(x) ≥ γ } where γ satisfies: c =x∈X

1 (β(x) ≥ γ ) dFX (x).That is, γ is the (1 − c)th quantile in the marginal distributionof the random variable β(X) which is unique under assumptionAM(ii). Define the resulting ‘‘ideal’’ welfare as

ρ =

x∈X

[β1(x)1 (β(x) ≥ γ )+ β0 (x) 1 (β(x) < γ )] dFX (x), (2)

which would be attainable if β1(·), β0(·) and the distribution ofX were known exactly to the planner. Note that γ also equalsρ ′ (c), the shadow cost of the budget constraint, i.e., the expectedtreatment effect on the marginal household made eligible fortreatment under our budget-constrained rationing rule.Minimizing expenditure: The dual formulation of the problem iswhere the planner’s objective is to achieve an expected outcomeequal to b by allocating treatment based on covariates. Theparameter of interest is the minimum amount of funds necessaryto achieve b. This problem can be represented as

minp(·)

x∈X

p (x) dFX (x) (3)

subject tox∈X

[β1(x)p (x)+ β0(x) {1 − p(x)}] dFX (x) = b. (4)

It is not hard to show that the optimal p(·)will again be of the formp(x) = 1 {β (x) ≥ γ (b)} where γ (b) is such that p(·) satisfies (4).Note that by duality, the minimum value of (3) is simply ρ−1 (b)where ρ (·) is defined in (2) and ρ−1 (b) = inf {c : ρ (c) ≥ b}.Notice that ρ(·) is monotone increasing and so ρ−1(·) is well-defined.Some extensions: While we focus the current paper on the meanutility of outcome as the objective, the idea applies in principle toany functional of the overall outcome distribution. Let F1 (· | x) andF0 (· | x) denote the marginal C.D.F. of the outcome, conditional onX , under treatment andno treatment respectively. Thesemarginalscan be identified using experimental data from randomizedtreatment allocation. If we fix the treatment probability as p(·),then the C.D.F. of the overall outcome corresponding to the choicep(·) is

G (·; p) =

x∈X

[F1 (· | x)× p(x)+ F0 (· | x)

× {1 − p(x)}] dFX (x).

If the planner wishes to maximize a functional F (·) of the C.D.F.G (·, p), then the optimization problem reduces to

maxp(·)

F (G (·; p)) s.t. c =

x∈X

p(x)dFX (x).

3 In the Appendix part 3, we (a) show how a simple model of householdconsumption would imply AM(ii) and (b) briefly discuss how the analysis wouldproceed if we were to drop AM.

Page 3: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

170 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

For the mean utility case studied in this paper, F (G (·, p)) ≡XU (w) dG (w, p), with U (w) ≡ w denoting the mean outcome

case. Similarly, F (G (·; p)) ≡ G−1 (0.5; p) denotes the medianmaximization case. The form of the solution and the relateddistribution theory will of course change depending on the choiceof F (·).

A simple extension can also be made to the case where thetreatment cost varies by x. Let h(x) denote the per capita additionalcost of treating type x. This is estimable from the experimental datawhen cost data are available. The original problem is nowmodifiedto

maxp(·)

x∈X

[β1(x)p(x)+ β0(x) {1 − p(x)}] dFX (x),

s.t. c =

x∈X

h(x)p (x) dFX (x).

The solution is of the cost-benefit form: treat if and only if β (x)−h(x) ≥ γ , withx∈X

h(x)1 (β(x)− h(x) ≥ γ ) dFX (x) = c.

3. Sample-based rule

The rule 1 {β(x) ≥ γ } is infeasible because β(·) (and usuallyalso F (·)) are unknown. But we can approximate it with sampledata from an experiment where the treatment was randomlyassigned. In particular, we consider Manski (2004)’s ConditionalEmpirical Success (CES, henceforth)-type rule which is the sampleanalog of the rule in the previous display, viz.,

p(x) = 1x : β(x) ≥ γ

with c =

1n

ni=1

1β (xi) ≥ γ

.

Here β(x) an γ are consistent estimates of β(x) and γ .4 One canthen define the expected welfare of the feasible rule as

ρn :=

X

β1(x) Pr

β(x) ≥ γ

+ β0 (x) Pr

β(x) < γ

dFX (x), (5)

where FX (·) represents the population distribution of covariate X ,based on which the treatment will be allocated to new applicants.Under regularity conditions (e.g., uniform convergence of β(·) toβ(·), detailed in the Appendix) ρn will converge to ρ defined in (2)as n → ∞.

The condition c =1n

ni=1 1

β (xi) ≥ γ

is one that holds in

the sample and will hold in the population only asymptotically.In order to have a γ that will make the budget constraint holdexactly in the population, the planner needs to know the marginaldistribution of X , whence γ may be obtained by solving

c =

X

1β(x) ≥ γ

dFX (x).

In our empirical application, we do not know the marginaldistribution of X and so we stick to the sample budget constraintformulation with the understanding that the resulting choice of γwill satisfy the budget constraint approximately in the population.

The technical results below concern the estimation of ρn, itsinverse and the large sample properties of these estimators. Thisdistribution theory, which is significantly more interesting than in

4 It is worth investigating whether the CES rule itself is optimal in a Hirano andPorter (2009) sense here but this question is outside the scope of the present paper.It would require extension of their methods to allow for unknown parameters, suchas γ , which cannot be estimated at the

√n-rate.

the unconstrained case where γ is known to be zero, is used to(i) construct asymptotically valid confidence intervals for expectedwelfareρn and its dual (introducedbelow) and (ii) address the issueof covariate choice in treatment allocation. Interestingly, it turnsout here that the estimator ρn of ρn (and thus of ρ in large samples)has the parametric rate of convergence but the corresponding γis not

√n-consistently estimable in general. To see why, consider

the special case where X is a scalar continuous covariate with c.d.f.FX (·) and β(·) is strictly increasing but otherwise unknown. Then

1 − c = Pr (β(X) < γ ) = PrX < β−1 (γ )

= FX

β−1 (γ )

.

Inverting this, we get that

γ = βF−1X (1 − c)

= E

Y1 − Y0 | X = F−1

X (1 − c).

Thus γ is the nonparametric conditional expectation evaluatedfor the marginal treatment recipient and is therefore not

√n-

consistent even when the marginal of X is known. Although thedistribution theory for γ is not of primary interest for the restof the paper (it is used to get the distribution of ρ), our resultis somewhat interesting from a theoretical perspective. It showsthatwhenβ (·) is an infinite-dimensional unknownparameter andcannot be estimated in a

√n-consistent way, then a quantile of

the scalar random variable β(X) is generally not√n-consistently

estimable although the mean of β(X) is. To our knowledge, thisresult is new and appears to us to be somewhat interesting becausein most cases of interest, quantiles and means have the same ratesof convergence.

Note that we can define the relevant population quantities,viz., γ , β(·) and ρ as solutions to the (semiparametric) momentconditions

E (Y1 − Y0 − β(X) | X) = 0,E [1 (β(X) ≤ γ )− 1 + c] = 0,E [β(X)× 1 (β(X) ≥ γ )− ρ] = 0.

Of these, the second moment condition is differentiable in thescalar γ if β(X) has a density but, in general, not functionallydifferentiable in β(·), owing to the presence of the indicator. Thismakes it infeasible to directly apply the methods of e.g., Chenet al. (2003) which requires (Hadamard) differentiability of all thepopulation moment conditions with respect to both the finite andthe infinite dimensional parameters.

4. Estimation

We now formally define our estimators corresponding to theCES approach defined above. Suppose β(x) denotes an estimatorof β(x) and β1(x) an estimator of β1(x). In order to circumvent thenonsmoothness of the population moment for γ discussed above,we use extra smoothing to construct our estimators. Suppose thatβ(X) is bounded between [−M,M] on the support of X . Thenchoose a symmetric (about zero) kernel L(·)with bounded support,w.l.o.g. [−1, 1], the corresponding C.D.F. kernel L (t) =

t−1 L (a) da

for each t ∈ [−1, 1] and a sequence of bandwidths hn converging(slowly) to zero as n → ∞. The C.D.F. kernel simply convertsthe indicator function 1 {β(x) ≤ γ } to a function that changessmoothly from 0 to 1 as β(x) − γ changes sign from positive tonegative in finite samples but approaches the indicator as n → ∞.

Now define γ , and ρ by

1n

ni=1

L

γ − β (Xi)

hn

− {1 − c}

= 0,

ρ −1n

ni=1

β1 (Xi)− β (Xi) L

γ − β (Xi)

hn

= 0. (6)

Page 4: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 171

Notice that the solution to the first equation in the previous displayis unique. Indeed, the derivative of the LHS of the first equationw.r.t. γ is positive since L(·) is the integral of a (positive) kernel.Given that γ can be uniquely solved, the solution to the secondequation must necessarily be unique because ρ is defined as theaverage of something involving γ and other observed quantities.The smoothing applied in (6) is similar in spirit to Horowitz’s(1992) analysis of the smoothedmaximumscore (see alsoHall et al.(1999)). But in that problem, the finite-dimensional parameter ofinterest does not explicitly depend on any infinite-dimensionalunderlying parameter. In contrast, here the key parameters ofinterest, γ andρ, are based on the infinite-dimensional componentβ(·) through population moments that are not smooth in β(·).Thus the present estimators lie at the intersection of classical 2-step semiparametric estimators and smoothing-based estimatorsfor countering non-differentiability. This makes both the resultsand the proofs substantially different from both strands of theliterature.

We now consider two cases as follows.Parametric case: First consider the case where we have a finitelyparametrized regressionmodel for Y giving us β(x) = G (x, β) andβ1(x) = G1 (x, β), where the functional forms G and G1 are knownand β is a finite dimensional parameter. The efficient estimatorsof β can be constructed here either by a direct optimal minimumdistance approach or by a one-step GMM procedure. For generalnon-linear models with a distributional condition, the MLE will beefficient.Nonparametric case: Suppose X ≡

Xd, X c

where Xd contains

the discrete components of X and X c is a p-variate vector of thecontinuous components of X with support Xc and density f (·).Let K (·) be a density kernel and ωn a sequence of bandwidthsconverging to zero at an appropriate rate, to be specifiedlater, as n → ∞. Define µ1 (Xi), µ0 (Xi), π1 (Xi), π0 (Xi) asthe Nadaraya–Watson regression estimators of respectively DY ,(1 − D) Y , D and (1 − D) on X , evaluated at X = Xi, and calculatedwhile leaving out the ith observation, e.g.,

µ1 (Xi) =

1n−1

j=i

yjDj

ωpnK Xc

j −Xci

ωn

1Xdj = Xd

i

1

n−1

j=i

1ωpnK Xc

j −Xci

ωn

1Xdj = Xd

i

.Now β (Xi) can be defined in terms of the above quantities as

β (Xi) =µ1 (Xi)

π1 (Xi)−µ0 (Xi)

π0(Xi).

For a large number of discrete regressors, one can alsouse multi-dimensional kernels (typically product of severalunidimensional kernels) and smooth across the discrete valuesas well. As is well-known, this does not affect the asymptoticdistribution theory under the same bandwidth conditions (c.f.Birens (1994), Theorem 10.4.3, Li and Racine (2007), Chapter 4).If, however, in place of a continuous regressor, we have a discreteregressor taking infinitely many values (such as a Poisson randomvariable), one can simply use cell-averages to estimate conditionalmeans. Following the arguments of Delgado and Mora (1995), itfollows that smooth functionals of such nonparametric estimatorsare

√n-normal, with no bias term. In our application, all discrete

variables are binary and the key continuous variable (wealth)does appear to be continuously distributed. So we work with thetraditional Nadaraya–Watson type estimators.Fixed trimming: To avoid some technical issues related to boundarybias, we use the following ‘‘fixed trimming’’ modification, as inNewey (1994, Theorem 4.1). Let X denote the support of X and letX be a compact subset of the interior of X, such that the density of

X is bounded away from zero on X. Then to define our estimands,we will simply work within X. Notice that in this problem,which X ’s and what range of X ’s we want to use for conditioningtreatment assignment is up to us. So the fixed trimming conditioncan be interpreted as part of the problem description.5 However,we will suppress the trimming term 1 (Xi ∈ X) in the rest of ourexposition to prevent notational clutter.Finite sample bias: Note that our estimator ρ may be expressed as

1n

ni=1

β1 (Xi)−1n

ni=1

β (Xi)× L

γ − β (Xi)

hn

.

The second term is composed of the product of β (Xi) and the termLγ−β(Xi)

hn

and the latter is large when β (Xi) is small. Because of

the local variance in β(x), there is potentially a negative correlationin the estimation errors in β (Xi) and L

γ−β(Xi)

hn

which would

tend to introduce a positive bias in the estimate of ρ. In order tomitigate the bias, one option is to use a split-sample method asfirst suggested in Geisser (1975). A simulation exercise using thismethod is reported in the Appendix, part 2. Given that the split-sample results seem to be worse to us – probably because theeffective sample size is halved – we conduct the nonparametricapplication using full-sample methods. We will return to split-sample methods again when we discuss covariate selection butin that case, we restrict attention to parametric specifications, assuggested by a referee.

5. Large sample theory

In this section, we consider the asymptotic distribution theoryfor our estimators. Section 5.1 outlines the simpler results for theparametric case where β(x) is known up to a finite-dimensionalparameter. Section 5.2 focuses on the nonparametric case. For thelatter, Theorem 1 concerns asymptotic properties of γ , Theorem 2concerns ρ. The key result is that ρ but not γ has parametric rateof convergence under regularity conditions.

5.1. Parametric case

To get an idea for the distribution theory, consider the set-up where β(x) is parametrically specified as G (x, β), with G(·)known; typically β (the so-called ‘‘pseudo-true value’’) is a slopecoefficient in a linear or non-linear regression and can be estimatedat parametric rates using, say, GMM or the MLE. For some specificfunctional forms of G (·, ·), e.g., a linear one, the function h (β) = M−M 1 {G (x, β) ≤ γ } dFX (x) may be differentiable in β and then

no smoothing would be necessary; but smoothing-based methods(i.e., using L) aremore generally applicable and sowe focus on that.The key result is that both γ and ρ can be estimated at the

√n-rate.

We provide an outline of proof in the Appendix at the end of part 1.

5.2. Nonparametric case

The nonparametric case is far more involved and occupies thebulk of our technical proofs in the Appendix. It has the obviousadvantage of being robust to distributional misspecifications andis therefore our preferred method. We now present two resultsregarding the large sample behavior of our estimators of γ andρ. The exact conditions under which these hold are listed in theAppendix together with the proofs of the results.

The next theorem deals with the distribution of γ which isultimately needed for obtaining confidence intervals for ρn. Thekey assumptions for the following Theorem 1 are that for some

5 Imbens and Ridder (2009) have recently considered some alternatives to fixedtrimming using a ‘‘nearest interior point’’ modification to NW estimates.

Page 5: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

172 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

r ≥ 2, density of β(·) has at least r − 1 uniformly boundedderivatives, the (r − 1)th one denoted by f (r−1)(·), the bandwidthsequence satisfies nh2r+1

n → λ < ∞, L(·) is an rth order kernelsatisfying

1−1 u

rL (u) du < ∞, and σ 2j (x) := var

Yj | X = x

are

finite for j = 0, 1. The asymptotic bias of γ will be proportionalto

√λ which can be made to go to zero by choosing hn to be of

smaller order than n−1

2r+1 but at the cost of increasing the variance(see below for more on this).

Theorem 1. Under conditions A0–A3, A4(i), conditions B3(i), B4(i)and B5, listed in the Appendix, we have that

γ − γ = op (1) .

Further, under conditions A0–A4, and B3–B11, detailed in the Ap-pendix, we have thatnhn

γ − γ

d−→ N

κ,τ 2 (γ )+ ξ 2 (γ )

fβ (γ )

1

−1L2 (u) du

,

where

τ 2 (γ ) = E

σ 20 (X)

Pr (D = 0 | X)

β(X) = γ

ξ 2 (γ ) = E

σ 21 (X)

Pr (D = 1 | X)

β(X) = γ

κ = (−1)r

√λ

r!× f (r−1)

β (γ )

1

−1urL (u) du.

Proof. Appendix. �

Finally, we state the result for ρ, under somewhat strongerconditions than the ones above. The r defined above now needsto be at least 4 and the kernel L is then of a ‘‘higher-order’’ variety.Defineψ1i = γ

Fβ (γ )− 1

βXj

≤ γ,

ψ2i = β (Xi)× 1 {β (Xi) ≤ γ } − ζ ,

ψ3i = 1 (β (Xi) ≤ γ )×Di

Pr (D = 1 | Xi){Y1i − E (Y1 | Xi)} ,

ψ4i = 1 (β (Xi) ≤ γ )×1 − Di

Pr (D = 0 | Xi){Y0i − E (Y0 | Xi)} ,

ψi = ψ1i + ψ2i + ψ3i − ψ4i.

Theorem 2. Under conditions A0–A4 and B3–B11, listed in the Ap-pendix, we have that

ρ − ρn = op (1) .

Further, under conditions A0–A5 and B3–B12, listed in the Appendix,

√nρ − ρn

=

1√n

ni=1

DiY1i − β1 (Xi)

Pr (D = 1 | Xi)

−1

√n

ni=1

ψi + op (1) . (7)

Proof. Appendix. �

The proof works by using

ρ =1n

ni=1

β1 (Xi)−1n

ni=1

β (Xi)× L

γ − β (Xi)

hn

.

The first term is then analyzed via Lemma 3 in the Appendix anda Taylor expansion coupled with U-statistics results are used toshow that the second term has an asymptotically linear form. Thefinal variance can be consistently estimated using sample cross-products, under standard conditions for theWLLN. It may be noted

here that the estimation error in β(·) affects the distribution of ρthrough the terms ψ3i and ψ4i.Bias removal: Our proofs of the above results use bias-removal. Iffβ(·) has bounded derivatives up to order (r − 1), then the MSE ofγ − γ

is given by

h2rn ×

f (r−1)β (γ )

r!

1

−1urL (u) du

2

C

+1

nhn

τ 2 (γ )+ ξ 2 (γ )

fβ (γ )

1

−1L2 (u) du

B

,

implying an MSE minimizing bandwidth choice of hn = λ∗n−1

2r+1 ,

where λ∗=

C

2rB2

12r+1

. This choice of hn does not work for getting

a√n-rate for ρ because (c.f. step 6A in the proof) it implies that

√nhr

n = On

12(2r+1)

which blows up to+∞. Soweneed to choose

hn to be smaller than the one that is MSE-optimal for γ .Bandwidth conditions: The conditions in the Appendix imposeseveral restrictions on bandwidths which are related to thedimension p of X , the extent of smoothness r of the density of β(X)(higher r implying faster convergence for γ ), smoothness q of thefunction β(·) (higher q implying faster convergence for β(·)) andthe number of momentsm of the dependent variables. They implysufficient moment and bandwidth conditions for a rate of uniformconvergence, obtained in Hansen (2008) and Newey (1994). It isuseful to have a specific choice of bandwidths which imply theconditions for (7) (which requires the strongest conditions). Wecan choose the bandwidth ωn for estimating β(·) as ωn = n−ω

and the bandwidth hn = n−h for some w, h > 0 which satisfythe following sufficient restrictions: r ≥ 4, m ≥ 4, q ≥

r+2r p,

16 > h > 1

2r ,512 < qω < 1 and pω < 1

4 . This choice will satisfy allthe conditions for (7) and, in particular satisfy sufficient conditionsfor the result that sup

β(·)− β(·)

= opn−1/4

. In particular, for

p = 1, one can assume q = 3, r = 4, m = 4 and set h =17 and

524 < ω < 1

4 , which is what we use in the application.Distribution theory for dual: Recall that the value function for thedual problem δn (b) represents the smallest fraction of householdswho have to be assigned to treatment (optimally) to guarantee thatthe expectedmean outcome is at least b. In other words, ρn [δn (b)]equals b, where δn (b) plays the role of c in the primal problem. Theδn (b) is estimated by the sample analog δ (b) = ρ−1 (b). Using astandard first-order expansion, and noting that ρ ′ (c) equals γ (c),one gets that

√nδ (b)− δn (b)

= −

√nρ (δ (b))− ρn {δ (b)}

γ {δ (b)}

+ op (1) ,

from which the asymptotic normality of√nδ (b)− δn (b)

follows by applying the above theorems.

6. Choosing covariates

The above analysis has considered treatment targeting by afixed set of covariates and experimental estimates from largesamples were used to determine CES type allocation rules basedon these covariates. The resulting expected welfare (ρn) is thenapproximated in a large-sample sense by an appropriate sampleanalog (ρ). However, one practical consideration in designingtreatment protocols is to decide which covariates to choose

Page 6: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 173

for targeting. A brief outline of the problem and a practicalsolution based on cross-validation is discussed below. A full-blowntheoretical analysis of this question is beyond the scope of thepresent paper. But in Appendix Appendix A.4, we provide anoutline of a potential theoretical approach, based to local-to-zerotype asymptotics, for analyzing the problem.

Basically, there are two considerations involved. One is thefinancial cost of gathering covariate information required forimplementing the desired allocations. The second is the issueof statistical precision of covariate-conditioned treatment effectsobtained in the finite experimental sample.

For the following discussion (as recommended by a referee),we concentrate on the case where β(·) is parametrically specifiedas x′β , β ∈ Rk and let γ := γ (β, FX (·)) denote the treatmentthreshold.Finite sample accuracy: Suppose we want to choose one among aset of covariate combinations, indexed by {1, . . . , J} with β and βjdenoting the parameter and itsMLE, respectively, corresponding tocovariate set j. Analogously, denote the treatment threshold and itsestimator by γj and γj. The case j = 0 will denote the full covariateset. Let δ denote assigning an individual to treatment and considerthe loss function

L (δ; (β, F)) =

x′β − γ (β, F)

×1x′β > γ (β, F)

− δ

dF(x). (8)

The above loss function penalizes when knowledge of the true(β, F) would recommend giving D = 1 (1) but δ leads torecommendation of 0 (1). When using all covariates, the CES ruleis of the form δ = 1

x′β > γ

β, F

, where

β, F

denote

estimates of (β, F). Now consider two rules, one where allocationis based on covariate set j (with respective parameters βj and γj)and onewhere it is based on the full feasible set of covariates. Usingan Edgeworth expansion, the risk difference between using the fullset versus the set j = 1, say, up to first order is given by

R ((β, F) , σ0(·))− R1 ((β, F) , σ1(·)) := ρ1n − ρn

=

x′β − γ (β, F)

×

Φ

x′β1 − γ (β1, F)σ1(x)/

√n

− Φ

x′β − γ (β, F)σ (x) /

√n

dF(x),

where σ1(x) and σ(x) denote respectively the asymptotic standarddeviation of x′β1 − γ1 and x′β − γ . This expression sheds lighton the role of estimation precision. Note that whenever x′β −

γ (β, F) and x′β1 − γβ1, F

have different signs, the integrand

will be negative irrespective of the values of σ1 (x) and σ(x). Whenthey have the same sign, however, the integrand will tend to bepositive whenever σ1 (x) is small relative to σ(x). This suggeststhat the narrowermodel will yield higher welfare in finite samplesif covariates not included in it have small impact on the treatmenteffect and/or there is a large amount of variation in these covariatesin the sample (leading to smaller σ1(x)).Financial costs: Using a larger number of conditioning covariatescan yield better (at least, no worse) targeting if our estimatescome from large samples. However, measuring more covariates isalso financially expensive. These measurement costs are incurredfor everyone in the population during the actual treatmentassignment and not just for individuals getting the treatment.Furthermore, unlike statistical precision costs, the measurementcosts do not disappear with increasing sample size. Suppose weare choosing whether to condition treatment on a larger set X0or a smaller subset X1 where gathering information about Xjentails a per capita expenditure of vj, j = 1, 2. Suppose these

two conditioning sets yield expected welfare functions ρ0n (·) andρ1n(·), respectively. Also suppose b is the average welfare we wantto attain. Then the expected per capita cost of using X1 is the sumofthe treatment cost plus the cost of measuring X1, i.e., ρ−1

1n (b)+v1.6

Similarly for X0. Therefore, we should choose X0 if ρ−11n (b) + v1 is

large relative to ρ−10n (b)+ v0.

6.1. Feasible rules

Note, however, that since the ρjn’s are unknown, theycannot be used directly in choosing j and feasible sample-basedrules are warranted. This problem is a somewhat complicatedversion of classic model selection problems, on which thereis a large literature in statistics and time-series econometrics.The complication arises here because the underlying constrainedoptimization problemmakes the welfare function depend on boththe conditional mean of Y1, Y0, and the marginal distributionfunction of the covariates in a complicated way.7

In the frequentist approaches to model selection in simplersettings, one aims to estimate the risk arising from the use ofestimates in place of true values, i.e., in the present context,estimate ρjn. See, for instance, Claeskens and Hjort (2008, CHhereafter) for a comprehensive textbook treatment. For reasonsof brevity, we concentrate on the method cross-validation forthe present paper but note that other frequentist approaches,e.g., resampling based methods and Bayesian approaches could bepotentially used here, as well.Cross-validation (CV): The CV idea is to split the sample into atraining part (containing n − n1 observations) and a validationpart (containing n1 observations). For a given covariate choice j,for each such split, the parameters β and γ are estimated usingthe training part — call these βjt and γjt . Then these estimates areused tomake allocations for the validation part and the welfare forthat particular split is calculated as:

ρjn :=1n1

n1i=1

YiDi

1n1

n1i=1

Di

1x′

iβjt ≥ γjt

+Yi (1 − Di)

1n1

n1i=1(1 − Di)

1x′

iβjt < γjt

. (9)

The overall welfare is calculated by averaging the per-split welfareacross all possible splits of the sample. A particularly popular splitin statistics is the leave-one-out version. The idea here is to predictwelfare from using estimates from samples of size n − 1 to designthe allocation for the remaining observation. Averaging across thesample observations then mimics the average welfare of the CESrule across samples drawn from the population.8 Amodification ofthe above is provided by V -fold CV (c.f., Geisser (1975)) — wherethe sample is split into V disjoint subsamples randomly; then eachof the V subsamples is used for validation and the result averagedacross them.

6 Notice that in this case, v and ρ−1 are bothmeasured inmonetary units and arethus directly comparable.7 Constructing Manski (2004)-style bounds on exact finite sample risk for

different covariate choice is difficult here owing to the complicated nature of theprobabilities which involve, among other things, finite sample Lorenz shares for thefinite sample conditional mean function.8 It is possible to show that ρjn is first-order equivalent to the naive ρjn minus

a penalty term that is related to the variance of βj (see Appendix part 4, last sub-section).

Page 7: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

174 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

In order to compare the accuracies of the cross-validated andthe naive sample-based estimates of welfare in finite samples, weperformed a small Monte Carlo exercise based on our sample data,which is reported in the next section. In the actual application,we complement our results based on the asymptotic approach ofSection 4 with a leave-one-out CV approach (which gave the bestresults in the Monte Carlo for small samples) to see how covariatechoice in the parametric specification may be dictated by samplesize.

One issue we do not address here, though, is that of postmodel-selection inference, which may be relevant in practice,especially when one model is chosen over many competing ones.In the Appendix A.4, we briefly outline an (local-to-zero type)asymptotic framework that seems appropriate for the covariatechoice problem for finite samples and describe how that wouldrelate to the post-model selection inference issue.

7. Application to bednet provision

7.1. Background

We now consider a substantive application — viz, the alloca-tion of subsidies for long-lasting insecticide-treated nets (ITNs) tohouseholds, using experimental evidence from Kenya. Our treat-ment of interest is making subsidized ITNs available to a section ofthis population.

ITNs have been shown to reduce overall child mortality by upto 38% in regions of Africa where malaria is the leading cause ofdeath among children under 5.9 In addition, ITN use can help avertsome of the substantial direct costs of treatment and the indirectcosts of malaria infection on lost income.10 Lucas (2007) estimatesthat, alone, the gains to education of a malaria-free environmentmore than compensate for the cost of an ITN. Costing $5–$7 anet, however, ITNs are not affordable to most families (Cohenand Dupas, 2010; Dupas, 2010). For this reason, there is a largeconsensus that ITNs should be fully subsidized (WHO, 2007; Sachs,2005).

Teklehaimanot (2007) estimate that providing one free long-lasting ITN for every two at-risk persons in sub-Saharan Africawould amount to 2.5 billion dollars. The funds committed bygovernments and donor agencies for ITNs have not yet reachedthat amount, however. For example, the Government of Kenyaestimates that around 1 million pregnant women are in need ofan ITN every year, but their budget will allow them to provide only0.5 million nets per year to pregnant women over the next 5 years(Kenya Round 7 Proposal, 2007).

Under such a budget constraint, the question of how to allocatethe available ITNs among households becomes an important policyquestion. If the treatment effect (the health impact of getting asubsidized ITN) is exactly the same for everyone in the population,then all possible allocations will lead to the same overall gains.However, when there is heterogeneity in the treatment effect(e.g. the health impact of getting a subsidized ITN varies withobserved covariates, such as socioeconomic status, presence ofchildren in the household, etc.), the gains can be maximized bya covariate-based allocation. While the health impact of using anITN might be homogeneous, the health impact of getting a highlysubsidized ITN might vary across covariates since usage rates(conditional on having a net) are likely to vary across covariates.

9 See Lengeler (2004) for a review. Earlier estimates of ITN use on reductions inchild mortality from a randomized trial in Gambia were as high as 60%, but mostestimates from randomized trials in Africa are closer to 20%.10 Ettling et al. (1994) find that poor households in a malaria-endemic area ofMalawi spend roughly 28% of their cash income treating malaria episodes.

Table 1Summary statistics.

Sample mean

Treatment 0.16(0.37)

Outcome = 1 (All) 0.16(0.37)

Outcome = 1 (treatment group) 0.60(0.49)

Outcome = 1 (control group) 0.08(0.27)

Has a child under 10 years of age 0.54(0.50)

Household size 6.94(2.65)

Household’s wealth in US$, per capita 47(34)

Owns a bank account 0.13(0.34)

Observations (households) 1008

Standard Deviations in parentheses. Household-level data collected in WesternKenya in 2007. ‘‘Treatment’’ is a dummy equal to 1 if the household received acoupon for a bed net to be purchased at a low price ($0 or $0.50), and 0 if thehousehold received a coupon for a bed net to be purchased at a price of $2 orabove. Outcome = 1 only if (1) the household has redeemed the coupon and (2)the household had started using the bed net at the time of the follow-up visit.

For example, households who can afford to purchase an ITN inthe absence of any subsidy (because they have access to creditor are wealthy enough) will not benefit from the treatment verymuch (i.e. their β0(x)will be large and thus for them the differenceβ1(x)− β0(x) is likely to be small). Likewise, since young childrenare the most vulnerable to the disease, households without youngchildren might not benefit much from the treatment (i.e. theirβ1(x) will be small and thus the difference β1(x) − β0(x) is likelyto be small). For these reasons, the treatment effect is likely tovary across observed covariates such as wealth, access to financialservices, and the presence of young children. An allocation rule thattakes into account such heterogeneity could potentially generateimportant welfare gains.

7.2. Design

For this applicationwe use data from a randomized experimentconducted with rural households in Western Kenya in 2007(Dupas, 2009). The price at which a household could purchase anITN varied from $0 (a free ITN) to $4, in steps of $0.50. Householdswere randomly assigned to a price. People had three months toredeem the voucher entitling them to an ITN at the assigned price.In this application, we consider two groups: households that faceda very low (highly subsidized) price ($0 or $0.50) and householdsthat faced a high price of $2 or more. Table 1 presents summarystatistics on the 1007 households that form the sample used inthe analysis. The take-up rate of the ITN subsidy was 84% in thelow price group and 13% in the high price group. Conditional ontake-up, the usage rate was slightly higher in the low price groupthan in the high price group (70% versus 58%), leading to coveragerates of 60% and 8%, respectively. In what follows, we consider thelow price group as the treatment group and the high price groupconstitutes the control. The treatment is thus ‘‘having access to alow-price ITN’’.11

11 One may note that the short-run take-up in the low price group was not 100%,since some of the ‘‘treated’’ had to pay a small fee (i.e., $0.50) to access the net. Inthe analysis below, however, we assume that take-up was 100% in the treatmentgroup. In other words, we consider that those who did not take-up the subsidy costas much to the government as those who took-up the subsidy but did not use theirnet. This is because the take-up would potentially have reached 100% if people hadmore than three months to redeem their voucher.

Page 8: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 175

The welfare measure we consider is the fraction of householdscovered by an ITN (i.e., the fraction that owns and uses an ITN).While incidence ofmalaria, health levels,wages lost due tomalaria,or even consumption foregone due to wages lost due to malaria,could be the ‘‘ultimate’’ outcomes of interest, we concentrateon the ITN coverage rate for two reasons. First, as discussedabove, multiple trials have established that ITN coverage leadsto significant reductions in morbidity and mortality, especiallyamong young children (see Lengeler (2004) for a review), andtherefore the coverage rate is a good proxy for those ultimateoutcomes. Second, in our data, coverage was observed directly(through surprise home visits during which enumerators checkedwhether the ITN was hung above a bed), and therefore is notsubject to reporting biases, in contrast with self-reported healthmeasures.

To test for heterogeneity in the treatment effect, we run anOLS regression of ITN coverage on the treatment, three covariates,and the interactions between the treatment and the covariates.The covariates are: a binary variable equal to 1 if the householdincludes at least one child under 10; the natural log of the valueof the household’s wealth per capita; and a binary variable equalto 1 if the household owns a bank account. The first covariate(presence of a child) was chosen as an indicator of the privatereturns to using a bed net (since young children are the mostvulnerable to malaria). The two other covariates were chosenas proxies for socioeconomic status and ability to pay. Theywere measured through a baseline survey administered throughhousehold visits. In particular, wealth per capita was measured asfollows: households were asked to list all their assets (includinganimal assets) and to estimate their resale value. The combinedvalue of all assets was then divided by household size to obtainthe ‘‘wealth per capita’’ indicator. Such wealth is typically heldin relatively illiquid assets, however. For this reason, we include‘‘ownership of a bank account’’ as a second ability-to-paymeasure.Savings held in a bank are perfectly liquid, and in our studyarea, access to formal financial services has been shown to bean important determinant of households’ ability to save andacquire lumpy durables (Dupas and Robinson, 2009). Of the threecovariates we use, the most costly one to measure is wealth,which typically requires a lengthy home visit by an enumerator.Conditional on making this visit, asking respondents about theirhousehold composition and their ownership of a bank accountwould not be onerous. One issue though is that of the reliabilityof such self-reported information, once people know the allocationrule. This is not a concern in our dataset since the respondentswereexplicitly told (through the informed consent process) that theiranswers would have no bearing on what treatment they wouldbe eligible for in the future, but in an actual program, once theallocation rule is well-known, collecting reliable information onthe covariates that determine eligibility might be quite costly. Thiscost would then need to be accounted for in choosing covariates,as discussed above.

The results are presented in Table 2. The treatment wasrandomized at the household-level so no clustering correction isneeded. We find that the treatment effect appears significantlyhigher for householdswith a child under 10 and significantly lowerfor households that own a bank account. The treatment effect isalso lower for those with greater wealth, but the standard error isalso large and the interaction term is not significant. An F-test ofthe joint significance of the three interaction terms is significant at5%.12

12 Two commonly used covariates are age and education level. They do not appearto affect the treatment effect in our sample, possibly because the sample is relativelyhomogeneous in terms of those two characteristics.

Table 2Treatment effects.

Outcome

Treatment 0.601(0.293)**

Has a child under 10 years of age 0.011(0.022)

Treatment X has a child under 10 years of age 0.104(0.054)*

Log wealth per capita 0.034(0.016)**

Treatment X log wealth per capita −0.013(0.037)

Has a bank account 0.038(0.031)

Treatment X has a bank account −0.214(0.103)**

Constant −0.2(0.126)

Observations 1008R-Squared 0.29Joint F-test for three interaction terms 2.78Prob> F 0.040

OLS estimates. Standard errors in parentheses. ***, **, * indicate significance at 1%,5%, and 10% levels, respectively.

7.3. Analysis

7.3.1. Non-parametric analysis: choice of kernels and bandwidthsFor bias-removal, we use the kernels13

K(t) = 0.5 ×3 − t2

× φ (t) ,

L(t) =1532

75t5 −

103

t3 + 3t +1615

×1 (−1 ≤ t ≤ 1)+ 1 (t > 1) ,

where φ(·) is the standard normal density. Two bandwidths areneeded for the non-parametric estimation: the bandwidth ωn inthe estimation of the conditional ATE β (X), and the bandwidth hnin the smoothing correction.

Fig. 1 illustrates how the estimated treatment threshold γ(Panel A) and value function ρ (Panel B) vary with hn for a rangeof possible ωn. We find that both estimates are insensitive tothe choice of hn. In Fig. 2, we present γ and ρ for two budgetconstraint levels: c = 0.5 (Panel A) and c = 0.25 (Panel B).The stability of ρ over a reasonable range of bandwidths suggeststhat the choice of bandwidths should have little effect on thenonparametric estimates of the value function.

Fig. 3 presents a leave-one-out cross validation criterionfunction for β(x). The function is plotted over the rangeωn ∈ [0.3, 0.4], which correspond roughly to n−1/6 and n−1/8,respectively. The function seems to dip around ωn = 0.33. Giventhe small sensitivity of our estimates of ρ and, to a certain extent,γ to the choice of ωn, we show the results for both ωn = 0.3 andωn = 0.4. We use hn = 0.35; recall that the results seem veryinsensitive to the choice of hn for a given choice of ωn.

7.3.2. Conditional ATEThe nonparametric estimate of the CATE β(x) was computed

corresponding to two bandwidths ωn = 0.3 and ωn = 0.4.The parametric estimator of β(X) was computed as β(x) =π0 + x′π01

, where π0 and π01 are OLS estimates in the regression

13 To see how our results are affected by choice of a higher order kernel, werepeated the analysis for a standard normal kernel instead of K (t). The results arenumerically very similar and do not imply any substantively different conclusion(results available upon request).

Page 9: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

176 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

Fig. 1. Graphs illustrate how the estimated treatment threshold γ (Panel A) and value function ρ (Panel B) vary with hn for a range of possible ωn . Both estimates appearinsensitive to the choice of hn .

Fig. 2. γ and ρ for two budget constraint levels: c = 0.5 (Panel A) and c = 0.25 (Panel B) are presented. The stability of ρ over a reasonable range of bandwidths suggeststhat the choice of bandwidths should have little effect on the nonparametric estimates of the value function.

Fig. 3. The graph illustrates a leave-one-out cross validation criterion function forβ(x). The function is plotted over the range ωn ∈ [0.3, 0.4], which correspondsroughly to n−1/6 and n−1/8 , respectively. The function seems to dip around ωn =

0.33.

(presented in Table 2):

yi = θ0 + X ′

i θ1 + δ0Di + X ′

i δ1 × Di + εi. (10)

Fig. 4, Panel A illustrates the kernel density of the conditionalATE β(X) computed with the two proposed bandwidths, as wellas the parametric estimate for comparison purposes. Observationswith X such that β(X) is below −0.2 or above 0.9 were discardedin accordance with Appendix condition A2. These cutoffs were the1 and 99 percentile values in the empirical distribution of β(X).

Fig. 4, Panel B presents the c.d.f. of the conditional ATEβ (X) computed both parametrically and nonparametrically. The

stepwise shape for the c.d.f. in the parametric model is essentiallydue to the binary nature of two of the three covariates since theinteraction of the treatment with wealth appears to be close tozero in the parametric case. Finally, Panel C of Fig. 4 presentsthe estimates of β1(x) and β0(x) as a function of the continuousregressor (wealth), for each cell defined by the discrete regressors.

7.3.3. Unrestricted and restricted value functionsIn what follows, we compare the ‘‘first best’’ allocation (the

unrestricted case, in which the allocation is based on all threecovariates) with two ‘‘restricted’’ cases: (i) means-testing wherethe allocation is based only on wealth — which is extremelycommon in both developed and developing countries, and(ii) purely random allocation which is not covariate-based at all(this is totally uncommon as far as we know, but makes for aconvenient benchmark). Notice that in the random allocation case,the estimated value function is linear in c:

ρ(c) =1n

ni=1

c × β1 (Xi)+ (1 − c)× β0 (Xi)

.

Fig. 5 shows the parametric and nonparametric estimates forthe treatment threshold γ (c) and the value function ρ (c) in theunrestricted case. The nonparametric estimates seem very stableover the two choices of bandwidth. The nonparametric estimatesof the unrestricted value function are higher than the parametricestimates.

7.3.4. Welfare lossesFig. 6 combines the estimates of the value function ρ (c)

for all three cases: allocation based on all three covariates(called unrestricted), allocation based on wealth only, and random

Page 10: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 177

Panel A presents the kernel density of the conditional ATE β(X) computedwith the two proposed bandwidths, as well as the parametric estimate forcomparison purposes.

Panel B presents the c.d.f. of the conditional ATE β(X) computed bothparametrically and nonparametrically. The stepwise shape for the c.d.f. inthe parametric model is essentially due to the binary nature of two of thethree covariates.

Panel C presents the non parametric estimates of β1(x) and β0(x), for x = logwealth per capita, holding the discrete covariate constants. (Estimated withbandwidth omega = 0.3).

Fig. 4. Kernel-based estimates.

Fig. 5. The parametric and nonparametric estimates for the treatment threshold γ (c) and the value function ρ(c) in the unrestricted case are presented. The nonparametricestimates seem very stable over the two choices of bandwidth.

allocation. Panel A presents the parametric estimates and PanelB presents the nonparametric estimates. Presenting all threecases on the same graph helps visualize the welfare loss whenthe optimal allocation is not implementable. In contrast to theparametric estimates, the non-parametric estimates suggest thatmeans-testing is a clear ‘‘second best’’, generating a higher meanoutcome than random allocation does. The standard errors of thewelfare losses generated by the two suboptimal allocations are

shown in Table A.1 for two budget constraints (c = 25% and c =

50%).

7.3.5. Dual problemIn Table 3 we report the minimum resources needed to

attain a certain expected outcome: we compute the share of thepopulation that needs to be treated in order to achieve a given

Page 11: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

178 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

Fig. 6. The parametric estimates of the value function ρ(c) for all four cases in Panel A and the nonparametric estimates in Panel B are combined. All three cases (unrestrictedallocation, allocation based on wealth only, and random allocation) represented on the same graph helps visualize the welfare loss when the optimal allocation is notimplementable.

Table 3Dual problem: cost of reaching a target outcome.

(1) (2) (3) (4)

Objective function: target ρ(c) Unrestricted case: share of population thatneeds to be treated to reach this target

Restricted cases: additional share that needs to be treatedto achieve the target when conditioning on:Wealth only Nothing (random assignment)

Panel A: parametric computation

0.250 0.287 0.041 0.043(0.027) *** (0.016) *** (0.019) ***

0.400 0.552 0.062 0.068(0.053) *** (0.029) *** (0.028) ***

Panel B: non-parametric computation

B1. Bandwidth ω = 0.30.250 0.232 0.035 0.098

(0.018) *** (0.023) *** (0.025) ***0.400 0.460 0.084 0.160

(0.042) *** (0.027) *** (0.036) ***B2. Bandwidth ω = 0.40.250 0.242 0.030 0.088

(0.020) *** (0.021) *** (0.023) ***0.400 0.483 *** 0.065 0.137

(0.047) *** (0.027) *** (0.035) ***

Standard errors in parentheses, significant at 1% (***), 5%(**), 10% (*) levels. The table reads as follows: (Panel B1, row 1): to reach a target value function of 0.250, a share 0.232of the population needs to be treated if the allocation can be based on all covariates (unrestricted case, column 2). In the presence of restrictions on what the conditioningcan be based on, the efficiency of targeting decreases. An additional 0.035 of the population needs to be treated if the conditioning is based on wealth only (column 3). If theallocation is purely random, an additional 0.098 of the population needs to be treated to achieve the 0.250 target value function (column 4).

target value function by allocating treatment based on all threecovariates (column 2). We then calculate the additional resourcesthat are needed when the optimal, unrestricted allocation isnot possible, and the allocation is instead based on wealth only(column 3) or the allocation is purely random (column 4). Thenonparametric estimates with the bandwidth ω = 0.4 suggestthat, compared to the unrestricted allocation, an allocation basedon wealth only requires treating an additional 6.5% points of thepopulation compared to the optimal allocation (Panel B2, column3). The additional spending is higher when the allocation is purelyrandom: an extra 13.7% points of the population need to be treatedto reach the target usage rate, compared to the optimal allocation(Panel B2, column 4).

7.4. Estimating welfare in finite samples

The application above was based on the large sample approx-imation to welfare ρn generated from using a fixed set of co-variates. The approximation is provided by ρn, the sample-basedestimate of the welfare. This approximation may be inaccurate infinite samples if the number of conditioning covariates is large,relative to the sample size. In order to compare the accuracies of

ρn, the naive sample-based estimate with that of ρn, the cross-validated estimate of finite sample risk, we perform a small MonteCarlo exercise on a parametric specification, as follows.

7.4.1. Monte-CarloMaking randomdraws from the sample of interest, we create an

artificial dataset of 1008 observations and remove the outcomes.We generate outcomes Y1 and Y0 through the probit equation

YT = 1b0 + b1T + b′

2X + b′

3T ⊗ X + e > 0,

where the b coefficients are chosen equal to the estimates froma probit regression of Y on the regressors, the treatment and theinteraction of the regressors and the treatment, in the originalsample. The error e is chosen to be 3 times a standard normal. Thisis used as our population. From this population, we draw samplesof varying size. For each sample, we estimate β and γ througha linear (as opposed to probit) regression, calculate the impliedallocations and, by averaging over many samples draws from thepopulation, compute for each sample size (i) the actual welfare(ρn), (ii) the estimatedwelfare (ρn) and the cross-validatedwelfare

Page 12: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 179

Table 4Estimating welfare in finite samples: Monte-Carlo simulations.

4A. Simulated data: estimated welfare

Conditioning covariate True welfare Leave-one-out CV 2-fold CV Naive

c = 0.50, 5% sampleWealth only 0.4572 0.4601 0.4544 0.4838All 3 covariates 0.4511 0.4508 0.4569 0.5179

c = 0.50, 10% sampleWealth only 0.4574 0.4626 0.4577 0.4877All 3 covariates 0.4588 0.458 0.4595 0.5049

c = 0.50, 25% sampleWealth only 0.4593 0.4601 0.4508 0.4694All 3 covariates 0.4594 0.4375 0.4485 0.4778

4B. Simulated data: bias and standard deviation for estimates of welfare.

Conditioning covariate True welfare Leave-one-out CV 2-fold CV NaiveBias Std. dev. Bias Std. dev. Bias Std. dev.

c = 0.50, 5% sampleWealth only 0.4572 0.0029 0.0765 −0.0028 0.0924 0.0266 0.0824All 3 covariates 0.4511 −0.0003 0.0984 0.0058 0.1034 0.0668 0.0763

c = 0.50, 10% sampleWealth only 0.4574 0.0052 0.0566 0.0003 0.0736 0.0303 0.0571All 3 covariates 0.4588 −0.0008 0.0739 0.0007 0.07 0.0461 0.0573

c = 0.50, 25% sampleWealth only 0.4593 0.0008 0.0291 −0.0085 0.0348 0.0101 0.0263All 3 covariates 0.4594 −0.0219 0.0659 −0.01091 0.0385 0.0184 0.0369

4C. Simulated data: mean squared error for estimates of welfare

Leave-one-out CV 2-fold CV Naive

c = 0.50, 5% sampleWealth only 0.0059 0.0085 0.0075All 3 covariates 0.0097 0.0107 0.0103

c = 0.50, 10% sampleWealth only 0.0032 0.0054 0.0042All 3 covariates 0.0055 0.0049 0.0054

c = 0.50, 25% sampleWealth only 0.0008 0.0013 0.0008All 3 covariates 0.0048 0.0016 0.0017

Notes: See text, section for 7.4.1.

(ρn) where we use both a leave-one-out and a 2-fold CV.14 Theresults are reported in Table 4A (estimated and true welfare), B(Bias and standard deviation of welfare estimator) and C (MSE ofwelfare estimator). We see in Table 4A that as the sample sizeincreases, ρns are generally increasing as expected but ρn and ρnare not. Table 4A also shows that for smaller sample sizes, thewelfare from using fewer covariates is larger and in this case,the leave-one-out CV criterion favors the smaller model while ρnalways favors the larger model.

From Table 4B, we see that the bias of ρn in estimating ρn islarger in almost all cases than the bias in ρns and this is morepronounced as the sample size falls. This is to be expected, giventhat CV is generically equivalent to a higher-order bias removal. Itis also instructive that the bias of ρn is positive, which is consistentwith the point made just before Section 5. In terms of the MSE,reported in Table 4C, the leave-one-out CV performs the bestexcept when the sample size is relatively large and consequentlyρn is at least as good as a leave-one-out CV. However, given thatthe leave-one-out CV criterion performs best in small samples andalmost as well in larger samples as the naive estimator ρn, wepropose using it to choose between alternative subsets of targetingcovariates, in the application. Those results are reported in Table 5and discussed next.

14 Results are indistinguishable between the nonsmooth and the smoothedversion of (9) with hn varying between n−1/5 and n−1/8 .

Table 5Predicted welfare (ρ), leave-one-out cross-validation.

Y1 Y0

Wealth only All 3 covariates

Panel A: c = 0.50, sample size = 500Wealth only 0.3634 (0.3752) 0.3540 (0.3991)All 3 covariates 0.3019 (0.3353) 0.3010 (0.3374)

Panel B: c = 0.50, sample size = 800Wealth only 0.3599 (0.3735) 0.3537 (0.3796)All 3 covariates 0.3296 (0.3425) 0.3296 (0.3621)

Panel C: c = 0.50, full sampleWealth only 0.3385 (0.3565) 0.347 (0.3771)All 3 covariates 0.362 (0.3599) 0.3683 (0.3603)

Notes: See text, Section 7.4.2.

7.4.2. Cross-validation results for applicationWe consider the parametric model (10) and two subsets of

covariates – (i) household wealth – the most popular targetingcriterion in both developed and developing countries and (ii) thecombination of wealth, number of children under 10 andpossession of a bank account. We calculate the leave-one-outwelfare estimate using three samples of different sizes. Thesesampleswere chosen by randomly sorting the original data of 1008observations (with a specific seed) and choosing the top 500, 800and 1008 observations. The welfare from using different subsets ofregressors and different sample sizes are displayed in Table 5.

The table can be read as follows. For the first panel with c =

0.50, sample size = 500, the number 0.3540 was obtained as fol-

Page 13: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

180 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

lows. Choose the first 500 households, as described above. Then(i) regress the outcome for individuals randomized into treatmenton wealth only and (ii) regress the outcome for individuals ran-domized out of treatment on wealth, bank account and child. Foreach individual in the data, predict their outcomeswith the subsidy(Y1) using the first set of estimated coefficients and their outcomewithout the subsidy (Y0) using the second set of estimates. Calcu-late the difference and its median in the sample which is γ . Thencalculate the cross-validated welfare estimate as described in theprevious section. Repeat for the other sample sizes. The highlightedvalues indicate the highest number among the four cells. The naiveestimates (ρ) are reported in parentheses.

From Table 5 it is apparent that CV recommends using wealthonly to target the subsidy when the sample size is less than 800.For sample size 800, the two conditioning sets yield similar levelsof welfare and for the full sample, it is advisable to use all threecovariates for both Y0 and Y1. Further, for the smaller sample sizes,it is nearly as effective to predict Y1 using wealth only and Y0 usingall three covariates as it is to use wealth only for both Y1 andY0. The reverse conditioning produces significantly lower welfare.This asymmetry arises most likely because there are many moreobservations with no treatment and so the additional covariates– children and bank account – have relatively low explanatorypower for treatment effects when the household actually receivesthe subsidy and relatively high explanatory power when it doesnot. In contrast, the naive estimate, which ignores the finite sampleinaccuracy, suggests using all 3 covariates for Y0 for all samplesizes.

8. Conclusion

In this paper, we have considered a social planner’s problemof allocating a binary treatment among a target population basedon observed covariates in the presence of budget constraints.The paper proposes a simple covariate-conditioned allocationrule based on sample data from a randomized experiment. Thepaper then derives and uses large-sample frequentist propertiesof these rules to infer the expected welfare from the rule andthe minimum cost of attaining a specific average welfare —i.e., the dual problem. These methods are applied to data onthe provision of anti-malaria bed nets in Western Kenya. Theempirical findings are that a government which can afford todistribute bed net subsidies to only 50% of its target populationcan, if using an allocation rule based on multiple covariates,increase actual bed-net coverage by 17%–20% points relative torandom allocation. These conclusions are based on large sampleapproximations and we investigate, using leave-one-out and 2-fold cross-validation and in both a Monte Carlo and the actualapplication data, their robustness to small sample inaccuracy.Our findings suggest that under parametric specifications andwhen using evidence from small experimental samples (under800 in our application), it is better to condition the subsidysolely on household wealth; however, for our full sample of size1008, the expected subsidy use increases by about 9% when wecondition on two discrete household characteristics, viz. presenceof small children and possession of a bank account, in addition towealth.

This paper has left several topics to future research. A formalanalysis of covariate-selection and post model-selection inferencein treatment choice problems is being pursued by the presentauthors. This analysis uses tools from the frequentist literature onmodel selection, paying specific attention to bias-variance trade-offs implicit in various forms of cross-validation or resampling-based techniques. A particularly challenging task is to considerthis problem when treatment responses are not assumed tohave specific parametric forms. On the substantive front, one

possible extension is to the design of conditional cash-transferprograms, which have gained popularity in a large number ofcentral and south American countries. A second extension wouldbe to incorporate treatment externalities into an analysis ofefficient treatment assignment. Extension to multiple treatmentswould also be a theoretically interesting and practically relevantexercise.

A caveat to our analysis is the implicit condition that thecovariate distributions are not affected by the targeting strategyused. This may be violated if the population composition changesin response to changes in the targeting rule, e.g., switching subsidyeligibility toward families with children in a district may see aninflux of families with children from neighboring districts, therebyaltering the marginal distribution of covariates. Such migration isplausible only when the size of the transfer is high enough relativeto migration costs and thus quite unlikely at the usual scale ofin-kind transfer schemes present in the developing world today.But for larger sized transfers, this caveat can potentially be animportant one.

The methods proposed here have wider applicability, beyondsubsidy targeting in developing countries, to nearly any situationof constrained treatment assignment such as deciding eligibilityrules for access to credit under aggregate fund constraints orallocating the unemployed to subsidized job-training programswhen subsidy totals are limited by the government’s budgetoutlay.

Wewould like to endwith the observation that in developmentcircles, there has been a recent push for more experimentalevidence on the impact of social programs, as part of a generaleffort to improve the effectiveness of aid (Duflo et al., forthcoming).For example, theWorld Bank recently launched theDIME initiative,an effort to increase the number of Bank-funded projects withimpact evaluation components. We believe that as randomizedtrials of social programs, e.g., Oportunidades (PROGRESA) inMexico, become more common in both developed and developingcountries, our methodology will become increasingly relevant inhelping governments and aid-agencies roll out positive-impactprograms via efficient allocation rules.

Appendix

A.1. Proofs of theorems

In the proofs below, CMT will denote continuous mappingtheorem and DCT the Lebesgue dominated convergence theorem.The discrete regressors will not play any substantive roles in ouranalysis; so we will drop them in our proofs and put them backinto our final results at the end. In our proofs, the notation β(x)and γ will be used to denote values intermediate between β(x)and β(x) and γ and γ , respectively; c,M1 and M(x) will denotea finite number, a bounded positive constant and a uniformlybounded positive function respectively, whose actual values maybe different in different places. The latter will be used in theexpressions for upper bounds for various quantities which appearin the proof. We first state a set of conditions under which thefollowing results will hold.

A0(i) (Yi, Xi,Di) i = 1, 2, . . . , n is a random sample; A0(ii) D israndomly allocated;

A1. Conditional on every value xd assumed by the discreteregressors, Xc is a p-dimensional compact subset of the supportof the continuous components X c ; the density of X c satisfies thatFX (x) ≥ δ > 0 for all x ∈ Xc ; furthermore, µ1(·), µ0(·) andthus β(·) and their qth order derivatives are uniformly boundedon Xc for some q > p. Also, E

Yjm and E

Yjm |X = x

× FX (x)

are bounded for somem ≥ 4.

Page 14: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 181

Table A.1Allocation efficiency (with higher order kernels).

(1) (2) (3) (4)

Population share c that the program can afford to treat Value function ρ(c): unrestricted case Restricted cases: efficiency loss as a share of ρ(c)when conditioning on:Wealth only Nothing (random assignment)

Panel A: parametric estimates

0.00 0.080.25 0.23 0.09 0.09

(0.02)*** (0.03)*** (0.04)**0.50 0.37 0.09 0.09

(0.03)*** (0.04)** (0.04)**

Panel B: non-parametric estimates

B1. Bandwidth ω = 0.30.00 0.080.25 0.26 0.08 0.21

(0.01)*** (0.04)* (0.04)***0.50 0.42 0.11 0.20

(0.03)*** (0.03)*** (0.03)***B2. Bandwidth ω = 0.40.00 0.080.25 0.26 0.07 0.18

(0.01)*** (0.04)* (0.04)***0.50 0.41 0.09 0.17

(0.03)*** (0.03)*** (0.03)***

Panel C: differences between non-parametric (bandwidth ω = 0.3 and parametric estimates

0.25 0.03 0.00 0.12(0.01)*** (0.05) (0.04)***

0.50 0.05 0.02 0.11(0.02)*** (0.04) (0.03)***

Unrestricted case: conditioning on all 3 covariates available (presence of a child under 10, bank account ownership and normal log of value of household’s wealth per capita.)Standard errors in parentheses, significant at 1% (***), 5%(**), 10% (*) levels. The table reads as follows: (Panel B1, second row): by treating a share 0.25 of the population, avalue function of 0.26 will be reached if the allocation can be based on all covariates (unrestricted case, column 2). In the presence of restrictions on what the conditioningcan be based on, the efficiency of targeting decreases. The value function will be 8% lower than in the unrestricted case if it conditions only on wealth (column 3), and 21%lower if the allocation is random (column 4).

A2. For someM > 0, β(x) ∈ [−M,M] for every x ∈ X; A3. K(·)is a qth order p-dimensional bounded kernel, with q > p and thebandwidth sequence ωn satisfies (i) ωn → 0 (ii)

√nωq

n → 0; A4.The kernel L (·) is uniformly bounded with a bandwidth sequencehn → 0 and nhn → ∞.

A1 and A2 are standard. Xc corresponds to the fixed trimming.condition A3 part (i) is standard. condition A3 part (ii) is an‘‘undersmoothing’’ requirement, which is commonly used insemiparametric problems for bias removal; it is also a keycondition for condition B11 below (c.f. NM, Lemma 8.10).

Let fβ (u) and fβ (u) denote respectively the estimated and thetrue density ofβ (X) atu. Under conditionsA1 andA2,we can applyLemma B.1 of Newey (1994) to conclude

B1. supx∈X

β(x)− β(x) = Op

ln nnωp

n

1/2+ ω

qn

and ap-

ply Theorem 7 of Hansen (2008) specialized to bounded cn(given fixed trimming), iid data and r = 0, to conclude:B2. supu∈[−M,M]

fβ (u)− fβ (u) = op (1). Although these are con-

sequences, rather than primitive conditions, we refer to these asconditions B1 and B2 for easy reference in subsequent use.

Introduce the following additional conditions:B3(i) The first derivative of kernel L(·), denoted by L, is

uniformly bounded; additionally L(·) satisfies a uniform Lipschitzcondition:

L (u)− Lu′ ≤

u − u′Λ for some Λ < ∞

for all u, u′; B4(i) hn → 0, nh4n → α ∈ (0,∞] and n1/4

ln nnωp

n

1/2+ ω

qn

→ 0; B5. The density of β(X) is strictly positive

on an open set containing γ .

Most standard kernels with polynomial structure and withbounded support will satisfy B3.

Let Fβ (t) ≡1n

ni=1 L

t−β(Xi)

hn

.

Lemma 1. Under conditions A0–A3, A4(i), B1–B3(i) and B4(i),

supt∈[−M,M]

Fβ (t)− Fβ (t) P−→ 0.

Proof. Observe that

Fβ (t)− Fβ (t) =1n

ni=1

L

t − β (Xi)

hn

1n

ni=1

Lt − β (Xi)

hn

+1n

ni=1

Lt − β (Xi)

hn

− 1 (β (Xi) ≤ t)

+

1n

ni=1

1 (β (Xi) ≤ t)− Fβ (t)

=

1nhn

ni=1

L

t − β (Xi)

hn

β (Xi)− β (Xi)

+

1n

ni=1

Lt − β (Xi)

hn

− 1 (β (Xi) ≤ t)

+

1n

ni=1

1 (β (Xi) ≤ t)− Fβ (t)

.

Page 15: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

182 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

Therefore,

supt∈[−M,M]

Fβ (t)− Fβ (t)

≤ supt∈[−M,M]

1nn

i=1

1 (β (Xi) ≤ t)− Fβ (t)

+ sup

t∈[−M,M]

1nn

i=1

Lt − β (Xi)

hn

− 1 (β (Xi) ≤ t)

+

1n1/4hn

1n

ni=1

supt∈[−M,M]

Lt − β (Xi)

hn

×

n1/4 sup

a

β (a)− β (a) .

By condition B3(i) (i.e. L(·) is uniformly bounded), condition B4(i)and condition B1, the third term is op (1). The first term is op (1)by the standard Glivenko–Cantelli theorem. Under the statedconditions on L and that β (X) has a Lebesgue density uniformlybounded above, we can apply Lemma 4 of Horowitz (1992) toconclude that the second term in the previous display is op (1).(This is analogous to Horowitz’s proof that limα→0 Pr

b′x < α

;

here we have that

limα→0

Pr (|t − β(X)| < α) = limα→0

Fβ (t + α)− Fβ (t − α)

≤ 2 lim

α→0

α × sup

t∈R

fβ (t)

= 0,

and the rest of the proof is identical to Horowitz Lemma 4). �

Theorem 1 (Consistency of γ ):Proof. Fix ε > 0. Then Fβ (γ + ε) − 1 + c > 0 and 1 − c −

Fβ (γ − ε) > 0, by condition (B5). Therefore, we have that

Prγ − γ

> ε

≤ Prγ > γ + ε

+ Pr

γ < γ − ε

≤ Pr

Fβγ> Fβ (γ + ε)

+ Pr

Fβγ< Fβ (γ − ε)

= Pr

1 − c > Fβ (γ + ε)

+ Pr

1 − c < Fβ (γ − ε)

≤ Pr

Fβ (γ + ε)− 1 + c < Fβ (γ + ε)− Fβ (γ + ε)

+ Pr

1 − c − Fβ (γ − ε) < Fβ (γ − ε)− Fβ (γ − ε)

≤ Pr

Fβ (γ + ε)− 1 + c < sup

t∈[−M,M]

Fβ (t)− Fβ (t)

+ Pr1 − c − Fβ (γ − ε) < sup

t∈[−M,M]

Fβ (t)− Fβ (t)

both of which converge to zero by Lemma 1. �

Now, define

fβ (t) =1

nhn

ni=1

L

t − β (Xi)

hn

.

The following lemma shows that fβ(·) converges to fβ(·) inprobability, uniformly on the support of β(X).

Lemma 2. Under conditions A0–A4, and B3–B5,

supu∈[−M,M]

fβ (u)− fβ (u) = op (1) .

Proof. Observe that

fβ (u)− fβ (u) =1

nhn

ni=1

L

u − β (Xi)

hn

− fβ (u) .

By triangle inequality,

supu∈[−M,M]

fβ (u)− fβ (u) ≤ sup

u∈[−M,M]

fβ (u)− fβ (u)

+ supu∈[−M,M]

fβ (u)− fβ (u) .

The first term is op (1) due to B2. As for the second term, notice thatfβ (t)− fβ (t) =

1nhn

ni=1

L

t − β (Xi)

hn

− L

t − β (Xi)

hn

supx

β(x)− β (x)

hnΛ, by B3,

= Op

1hn

×

ln nnωp

n

1/2

+ ωqn

,

by conclusion B2 and Assumption B3. Therefore by condition B4,we get the conclusion. �

Additional conditions:A4(ii) The kernel L(·) has two derivatives which are also

uniformly bounded. For some r ≥ 2, we have B7. The densityof β(X) is (r − 1) times continuously differentiable, the (r − 1)thderivative f (r−1)

β (·) is bounded and Lipschitz in a neighborhood ofγ ; B8. nh2r+1

n → λ < ∞; B9. L(·) is symmetric around zero andhas bounded support [−1, 1], is of order r and

1−1 L

2 (u) du < ∞.Also assume that B10. σ 2

j (x) = Var(Yj | X = x) for j = 1, 2 arefinite; B11. For j = 0, 1,√n sup

x∈X

µj (x)− µj(x) πj(x)− πj(x)

= op (1) ,

√n sup

x

πj(x)− πj(x)2 = op (1) .

ConditionB11 is basically the sameas B4withµ andπ replacingβ and hold under exactly the same conditions as B4. These arewell-known requirement for

√n-normality for semiparametric

estimators (c.f. Newey and McFadden (1994, Section 8.3))Theorem 1 (Distribution of γ ):

To derive the distribution theory for γ , wewill use the followingfirst-order approximation

Fβ (γ ) = 1 − c = Fβγ

= Fβ (γ )+ (γ − γ )fβ (γ ) ,

where γ is between γ and γ . This gives us the following first-orderexpansion for γ :

γ − γ

=

fβ (γ )

−1Fβ (γ )−

1n

ni=1

Lγ − β (Xi)

hn

T1n

+

fβ (γ )

−11n

ni=1

Lγ − β (Xi)

hn

− L

γ − β (Xi)

hn

T2n

.

(11)

The proof will proceed in three steps: step 1 is that the multiplier{fβ (γ )}

−1 converges in probability to {fβ(γ )}−1. Step 2 is that theterm T1n will be Op(

1√nhn). Finally in step 3 we will show, using U-

statistic type decompositions, that the term T2n will be Op(1

√nhn).

Thesewill eventually lead to the result that√nhn(γ−γ ) converges

to a normal distribution, as required.

Page 16: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 183

Proof. Step 1. We first show that

fβ (γ )− fβ (γ )P−→ 0. (12)fβ (γ )− fβ (γ ) ≤

fβ (γ )− fβ (γ )+ fβ (γ )− fβ (γ )

≤ sup

a∈[−M,M]

fβ (a)− fβ (a)

op(1),by Lemma 2

+fβ (γ )− fβ (γ )

op(1)by CMT and Theorem 1

= op (1) .

Step 2: We will show thatnhn

Fβ (γ )− L

γ − β (Xi)

hn

= κ + op (1) . (13)

Observe that

Tn ≡1n

ni=1

Fβ (γ )− L

γ − β (Xi)

hn

=

1n

ni=1

Fβ (γ )− 1 (β (Xi) ≤ γ )

+

1n

ni=1

1 (β (Xi) ≤ γ )− L

γ − β (Xi)

hn

≡ T2n − T1n. (14)

Now,nhnT2n =

hn ×

1√n

ni=1

Fβ (γ )− 1 (β (Xi) ≤ γ )

Op(1)

= op (1) .

We will show that

E

nhnT1n − κ2

= hnVar√

nT1n

+

E

nhnT1n − κ2

→ 0 (15)

and thusnhnT1n − κ = op (1) . (16)

Now,

Var√

nT1n

= E1 (β (Xi) ≤ γ )− L

γ − β (Xi)

hn

2

Fβ (γ )− E

Lγ − β (Xi)

hn

2

. (17)

Observe that

E1 (β (Xi) ≤ γ )− L

γ − β (Xi)

hn

2

=

γ

−M

Lγ − thn

− 1

2

fβ (t) dt

+

M

γ

Lγ − thn

2

fβ (t) dt

and both of the terms in the previous display converge to zero bythe DCT since lima→∞ L (a) = 1 = 1 − lima→−∞ L (a).

Next,

Fβ (γ )− ELγ − β (Xi)

hn

=

γ

−M

1 (t ≤ γ )− L

γ − thn

fβ (t) dt

M

γ

Lγ − thn

fβ (t) dt → 0, by the DCT.

Thus, from (17), we have that

Var√

nT1n

→ 0 as n → ∞. (18)

Next, consider

E (T1n) = E1 (β (Xi) ≤ γ )− L

γ − β (Xi)

hn

=

Fβ (γ )−

M

−MLγ − thn

fβ (t) dt

=

Fβ (γ )− L

γ − thn

Fβ (t) |M−M

−1hn

M

−MFβ (t) L

γ − thn

dt

=

Fβ (γ )−

γ+Mhn

γ−Mhn

Fβ (γ − uhn) L (u) du

= (−1)r+1 hrn

r!× f (r−1)

β (γ )×

1

−1urL (u) du

+ ohrn

, by condition B7.

This implies that

E

nhnT1n

= (−1)r+1√nhr+1/2

n

r!× f (r−1)

β (γ )

×

1

−1urL (u) du + o

hrn

→ κ, by conditions B7, B8. (19)

Now, (18) and (19) imply (15) and thus (16).Step 3: We will now analyze the second term in (11):

Sn =1n

ni=1

Lγ − β (Xi)

hn

− L

γ − β (Xi)

hn

du,

using U-statistic type decompositions to show thatnhnSn =

√hn

√n

nj=1

λ1n

Zj− E

λ1n

Zj

−λ2n

Zj− E

λ2n

Zj

+ op (1)d−→ N

0, η2

, (20)

where the triangular arrays λ1nZj, λ2n

Zjand the constant η2 >

0, will be specified below.To that end observe thatnhnSn =

√hn

√n

ni=1

Lγ − β (Xi)

hn

− L

γ − β (Xi)

hn

=

√hn

√nhn

ni=1

β (Xi)− β (Xi)

Lγ − β (Xi)

hn

+

√hn

2√nh2

n

ni=1

β (Xi)− β (Xi)

2L′

β (Xi)− γ

hn

.

Page 17: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

184 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

3)

1√n

ni=1

π1 (Xi) µ1 (Xi)− µ1 (Xi) π1 (Xi)

π21 (Xi)

1hn

Lβ (Xi)− γ

hn

=√n

1n (n − 1)

ni=1

j=i

1

π21 (Xi)

π1 (Xi) YjDj − µ1 (Xi)Dj

pnKXj − Xi

ωn

×

1hn

Lβ (Xi)− γ

hn

=wn(Zi,Zj),say

=1

√n (n − 1)

ni=1

j=i

wnZi, Zj

− E

wnZi, Zj

|Zi− E

wnZi, Zj

|Zj+ E

wnZi, Zj

U1n

+1

√n

nj=1

EwnZi, Zj

|Zj− E

wnZi, Zj

U2n

+1

√n

ni=1

EwnZi, Zj

|Zi

U3n

(2

Box I.

The second term in absolute value has an expectation which is ofthe order of

supx∈X

β(x)− β (x)2 √

n

h3/2n

→ 0, by condition B4.

Next, note that

β (Xi)− β (Xi) =

µ1 (Xi)

π1 (Xi)−µ1(Xi)

π1 (Xi)

µ0 (Xi)

π0(Xi)−µ0 (Xi)

π0 (Xi)

. (21)

Wewill simplyworkwith the first termbecause theproof is exactlyanalogous for the second term and show that

1√nhn

ni=1

β (Xi)− β (Xi)

Lγ − β (Xi)

hn

= Op (1) .

Step 3A: Now,

1√n

ni=1

µ1 (Xi)

π1 (Xi)−µ1(Xi)

π1 (Xi)

1hn

Lβ (Xi)− γ

hn

=

1√n

ni=1

µ1 (Xi)− µ1(Xi)

π (Xi)

1hn

Lβ (Xi)− γ

hn

1√n

ni=1

µ (Xi)

π (Xi)

π1 (Xi)− π1(Xi)

π1 (Xi)

1hn

Lβ (Xi)− γ

hn

1√n

ni=1

µ1 (Xi)− µ(Xi)

π1 (Xi)− π1(Xi)

π1 (Xi) π1 (Xi)

×1hn

Lβ (Xi)− γ

hn

+1

√n

ni=1

µ(Xi)π1 (Xi)− π1(Xi)

2π21 (Xi) π1 (Xi)

1hn

Lβ (Xi)− γ

hn

. (22)

The last two terms in absolute value have expectations that arebounded above by a positive scalar times

√n supx

µ1(x) −

µ1(x)π1x

− π1(x) and

√n supx

π1(x)− π(x)2. Condi-

tion B11 above then implies that these are both op (1).Now, the first two terms in (22) add up to the equation in Box I).

Step 3B: We first show that

U3n = op (1) . (24)

Notice that we obtain the equation in Box II) for some uniformlybounded function H by condition. Therefore,

U3n = Oωq

n

×

1√n

ni=1

H (Xi)×1hn

Lβ (Xi)− γ

hn

= Op

√nωq

n

= op (1) ,

by condition B4.Step 3C: The term

U1n =1

√n (n − 1)

ni=1

j=i

wnZi, Zj

− E

wnZi, Zj

|Zi− E

wnZi, Zj

|Zj

+ EwnZi, Zj

can be analyzed using essentially the steps of Powell et al. (1989),Lemma 3.1, whence one can conclude that

EU21n

= o (1) . (25)

The key step is to show that

Ew2

n

Zi, Zj

= o (n) .

Observe that we have the equation in Box III).Step 3D: Now consider the term

U2n =1

√n

nj=1

EwnZi, Zj

| Zj− E

wnZi, Zj

.

Observe that we have the equation in Box IV).Notice that

EWZjVnβXj

= E

VnβXj

EWZj

| Xj

=0

= 0.

Page 18: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 185

E

1hnLβ(Xi)−γ

hn

π21 (Xi)

×π1 (Xi) YjDj − µ1 (Xi)Dj

pnKXj − Xi

ωn

Zi

L.I.E.= E

1hnLβ(Xi)−γ

hn

π21 (Xi)

×

π1 (Xi)

µ1Xj

fXj − µ1 (Xi)

π1Xj

fXj 1

ωpnKXj − Xi

ωn

Zi

=

1hnLβ(Xi)−γ

hn

π21 (Xi)

×

π1 (Xi) µ1(x)− µ1 (Xi) π1(x)

f (x)1ω

pnKx − Xi

ωn

f (x)dx

=

1hnLβ(Xi)−γ

hn

π21 (Xi)

×

[π1 (Xi) µ1 (Xi + uωn)− µ1 (Xi) π1 (Xi + uωn)] K (u) du

A1= H (Xi)×

1hn

Lβ (Xi)− γ

hn

× O

ωq

n

Box II.

n−1Ew2

n

Zi, Zj

= n−1E

1

π41 (Xi)

π1 (Xi) YjDj − µ1 (Xi)Dj

2 1

ω2pn

K 2Xj − Xi

ωn

×

1hn

Lβ (Xi)− γ

hn

2

= n−1E

1

π41 (Xi)

1

ω2pn

K 2Xj − Xi

ωn

×

1hn

Lβ (Xi)− γ

hn

2×π21 (Xi) E

Y 2D | Xj

+ µ2

1(Xi)ED | Xj

− 2π1 (Xi) µ1 (Xi) E

YD | Xj

=1

nωpnh2

n

1π41 (X)

K 2 (u)×

Lβ(x)− γ

hn

π21 (x)E

Y 2D | X = x + uωn

+µ2

1(x)E (D | X = x + uωn)−2π1(x)µ1(x)E (YD | X = x + uωn)

fX (x)fX (x + uωn) dudx

= O

1nωp

nh2n

→ 0 which is implied by B4

Box III.

EwnZi, Zj

| Zj

= E

1π21 (Xi)

π1 (Xi) YjDj − µ1 (Xi)Dj

pnK

−Xj + Xi

ωn

×

1hn

Lβ (Xi)− γ

hn

Yj,Dj, Xj

=

1

π21 (x)

π1 (x) YjDj − µ1(x)Dj

pnK

−Xj + xωn

×

1hn

Lβ(x)− γ

hn

f (x)dx

=

π1Xj + uωn

YjDj − µ1

Xj + uωn

Dj

π21

Xj + uωn

K (u)×

1hn

L

βXj + uωn

− γ

hn

fXj + uωn

du

=

1

π21

Xj π1

XjYjDj − µ1

XjDjfXj×

1hn

L

βXj− γ

hn

K (u) du + O

ωq

n

=

1π21

Xj π1

XjYjDj − µ1

XjDjfXj

W(Zj)

×

1hn

L

βXj− γ

hn

Vn(β(Xj))

+ Oωq

n

Box IV.

Page 19: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

186 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

Now

Var (U2n) = Var

1√n

nj=1

EwnZi, Zj

| Zj

− EwnZi, Zj

= Var

EwnZi, Zj

| Zj

= EWZjV1n

βXj

+ Oωq

n

2− O

ω2q

n

= E

W 2 Zj V 2

1n

βXj

+ Oω2q

n

= O

EW 2 Zj V 2

1n

βXj

.

Now, let ξ 2 (t) = EW 2

Zj

| βXj

= t. Then

EW 2 Zj V 2

1n

βXj

=

M

−Mξ 2 (t)

1hn

Lt − γ

hn

2fβ (t) dt

=1hn

M−γhn

−M−γhn

ξ 2 (γ + uhn) L2 (u) fβ (γ + uhn) du

=1hnξ 2 (γ ) fβ (γ )

M−γhn

−M−γhn

L2 (u) du + terms of smaller order.

This implies that

Var

hnU2n

= Var

hn

n

nj=1

EwnZi, Zj

| Zj

− EwnZi, Zj

→ ξ 2 (γ ) fβ (γ )

−∞

L2 (u) du. (26)

Now we will apply the Liapunov condition and use the LindebergCLT for triangular arrays. Consider the array

Rnj =

√hn

√n

EwnZi, Zj

| Zj− E

wnZi, Zj

,

which is independent across j and ERnj

= 0. Let Un =n

j=1 Rnj.Then

EU2n

=

nj=1

ER2nj

=

hn

n

nj=1

EEwnZi, Zj

|Zj− E

wnZi, Zj

2=

hn

n

nj=1

VarWZjV1n

βXj

+ o (1)

=hn

hnξ 2 (γ ) fβ (γ )

−∞

L2 (u) du + o (1)

→ ξ 2 (γ ) fβ (γ )

−∞

L2 (u) du,

by (26). To apply the Liapunov condition, observe that for anyε > 0,

nj=1

ERnj

2+ε = nhn

n

2+ε2

EW

ZjV1n

βXj2+ε

= O

nhn

n

2+ε2 1

h1+εn

= O

(hnn)−ε/2

→ 0.

Thus the Liapunov condition holds and applying the Lindeberg CLT,we get that

Un =

√hn

√n

nj=1

EwnZi, Zj

| Zj− E

wnZi, Zj

d−→ N

0, ξ 2 (γ ) fβ (γ )

−∞

L2 (u) du. (27)

Putting together (24)–(27), we get thathn

n

ni=1

µ1 (Xi)

π1 (Xi)−µ1(Xi)

π1 (Xi)

1hn

Lβ (Xi)− γ

hn

=

√hn

√n

nj=1

λ1nZj+ op (1)

d−→ N

0, ξ 2 (γ ) fβ (γ )

−∞

L2 (u) du,

where

λ1nZj

=

fXj

π21

Xj π1

XjYjDj − µ1

XjDj

×1hn

L

βXj− γ

hn

,

ξ 2 (t) = E

π1(X)YD − µ1(X)D

π21 (X)

f (X)2

| β(X) = t

. (28)

Similarly, we will get thathn

n

ni=1

µ0 (Xi)

π0(Xi)−µ0 (Xi)

π0 (Xi)

1hn

Lβ (Xi)− γ

hn

=

√hn

√n

nj=1

λ2nZj+ op (1)

d−→ N

0, τ 2 (γ ) fβ (γ )

−∞

L2 (u) du,

where

λ2nZj

=

fXj

π20

Xj π0

XjYj1 − Dj

− ν

Xj

1 − Dj

×1hn

Lβ (Xi)− γ

hn

,

τ 2 (t) = E

π0(X)Y (1 − D)− ν(X) (1 − D)

π20 (X)

f (X)2β (X) = t

.

(29)

Thus we get thatnhnSn =

√hn

√n

nj=1

λ1n

Zj− λ2n

Zj

+ op (1)d−→ N

0, η2

,

which establishes (20).

Page 20: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 187

To get the expression for η2, note by direct multiplication thatsince Dj

1 − Dj

= 0 for every j, E

λ1n

Zjλ2n

Zj

= 0.Moreover, E

λ1n

Zj

= 0. Therefore, covλ1n

Zj, λ2n

Zj

= 0.This implies that

η2 =τ 2 (γ )+ ξ 2 (γ )

× fβ (γ )

−∞

L2 (u) du. (30)

Using the definitions of µ1(·), π0 (·) , ν(·) and π1(·), the aboveexpressions simply to

τ 2 (γ ) = E

σ 20 (X)

1 − Pr (D = 1 | X)

β(X) = γ

ξ 2 (γ ) = E

σ 21 (X)

Pr (D = 1 | X)

β(X) = γ

.

Nowput together (12), (13), (20) and (30) to conclude from (11)thatnhn

γ − γ

= κ +

1fβ (γ )

√hn

√n

nj=1

λ1n

Zj− λ2n

Zj

+ op (1)

d−→ N

κ,τ 2 (γ )+ ξ 2 (γ )

fβ (γ )

−∞

L2 (u) du. �

Lemma 3. Under conditions A0–A4 and B1–B11,

1√n

nj=1

β1Xj− β1

Xj

=1

√n

nj=1

DjY1j − β1

Xj

PrD = 1 | Xj

+ op (1) .

Proof. Note that

1√n

nj=1

β1Xj− β1

Xj

=1

√n

ni=1

µ1 (Xi)− µ1(Xi)

π1 (Xi)

1√n

ni=1

µ1 (Xi)

π1 (Xi)

π1 (Xi)− π(Xi)

π1 (Xi)

+

1√n

ni=1

µ1 (Xi)− µ1(Xi)

π1 (Xi)− π(Xi)

π1 (Xi) π1 (Xi)

−1

√n

ni=1

µ1(Xi)π1 (Xi)− π(Xi)

2π21 (Xi) π1 (Xi)

. (31)

The last two terms are both op (1) under condition B11 above.Now, the first two terms in (31) add up to the equation in BoxV).

We will show that

E (U1n)2

= o (1) , (32)

U2n =1

√n

nj=1

ED | Xj

× YjDj − E

DY | Xj

× Dj

+ op (1) , (33)

U3n = op (1) . (34)

Observe that

Eπ1 (Xi) YjDj − µ1 (Xi)Dj

π21 (Xi)

pnKXj − Xi

ωn

ZiL.I.E.= E

1

π21 (Xi)

π1 (Xi)

µ1Xj

fXj − µ1 (Xi)

π1Xj

fXj

×1ω

pnKXj − Xi

ωn

Zi

=1

π21 (Xi)

[π1 (Xi) µ1(x)− µ1 (Xi) π1(x)]

×1ω

pnKx − Xi

ωn

dx

=1

π21 (Xi)

[π1 (Xi) µ1 (Xi + uωn)

− µ1 (Xi) π1 (Xi + uωn)] K (u) duA1= H (Xi)× O

ωq

n

,

for someuniformly bounded functionH by conditionA1. Therefore,U3n = Op

√nωq

n

= op (1) by condition A3 and this establishes(34).

Next observe that

E

1π21 (Xi)

π1 (Xi) YjDj − µ1 (Xi)Dj

pnKXj − Xi

ωn

Zj=

1

π21 (x)

π1 (x) YjDj − µ1(x)Dj

pnKXj − xωn

f (x)dx

=

1

π21

Xj + uωn

π1Xj + uωn

YjDj

−µ1Xj + uωn

DjK (u) f

Xj + uωn

du

=π1XjYjDj − µ1

XjDjfXj 1

π21

Xj + uωn

K (u) du+ωn

π ′

1

XjYjDj − µ′

1

XjDjfXj

×

1

π21

Xj + uωn

K (u) udu + · · · + ωqn

π(q)1

XjYjDj

−µ(q)1

XjDj

fXj 1

π21

Xj + uωn

K (u) uqdu

=π1XjYjDj − µ1

XjDj f

Xj

π21

Xj + O

ωq

n

=

1ED | Xj

2 ED | Xj× YjDj − E

DY | Xj

× Dj

+ O

ωq

n

,

by a dominated convergence theorem, given the uniform bound-edness of π1(·). Together with condition A3, we get (33).

One can establish (32) by essentially repeating the proof ofPowell et al. (1989), Lemma 3.1.

Combining (32)–(34), we get that

1√n

ni=1

µ1 (Xi)

π1 (Xi)−µ1(Xi)

π1 (Xi)

=

1√n

nj=1

π1XjYjDj − µ1

XjDjfXj 1π21

Xj

+ op (1)

=1

√n

nj=1

DjY1j − β1

Xj

PrD = 1 | Xj

+ op (1) . �

Page 21: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

188 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

1√n

ni=1

π1 (Xi) µ1 (Xi)− µ1 (Xi) π1 (Xi)

=

√n

1n (n − 1)

ni=1

j=i

π1 (Xi) YjDj − µ1 (Xi)Dj

π2 (Xi)

pnKXj − Xi

ωn

≡√n

1n (n − 1)

ni=1

j=i

wZi, Zj, ωn

=

1√n (n − 1)

ni=1

j=i

wZi, Zj, ωn

− E

wZi, Zj, ωn

| Zi− E

wZi, Zj, ωn

| Zj+ E

wZi, Zj, ωn

U1n

+1

√n

nj=1

EwZi, Zj, ωn

| Zj− E

wZi, Zj, ωn

U2n

+1

√n

ni=1

EwZi, Zj, ωn

| Zi

U3n

Box V.

|T1n| =

1nn

i=1

β (Xi) L

γ − β (Xi)

hn

− β (Xi) L

γ − β (Xi)

hn

1n

ni=1

β (Xi)− β (Xi)

hn

hnL

γ − β (Xi)

hn

− β (Xi) L

γ − β (Xi)

hn

+γ − γ

hn

×1n

ni=1

β (Xi) L

γ − β (Xi)

hn

supβ(x)− β (x)

hn

1n

ni=1

hnL

γ − β (Xi)

hn

− β (Xi) L

γ − β (Xi)

hn

+

nhn

γ − γ

2nh3

n

1/2

×1n

ni=1

β (Xi) L

γ − β (Xi)

hn

Box VI.

Theorem 2 (consistency of ρ)

Proof. Define

ζ = E {β (Xi)× 1 (γ ≥ β (Xi))} ,

ζ =1n

ni=1

β (Xi)× L

γ − β (Xi)

hn

,

so that

ρ =1n

ni=1

β1 (Xi)− ζ , ρ = E {β1(X)} − ζ .

Now,

ζ − ζ =1n

ni=1

β (Xi) L

γ − β (Xi)

hn

− ζ

=1n

ni=1

β (Xi) L

γ − β (Xi)

hn

1n

ni=1

β (Xi) Lγ − β (Xi)

hn

T1n

+1n

ni=1

β (Xi)

Lγ − β (Xi)

hn

− 1 {β (Xi) ≤ γ }

T2n

+1n

ni=1

{β (Xi) 1 {β (Xi) ≤ γ } − ζ } =op(1), by standard WLLN

.

Now, the equation in Box VI

Since L, L are uniformly bounded, the above display is of the form

supx∈X

β(x)− β(x)

hn× Op (1)+

nhn

γ − γ

2nh3

n

1/2

× Op (1) .

Now, Theorem 2 implies that nhnγ − γ

2= Op (1), condition B1

and B4(i) imply thatsup

β(x)−β(x)hn

= op (1) and that nh3n → ∞. Thus

we have that T1n = op (1).As for T2n, observe that since β(·) is uniformly bounded, by using

steps exactly analogous to step 2 in the proof of Theorem 2 (leadingto (13)), we will get by the DCT that T2n = op (1).

Now combine with Lemma 3 to conclude thatρ − ρ

= op (1).The proof of |ρn − ρ| = o (1) is just a simpler version of (45) whichis proved below. �

Additional conditions:The following additional conditions will be used to prove

Theorem 2.

B4(ii)√n

h2n×

ln nnωp

n

1/2+ ω

qn

2

→ 0; B12. nh6n → ∞, r of

condition B7 is at least 4 and nh2rn → 0.

Theorem 2 (distribution of ρ)

Proof. We will work with the expansion given in Box VIIStep 4:Under condition B1 and condition B4, the fourth term in (35)will be op

1

√n

since L′ (·) is assumed to be uniformly bounded

in absolute value. As for the fifth term, observe by the previous

theorem, that (γ−γ )2

h2n= Op

1

nh3n

= op

1

√n

by condition B12. So

Page 22: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 189

the fifth term in (35)will be op

1√n

. That the sixth term is op

1

√n

follows from combining the two previous results.Step 5: The multiplier for the third term in (35) equals

1nhn

ni=1

β (Xi) Lβ (Xi)− γ

hn

→ E

β (Xi)

1hn

Lγ − β (Xi)

hn

= γ fβ (γ )+ O

hrn

→ γ fβ (γ ) ,

which follows from the standard consistency proof for e.g. kerneldensity estimates.

Combining steps 4 and 5, and replacing the asymptoticexpansion of

γ − γ

from Theorem 1, we get from (35) that

√nζ − ζ

=

1√n

ni=1

β (Xi) L

γ − β (Xi)

hn

− ζ

γ√n

ni=1

1 (β (Xi) ≤ γ )− Fβ (γ )

+

1√n

ni=1

β (Xi)− β (Xi)

Lγ − β (Xi)

hn

+ {γ − β (Xi)}

1hn

Lγ − β (Xi)

hn

+ γ

12√nh2

n

ni=1

β (Xi)− β (Xi)

2L′

β (Xi)− γ

hn

+ op (1) . (36)

The third term in (36) in absolute value is dominated by

γ

ln nnωp

n

1/2

+ ωqn

2

×

√n

h2n

× Op (1) = op (1) ,

by condition B4(iii).Step 6A: Consider the first term in (36)

=1

√n

ni=1

β (Xi) L

γ − β (Xi)

hn

− β (Xi)× 1 {β (Xi) < γ }

T4n

+1

√n

ni=1

[β (Xi)× 1 {β (Xi) < γ } − ζ ] T5n=Op(1), by CLT.

(37)

We will show that T4n = op (1) using the arguments similar to theones used for showing (16).

Define gβ (a) = afβ (a) and Gβ (a) = a−M gβ (t) dt . Then

E (T4n) =√nEβ (Xi) L

γ − β (Xi)

hn

− Gβ (γ )

=

√n M

−MLγ − thn

gβ (t) dt − Gβ (γ )

=

√n

Lγ − thn

Gβ (t)

M−M

+

M

−M

1hn

Lγ − thn

Gβ (t) dt − Gβ (γ )

=

√n

γ+Mhn

γ−Mhn

L (u)Gβ (γ − uhn) du − Gβ (γ )

= O

√nhr

n

→ 0, by B12. (38)

Next, define gβ (t) = t2fβ (t) and Gβ (t) = t−M gβ (a) da. Then

Eβ (Xi) L

γ − β (Xi)

hn

− β (Xi)× 1 {β (Xi) < γ }

2

=

M

−M

Lγ − thn

− 1 {t < γ }

2

gβ (t) dt

=

M

−ML2γ − thn

gβ (t) dt +

γ

−M1 {t < γ } gβ (t) dt

− 2 M

−MLγ − thn

× 1 {t < γ } × gβ (t) dt.

Using the DCT repeatedly (c.f. the steps leading to (14), we get that

Eβ (Xi) L

γ − β (Xi)

hn

− β (Xi)× 1 {β (Xi) < γ }

2

=

M

−ML2γ − thn

gβ (t) dt +

γ

−M1 {t < γ } gβ (t) dt

− 2 M

−MLγ − thn

× 1 {t < γ } × gβ (t) dt

→ Gβ (γ )+ Gβ (γ )− 2Gβ (γ ) = 0.

This implies that for T4n defined in (37),

Var (T4n) = Var

1√n

ni=1

β (Xi) L

γ − β (Xi)

hn

−β (Xi)× 1 {β (Xi) < γ }

= Var

β (Xi)× L

γ − β (Xi)

hn

−β (Xi)× 1 {β (Xi) < γ }

≤ E

β (Xi) L

γ − β (Xi)

hn

− β (Xi)

× 1 {β (Xi) < γ }

2

→ 0. (39)

From (38) and (39), we get that E (T4n)2 → 0 and thus T4n = op (1).Replacing in (36), we get that

√nζ − ζ

=

1√n

ni=1

[β (Xi)× 1 {β (Xi) < γ } − ζ ]

−γ

√n

ni=1

1 (β (Xi) ≤ γ )− Fβ (γ )

+

1√n

ni=1

β (Xi)− β (Xi)

Lγ − β (Xi)

hn

+ {γ − β (Xi)}1hn

Lγ − β (Xi)

hn

+ op

1

√n

. (40)

The final step is to analyze the third term in (40), using U-statistic type decompositions. First, notice that analogous to (23),see (Box I), we have that up to op (1) terms the equation in BoxVIII).It is straightforward (replace the kernel involving terms) to verify

Page 23: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

190 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

5)

ζ − ζ =1n

ni=1

β (Xi) Lγ − β (Xi)

hn

− ζ

T1n

+1n

ni=1

β (Xi)− β (Xi)

Lγ − β (Xi)

hn

1hnβ (Xi) L

γ − β (Xi)

hn

T2n

+γ − γ

1nhn

ni=1

β (Xi) Lγ − β (Xi)

hn

T3n

−1

4nh2n

ni=1

β (Xi)− β (Xi)

2 −2hnL

γ − β (Xi)

hn

+ β (Xi) L′

γ − β (Xi)

hn

T4n

+γ − γ

14nh2

n

ni=1

β (Xi) L′

γ − β (Xi)

hn

T5n

+γ − γ

12nh2

n

ni=1

β (Xi)− β (Xi)

hnL

γ − β (Xi)

hn

− β (Xi) L′

γ − β (Xi)

hn

T6n

(3

Box VII.

that we will get the same conclusion as (25) and (24) here. So weonly perform the analysis for U2n.

Using steps similar to the case for γ , one gets the equation inBox IX.

Therefore,

1√n

nj=1

EwnZi, Zj

|Zj− E

wnZi, Zj

=

1√n

nj=1

EwnZi, Zj

|Zj− W

Zj× 1

βXj

≤ γ

+1

√n

nj=1

WZj× 1

βXj

≤ γ

=1

√n

nj=1

WZj× 1

βXj

≤ γ

+1

√n

nj=1

WZjL

γ − β

Xj

hn

− 1

βXj

≤ γ

+γ − β

Xj 1

hnL

γ − β

Xj

hn

Tnj

.

(41)

Now, we will show that the second term in the previous displayis op (1). Letting ω2 (t) = E

W 2

Zj

| βXj

= t, we have

that

ET 2nj

=

M

−Mω2 (t)

Lγ − thn

− 1 (t ≤ γ )

2

fβ (t) dt

+

M

−Mω2 (t)

γ − thn

2

L2γ − thn

fβ (t) dt

+ 2 M

−Mω2 (t) L

γ − thn

γ − thn

Lγ − thn

fβ (t) dt. (42)

The first term in (42) equals M

−Mω2 (t)

Lγ − thn

− 1 (t ≤ γ )

2

fβ (t) dt

=

γ

−Mω2 (t)

Lγ − thn

− 1

2

fβ (t) dt

+

M

γ

ω2 (t)Lγ − thn

2

fβ (t) dt

and both of the terms in the previous display converge to zeroby the DCT since lima→∞ L (u) = 1 = 1 − lima→−∞ L (u).The second integral in (42) converges to zero by the DCT sincelimu→±∞ u2L2 (u) = 0. The third integral in (42) also convergesto zero by limu→±∞ uL (u) = 0 and the DCT. This implies thatET 2nj

→ 0 and thus

0 < Var

1

√n

nj=1

Tnj

= Var

Tnj

≤ ET 2nj

→ 0.

Next,

√nE

WZjL

γ − β

Xj

hn

− 1

βXj

≤ γ

+γ − β

Xj 1

hnL

γ − β

Xj

hn

=√nEXj

EWZj|Xj

=0

×

L

γ − β

Xj

hn

− 1

βXj

≤ γ

+γ − β

Xj 1

hnL

γ − β

Xj

hn

= 0.

So it follows that the second term in (41) is op (1).

Page 24: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 191

Thus we have that

1√n

ni=1

µ1 (Xi)

π1 (Xi)−µ1(Xi)

π (Xi)

Lγ − β (Xi)

hn

+ {γ − β (Xi)}

1hn

Lγ − β (Xi)

hn

=

1√n

nj=1

π1XjYjDj − µ1

XjDj

π2Xj

× fXXj× 1

βXj

≤ γ

+ op (1)

d−→ N

0, γ

−Mω2 (t) fβ (t) dt

,

where

ω2 (a) = E

π1XjYjDj − µ1

XjDj

π2Xj

× fXXj2

|βXj

= a.

Using exactly analogous steps, we will also get that

1√n

ni=1

µ0 (Xi)

π0(Xi)−µ0 (Xi)

π0 (Xi)

L

γ − β

Xj

hn

+γ − β

Xj 1

hnL

γ − β

Xj

hn

=1

√n

nj=1

π0XjYj1 − Dj

− µ0

Xj

1 − Dj

π20 (X)

× fXXj1βXj

≤ γ

+ op (1) .

d−→ N

0, γ

−Mτ 2 (t) fβ (t) dt

,

where

τ 2 (a) = Eπ0(X)Y (1 − D)− µ0(X) (1 − D)

π20 (X)

× f (X)2

|β (X) = a.

Finally, we get that

1√n

ni=1

β (Xi)− β (Xi)

L

γ − β

Xj

hn

+γ − β

Xj 1

hnL

γ − β

Xj

hn

d−→ N

0, γ

−M

ω2 (t)+ τ 2 (t)

fβ (t) dt

, (43)

since the covariances will be zero (as can be easily seen from theasymptotic linear expansions because D (1 − D) = 0).

Replacing in (40), and noticing that

π1 (Xi) YiDi − µ1 (Xi)Di

π2 (Xi)× fX (Xi)

=Di

Pr (D = 1 | Xi){Y1i − E (Y1 | Xi)} ,

andπ0 (Xi) Yi (1 − Di)− µ0 (Xi) (1 − Di)

π20 (Xi)

× fX (Xi)

=1 − Di

1 − Pr (D = 1 | Xi){Y0i − E (Y0 | Xi)} ,

we finally arrive at

√nζ − ζ

=

1√n

nj=1

Di

Pr (D = 1 | Xi)

× {Y1i − E (Y1 | Xi)} 1βXj

≤ γ

−1

√n

nj=1

1 − Di

1 − Pr (D = 1 | Xi){Y0i − E (Y0 | Xi)}

× 1βXj

≤ γ

+ γ ×1

√n

nj=1

Fβ (γ )− 1

βXj

≤ γ

+1

√n

ni=1

{β (Xi)× 1 {β (Xi) < γ } − ζ } + op (1) . (44)

Now, observe that

ρ − ρ =1n

nj=1

β1Xj− β1

Xj

D1n

+1n

nj=1

β1Xj− E

β1Xj

D2n

ζ − ζ

.

Lemma 3 describes the behavior of D1n, and D2n is a standardempirical process which yields that

√nρ − ρ

tends to a zero-

mean normal. The final step is to show

ρn − ρ = o

1√n

. (45)

Derivation of (45): Let σ(·) denote the asymptotic variance ofnαβ(x)− γ − β(x)+ γ

, where α < 1

2 (by Theorem 1 andstandard results on asymptotic distribution of Nadaraya–Watsonestimator) and let the support of γ−β(X)

σ (X) be (−A, A) for A > 0.Observe that

rn :=

X

(γ − β(x)) Prβ(x) ≥ γ

1 (β(x) < γ ) dFX (x)

=

X

(γ − β(x)) Pr

nαβ(x)− γ − β (x)+ γ

≥ nα {γ − β(x)}

1 (β(x) < γ ) dFX (x)

Edgeworth=

X

γ − β (x)σ (x)

Φ

−nα

γ − β(x)σ (x)

× 1 (β(x) < γ ) dFX (x)+ smaller order terms

≤ c A

0zΦ (−nαz) f γ−β(X)

σ (X)(z) dz + smaller order terms

=1n2α

Anα

0tΦ (−t) f γ−β(X)

σ (X)=r1n, say

tnα

dt + smaller order terms.

Page 25: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

192 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

1√n

ni=1

β (Xi)− β (Xi)

Lγ − β (Xi)

hn

+ {γ − β (Xi)}

1hn

Lγ − β (Xi)

hn

=√n

1n (n − 1)

ni=1

j=i

1

π2 (Xi)

π (Xi) YjDj − µ1 (Xi)Dj

pnKXj − Xi

ωn

×

Lγ − β (Xi)

hn

+ {γ − β (Xi)}

1hn

Lγ − β (Xi)

hn

≡√n

1n (n − 1)

ni=1

j=i

wnZi, Zj

=

1√n (n − 1)

ni=1

j=i

wnZi, Zj

− E

wnZi, Zj

|Zi− E

wnZi, Zj

|Zj+ E

wnZi, Zj

U1n

+1

√n

nj=1

EwnZi, Zj

|Zj− E

wnZi, Zj

U2n

+1

√n

ni=1

EwnZi, Zj

|Zi

U3n

Box VIII.

Therefore,

√nr1n =

√n

n2α

Anα

0tΦ (−t) f γ−β(X)

σ (X)

tnα

dt

≤ c1

n2α−1/2

Anα

0tΦ (−t) dt

L′Hospital−−−−−−→ c

A2nαΦ (−Anα) αnα−1

(2α − 1/2) n2α−3/2

= cA2n1/2Φ (−Anα) α(2α − 1/2)

→ 0 as n → ∞.

Now observe that

ρn − ρ

=

X

{β(x)− γ } Prβ (x) ≥ γ

1 (β(x) < γ )

+ {γ − β(x)} Prβ(x) < γ

1 (β(x) ≥ γ )

dFX (x)

+ γ

X

Prβ(x) ≥ γ

1 (β(x) < γ )

− Prβ(x) < γ

1 (β(x) ≥ γ )

dFX (x).

The first term is o

1√n

by the previous calculation. Finally, the 2nd

term is proportional to the equation in Box X)

The leading term in (46) multiplied by√n equals

√n A

−A

[1 − Φ (nαz)] 1 (z > 0)

−Φ {nαz} 1 (z < 0)

f γ−β(X)σ (X)

(z) dz.

Now, observe that

√n A

−A[1 − Φ (nαz)] 1 (z > 0) f γ−β(X)

σ (X)(z) dz

≤ c√n A

0Φ (−nαz) dz

= c√n

Anα

0Φ (−t) dt,

L′Hospital−−−−−−→

Aαnα−1Φ (−Anα)(α − 1/2) nα−3/2

= c√nΦ (−Anα) → 0.

An exactly symmetric argument applies to√n A−AΦ {nαz}

1 (z < 0) f γ−β(X)σ (X)

(z) dz.

Consequently, ρn − ρ = o

1√n

and hence a

√n-consistent CI

for ρ is also a valid√n-consistent CI for ρn.

Outline of proof for parametric β(·): Suppose β(x) is parametricallyspecified asG (x, β), whereG(·) is known; typicallyβ (the so-called‘‘pseudo-true value’’) can be estimated at parametric rates using,say, GMM. For estimation of γ and ρ resulting from the plug-inapproach, we will still use smoothing with the c.d.f. kernel L(·) tohandle the nonsmoothness, since smoothing-based methods aremore generally applicable.

The key result is that both γ and ρ can be estimated at the√n-

rate. To see this, recall the asymptotic expansion for γ :√nγ − γ

=

fβ (γ )

−1 1√n

ni=1

Fβ (γ )− L

γ − G (Xi, β)

hn

+

fβ (γ )

−1

1√n

ni=1

Lγ − G (Xi, β)

hn

− L

γ − GXi, β

hn

.Using similar steps as in the proof of Theorem 2 in the Appendix,the first term is asymptotically normal with mean equal to

limn→∞

√nhr

n ×

(−1)r+1f (r−1)

β (γ )× 1−1 u

rL (u) du

r!

+ o

√nhr

n

which is finite if limn→∞

√nhr

n < ∞.As for the second term, (and this is what makes γ a

√n-

consistent estimator in the parametric case) notice that

1√n

ni=1

Lγ − G (Xi, β)

hn

− L

γ − GXi, β

hn

=

√nβ − β

′ 1nhn

ni=1

∇G (Xi, β) Lγ − G (Xi, β)

hn

+ Tn,

Page 26: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 193

EwnZi, Zj

|Zj

= E

1π21 (Xi)

π1 (Xi) YjDj − µ1 (Xi)Dj

pnK

−Xj + Xi

ωn

×

Lγ − β (Xi)

hn

+ {γ − β (Xi)}

1hn

Lγ − β (Xi)

hn

Yj,Dj, Xj

=

1

π21 (X)

π1(x)YjDj − µ1(x)Dj

pnK

−Xj + xωn

×

Lγ − β (Xi)

hn

+ {γ − β (Xi)}

1hn

Lγ − β (Xi)

hn

f (x)dx

=

1

π21

Xj + uωn

π1Xj + uωn

YjDj − µ1

Xj + uωn

DjK (u)

×

L

γ − β

Xj + uωn

hn

+γ − β

Xj + uωn

1hn

L

γ − β

Xj + uωn

hn

f

Xj + uωn

du

=

1

π21

Xj π1

XjYjDj − µ1

XjDjfXj

×

L

γ − β

Xj

hn

+γ − β

Xj 1

hnL

γ − β

Xj

hn

K (u) du + Oωq

n

=

π1XjYjDj − µ1

XjDj

π21

Xj f

Xj

W(Zj)

×

L

γ − β

Xj

hn

+γ − β

Xj 1

hnL

γ − β

Xj

hn

V2n

βXj

+ O

ωq

n

Box IX.

where

|Tn| ≤ Mnβ − β

22√nh2

n

1n

ni=1

M1 (Xi) L′

GXi, β

− γ

hn

,with M a fixed positive constant and M1(X) a uniformly boundedfunction. Since

√nβ − β

= Op (1), by condition B4(i) and A4

(ii), the RHS of the previous display goes to zero if nh4n → ∞. Then

we have that

1√n

ni=1

Lγ − G (Xi, β)

hn

− L

γ − GXi, β

hn

=

√nβ − β

∇G (Xi, β)

× fG(X,β) (γ )+ op (1) .

This implies that√nγ − γ

will converge to a zero mean normal

if nh2rn → 0 and n3h4

n → ∞ and when the density of G (X, β) hasuniformly bounded derivatives up to order (r − 1) where r ≥ 3.The result for ρ will follow.

A.2. Part 2: Split sample simulation

Here we describe and perform a simulation exercise, using asplit sample method in order to address the potential positivebias alluded to in Section 4. Divide the sample randomly into 2parts: call the first group j = 1, . . . ,m and call 2nd group i =

m + 1, . . . , n. Then for each i = m + 1, . . . , n, calculate β(1) (xi)using only observations j = 1, . . . ,m. Then use the second half tocalculate a second independent estimate β(2) (xi) at the same set

of values xi as before. Then β(1) (xi) and β(2) (xi) are functions ofdistinct Y observations which are independent. Finally, calculate γand ρ via:

1 − c =1n

ni=m+1

L

γ − β(2) (xi)

hn

ρ =1n

ni=1

β1 (Xi)−1n

ni=m+1

β(1) (xi)× L

γ − β(2) (xi)

hn

.

Such split-sample methods were previously suggested in Altonjiand Segal (1996) to reduce finite sample bias of GMM estimates.While this split-sample method does not change the asymptoticdistribution of ρ or γ , we investigate its potential finite samplesuperiority in a small Monte Carlo study as follows.

Take X = wealth and take the population to be our sample.Generate ε∼N (0, 1), Y∼

0 N (0, 1), Y1 = Y0 + X + X ∗ ε. ThenE (Y1 | x) = β(x) = x. Set c = 0.5, find median of wealth whichequals γ and then

ρ =1N

Ni=1

Xi × 1 (Xi > γ ) .

This γ and ρ are the ‘‘true’’ parameter values. Finally, generate Das a random Bernoulli variable and set Y = DY1 + (1 − D) Y0.

Now we will do the simulations as follows. For each drawfrom this population, first estimate ρ and γ with the full sampleand then split the sample into two equal halves and repeat theestimation in the split-sample way, as described above. In bothcases, we vary hn between n−1/6 and n−1/8 and ωn between n−5/24

and n−6/24. We present the results graphically in Fig. A.1. From thegraphs, it appears that the full-sample estimates of ρ are generally

Page 27: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

194 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

6)

X

Prβ(x) ≥ γ

1 (β(x) < γ )

− Prβ(x) < γ

1 (β(x) ≥ γ )

dF(x)

=

X

Prβ(x)− γ − β(x)+ γ ≥ γ − β(x)

1 (β (x) < γ )

− Prβ(x)− γ − β(x)+ γ < γ − β(x)

1 (β (x) ≥ γ )

dF(x)

=

X

Prnαβ(x)− γ − β(x)+ γ

≥ nα {γ − β (x)}

1 (β(x) < γ )

− Prnαβ(x)− γ − β(x)+ γ

< nα {γ − β(x)}

1 (β(x) ≥ γ )

dF(x)

=

X

1 − Φ

nαγ − β(x)σ (x)

1 (β(x) < γ )

−Φ

nαγ − β(x)σ (x)

1 (β(x) ≥ γ )

dF(x)+ smaller order terms. � (4

Box X.

closer to the true value and the bias for hn = n−1/8 is in factnegative. Given these results, we use the full-sample estimation inour application.

A.3. Part 3: Discussion of condition AM

Part A: We first present a very simple model of bednet use,which would imply AM(ii). Suppose a representative householdhas utility function defined on consumption and bednet adoptionU (c, d), d ∈ {0, 1} where U (·, d) is continuous, strictly increasingand strictly concave for d = 0, 1. Suppose nonsubsidizedprice of bednet is p in terms of the consumption good and thesubsidized bednet costs 0. Let ε denote unobserved heterogeneityin household preference, distributed with c.d.f. F . Ignore othercovariates (or condition on them), suppose household budget isX and assume for simplicity that ε ⊥ X . Then one can showthat

β(x) = Pr (use net with subsidy | X = x)− Pr (use net without subsidy | X = x)

= Pr (U (x, 1)− U (x, 0) > ε)

− Pr (U (x − p, 1)− U (x, 0) > ε)

= F {U (x, 1)− U (x, 0)} − F {U (x − p, 1)− U (x, 0)} .

If F is uniform, we get β(x) = U (x, 1)−U (x − p, 1). For fixed p >0, β ′(x) =

∂∂xU (x, 1) −

∂∂xU (x − p, 1) < 0 if U is strictly concave

in its first argument. If X is continuously distributed on a boundedsupport, then β (X) – a strictly monotone continuous function ofX – will also be continuously distributed with a bounded support.For other choices of F , we can get piecewise monotone β(·).An intuitive interpretation of β ′ < 0 is simply that individualdemand for bednets is less price-sensitive when individuals arewealthier.

Part B: If we do not assume AM, then the following generalizationsare appropriate. Define

F−1β(X) (1 − c) = sup

a :

x∈X

1 (β(x) ≥ a) dF(x) ≤ c,

δ1 = 1

x∈X

1β(x) ≥ F−1

β(X) (1 − c)dF (x) = c

γ = max

F−1β(X) (1 − c) , 0

δ2 = 1

F−1β(X) (1 − c) > 0

,

and write

ρ = δ1δ2 ×

E {β1 (X)} −

x∈X

[β(x)1 (β(x) < γ )] dF(x)

+ (1 − δ2)× E {β1(X)}+ (1 − δ1) δ2 × [E {β1(X)1 {β(X) > γ }}

+ E {β1(X)1 {β(X)} = γ } × (c − Pr (β(X) ≥ γ ))] .

When δ1 = 1 and δ2 = 1, then we have the case discussed inthe paper. When δ2 = 0, everyone with positive treatment effectwill be treated and hence the second term in the previous display.When δ2 = 1 and δ1 = 0, everyone with CATE above γ is treatedbut this leaves a surplus equal to (c − Pr (β (X) > γ )), which isthen randomly distributed among those with the ‘‘next highest’’value of β(X), which is the third term. So if β(X) has a point massat a with Fβ(X) (a−) > 1 − c > Fβ(X) (a), for some a, then thesurplus may then be randomly distributed among the those withβ(X) = a.

As is clear from the expressions, the sample analog of ρ willhave discontinuities in its asymptotic distribution, depending onwhat values the nuisance parameters δ1 and δ2 take and methodsanalogous to Andrews and Guggenberger would be warrantedfor constructing uniformly valid confidence intervals for ρ. Theseissues are outside the scope of the present paper and we leavethem to future research. condition AM guarantees that δ1 =

1 = δ2. Strengthening AM (i) to ln (n) × {Pr (β(X) > 0)− c} >a for some a > 0 would let us assume away AG typesituations.

A.4. Part 4: Asymptotic framework for covariate choice

Suppose X = (X1, X2) and we want to decide whether tocondition allocation on X2 in addition to conditioning on X1. Fora fixed β = (β1, β2), as n → ∞, the wider model is obviouslybetter. Therefore, to keep the covariate choice problem non-trivialin the asymptotics, we consider a sequence of models Pn with

E (Y1 − Y0 | X = x) := x′βn := x′

1β1 + x′

2δ/√n,

for some δ ∈ Rk. Let β and β1 be the MLE of β in the unrestricted(labeled ur) model and of β1 in the restricted (labeled r) modelrespectively with γ ur and γ r the respective estimates of γ . Withinthis ‘‘limit of experiment’’ framework, Le Cam type convergencetheorems can be invoked under appropriate regularity conditionsto show that the difference in welfare functions resulting from thelarger and smaller set of covariates satisfies

Page 28: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196 195

Fig. A.1. The figures show the results of a simulation exercise comparing our fullsamplemethod to a split samplemethod suggested in order to address the potentialpositive bias alluded to in section 3. See Appendix Part 2 for details.

limn→∞

√nρurn − ρr

n

= lim

n→∞

√nx′

1β1 + x′

×

Φ

√nx′

1β1 + x′

2δ − γ urn

σ ur(x)

− Φ

√nx′

1β1 + x′

1S−111 S12δ − γ r

n

σ r(x)

dF(x),

where σ r(x) and σ ur(x) represent the asymptotic std deviation ofβ ′x− γ and β ′x− γ respectively, S−1

11 S12 represents the projectionmatrix from regressing X2 on X1 and γ r

n and γ urn represent the

true thresholds corresponding to the wider and narrower models,respectively. S−1

11 S12δ corresponds to the familiar ‘‘omitted variablebias’’ formula.

Since these parameters are unknown, potential feasible rulesmay be based on:

dif1n :=1n

ni=1

x′

iβ(−i) ×

1x′

iβ(−i) ≥ γ ur(−i)

− 1

x′

1iβ(−i) ≤ γ r(−i)

, or

dif2n :=1n

ni=1

x′

iβ(−i) ×

L

x′

iβ(−i) − γ ur(−i)

hn

− L

x′

1iβ(−i) − γ r(−i)

hn

.

where the subscript (−i) denotes the leave-one-out estimate. Theformal task is to show that under the sequence of models Pn,the sequences

√ndifn −

ρurn − ρr

n

converge to an appropriate

limiting random variable (see below for how to handle CV basedestimates in the asymptotics). The quantiles of this latter randomvariable can be used to decide on the tolerance threshold c fordifn above which the wider model will be preferred. Two otherissues of interest are those of post-model selection inference, i.e.,estimating

ρurn Pr

difn >

c√n

+ ρr

n Prdifn ≤

c√n

;

and generalizing the above analysis to the case where E(Y1 − Y0 |

X = x) is not necessarily linear in x and yet we are considering twolinear models to make treatment allocation. In this case, β and β1can be interpreted as pseudo-MLEs and dif1n replaced by

1n

ni=1

YiDi

π−

Yi (1 − Di)

1 − π

×

1x′

iβ(i) ≥ γ ur(i)

− 1

x′

1iβ(i) ≤ γ r(i)

,

where π =1n

ni=1 Di. An appropriate local to zero asymptotic

analysis in this case seems to be an interesting taskworth pursuing.Cross-validation (heuristics for the claim in footnote 8): Using thesmoothed estimates and the parametric version of Theorems 1 and2 in the main text, it follows that the difference between the naiveestimate and the CV estimate of welfare is given by

ρjn − ρjn :=1n

ni=1

x′

iβj(−i)L

x′

iβj(−i) − γj(−i)

hn

−1n

ni=1

x′

iβjL

x′

iβj − γj

hn

=1n

ni=1

υxi, hn, βj

′ βj(−i) − βj

+O

βj(−i) − βj

2,

for some vector υxi, hn, βj

. Now, using an influence functions

approach, it can be shown (c.f., CH page 52) that when β is theMLEwith −J the Hessian and u (·) the score, then

βj(−i) − βj = −n−1 J−1u (zi)+ o1n

and the term J−1 is related to the estimated asymptotic variance ofβ . Replacing βj(−i) − βj in the previous display, we get the feasibleform of the penalization term.

Page 29: Inferring welfare maximizing treatment assignment under ...web.stanford.edu/~pdupas/TreatmentAssignment.pdf · JournalofEconometrics167(2012)168–196 Contents lists available atSciVerse

196 D. Bhattacharya, P. Dupas / Journal of Econometrics 167 (2012) 168–196

References

Altonji, J., Segal, L., 1996. Small sample bias in GMM estimation of covariancestructures. Journal of Business and Economic Statistics 14 (3), 353–366.

Andrews, Donald W.K., 1999. Estimation when a parameter is on a boundary.Econometrica 67, 1341–1383.

Attanasio, O., Meghir, Costas, Santiago, Ana, 2011. Education choices in Mexico:using a structural model and a randomized experiment to evaluate Progresa,Review of Economic Studies (in press).

Behnke, Stephanie, Frölich, Markus, Lechner, Michael, 2009. Targeting labourmarket programmes — results from a randomized experiment. swiss. Journalof Economics and Statistics 145, 221–268.

Berger, Mark, Black, Dan, Smith, Jeffrey, 2001. Evaluating profiling as a means ofallocating government services. In: Lechner, Michael, Pfeiffer, Friedhelm (Eds.),Econometric Evaluation of Active Labour Market Policies. Physica, Heidelberg,pp. 59–84.

Bhattacharya, D., 2009. Inferring optimal peer allocation using experimental data.Journal of the American Statistical Association 104 (486), 486–500.

Birens, H, 1994. Topics in Advanced Econometrics: Estimation, Testing, andSpecification of Cross-Section and Time Series Models. Cambridge UniversityPress, (xii + 258 pages).

Chen, X., Linton, O., Van Keilegom, I., 2003. Estimation of semiparametric modelswhen the criterion is not smooth. Econometrica 71, 1591–1608.

Claeskens,Hjort, 2008.Model Selection andModel Averaging. CambridgeUniversityPress.

Cohen, Jessica, Dupas, Pascaline, 2010. Free distribution or cost-sharing? evidencefrom a randomized malaria experiment in Kenya. Quarterly Journal ofEconomics 125 (1), 1–45.

Dehejia, Rajeev H., 2005. Program evaluation as a decision problem. Journal ofEconometrics 125 (1–2), 141–173.

Delgado, M., Mora, J., 1995. Nonparametric and semiparametric estimation withdiscrete regressors. Econometrica 63 (6), 1477–1484.

Duflo, Esther, Hanna, Rema, Ryan, Stephen, 2007. Monitoring works: gettingteachers to come to school, American Economic Review (forthcoming).

Dupas, Pascaline, 2009. What matters (and what does not) in households’ decisionto invest in malaria prevention? American Economic Review Papers andProceedings 99 (2), 224–230.

Dupas, Pascaline, 2010. Short-run subsidies and long-run adoption of new healthproducts: evidence from a field experiment, mimeo, UCLA.

Dupas, Pascaline, Robinson, Jonathan, 2009. Savings Constraints and Microenter-prise Development: Experimental Evidence from Kenya. NBER WP #14693.

Eberts, Randall W., O’Leary, Christopher J., Wandner, Stephen A. (Eds.), 2002.Targeting Employment Services. W.E. Upjohn Institute for EmploymentResearch, Kalamazoo, MI.

Ettling, M., et al., 1994. Economic impact of malaria in malawian households.Tropical Medicine and Parasitology 45, 74–79.

Frolich, M., 2008. Statistical treatment choice: an application to active labourmarket programmes. Journal of the American Statistical Association 103 (482),547–558.

Geisser, S., 1975. The predictive sample reuse method with applications. Journal ofthe American Statistical Association 70, 320–328.

Hall, Peter, Wolff, Rodney C.L., Yao, Qiwei, 1999. Methods for estimating aconditional distribution function. Journal of theAmerican Statistical Association94 (445), 154–163.

Hansen, B., 2008. Uniform convergence rates for kernel estimation with dependentdata. Econometric Theory 24, 726–748.

Hirano, K., Porter, J., 2009. Asymptotics for statistical treatment rules. Econometrica77 (5), 1683–1701.

Horowitz, J., 1992. A smoothed maximum score estimator for the binary responsemodel. Econometrica 60 (3), 505–531.

Imbens, G., Ridder, G., 2009. Estimation and Inference for Generalized Full andPartial Means and Average Derivatives, working paper Harvard University.

Kenya Round 7 Proposal in response to the Global Fund to fight AIDS,Tuberculosis andMalaria 7th call for proposals. Available online at: http://www.theglobalfund.org/en/files/apply/call7/notapproved/7KENM_1527_0_full.pdf.

Lechner, M., Smith, J., 2007. What is the value added by case workers? LabourEconomics 14, 135–151.

Lengeler, Christian, 2004. Insecticide-treated bed nets and curtains for preventingmalaria. Cochrane Database Syst Rev 2004; 2:CD000363.

Li, Qi, Racine, Jeffrey Scott, 2007. Nonparametric Econometrics: Theory and Practice.Princeton University Press, Princeton, NJ.

Lucas, Adrienne, 2007. Economic effects of malaria eradication: evidence from themalarial periphery, Mimeo, Wellesley College.

Mahajan, A, Tarozzi, A., Yoong, J., Blackburn, B., 2009. Bednets, Information andMalaria in Orissa, mimeo.

Manski, C., 2004. Statistical treatment rules for heterogeneous populations.Econometrica 72 (4), 1221–1246.

Manski, C., 2005. Social Choice With Partial Knowledge of Treatment Response.Princeton University Press.

Newey, W.K., 1994. Kernel estimation of partial means and a general varianceestimator. Econometric Theory 10, 233–253.

(NM) Newey, W., McFadden, D., 1994. Large sample estimation and hypothesistesting. In: Handbook of Econometrics, vol. IV. Elsevier Science, B.V. Amsterdam,pp. 2113–2245.

Powell, J.L., Stock, J.H., Stoker, T.M., 1989. Semiparametric estimation of weightedaverage derivatives. Econometrica 57, 1403–1430.

Sachs, Jeffrey, 2005. The End of Poverty: Economic Possibilities for Our Time.Penguin Press, New York.

Stoye, J., 2009. Minimax regret treatment choice with finite samples. Journal ofEconometrics 151, 70–81.

Teklehaimanot, Awash, et al., 2007. Scaling upmalaria control in africa: an economicand epidemiological assessment. American Journal of Tropical Medicine andHygiene 77 (Suppl 6), 138–144.

Tetenov, A., 2012. Statistical treatment choice based on asymmetricminimax regretcriteria. Journal of Econometrics 166, 157–165.

Todd, Petra, Wolpin, Kenneth, 2006. Assessing the impact of a school subsidyprogram in mexico: using experimental data to validate a dynamic behavioralmodel of child schooling and fertility. American Economic Review 96 (5),1384–1417.

WHO, 2007. WHO Global Malaria Programme: Position Statement on ITNs.http://www.who.int/malaria/docs/itn/ITNspospaperfinal.pdf.