artigo do dsmar e wilson (2007) algortimo para 2 estágio do dea

8/13/2019 artigo do dsmar e wilson (2007) algortimo para 2 estgio do DEA

1/127

Statistical Inference in Nonparametric Frontier

Models: Recent Developments and

Perspectives

Leopold Simar1

Institut de StatistiqueUniversite Catholique de Louvain

Louvain-la-Neuve, [email protected]

Paul W. Wilson

The John E. Walker Department of Economics222 Sirrine Hall

Clemson UniversityClemson, South Carolina 29634 USA

[email protected]

1Research support from the Inter-university Attraction Pole, Phase V (No.P5/24) sponsored by the Belgian Government (Belgian Science Policy) is gratefullyacknowledged.


2/127

Contents

4 Statistical Inference in Nonparametric Frontier Models: Re-

cent Developments and Perspectives 14.1 Frontier Analysis and The Statistical Paradigm . . . . . . . . 14.1.1 The Frontier Model: Economic Theory . . . . . . . . . 14.1.2 The Statistical Paradigm . . . . . . . . . . . . . . . . . 3

4.2 The Nonparametric Envelopment Estimators . . . . . . . . . . 8

4.2.1 The Data Generating Process (DGP) . . . . . . . . . . 84.2.2 The FDH Estimator . . . . . . . . . . . . . . . . . . . 10

4.2.3 The DEA Estimators . . . . . . . . . . . . . . . . . . . 124.2.4 An Alternative Probabilistic Formulation of the DGP . 154.2.5 Properties of FDH/DEA Estimators . . . . . . . . . . 19

4.3 Bootstrapping DEA and FDH Efficiency Scores . . . . . . . . 304.3.1 General principles . . . . . . . . . . . . . . . . . . . . . 304.3.2 Bootstrap confidence intervals . . . . . . . . . . . . . . 324.3.3 Bootstrap bias corrections . . . . . . . . . . . . . . . . 354.3.4 Bootstrapping DEA in action . . . . . . . . . . . . . . 36

4.3.5 Practical Considerations for the Bootstrap . . . . . . . 454.3.6 Bootstrapping FDH efficiency scores . . . . . . . . . . 604.3.7 Extensions of the bootstrap ideas . . . . . . . . . . . . 62

4.4 Improving the FDH estimator . . . . . . . . . . . . . . . . . . 674.4.1 Bias-corrected FDH . . . . . . . . . . . . . . . . . . . . 67

4.4.2 Linearly interpolated FDH: LFDH . . . . . . . . . . . 684.5 Robust Nonparametric Frontier Estimators . . . . . . . . . . . 71

4.5.1 Order-m frontiers . . . . . . . . . . . . . . . . . . . . . 73

4.5.2 Order- quantile frontiers . . . . . . . . . . . . . . . . 854.5.3 Outlier Detection . . . . . . . . . . . . . . . . . . . . . 92

4.6 Explaining Efficiencies . . . . . . . . . . . . . . . . . . . . . . 95

i


3/127

ii CONTENTS

4.6.1 Two-stage regression approach . . . . . . . . . . . . . . 97

4.6.2 Conditional Efficiency Measures . . . . . . . . . . . . . 1024.7 Parametric Approximations of Nonparametric Frontier . . . . 105

4.7.1 Parametric v. nonparametric models . . . . . . . . . . 1054.7.2 Formalization of the problem and two-stage procedure 109

4.8 Open Issues and Conclusions . . . . . . . . . . . . . . . . . . . 1134.8.1 Allowing for noise in DEA/FDH . . . . . . . . . . . . . 1134.8.2 Inference with DEA/FDH estimators . . . . . . . . . . 1144.8.3 Results for CRS case . . . . . . . . . . . . . . . . . . . 1144.8.4 Tools for dimension reduction . . . . . . . . . . . . . . 1144.8.5 Final conclusions . . . . . . . . . . . . . . . . . . . . . 115


4/127

Chapter 4

Statistical Inference in

Nonparametric FrontierModels: Recent Developmentsand Perspectives

4.1 Frontier Analysis and The StatisticalParadigm

4.1.1 The Frontier Model: Economic Theory

As discussed in Chapter 1, the economic theory underlying efficiency analysisdates to the work of Koopmans (1951), Debreu (1951), and Farrell (1957),who made the first attempt at empirical estimation of efficiencies for a set ofobserved production units.

In this section, the basic concepts and notation used in this chapter areintroduced. The production process is constrained by the production set ,which is the set of physically attainable points (x, y); i.e.,

= {(x, y) RN+M+ | x can produce y}, (4.1)where x RN+ is the input vector and y RM+ is the output vector.

For purposes of efficiency measurement, the upper boundary of is ofinterest. The efficient boundary (frontier) of is the locus of optimal pro-duction plans (e.g., minimal achievable input level for a given output, or

1


5/127

maximal achievable output given the level of the inputs). The boundary of

,

= {(x, y) | (x,y) , 0< 1}, (4.2)is sometimes referred to as the technologyor the production frontier, and isgiven by the intersection of and the closure of its complement. Firms thatare technically inefficient operate at points in the interior of, while thosethat are technically efficient operate somewhere along the technology definedby.

It is often useful to describe the production set by its sections. Forinstance, the input requirement set is defined for all y RM+ by

X(y) = {x RN+| (x, y) }. (4.3)The (input-oriented) efficiency boundaryX(y) is defined for a giveny RM+by

X(y) = {x | x X(y), x / X(y), 0<


6/127

Then the (output-oriented) efficiency boundary Y(x) is defined for a given

x RN+ as

Y(x) = {y| y Y(x),y / Y(x), >1}, (4.8)and the Debreu-Farrell output measure of efficiency for a production unitlocated at (x, y) RN+M+ is

(x, y) = sup{ | (x,y) }. (4.9)Analogous to the input-oriented case described above, (x, y) is the pro-

portionate, feasible increase in outputs for a unit located (x, y) that wouldachieve technical efficiency. By construction, for all (x, y)

, (x, y)

1

and (x, y) is technically efficient if and only if (x, y) = 1. The outputefficiency measure (x, y) is the reciprocal of the Shephard (1970) outputdistance function. The efficient level of output, for the input level x and forthe direction of the output vector determined by y is given by

y(x) =(x, y)y. (4.10)

Thus the efficient boundary of, can be described in either of two ways,either in terms of the input direction or in terms of the output direction.In other words, the unit at (x, y) in the interior of can achieve technical

efficiency either (i) by moving from (x, y) to (x

(y), y), or (ii) by movingfrom (x, y) to (x, y(x)). Note, however, that there is only one well-definedefficient boundary of.

A variety of assumptions on are found in the literature (e.g., free dis-posability, convexity, etc.; see Shephard (1970) for examples). The assump-tions about determine the appropriate estimator that should be used toestimate , (x, y), or(x, y). This issue will be discussed below in detail.

4.1.2 The Statistical Paradigm

In any interesting application, the attainable set as well as X(y), X(y),

Y(x), and Y(x) are unknown. Consequently, the efficiency scores (x, y)and (x, y) of a particular unit operating at input, output levels are alsounknown.

Typically, the only information available to the analyst is a sample

Xn= {(xi, yi), i= 1, . . . , n} (4.11)

3


7/127

of observations on input and output levels for a set of production units en-

gaged in the activity of interest.1

The statistical paradigm poses the followingquestion that must be answered: what can be learned by observingXn? Inother words, how can the information inXn be used to estimate (x, y) and(x, y), or and hence X(y), X(y), Y(x), and Y(x)?

Answering these questions involves much more than writing a linear pro-gram and throwing the data inXn into a computer to compute a solutionto the linear program. Indeed, one could ask, what is learned from an es-timate of (x, y) or (x, y) (i.e., numbers computed fromXn by solving alinear program)? The answer is clear: almost nothing. One might learn,for example, that unit A uses less input quantities while producing greater

output quantities than unit B, but little else can be learned from estimatesof(x, y) and(x, y) alone.

Before anything can be learned about (x, y) or (x, y), or by extensionabout and its various characterizations, one must use methods of statisti-cal analysis to understand the properties of whatever estimators have beenused to obtain estimates of the things of interest.2 This raises the followingquestions:

is the estimator consistent?

is the estimator biased?

if the estimator is biased, does the bias disappear as the sample sizetends toward infinity?

if the estimator is biased, can the bias be corrected, and at what cost?

can confidence intervals for the values of interest be estimated?

can interesting hypotheses about the production process be tested, andif so, how?

1In oder to simplify notation, a random sample is denoted by the lower-case letters(xi, yi) and not, as in standard textbooks, by the upper-case (Xi, Yi). The context willdetermine whether the (xi, yi) are random variables or their realized, observed values,which are real numbers.

2Note that an estimatoris a random variable, while an estimate is a realization of anestimator (random variable). An estimator can take perhaps infinitely many values withdifferent probabilities, while an estimate is merely a known, non-random value.

4


8/127

Notions of statistical consistency, etc. are discussed below in Section 4.2.5.

Before these questions can be answered, indeed before it can be knownwhat is estimated, a statistical model must be defined. Statistical modelsconsist of two parts: (i) a probability model, which in the present case in-cludes assumptions on the production set and the distribution of input,output vectors (x, y) over ; and (ii) a sampling process. The statisticalmodel describes the process that yields the data in the sample Xn, and issometimes called the data-generating process (DGP).

In cases where a group of productive units are observed at the same pointin time, i.e.,where cross-sectional data are observed, it is convenient and of-ten reasonable to assume the sampling process involves independent draws

from the probability distribution defined in the DGPs probability model.With regard to the probability model, one must attempt reasonable assump-tions. Of course, there are trade-offs here; the assumptions on the probabilitymodel must be strong enough to permit estimation using estimators that havedesirable properties, and to allow those properties to be deduced, yet not sostrong as to impose conditions on the DGP that do not reflect reality. Thegoal should be, in all cases, to make flexible, minimal assumptions in orderto let the data reveal as much as possible about the underlying DGP, ratherthan making strong, untested assumptions that have potential to influenceresults of estimation and inference in perhaps large and misleading ways.The assumptions defining the statistical model are of crucial importance,

since any inference that might be made will typically be valid only if theassumptions are in fact true.

The above considerations apply equally to the parametric approach de-scribed in Chapter 2 as well as to the nonparametric approaches to esti-mation discussed in this chapter. It is useful to imagine a spectrum ofestimation approaches, ranging from fully parametric (most restrictive) tofully non-parametric (least restrictive). Fully parametric estimation strate-gies necessarily involve stronger assumptions on the probability model, whichis completely specified in terms of a specific probability distribution function,structural equations, etc. Semi-parametric strategies are less restrictive; in

these approaches, some (but not all) features of the probability model areleft unspecified (for example, in a regression setting one might specify para-metric forms for some, but not all, of the moments of a distribution functionin the probability model). Fully non-parametric approaches assume no para-metric forms for any features of the probability model. Instead, only (rela-tively) mild assumptions on broad features of the probability distribution are

5


9/127

made, usually involving assumptions of various types of continuity, degrees

of smoothness, etc.With nonparametric approaches to efficiency estimation, no specific an-

alytical function describing the frontier is assumed. In addition, (too) re-strictive assumptions on the stochastic part of the model, describing theprobabilistic behavior of the observations in the sample with respect to theefficient boundary of, are also avoided.

The most general nonparametric approach would assume that observa-tions (xi, yi) on input, output vectors are drawn randomly, independentlyfrom apopulationof firms whose input, output vectors are distributed on theattainable set according to some unknown probability law described by a

probability density function f(x, y) or the corresponding distribution func-tionF(x, y) = Prob(Xx, Y y). This chapter focuses on deterministicfrontier models where all observations are assumed to be technically attain-able.

Formally,

Prob((xi, yi) ) = 1. (4.12)The most popular nonparametric estimators are based on the idea of

estimating the attainable set by the smallest set within some class ofsets that envelop the observed data. Depending on assumptions made on, this idea leads to the Free Disposal Hull (FDH) estimator of Deprins et

al. (1984), which relies only on an assumption of free disposability, and theData Envelopment Analysis (DEA) estimators which incorporate additionalassumptions. Farrell (1957) was the first to use a DEA estimator in anempirical application, but the idea remained obscure until it was popularizedby Charnes et al. (1978) and Banker et al. (1984). Charnes et al. (1978)estimated by the convex cone of the FDH estimator of, thus imposingconstant returns to scale, while Banker et al. (1984) used the convex hull ofthe FDH estimator of, thereby allowing for variable returns to scale.

Among deterministic frontier models, the primary advantage of nonpara-metric models and estimators lies in their great flexibility (as opposed to

parametric, deterministic frontier models). In addition, the nonparametricestimators are easy to compute, and today most of their statistical propertiesare well-established. As will be discussed below, inference is available usingbootstrap methods.

The main drawbacks of deterministic frontier modelsboth non-parametric as well as parametric modelsis that they are very sensitive to

6


10/127

outliers and extreme values, and that noisy data are not allowed. As discussed

later in this chapter (see Section 4.8.1), a fully nonparametric approach whennoise is introduced into the DGP leads to identification problems. Some al-ternative approaches will be described later.

It should be noted that allowing for noise in frontier models presentsdifficult problems, even in a fully parametric framework where one can relyon the assumed parametric structure. In fully parametric models where theDGP involves a one-sided error process reflecting inefficiency and a two-sided error process reflecting statistical noise, numerical identification of thestatistical models features is sometimes highly problematic even with large(but finite) samples; see Ritter and Simar (1997) for examples.

Apart from the issue of numerical identification, fully parametric, stochas-tic frontier models present other difficulties. Efficiency estimates in thesemodels are based on residual terms that are unidentified. Researchers insteadbase efficiency estimates on an expectation, conditional on a composite resid-ual; estimating anexpected inefficiencyis rather different from estimatingac-tualinefficiency. An additional problem arises from the fact that, even if thefully parametric, stochastic frontier model is correctly specified, there is typi-cally a non-trivial probability of drawing samples with the wrong skewness(e.g., when estimating cost functions, one would expect composite residualswith right-skewness, but it is certainly possible to draw finite samples withleft-skewnessthe probability of doing so depends on the sample size and the

mean of the composite errors). Since there are apparently no published stud-ies, and also apparently no working papers in circulation, where researchersreport composite residuals with the wrong skewness when fully parametric,stochastic frontier models are estimated, it appears that estimates are some-times, perhaps often, conditioned (i) on either drawing observations until thedesired skewness is obtained or (ii) on model specifications that result in thedesired skewness. This raises formidable questions for inference.

The remainder of this chapter is organized as follows. Section 4.2 presentsin a unified notation the basic assumptions needed to define the DGP andshow how the nonparametric estimators (FDH and DEA) can be described

easily in this framework. Section 4.2.4 is particularly appealing: it showshow the Debreu-Farrell concepts of efficiency can be formalized in an in-tuitive probabilistic framework. All the available statistical properties ofFDH/DEA estimators are then summarized and the basic ideas for perform-ing consistent inference using bootstrap methods are described. Section 4.3discusses bootstrap methods for inference based on DEA and FDH estimates.

7


11/127

Section 4.3.7 illustrates how the bootstrap can be used to solve relevant test-

ing issues: comparison of groups of firms, testing returns to scale, and testingrestrictions (specification tests).

Section 4.4 discusses two ways FDH estimators can be improved, usingbias-corretions and interpolation. As noted above, the envelopment estima-tors are very sensitive to outliers and extreme values; Section 4.5 proposesa way for defining robust nonparametric estimators of the frontier, based ona concept of partial frontiers (order-m frontiers or order- quantile fron-tiers). These robust estimators are particularly easy to compute, and areuseful for detecting outliers.

An important issue in efficiency analysis is the explanation of the ob-

served inefficiency. Often researchers seek to explain (in)efficiency in termsof some environmental factors. Section 4.6 surveys the most recent tech-niques allowing investigation of the effects of these external factors on effi-ciency. Parametric and nonparametric methods have often been viewed aspresenting paradoxes in the literature. In Section 4.7, the two approaches arereconciled with each other, and a nonparametric method is is shown to beparticularly useful even if in end a parametric model is desired. This mixedsemi-parametric approach seems to outperform the usual parametric ap-proaches based on regression ideas. Section 4.8 concludes with a discussionof still-important, open issues and questions for future research.

4.2 The Nonparametric Envelopment Esti-

mators

4.2.1 The Data Generating Process (DGP)

They assumptions listed below are adapted from Kneip et al. (1998), Parket al. (2000) and Kneip et al. (2003). These assumptions define a statisticalmodel (i.e., a DGP), are very flexible, and seem quite reasonable in manypractical situations. The first assumption reflects the deterministic frontier

model defined in (4.12). In addition, for sake of simplicity, the standardindependence hypothesis is assumed, meaning that the observed firms areconsidered as being drawn randomly and independently from a populationof firms.

Assumption 4.2.1. The sample observations(xi, yi) inXn are realizations

8


12/127

of identically, independently distributed (iid) random variables (X, Y) with

probability density functionf(x, y), which has support over RN+M+ , the

production set as defined in (4.1); i.e. Prob((X, Y) ) = 1.

The next assumption is a regularity condition sufficient for proving theconsistency of all the nonparametric estimators described in this chapter. Itsays that the probability of observing firms in any open neighborhood of thefrontier is strictly positivequite a reasonable property since microeconomictheory indicates that with competitive input and output markets, firms whichare inefficient will, in a long run, be driven from the market.

Assumption 4.2.2. The densityf(x, y) is strictly positive on the boundary of the production set and is continuous in any direction toward theinterior of.

The next assumptions regarding are standard in microeconomic theoryof the firm; see, for example, Shephard (1970) and Fare (1988).

Assumption 4.2.3. All production requires use of some inputs: (x, y) /ifx= 0 andy 0, y= 0.

Assumption 4.2.4. Both inputs and outputs are freely disposable: if(x, y) , then for any(x

, y

) such thatx

x andy

y, (x

, y

) .Assumption 4.2.3 means that there are no free lunches. The disposabil-

ity assumption is sometimes called strong disposability and is equivalent toan assumption of monotonicity of the technology. This property also char-acterizes the technical possibility of wasting resources (i.e.,the possibility ofproducing less with more resources).

Assumption 4.2.5. is convex: if (x1, y1), (x2, y2) , then (x, y) for(x, y) =(x1, y1) + (1 )(x2, y2), for all [0, 1].

This convexity assumption may be really questionable in many situations.Several recent studies focus on the convexity assumption in frontier models(e.g., Bogetoft, 1996; Bogetoft et al., 2000; Briec et al., 2004). Assumption4.2.5 will be relaxed at various points below.

Assumption 4.2.6. is closed.

9


13/127

Closedness of the attainable setis a technical condition, avoiding math-

ematical problems for infinite production plans.Finally, in order to prove consistency of the estimators, the production

frontier must be sufficiently smooth.

Assumption 4.2.7. For all(x, y) in the interior of, the function(x, y)and(x, y) are differentiable in both their arguments.

The characterization of smoothness in Assumption 4.2.7 is stronger thanrequired for the consistency of the nonparametric estimators. Kneip etal. (1998) require only Lipschitz continuity of the efficiency scores, whichis implied by the simpler, but stronger requirement presented here. How-

ever, derivation of limiting distributions of the nonparametric estimators hasbeen obtained only with the stronger assumption made here.

4.2.2 The FDH Estimator

The FDH estimator was first proposed by Deprins et al.(1984) and has beendiscussed in Chapter 1. It relies only on the free disposability assumption inAssumption 4.2.4, and does not require Assumption 4.2.5. The Deprins etal. estimatorF DHof the attainable set is simply the free disposal hullof the observed sampleXn given by

F DH = (x, y) RN+M+ | y yi, x xi, (xi, yi) Xn=

(xi,yi)Xn

{(x, y) Rp+q+ | y yi, x xi}, (4.13)

and is the union of all the southeast (SE) -orthants with vertices (xi, yi).Figure 4.1 illustrates the idea when N=M= 1.

A nonparametric estimator of the input efficiency for a given point (x, y)is obtained by replacing the true production set in the definition of(x, y)

given by (4.5) with the estimator

F DH, yielding

F DH(x, y) = inf{| (x,y) F DH}. (4.14)Estimates can be computed in two steps: first, identify the set D of

observed points dominating (x, y):

D(x, y) = {i | (xi, yi) Xn, xi x, yi y},

10


14/127

0 2 4 6 8 10 120

1

2

3

4

5

6

7

8

9

10

input: x

output:y

FDH estimator

Free Disposability

FDH

(xi,y

i)

Figure 4.1: FDH estimatorF DHof the production set.then(x, y) is computed simply by evaluating

F DH(x, y) = miniD(x,y)

maxj=1,...,N

xjixj

, (4.15)

where for a vectora,aj denotes thejth element ofa. The estimator in (4.15)can be computed quickly and easily, since it involves only simple sortingalgorithms.

An estimate of the efficient levels of inputs for a given output levelsy anda given input direction determined by the input vector xis obtained by

x(y) =F DH(x, y)x. (4.16)By construction,F DH , and so X(y) is an upward-biased estimator

11


15/127

ofX(y). Therefore, for the efficiency scores,F DH(x, y) is an upward-biasedestimator of(x, y); i.e.,F DH(x, y) (x, y).Things work similarly in the output-orientation. The FDH estimator of

(x, y) is defined by

F DH(x, y) = sup{ | (x,y) F DH}. (4.17)This is computed quickly and easily by evaluating

F DH(x, y) = max

iD(x,y)min

j=1,...,M

yjiyj

, (4.18)

Efficient output levels for given input levels x and given an output mix (di-rection) described by the vector y are estimated by

y(x) =F DH(x, y)y. (4.19)By construction,F DH(x, y) is an downward-biased estimator of(x, y); forall (x, y) ,F DH(x, y) (x, y).4.2.3 The DEA Estimators

Although DEA estimators were first used by Farrell (1957) to measure techni-cal efficiency for a set of observed firms, the idea did not gain wide acceptanceuntil the paper by Charneset al.(1978) appeared 21 years later. Charneset

al. used the convex cone (rather than the convex hull) ofF DHto estimate, which would be appropriate if returns to scale are everywhere constant.Later, Banker et al. (1984) used the convex hull ofF DH to estimate ,thus allowing variable returns to scale. Here, DEA refers to both of theseapproaches, as well as others that involve using linear programs to define aconvex set enveloping the FDH estimatorF DH.

The most general DEA estimator of the attainable set is simply theconvex hull of the FDH estimator, i.e.,

V RS = {(x, y) RN+M | y ni=1

iyi; x n

i=1

ixi for (1, . . . , n)

such thatn

i=1

i= 1; i 0, i= 1, . . . , n}. (4.20)

12


16/127

Alternatively, the conical hull of the FDH estimator, used by Charnes et

al. (1978), is obtained by dropping the constraint in (4.20) requiring the sto sum to one:

CRS = {(x, y) RN+M | y ni=1

iyi; x n

i=1

ixi for (1, . . . , n)

such that i 0, i= 1, . . . , n}. (4.21)Other estimators can be defined by modifying the constraint on the sum ofthe s in (4.20). For example, the estimator

NIRS = {(x, y) RN+M | yn

i=1 iyi; x n

i=1 ixi for (1, . . . , n)such that

ni=1

i 1; i 0, i= 1, . . . , n} (4.22)

incorporates an assumption of non-increasing returns to scale. In otherwords, returns to scale along the boundary ofNIRS are either constantor decreasing, but not increasing. By contrast, returns to scale along theboundary ofV RS are either increasing, constant, or decreasing, while re-turns to scale along the boundary of

CRS are constant everywhere.

Figure 4.2 illustrates the DEA estimatorV RSfor the case of one inputand one output (N=M= 1).As with the FDH estimators, DEA estimators of the efficiency scores

(x, y) and (x, y) defined in 4.5 and (4.9) can be obtained by replacing

the true, but unknown, production set with one of the estimatorsV RS,CRS, orNIRS. For example, in the input-orientation, with varying returnsto scale, usingV RSto replace in (4.5) leads to the estimatorV RS(x, y) = inf{| (x,y) V RS}. (4.23)

As a practical matter,

V RS(x, y) can be computed by solving the linear

program

V RS(x, y) = min{ >0 | y ni=1

iyi; x n

i=1

ixi for (1, . . . , n)

such thatn

i=1

i= 1; i 0, i= 1, . . . , n}. (4.24)

13


17/127

0 2 4 6 8 10 120

1

2

3

4

5

6

7

8

9

10

input: x

output:y

DEA estimator

R Q

DEA

P=(xi,y

i)

Figure 4.2: DEA estimatorV RSof the production set.

Today, a number of algorithms exist to solve linear programs such as the onein (4.24), and so in principle, solutions are obtained easily. However, thecomputational burden represented by (4.24) is typically greater than thatposed by the FDH estimators.

These ideas extend naturally to the output orientation. For example,replacing withV RS in (4.9) yields

V RS(x, y) = sup{ | (x,y) V RS}, (4.25)14


18/127

which can be computed by solving the linear program

V RS(x, y) = sup{ | y ni=1

iyi; x n

i=1

ixi for (1, . . . , n)

such thatn

i=1

i= 1; i0, i= 1, . . . , n}. (4.26)

In the input orientation, the technically efficient level of inputs, for agiven level of outputs y, is estimated by (V RS(x, y)x, y). Similarly, in theoutput orientation the technically efficient level of outputs for a given levelof inputs x is estimated by (x, V RS(x, y)y).

All of the FDH and DEA estimators are biased by construction sinceF DHV RS . Moreover,V RSNIRSCRS. If the tech-nology exhibits constant returns to scale everywhere, thenCRS; otherwise,CRS will not be a statistically consistent estimator of (of course, if is not convex, thenV RS will also be inconsistent).These relations further imply thatF DH(x, y)V RS(x, y) (x, y) andV RS(x, y)NIRS(x, y)CRS(x, y). Alternatively, in the output orien-tation,F DH(x, y)V RS(x, y) (x, y) andV RS(x, y)NIRS(x, y)

CRS(x, y).

4.2.4 An Alternative Probabilistic Formulation of theDGP

The presentation of the DGP in Section 4.2.1 is traditional. However, itis also possible to present the DGP in a way that allows a probabilisticinterpretation of the Debreu-Farrell efficiency scores, providing a new way ofdescribing the nonparametric FDH estimators. This new formulation will beuseful for introducing extensions of the FDH and DEA estimators describedabove. The presentation here follows that of Daraio and Simar (2005a), whoextend the ideas of Cazals et al. (2002).

The stochastic part of the DGP introduced in Section 4.2.1 through theprobability density function f(x, y) (or the corresponding distribution func-tion F(x, y)) is completely characterized by the following probability func-tion:

HXY (x, y) = Prob(X x, Yy). (4.27)

15


19/127

Note that this is not a standard distribution function, since the cumulative

form is used for the inputs x and the survival form is used for the outputsy .The function has a nice interpretation and interesting properties:

HXY (x, y) gives the probability that a unit operating at input, outputlevels (x, y) is dominated, i.e., that another unit produces at least asmuch output while using no more of any input than the unit operatingat (x, y).

HXY (x, y) is monotone, non-decreasing in x and monotone non-increasing in y.

The support of the distribution function HXY (, ) is the attainable set; i.e.,

HXY (x, y) = 0 (x, y) . (4.28)

The joint probabilityHXY (x, y) can be decomposed using Bayes rule bywriting

HXY (x, y) = Prob(X x|Y y)

=FX|Y(x|y)

Prob(Y y)

=SY(y)

(4.29)

= Prob(Y y|X x) =SY|X(y|x)

Prob(Xx) =FX(x)

, (4.30)

whereSY(y) = Prob(Y y) denotes the survivor function ofY,SY|X(y|x) =Prob(Y y| X x) denotes the conditional survivor function ofY, and theconditional distribution and survivor functions are assumed to exist wheneverused (i.e., when needed, SY(y) > 0 and FX(x) > 0). Since the support ofthe joint distribution is the attainable set, boundaries of can be definedin terms of the conditional distributions defined above by (4.29) and (4.30).This allows definition of some new concepts of efficiency.

For the input-oriented case, assuming SY(y)> 0, define

(x, y) = inf{ | FX|Y(x|y)> 0} = inf{ | HXY (x,y)> 0}. (4.31)Similarly, for the output-oriented case, assuming FX(x)> 0, define

(x, y) = sup{ | SY|X(y|x)> 0} = sup{ | HXY (x,y)> 0}. (4.32)16


20/127

The input efficiency score(x, y) may be interpreted as the proportionate re-duction of inputs (holding output levels fixed) required for a unit operatingat (x, y) to achieve zero probability of being dominated. Analogously,the output efficiency score(x, y) gives the proportionate increase in out-puts required for the same unit to have zero probability of being dominated,holding input levels fixed. Note that in a multivariate framework, the radialnature of the Debreu-Farrell measures is preserved.

From the properties of the distribution function HXY (x, y), it is clear thatthe new efficiency scores defined in (4.31) and (4.32) have some interesting(and reasonable) properties. In particular,

(x, y) is monotone, non-increasing with x and monotone, non-decreasing with y.(x, y) is monotone non-decreasing with x and monotone non-

increasing withy.

Most importantly, if is free disposal (an assumption that will be maintainedthroughout this chapter), it is trivial to show that

(x, y) (x, y) and

(x, y) (x, y).

Therefore, under the assumption of free disposability of inputs and outputs,the probabilistic formulation presented here leads to a new representation ofthe traditional Debreu-Farrell efficiency scores.

For a given y, the efficient frontier of can be characterized as notedabove byx(y) defined in (4.6). Ifx is univariate, x(y) determines a frontierfunction (y) as a function ofy , where

(y) = inf{x|FX(x|y)> 0} y RM+. (4.33)

(this was called a factor requirements set in Chapter 1). This function isillustrated in Figure 4.3. The intersection of the horizontal line at output

level y0 and the curve representing (y) gives the minimum input level (y0)than can produce output level y0. Similarly, working in the output direction,a production function is obtained ify is univariate.

Nonparametric estimators of the efficiency scores(x, y) and(x, y) (andhence also of (x, y) and (x, y), can be obtained using the same plug-inapproach that was used to define DEA estimators. In the present case, this

17


21/127

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

value of input x

valueofoutputy

y0

(y0)

Y>=y0

Y 0}, (4.35)

18


22/127


23/127

about the underlying quantity that one wishes to estimate. An understanding

of the properties of an estimator is necessary in order to make inference.Earlier, the FDH and DEA estimators were shown to be biased. Here, weconsider whether, and to what extent, these estimators can reveal usefulinformation about efficiency, and under what conditions. Simar and Wilson(2000a) surveyed the statistical properties of FDH and DEA estimators, butmore recent results have been obtained.

Stochastic convergence and rates of convergence

Perhaps the most fundamental property that an estimator should possessis that ofconsistency. Loosely speaking, if an estimatorn of an unknownparameter is consistent, then the estimator converges (in some sense) to as the sample sizen increases toward infinity (a subscript n is often attachedto notation for estimators to remind the reader that one can think of aninfinite sequence of estimators, each based on a different sample size). Inother words, if an estimator is consistent, then more data should be helpfulquite a sensible property. If an estimator is inconsistent, then in general evenan infinite amount of data would offer no particular hope or guarantee ofgetting close to the true value that one wishes to estimate. It is this sensein which consistency is the most fundamental property that an estimatormight have; if an estimator is not consistent, there is no reason to consider

what other properties the estimator might have, nor is there typically be anyreason to use such an estimator.

To be more precise, first consider the notion of convergence in probability,denotedn p. Convergence in probability occurs whenever

limn

Prob(|n | < ) = 1 for any >0.An estimator that converges in probability (to the quantity of interest) issaid to be weakly consistent; other types of consistency can also be defined(e.g., see Serfling, 1980). Convergence in probability means that, for any

arbitrarily small (but strictly positive) , the probability of obtaining anestimate different from by more than in either direction tends to 0 asn .

Note that consistency does not mean that it is impossible to obtain anestimate very different from using a consistent estimator with a very largesample size. Rather, consistency is an asymptotic property; it only describes

20


24/127

what happens in the limit. Although consistency is a fundamental property,

it is also a minimal property in this sense. Depending on the rate, or speed,with whichn converges to , a particular sample size may or may not offermuch hope of obtaining an accurate, useful estimate.

In nonparametric statistics, it is often difficult to prove convergence ofan estimator and to obtain its rate of convergence. Often, convergence andits rate are expressed in terms of the stochastic order of the error of estima-tion. The weakest notion of the stochastic order is related to the notion ofbounded in probability. A sequence of random variablesAn is said to bebounded in probability if there exists B andn such that for alln > n and > 0, Prob(An > B) < ; such cases are denoted by writing An = Op(1).

This means that when n is large, the random variable An is bounded, withprobability tending to one. The notation An = Op(n

), where > 0, de-notes that the sequence An/n

= nAn is Op(1). In this case, An is saidto converge to a small quantity of order n. With some abuse of language,one can say also that An converges at the rate n

because even multipliedbyn (which can be rather large ifn is large and if is not too small), thesequence nAn remains bounded in probability. Consequently, if is small(near zero), the rate of convergence is very slow, because n is not so smalleven when n is large.

This type of convergence is weaker than convergence in probability. If

An

converges in probability at the rate n

( >0), thenAn/n

=n

Anp

0,which can be denoted by writing An = op(n) or nAn = op(1) (this issometimes called big-O, little-o notation). Writing An = Op(n

) meansthat nAn remains bounded when n but writing An =op(n) meansthat nAn

p0 when n .In terms of the convergence of an estimatorn of, writingn =

op(n) means thatn converges in probability at raten. Writingn =

Op(n) implies the weaker form of convergence, but the rate is still said to be

n. Standard, parametric estimation problems usually yield estimators thatconverge in probability at the rate

n (corresponding to = 1/2) and are

said to be root-nconsistent in such cases; this provides a familiar benchmarkto which the rates of convergence of other, nonparametric estimators can becompared.

In all cases, the value ofplays a crucial role by indicating the stochasticorder of the error of estimation. Since in many nonparametric problemswill be small (i.e., smaller than 1/2), it is illuminating to look at Table 4.1

21


25/127

where the values ofn are displayed for some values ofnand of. Table 4.1

illustrates that asdiminishes, the sample sizenmust increase exponentiallyto maintain the same order of estimation error. For example, to achieve thesame order of estimation error that one attains with n = 10 when = 1,one needsn = 100 observations when = 1/2, n = 1, 000 observations when= 1/3, and n = 10, 000 observations when= 1/4.

n \ 1=2/2 2/3 2/4 2/5 2/6 2/7 2/8 2/9 2/10

10 0.1000 0.2154 0.3162 0.3981 0.4642 0.5179 0.5623 0.5995 0.631050 0.0200 0.0737 0.1414 0.2091 0.2714 0.3270 0.3761 0.4192 0.4573

100 0.0100 0.0464 0.1000 0.1585 0.2154 0.2683 0.3162 0.3594 0.3981500 0.0020 0.0159 0.0447 0.0833 0.1260 0.1694 0.2115 0.2513 0.2885

1000 0.0010 0.0100 0.0316 0.0631 0.1000 0.1389 0.1778 0.2154 0.25125000 0.0002 0.0034 0.0141 0.0331 0.0585 0.0877 0.1189 0.1507 0.1821

10000 0.0001 0.0022 0.0100 0.0251 0.0464 0.0720 0.1000 0.1292 0.1585

Table 4.1: Values of the ordern for variousn and.

Parametric estimators, such as those described in Chapter 2, typicallyachieve a convergence rate ofn1/2, while many nonparametric estimatorsachieve only a (sometimes far) slower rate of convergence. The tradeoffs are

clear: parametric estimators offer fast convergence, and hence it is possibleto obtain meaningful estimates with smaller amounts of data than would berequired by nonparametric estimators with slower convergence rates. Butthis is valid only if the parametric model that is estimated accurately re-flects the true DGP; if not, there is specification error, calling into questionconsistency (and perhaps other properties) of the parametric estimators. Bycontrast, the nonparametric estimators discussed in this chapter largely avoidthe risk of specification error, but (in some, but not all cases) at the cost ofslower convergence rates and hence larger data requirements. What mightconstitute a large sample varies, depending on the stochastic order of the

estimation error for the estimator that one chooses. Unfortunately, as willbe shown below, many published applications of DEA estimators have usedfar few data that what might reasonably be required to obtain statisticallymeaningful results.

In an effort to simplify notation, the subscript n on the estimators insubsequent sections will be omitted.

22


26/127

Consistency of DEA/FDH estimators

This section summarizes the consistency results for the nonparametric en-velopment estimators under the Assumptions 4.2.1 to 4.2.7 given above. Forthe DEA estimators, convexity of (Assumption 4.2.5) is needed, but forthe FDH case, the results are valid with or without this assumption

Research on consistency and rates of convergence of efficiency estimatorsfirst examined the simpler cases where either inputs or outputs were unidi-mensional. For N= 1 and M 1, Banker (1993) showed consistency of theinput efficiency estimatorV RS for convex , but obtained no informationon the rate of convergence.

The first systematic analysis of the convergence properties of the envel-

opment estimators (F DH andV RS) appeared in Korostelev et al.(1995a,1995b). For for the caseN= 1 andM 1, they found that when satisfiesfree disposability, but not convexity,

d(F DH, ) =Op(n 1M+1 ), (4.41)and when satisfies both free disposability and convexity,

d(V RS, ) =Op(n 2M+2 ), (4.42)where d(

,

) is the Lebesgue measure (giving the volume) of the difference

between the two sets.The rates of convergence forF DH andV RS are not very different;

however, for all M 1, n1/(M+1) > n2/(M+2). Moreover, the difference islarger for small values ofMas revealed by Table 4.1. Hence, incorporationof the convexity assumption into the estimator of (as done byV RS)improves the rate of convergence. But, as noted earlier, if the true set in non-convex, the DEA estimator is not consistent and hence does not converge,whereas the FDH estimator converges regardless of whether is convex. Therates of convergence depend on the dimensionality of the problem, i.e., onthe number of outputs M (when N= 1). This is yet another manifestation

of the curse of dimensionality shared by most nonparametric approachesin statistics and econometrics; additional discussion on this issue is givenbelow.

Despite the curse of dimensionality, Korostelevet al.(1995a, 1995b) showthat the FDH and DEA estimators share some optimality properties. Inparticular, under free disposability (but not convexity),F DH is the most

23


27/127

efficient estimator of(in term of the minimax risk over the set of estimators

sharing the free disposability assumption, where the loss function is d).Where is both free-disposal and convex,V RSbecomes the most efficientestimator over the class of all the estimators sharing the free disposal andthe convexity assumption. These are quite important results and suggest theinherent quality of the envelopment estimators, even if imprecise in smallsamples where M large.

The full multivariate case, where both NandMare greater than one, wasinvestigated later; results were established in terms of the efficiency scoresthemselves. Kneip et al. (1998) obtained results for DEA estimators, andPark et al. (2000) obtained results for FDH estimators. To summarize,

F DH(x, y) (x, y) =Op(n 1N+M) (4.43)and

V RS(x, y) (x, y) =Op(n 2N+M+1 ). (4.44)The rates obtained above when N= 1 are a special case of the results here.The curse of dimensionality acts symmetrically in both input and outputspaces; i.e., the curse of dimensionality is exacerbated, to the same degree,regardless of whether N or M is increased. The same rates of convergence

can be derived for the output-oriented efficiency scores.The convergence rate forV RS(x, y) is slightly faster than forF DH(x, y),

provided Assumption 4.2.5 is satisfied; otherwise,V RS(x, y) is inconsistent.The faster convergnce rate for the DEA estimator is due to the fact thatF DHDEA To date, the convergence rate forCRS(x, y) has notbeen establshed, but we can imagine that its convergence rate would be fasterthan the rate achieved byCRS(x, y) ifdisplays globally constant returnsto scale for similar reasons.

Curse of dimensionality: parametric v. nonparametric inference

Returning to Table 4.1, it is clear that ifM+N increases, a much largersample size is needed to reach the precision obtained in the simplest casewhere M = N = 1, where the parametric rate n1/2 is obtained by theFDH estimator, or where an even better rate n2/3 is obtained by the DEAestimator. WhenM+N is large, unless a very large quantity of data are

24


28/127

available, the resulting imprecision will manifest itself in the form of large

bias, large variance, and very wide confidence intervals.This has been confirmed in Monte-Carlo experiments. In fact, as the

number of outputs is increased, the number of observations must increase atan exponential rate to maintain a given mean-square error with the nonpara-metric estimators of. General statements on the number of observationsrequired to achieve a given level of mean-square error are not possible, sincethe exact convergence of the nonparametric estimators depends on unknownconstants related to the features of the unobserved . Nonetheless, for esti-mation purposes, it is always true that more data are better than fewer data(recall that this is a consequence of consistency). In the case of nonparamet-

ric estimators such asF DHandV RS, this statement is more than doublytrueit is exponentially true! To illustrate this fact, from Table 4.1 it can beseen that, ceteris paribus, inference with n= 10, 000 observations when thenumber of inputs and outputs is above or equal to 9, is less accurate than withn= 10 observations with one input and one output! In fact, withN+M= 9,one would need not 10,000 observations, but 10 10, 000 = 100, 000 observa-tions to achieve the same estimation error as with n = 10 observations whenN = M = 1. A number of applied papers using relatively small numbersof observations with many dimensions have appeared in the literature, buthopefully no more will appear.

The curse of dimensionality results from the fact that as a given set ofnobservations are projected in an increasing number of orthogonal directions,the Euclidean distance between the observations necessarily must increase.Moreover, for a given sample size, increasing the number of dimensions willresult in more observations lying on the boundaries of the estimatorsF DHandV RS. The FDH estimator is particularly affected by this problem.The DEA estimator is also affected, often times to a lesser degree due to itsincorporation of the convexity assumption (recall thatV RS is merely theconvex hull of

F DH; consequently, fewer points will lie on the boundary

ofV RS than on the boundary ofF DH). Wheelock and Wilson (2003)and Wilson (2004) have found cases where all or nearly all observations insamples of several thousand observations lie on the boundary ofF DH, whilerelatively few observations lie on the boundary ofV RS. Both papers arguethat the FDH estimator should be used as a diagnostic to check whether itmight be reasonable to employ the DEA estimator in a given application;large numbers of observations falling on the boundary ofF DHmay indicate

25


29/127

problems due to the curse of dimensionality.

Parametric estimators suffer little from this phenomenon in the sensethat their rate of convergence typically does not depend on dimensionalityof the problem. The parametric structure incorporates information from allof the observations in a sample, regardless of the dimensionality of. Thisalso explains why parametric estimators are usually (but not always) moreefficient in a statistical sense than their nonparametric counterpartstheparametric estimators extract more information from the data, assuming ofcourse, that the parametric assumptions that have been made are correct.This is frequently a bigassumption.

Additional insight is provided by comparing parametric maximum like-

lihood estimators and nonparametric FDH and DEA estimators. Paramet-ric maximum likelihood estimation is the most frequently used parametricestimation method. Under some regularity conditions that do not involvedimensionality of the problem at hand (see, for example, Spanos 1999), max-imum likelihood estimators are root-nconsistent. Maximum likelihood esti-mation involves maximizing a likelihood function in which each observationis weighted equally. Hence, maximum likelihood estimators are globalesti-mators, as opposed to FDH and DEA estimators, which are localestimators.With FDH or DEA estimators of the frontier, only observations near thepoint where the frontier is being estimated contribute to the estimate at thatpoint; far-away observations contribute little or nothing to estimation at the

point of interest.It remains true, however, that when data are projected in an increasing

number of orthogonal directions, Euclidean distance between the observa-tions necessarily increases. This is problematic for FDH and DEA estima-tors, because it means that increasing dimensionality results in fewer near-byobservations that can impart information about the frontier at a particularpoint of interest. But increasing distance between observations means alsothat parametric, maximum likelihood estimators are combining information,and weighting it equally, from observations that are increasingly far apart(with increasing dimensionality). Hence, increasing dimensionality means

that the researcher must rely increasingly on the parametric assumptions ofthe model. Again, these are often big assumptions, and should be tested,though often they are not.

The points made here go to the heart of the tradeoff between nonpara-metric and parametric estimators. Parametric estimators incur the risk ofmisspecification, which typically results in inconsistency, but are almost al-

26


30/127

ways statistically more efficient than nonparametric estimators if properly

specified. Nonparametric estimators avoid the risk of misspecification, butusually involve more noise than parametric estimators. Lunch is not free,and the world is full of tradeoffs, which creates employment opportunitiesfor economists and statisticians.

Asymptotic sampling distributions

As discussed above, consistency is an essential property for any estimator.However, consistency is a minimal theoretical property. The preceding dis-cussion indicates that DEA or FDH efficiency estimators converge as thesample size increases (although at perhaps a slow rate), but by themselves,

these results have little practical use other than to confirm that the DEA orFDH estimators are possiblyreasonable to use for efficiency estimation.

For empirical applications, more is neededin particular, the appliedresearcher must have some knowledge of the sampling distributions in orderto make inferences about the true levels of efficiency or inefficiency (correctionfor the bias and construction of confidence intervals, for instance). This isparticularly important in situations where point estimates of efficiency mightbe highly varialbe due to the curse of dimensionality or other problems. Inthe nonparametric framework of FDH and DEA , as is often the case, onlyasymptotic results are available. For FDH estimators, a rather general result

is available, but for DEA estimators, useful results have been obtained onlyfor the case of one input and one output (M =N= 1). A general result isalso available, but it is of little practical use.

FDH with N, M 1: Park et al. (2000) obtained a well-known lim-iting distribution for the error of estimation for this case:

n 1N+M

F DH(x, y) (x, y) dWeibull(x,y, N+ M) (4.45)where x,y is a constant depending on the DGP. The constant is pro-portional to the probability of observing a firm dominating a point on

the ray x, in a neighborhood of the frontier point x(y). This con-stantx,y is larger (smaller) as the density f(x, y) provides more (less)mass in the neighborhood of the frontier point. The bias and standarddeviation of the FDH estimator are of the order n1/(N+M), and areproportional to

1/(N+M)x,y . Also, the ratio of the mean to the standard

deviation ofF DH(x, y) (x, y) does not depend on x,y or on n, and27


31/127

increases as the dimensionality M+N increases. Thus, the curse of

dimensionality here is two-fold: not only does the rate of convergenceworsen with increasing dimensionality, but bias also worsens. Theseresults suggest a strong need for a bias-corrected version of the FDHestimator; such an improvement is proposed later in this chapter.

Park et al. propose a consistent estimator ofx,y in order to obtain abias-corrected estimator and asymptotic confidence intervals. However,Monte-Carlo studies indicate that the noise introduced by estimatingx,yreduces the quality of inference whenN+Mis large with moderatesample sizes (say,n 1000 withM+N 5). Here again the bootstrapmight provide a useful alternative, but some smoothing of the FDH

estimator might even be more appropriate than it is in the case of thebootstrap with DEA estimators (Jeong and Simar, 2006). This pointis briefly discussed below, in Sections 4.3.6 and 4.4.2.

DEA with M = N = 1: In this case, the efficient boundary can berepresented by a function, and Gijbels et al. (1999) obtain the asymp-totic result

n2/3V RS(x, y) (x, y) dQ1() (4.46)

as well as an analytical form for the limiting distribution Q1(

). The

limiting distribution is a regular distribution function known up to someconstants. These constants depends on features of the DGP involvingthe curvature of the frontier and the magnitude of the density f(x, y) atthe true frontier point ((x, y)x, y). Expressions for the asymptotic biasand variance are also provided. As expected for the DEA estimator, thebias is of the order n2/3 and is larger for greater curvature of the truefrontier (DEA is a piecewise linear estimator, and so increasing curva-ture of the true frontier makes estimation more difficult) and decreaseswhen the density at the true frontier point increases (large density at((x, y)x, y) implies greater chances of observing observations near the

point of interest that can impart useful information; recall again thatDEA is a local estimator). The variance of the DEA estimator behavessimilarly. It appears also that in most cases, the bias will be muchlarger than the variance.

Using simple estimators of the two constants inQ1(), Gijbelset al. pro-vide a bias-corrected estimator and a procedure for building confidence

28


32/127

intervals for (x, y). Monte-Carlo experiments indicates that even for

moderate sample size (n= 100), the procedure works reasonably well.

Although the results of Gijbels et al. are limited to the case whereN = M = 1, their use extends beyond this simple case. The resultsprovide insight for inference by identifying which features of the DGPaffect the quality of inference-making. One would expect that in a moregeneral multivariate setting, the same features should play similar roles,but perhaps in a more complicated fashion.

DEA with N, M

1: Deriving analytical results for DEA estimators

in multivariate settings is necessarily more difficult, and consequentlythe results are less satisfying. Kneip et al. (2003) have obtained animportant key result, namely

n 2

N+M+1

V RS(x, y)(x, y)

1

dQ2() (4.47)

where Q2() is a regular distribution function known up to some con-stants. However, in this case, no closed analytical from for Q2() isavailable, and hence the result is of little practical use for inference. Inparticular, the moments and the quantiles ofQ2() are not available,so that neither bias-correction nor confidence intervals can be providedeasily using only the result in (4.47). Rather, the importance of thisresult lies in the fact that it is needed for proving the consistency ofthe bootstrap approximation (see below). The bootstrap remains theonly practical tool for inference here.

To summarize, where limiting distributions are available and tractable, onecan estimate the bias of FDH or DEA estimators and build confidence in-tervals. However, additional noise is introduced by estimating unknown con-stants appearing in the limiting distributions. Hence the bootstrap remainsan attractive alternative and is, to date, the only practical way of makinginference in the multivariate DEA case.

29


33/127

4.3 Bootstrapping DEA and FDH Efficiency

Scores

The bootstrap provides an attractive alternative to the theoretical resultsdiscussed in the previous section. The essence of the bootstrap idea (Efron,1979, 1982; Efron and Tibshirani, 1993) is to approximate the samplingdistributions of interest by simulating, or mimicking, the DGP. The firstuse of the bootstrap in frontier models dates to Simar (1992). Its use fornonparametric envelopment estimators was developed by Simar and Wilson(1998, 2000a). Theoretical properties of the bootstrap with DEA estimatorsis provided in Kneip et al. (2003), and for FDH estimators in Jeong and

Simar (2006).The presentation below is in terms of the input oriented case, but it can

easily be translated to the output oriented case. The bootstrap is intendedto provide approximations of the sampling distributions of(x, y) (x, y)or(x, y)

(x, y), which may be used as an alternative to the asymptotic results

described above.

4.3.1 General principles

The presentation in this section follows Simar and Wilson (2000a). The orig-inal dataXn are generated from the DGP, which is completely characterizedby knowledge of and of the probability density functionf(x, y). Let Pde-note the DGP. ThenP= P(, f(, )). LetP(Xn) be a consistent estimatorof the DGPP, where

P(Xn) = P,f(, ) (4.48)In the true world,P, and (x, y) are unknown ((x, y) is a given, fixed

point of interest). Only the dataXn are observed, and these must be usedto construct estimates ofP, and V RS(x, y).

Now consider a virtual, simulated world, i.e., the bootstrap world. Thisbootstrap world is analogous to the true world, but in the bootstrap world,estimatesP,, andV RS(x, y) take the place ofP, , and (x, y) in thetrue world. In other words, in the true worldP is the true DGP whilePis an estimate ofP, but in the bootstrap world,P is the true DGP. In the

30


34/127

bootstrap world, a new datasetXn = {(xi , yi ), i= 1, . . . , n}can be drawnfromP, sinceP is a known estimate. Within the bootstrap world, is thetrue attainable set, and the union of the free disposal and the convex hullsofXn gives an estimator of, namely

=(Xn) = {(x, y) RN+M | y ni=1

iyi , x

ni=1

ixi

ni=1

i= 1, i 0 i= 1, . . . , n}. (4.49)

For the fixed point (x, y), an estimator ofV RS(x, y) is provided byV RS(x, y) = inf{| (x,y) } (4.50)(recall that in the bootstrap world,V RS(x, y) is the quantity estimated,analogous to (x, y) in the true world). The estimatorV RS(x, y) may becomputed by solving the linear program

V RS(x, y) = min >0 | y ni=1

iyi , x

ni=1

ixi ,

n

i=1 i= 1, i 0 i= 1, . . . , n. (4.51)The key relation here is that within the true world,V RS(x, y) is an

estimator of(x, y), based on the sampleXn generated fromP; whereas inthe bootstrap world,V RS(x, y) is an estimator ofV RS(x, y), based on thepseudo-sampleXn generated fromP(Xn). If the bootstrap is consistent,then

V RS(x, y)

V RS(x, y)

|

P(Xn) approx.

V RS(x, y) (x, y)

| P,(4.52)

or equivalently,V RS(x, y)V RS(x, y)P(Xn) approx. V RS(x, y)

(x, y)

P. (4.53)Within the bootstrap world and conditional on the observed dataXn,

the sampling distribution ofV RS(x, y) is (in principle) completely known31


35/127

sinceP(Xn) is known. However, in practice, it is impossible to compute thisanalytically. Hence Monte-Carlo simulations are necessary to approximatethe left-hand sides of (4.52) or (4.53).

UsingP(Xn) to generate B samplesXnb of size n, b = 1, . . . , B, andapplying the original estimator to these pseudo samples, yields a set ofBpseudo estimatesV RS,b(x, y), b = 1, . . . , B. The empirical distribution ofthese bootstrap values gives a Monte Carlo approximation of the samplingdistribution ofV RS(x, y), conditional onP(Xn), i.e., the left-hand side of(4.52) (or of (4.53) if the ratio formulation is used). The quality of theapproximation relies in part on the value of B: by the law of large num-bers, when B , the error of this approximation due to the bootstrapresampling (i.e., drawing fromP) tends to zero. The practical choice ofBis limited by the speed of ones computer; for confidence intervals, values of2000 or more may be needed to give a reasonable approximation.

The issue of computational constraints continues to diminish in impor-tance as computing technology advances. For many problems, however, asingle desktop system may be sufficient. Given the independence of each ofthe B replications in the Monte Carlo approximation, the bootstrap algo-rithm is easily adaptable to parallel computing environments or by using aseries of networked personal computers; the price of these machines continuesto decline while processor speeds increase.

The bootstrap is an asymptotic procedure, as indicated by the condi-tioning on the left-hand-side of (4.52) or of (4.53). Thus the quality of thebootstrap approximation depends on both the number of replications B andand the sample size n. When the bootstrap is consistent, the approximationbecomes exact as B and n .

4.3.2 Bootstrap confidence intervals

The procedure described in Simar and Wilson (1998) for constructing theconfidence intervals depends on using bootstrap estimates of bias to correctfor the bias of the DEA estimators; in addition, the procedure described there

requires using these bias estimates to shift obtained bootstrap distributionsappropriately. Use of bias estimates introduces additional noise into theprocedure.

Simar and Wilson (1999c) propose an improved procedure outlined be-low which automatically corrects for bias without explicit use of a noisybias estimator. The procedure is presented below in terms of the difference

32


36/127


37/127

sample quantile of the empirical distribution of (V RS,b(x, y) V RS(x, y)),b= 1, . . . , B, then

Probc/2V RS(x, y) V RS(x, y) c1/2|P(Xn)= 1 . (4.59)

In practice, findingc/2 andc1/2 involves sorting the values (V RS,b(x, y) V RS(x, y)), b= 1, . . . , B in increasing order, and then deleting ( 2 100)-percent of the elements at either end of the sorted list. Then setc/2 andc1/2 equal to the endpoints of the truncated, sorted array.

The bootstrap approximation of (4.56) is then

Prob c/2V RS(x, y) (x, y) c1/2 1 , (4.60)and the estimated (1 ) 100-percent confidence interval for (x, y) is

V RS(x, y) c1/2 (x, y)V RS(x, y) c/2, (4.61)or

1

V RS(x, y)

c/2

(x, y) 1

V RS(x, y)

c1/2

. (4.62)

Since 0c/2c1/2, the estimated confidence interval will include theoriginal estimateV RS(x, y) on its upper boundary only ifc/2 = 0. Thismay seem strange until one recalls that the boundary of support off(x, y) isbeing estimated; under the assumptions, distance from (x, y) to the frontier

can be no lessthan the distance indicated byV RS(x, y), but is likely more.The procedure outlined here can be used for any point (x, y) RN+M+

for whichV RS(x, y) exists. Typically, the applied researcher is interested inthe efficiency scores of the observed units themselves; in this case, the aboveprocedure can be repeated n times, with (x, y) taking values (xi, yi), i =1, . . . , n, producing a set of n confidence intervals of the form (4.62) foreach firm in the sample. Alternatively, in cases where the computational

burden is large due to large numbers of observations, one might select afew representative firms, or perhaps a small number of hypothetical firms.For example, rather than reporting results for several thousand firms, onemight report results for a few hypothetical firms represented by the means(or medians) of input and output levels for observed firms in quintiles ordeciles determined by size or some other feature of the observed firms.

34


38/127


39/127

the bias correction in (4.65) unlessBIASB V RS(x, y) > 13 . (4.67)Alternatively, Efron and Tibshirani (1993) propose a less conservative rule,suggesting that the bias correction be avoided unlessBIASB V RS(x, y)

>1

4. (4.68)

4.3.4 Bootstrapping DEA in actionThe remaining question concerns how a bootstrap sampleXn from the con-sistent estimator ofP might be simulated. The standard, and simplest,nonparametric bootstrap technique, called the naive bootstrap, consist ofdrawing pseudo-observation (xi , y

i ), i = 1, . . . , n independently, uniformly,

and with replacement from the set Xnof original observations. Unfortunately,however, this naive bootstrap is inconsistent in the context of boundary es-timation. In other words, even if B and n , the Monte-Carloempirical distribution of the

V RS,b(x, y) will not approximate the sampling

distribution ofV RS(x, y).The bootstrap literature contains numerous univariate examples of thisproblem; see Bickel and Freedman (1981) and Efron and Tibshirani (1993) forexamples. Simar and Wilson (1999a, 1999b) discuss this issue in the contextof multivariate frontier estimation. As illustrated below, the problem comesfrom the fact that in the naive bootstrap, the efficient facet determiningthe value ofV RS(x, y) appears too often, and with a fixed probability, inthe pseudo-samplesXn. This fixed probability does not depend on the realDGP (other than the number Nof inputs and the number M of outputs),and does not vanish, even when n .

There are two solutions to this problem: either subsampling techniques

(drawing pseudo-samples of size m = n, where < 1) or smoothing tech-niques (where a smooth estimate of the joint density f(x, y) is employed tosimulateXn) must be used. Kneip et al. (2003) analyze the properties ofthese techniques for strictly convex sets and in both approaches prove thatthe bootstrap provides consistent approximation of the sampling distributionofV RS(x, y) (x, y) (or of the ratioV RS(x, y)/(x, y) if preferred). The

36


40/127

ideas are summarized below, again for the input-oriented case; the procedure

can easily be translated for the output oriented case.

Subsampling DEA scores

Of the two solutions mentioned above, this one is easiest to implementsince it is similar to the naive bootstrap except that pseudo samples ofsize m = n for some (0, 1) (0 < < 1) are drawn. For boot-strap replications b = 1, . . . , B, letXnm,b denote a random subsample ofsize m drawn fromXn. LetV RS,m,b(x, y) denote the DEA efficiency esti-mate for the fixed point (x, y) computed using the reference sampleXnm,b.Kneip et al. (2003) prove that the Monte-Carlo empirical distribution of

m2/(N+M+1)(V RS,m,b(x, y)V RS(x, y)) givenXn approximates the exactsampling distribution ofn2/(N+M+1)(V RS(x, y) (x, y)) as B .

Using this approach, a bootstrap bias estimate of the DEA estimator isobtained by computing

BIASB

V RS(x, y) = mn

2(N+M+1)

1

B

Bb=1

V RS,m,b(x, y)

V RS(x, y)

, (4.69)

and a bias-corrected estimator of the DEA efficiency score is given byV RS(x, y) =V RS(x, y) BIASB V RS(x, y) . (4.70)Note that (4.69) differs from (4.64) by inclusion of the factor

mn

2/(N+M+1),

which is necessary to correct for the effects of different sample sizes in thetrue world and bootstrap world.

Following the same arguments as in Section 4.3.2, but again adjustingfor different sample sizes in the true and bootstrap worlds, the following(1

)

100-percent bootstrap confidence interval is obtained for (x, y):

1V RS(x, y) c1/2 (x, y) 1V RS(x, y) c/2 , (4.71)

where =

mn

2/(N+M+1)and ca is the ath empirical quantile of

(V RS,m,b(x, y) V RS(x, y)), b = 1, . . . , B.37


41/127

In practice, with finite samples, some subsamplesXnm,bmay yield no fea-sible solutions for the linear program used to computeV RS,m,b(x, y). Thiswill occur more frequently when y is large compared to the original output

values yi inXn. In fact, this problem arises ify

=max{yj|(xj, yj) Xnm,b}.3In such cases, one could add the point of interest (x, y) to the drawn sub-

sampleXnm,b, resulting inV RS,m,b(x, y) = 1. This will solve the numericaldifficulties, and does not affect the asymptotic properties of the bootstrapsince under the assumptions of the DGP, for all (x, y), the probabilityof this event tends to zero. As shown in Kneip et al. (2003), any value of < 1 produces a consistent bootstrap, but in finite sample the choice ofseems to be critical. More work is needed on this problem.

Smoothing techniques

As an alternative to the subsampling version of the bootstrap method, pseudosamples can be drawn from a smooth, nonparametric estimate of the un-known density f(x, y). The problem is complicated by the fact that thesupport of this density, , is unknown. There are possibly several ways ofaddressing this problem, but the procedures proposed by Simar and Wilson(1998, 2000b) are straightforward and less complicated than other methodsthat might be used.

A number of nonparametric density estimators are available. Histogramestimators (see Scott, 1992, for discussion) are easy to apply and do not sufferdegenerate behavior near support boundaries, but are not smooth. The lackof smoothness can be eliminated while preserving the non-degenerate behav-ior near support boundaries by using local polynomial regression methods tosmooth a histogram estimate of a density as proposed by Cheng et al.(1997).However, it is difficult to simulate draws from such a density estimate. Ker-nel estimators are perhaps the most commonly used nonparametric densityestimators; these provide smooth density estimates, and it is easy to takedraws from a kernel density estimate. However, in their standard form and

with finite samples, kernel density estimators are severely biased near bound-aries of support; the existence of support boundaries also affects their rates

3The expression a

=b indicates that the elements of a are weakly greater than thecorresponding elements of b, with at least one element of a strictly greater than the cor-responding element ofb.

38


42/127

of convergence. Fortunately, several methods exist for repairing these prob-

lems near support boundaries; here, a reflection method similar to the ideaof Schuster (1985) is used.4

In univariate density estimation problems, the reflection method is verysimple; one simply reflects the n observations around a known boundary,estimates the density of the original and reflected data (2nobservations), andthen truncates this density estimate at the boundary point. In the presentcontext, however, the method is more complicated due to three factors:

(i) the problem involves multiple (N+M) dimensions;

(ii) the boundary of support, , is unknown;

(iii) the boundary is non-linear.

These problems can be solved by exploiting the radial nature of the efficien-cies (x, y) and(x, y).

The third problem listed in the previous paragraph can be dealt withby transforming the problem from Cartesian coordinates to spherical coor-dinates, which are a combination of Cartesian and polar coordinates. In theinput-orientation, the Cartesian coordinates for the input space are trans-formed to polar coordinates, while in the output-orientation, the Cartesian

coordinates for the output space would be transformed to polar coordinates.The method is illustrated here for the input-orientation; once again, it is easyto adapt the method to the output-orientation.

For the input-orientation, the Cartesian coordinates (x, y) are trans-formed to (, , y), where (, ) are the polar coordinates ofx in RN+ . Themodulus = (x) R1+ of x is given by the square root of the sum ofthe squared elements ofx, and the jth element of the corresponding angle

= (x) 0, 2

N1ofx is given by arctan(xj+1/x1) for x1 = 0; ifx1 = 0,

then all elements of(x) equal /2.

The DGP is completely determined by f(x, y), and hence is also com-

pletely determined by the density f( , , y) after transforming from Carte-sian coordinates to spherical coordinates. In order to generate sample points

4One could also use asymmetric boundary kernels, which change shape as one ap-proaches the boundary of support; see Scott (1992) for examples. This would introduceadditional complications, however, and the choice of methods is of second-order impor-tance for the bootstrap procedures described here.

39


43/127

in the spherical coordinate representation, the density f( , , y) can be de-

composed (using Bayes rule) to obtain

f(, , y) =f(| , y) f(| y) f(y), (4.72)where each of conditional densities are assumed to be well-defined. For agiven (y, ), the corresponding frontier point x(y) is defined by (4.6); thispoint has modulus

(x(y)) = inf{ R+ | f(| y, )> 0}. (4.73)Assumption 4.2.2 implies that for all y and , f((x(y)) | y, )> 0.

From the definition of the input efficiency score,

0 (x, y) = (x(y))

(x) 1. (4.74)

The density f(| y, ) on [(x(y)), ] induces a density f(| y, ) on[0, 1]. This decomposition of the DGP is illustrated in Figure 4.4: the DGPgenerates an output level, according to the marginal f(y); then, conditionalon y,f(|y) generates an input mix in the input-requirement setX(y); finally,the observation P is randomly generated inside , from the frontier point,along the ray defined by and according tof(|, y), the conditional densityof the efficiency on [0, 1].

Once again, for practical purposes, it is easier to parameterize the inputefficiency scores in terms of the Shephard (1970) input distance function as

in (4.54) and (4.55). Working in terms of (x, y) andV RS(x, y) has theadvantage that(x, y) 1 for all (x, y) ; consequently, when simulatingthe bootstrap values, there will be only one boundary condition for todeal with, not two as is the case of (in the output-orientation, efficiencyscores (x, y) are greater than one for all (x, y) , and so the reciprocaltransformation is not needed). Using (4.54), (4.74) can be rewritten as

(x, y) = (x)

(x(y)) 1. (4.75)

Here, the density f(| y, ) on [(x(y)), ) induces a density f(| y, )on [1, ).

The idea of the smooth bootstrap in this general, multivariate frameworkis to generate sample points (i ,

i , y

i ), i = 1, . . . , n from a smooth es-

timate of the density f(,,y). To accomplish this, first the original data

40


44/127

0 2 4 6 8 10 120

2

4

6

8

10

12

input1: x

1

input2:x2

Isoquants in input space

P =(x,y)=(,,y)

Q=(x(x,y),y)=((x

(x,y)),,y)

(x,y)=OQ/OP=(x(x,y))/(x)

Figure 4.4: Polar coordinates in the input space for a particular sectionX(y).

(xi, yi) Xn are transformed to polar coordinates (i, i, yi). Since the truefrontier is unknown, i is unknown, but i can be replaced by its consistentDEA estimatori=V RS(xi, yi). Estimation of the density is then based onthe sample values (i, i, yi), i = 1, . . . , n. Kernel smoothing techniques canbe used, along with the reflection method to account for the boundary con-ditions (see Silverman, 1986). This step involves the choice of a smoothing

parameter, i.e., the bandwidth (this issue is discussed below). Then, as afinal step, the simulated observations (i ,

i , y

i ) must be transformed back

to Cartesian coordinates (xi , yi ), i = 1, . . . , n. This yields a bootstrap

sampleXn.The last step requires several steps, including solution of an additional

linear program:

41


45/127

(i) Letxbe any point in thexspace on the ray with angle i (for instance,

takex1 = 1 andxj+1= tan(ij ) forj = 1, . . . , N 1; here the subscriptsonx denote individual elements of the vectorx).

(ii) For the point (x, yi ), compute the DEA estimatorV RS(x, yi ) using thereference setXn.

(iii) An estimator of the input-efficient level of inputs, given the output yiand the direction given by the input vectorxi, is given by adapting(4.6) to the bootstrap world:

x(yi ) = xV RS(x, yi ) .(iv) Compute

xi =ix(yi )

to obtain the Cartesian coordinates xi .

In the procedure above, the main difficulty lies in estimating f(,,y)because of the bound on . Simar and Wilson (2000b) reflected the points

(i, i, yi) about the boundary characterized by the valuesi = 1; i.e.,Simarand Wilson add points (2i, i, yi),i = 1, . . . , nto the original observationsinXn. This was also proposed by Kneip et al. (2003) where in addition, thevaluesi are adjusted to smooth the DEA estimator of. These proceduresare not easy to implement. So far, there seems to be no simple way to avoidthe considerable complexity required in such a general framework.

The homogeneous smoothed bootstrap

The bootstrap procedure described above can be substantially simplified if

one is willing to make additional assumptions on the DGP. In particular,assume that the distribution of efficiency is homogeneous over the input-output space (this is analogous to an assumption of homoskedasticity inlinear regression problems). In other words, assume that

f(| , y) =f(). (4.76)

42


46/127

This may be a reasonable assumption in many practical situations. Wilson

(2003) surveys tests for independence that may be used to check where datamight satisfy the assumption in (4.76).

Adoption of the homogeneity assumption in (4.76) makes the problemsimilar to the bootstrap one would use in homoskedastic regression models(See Bickel and Freedman, 1981), i.e., where the bootstrap is based on theresiduals. In the present context, the residuals correspond to the distancesi =V RS(xi, yi), i = 1, . . . , n, and so the problem becomes essentiallyunivariate, avoiding the complexity of estimating a multivariate density for(

i, i, yi). The idea is create a bootstrap sample by projecting each observa-

tion (xi, yi) onto the estimated frontier, and then projecting this point away

from the frontier randomly, resulting in points (

i xi/i, y), where i is adraw from a smooth estimate of the marginal densityf() obtained from thesample of estimates1, . . . ,n. Hence standard, univariate kernel densityestimation, combined with the reflection method, can be used.

The steps are summarized as follows:

(i) Letf() be a smooth estimate obtained from the observed {i| i =1, . . . , n}, and draw a bootstrap sample (i , i = 1, . . . , n) from thisdensity estimate.

(ii) Define the bootstrap sampleXn = {(xi , yi); i= 1, . . . , n}, where

xi =ix(yi) = ii xi.

The idea is illustrated in Figure 4.5, where the are the original observa-tions (xi, yi), and the * are the pseudo-observations (x

i , yi). For the point

P = (x, y), (x, y) =|OP|/|OQ|,(x, y) =|OP|/|OQV RS|, and(x, y) =|OP|/|OQV RS|. The hope is that(x, y) (x, y) (x, y) (x, y).

It is easy to understand why a naive bootstrap technique, where the iare drawn identically, uniformly, and with replacement from the set{

i| i=

1, . . . , n}

would be inconsistent. As observed by Simar and Wilson (1999a),this would yield

Prob((x, y) =(x, y) | Xn) = 1 (1 n1)n >0,with

limn

Prob((x, y) =(x, y) | Xn) = 1 e1 0.632. (4.77)43


47/127

0 2 4 6 8 10 120

2

4

6

8

10

12

input1: x

1

input2:x2

Isoquants in input space

P=(x,y)

y

QDEA

X(y)

XDEA

(y)

Q

(xi,y

i)

*

*

*

*

*

Q*

DEA

*X

*

DEA(y)

Figure 4.5: The Bootstrap idea.

The naive bootstrap is inconsistent since there is absolutely no reason tobelieve that this probability should equal approximately 0.632, independentlyof any features of the DGP. In fact, iff() is continuous over [1, ), theprobability in (4.77) should be zero, since in this case

Prob(

(x, y) =(x, y)) = 0.

Even if the true probability density function f() had a mass at the boundary,there is no reason to believe this mass would be equal to 0.632.

The problem with the naive bootstrap arises from the fact that a contin-uous density f() is approximated by a discrete density putting a mass 1 /n

at each observed. The problem is avoided by estimating f() by using asmooth kernel estimatorfh() while accounting for the boundary condition

44


48/127

1, i.e., by using a nonparametric kernel estimator with the reflectionmethod. A description of how the procedure works follows.

4.3.5 Practical Considerations for the Bootstrap

Estimation off():

The standard kernel density estimator of a density f() evaluated at anarbitrary pointis given by

fh() = (nh)1

n

i=1 K

i

h , (4.78)where h is a smoothing parameter, or bandwidth, and K() is a kernel func-tion satisfying K(t) = K(t),

K(t) dt = 1, and

tK(t) dt = 0. Anysymmetric probability density function with mean zero satisfies these condi-tions. If a probability density function is used forK(), thenfh() can beviewed as the average of n different densities Kcentered on the observedpointsi, with h playing the role of a scaling parameter. The bandwidth his used as a tuning parameter to control the dispersion of the n densities.

Two choices must be made in order to implement kernel density estima-tion: a kernel function and the bandwidth must be selected. The standardnormal (Gaussian) probability density function is frequently used, althoughthe Epanechnikov kernel given by

K(t) =3

4(1 t)21I(|t| 1) (4.79)

is optimal in the sense of minimizing asymptotic mean integrated squareerror (AMISE).5

The choice of kernel function is of relatively minor importance; the choiceof bandwidth has a far greater effect on the quality of the resulting estimatorin terms of AMISE. A sensible choice of h is determined, in part, by thefollowing considerations:

as h is becomes small, fewer observationsonly those closest to thepoint where the density is estimatedinfluence the estimatefh();

5See Scott (1992) for details.

45

8/13/2019 artigo do ds

artigo do dsmar e wilson (2007) algortimo para 2 estágio do dea

Documents