bootstrap-based inference for cube root asymptotics · bootstrap-based inference for cube root...

Bootstrap-Based Inference for Cube Root Asymptotics∗

Matias D. Cattaneo† Michael Jansson‡ Kenichi Nagasawa§

June 26, 2019

Abstract

This paper proposes a consistent bootstrap-based distributional approximation for cube root

consistent and related estimators exhibiting a Chernoff (1964)-type limiting distribution. For

estimators of this kind, the standard nonparametric bootstrap is inconsistent. Our method re-

stores consistency of the nonparametric bootstrap by altering the shape of the criterion function

defining the estimator whose distribution we seek to approximate. This modification leads to

a generic and easy-to-implement resampling method for inference that is conceptually distinct

from other available distributional approximations. We illustrate the applicability of our core

idea with six canonical examples in statistics, machine learning, econometrics, and biostatistics.

Simulation evidence is also provided.

Keywords: cube root asymptotics, bootstrapping, maximum score, empirical risk minimization,

monotone density, monotone regression, current status model, shape restrictions.

∗A previous version of this paper circulated under the title “Bootstrap-Based Inference for Cube Root ConsistentEstimators”. For comments and suggestions, we are grateful to Mehmet Caner, Andreas Hagemann, Kei Hirano,Bo Honoré, Guido Kuersteiner, Whitney Newey, and participants at various conferences, workshops, and seminars.Cattaneo gratefully acknowledges financial support from the National Science Foundation through grant SES-1459931,and Jansson gratefully acknowledges financial support from the National Science Foundation through grant SES-1459967 and the research support of CREATES (funded by the Danish National Research Foundation under grantno. DNRF78).†Department of Operations Research and Financial Engineering, Princeton University.‡Department of Economics, University of California at Berkeley and CREATES.§Department of Economics, University of Warwick.

1 Introduction

In a seminal paper, Kim and Pollard (1990) studied estimators exhibiting “cube root asymptotics”.

These estimators not only have a non-standard rate of convergence, but also have the property

that rather than being Gaussian their limiting distributions are of Chernoff (1964) type; i.e., the

limiting distribution is that of the maximizer of a Gaussian process. Estimators covered by Kim

and Pollard’s results include not only celebrated estimators such as the isotonic density estimator of

Grenander (1956) in statistics and the maximum score estimator of Manski (1975) in econometrics,

but also more contemporary estimators arising in examples related to classification problems in ma-

chine learning (Mohammadi and van de Geer, 2005), nonparametric inference under shape restric-

tions in statistics and biostatistics (Groeneboom and Jongbloed, 2018), massive dataM -estimation

framework (Shi, Lu, and Song, 2018), and maximum score estimation in high-dimensional settings

(Mukherjee, Banerjee, and Ritov, 2019). Moreover, Seo and Otsu (2018) recently generalized Kim

and Pollard (1990) to allow for n-varying objective functions (n denotes the sample size), further

widening the applicability of cube-root-type asymptotics in statistics, machine learning, econo-

metrics, biostatistics, and other disciplines. For example, their results covered the conditional

maximum score estimator of Honoré and Kyriazidou (2000) in panel data econometrics.

An important feature of Chernoff-type asymptotic distributional approximations is that the

covariance kernel of the Gaussian process characterizing the limiting distribution often depends on

an infinite-dimensional nuisance parameter. From the perspective of inference, this feature of the

limiting distribution represents a nontrivial complication relative to the conventional asymptotically

normal case, where the limiting distribution is known up to the value of a finite-dimensional nuisance

parameter (namely, the covariance matrix of the limiting distribution). The dependence of the

limiting distribution on an infinite-dimensional nuisance parameter implies that resampling-based

distributional approximations seem to offer the most attractive approach to inference in estimation

problems exhibiting cube root asymptotics. Unfortunately, however, the standard nonparametric

bootstrap is well known to be invalid in this setting (e.g., Abrevaya and Huang, 2005; Léger and

MacGibbon, 2006; Kosorok, 2008; Sen, Banerjee, and Woodroofe, 2010). The purpose of this

paper is to propose a generic and easy-to-implement bootstrap-based distributional approximation

applicable in the context of cube root asymptotics.

1

As does the familiar nonparametric bootstrap, the method proposed herein employs bootstrap

samples of size n from the empirical distribution function. But unlike the nonparametric bootstrap,

which is inconsistent, our method offers a consistent distributional approximation for estimators

exhibiting cube root asymptotics, and therefore can be used to construct asymptotically valid

inference procedures in such settings. Consistency is achieved by altering the shape of the criterion

function defining the estimator whose distribution we seek to approximate. Heuristically, the

method is designed to ensure that the bootstrap version of a certain empirical process has a mean

resembling the large sample version of its population counterpart. The latter is quadratic in the

problems we study, and known up to the value of a certain matrix. As a consequence, the only

ingredient needed to implement the proposed “reshapement”of the objective function is a consistent

estimator of the unknown matrix entering the quadratic mean of the empirical process. Such

estimators turn out to be generically available and easy to compute.

This paper is not the first to propose a consistent resampling-based distributional approximation

for cube-root-type estimators. For canonical cube root asymptotic problems, the best known con-

sistent alternative to the nonparametric bootstrap is probably subsampling (Politis and Romano,

1994), whose applicability was pointed out by Delgado, Rodriguez-Poo, and Wolf (2001). Related

methods are the rescaled bootstrap (Dümbgen, 1993) and the numerical bootstrap (Hong and Li,

2019), which can also be applied to this type of estimators. In addition, case-specific (smooth or

non-standard) bootstrap methods have been proposed for leading examples such as monotone den-

sity estimation (Kosorok, 2008; Sen, Banerjee, and Woodroofe, 2010), maximum score estimation

(Patra, Seijo, and Sen, 2018), and the current status model (Groeneboom and Hendrickx, 2018).

For the more generic cube-root-type estimators analyzed in Seo and Otsu (2018), subsampling

appears to be the only alternative available, and indeed the authors discuss in their concluding

remarks the need for (and importance of) developing resampling methods based on the standard

nonparametric bootstrap. Our paper appears to be the first to provide one such method.

Conceptually, each of the other resampling methods mentioned above (including ours) can be

viewed as offering a “robust” alternative to the standard nonparametric bootstrap but, unlike

ours, all other methods achieve consistency by modifying the distribution used to generate the

bootstrap counterpart of the estimator whose distribution is being approximated. In contrast, our

bootstrap-based method achieves consistency by means of an analytic modification to the objective

2

function used to construct the bootstrap-based distributional approximation. Furthermore, this

modification only requires estimating a finite dimensional matrix, which can be interpreted as an

analogue of a “standard error”estimator for the cube-root-type estimator. We discuss in detail how

this matrix can be easily estimated in general using numerical derivatives, including consistency and

an approximate mean square error expansion of our proposed estimator. In addition, we illustrate

how exploiting example-specific features other estimators can be constructed.

The paper proceeds as follows. Section 2 is heuristic in nature and serves the purpose of out-

lining the main idea underlying our approach in the M -estimation setting of Kim and Pollard

(1990). Section 3 then makes the heuristics of Section 2 rigorous in a more general setting where

the M -estimation problem is formed using a possibly n-varying objective function, as recently

studied by Seo and Otsu (2018). We then illustrate our novel bootstrap-based inference methods

with six distinct examples in statistics, machine learning, econometrics, and biostatistics settings.

Specifically, Section 4 discusses three examples directly covered by our general results: the max-

imum score estimator of Manski (1975), the empirical risk minimization problem of Mohammadi

and van de Geer (2005), and the conditional maximum score panel data estimator of Honoré and

Kyriazidou (2000). Then, Section 5 presents three other examples that, while not in M -estimation

form, can nonetheless be analyzed using our core ideas via the well-known “switching technique”

(see Groeneboom and Jongbloed, 2014, 2018, for overviews): the monotone density estimator of

Grenander (1956), the monotone regression estimator of Brunk (1958), and the distribution esti-

mator proposed by Ayer, Brunk, Ewing, Reid, and Silverman (1955) for the current status model.

Section 6 offers numerical evidence of our inference methods for two canonical examples: maximum

score and monotone density estimation. To conclude, a methodological discussion of our results,

including some potential avenues for future research, is provided in Section 7. Detailed derivations

and proofs are collected in the online supplemental appendix.

2 Heuristics

Suppose θ0 ∈ Θ ⊆ Rd is an estimand admitting the characterization

θ0 = argmaxθ∈Θ

M0(θ), M0(θ) = E[m0(z,θ)], (1)

3

where m0 is a known function, and where z is a random vector of which a random sample z1, . . . , zn

is available. Studying estimation problems of this kind for non-smooth m0, Kim and Pollard (1990)

gave conditions under which the M -estimator

θn = argmaxθ∈Θ

Mn(θ), Mn(θ) =1

n

n∑i=1

m0(zi,θ),

exhibits cube root asymptotics:

3√n(θn − θ0) argmax

s∈Rd{G0(s) +Q0(s)}, (2)

where denotes weak convergence, G0 is a non-degenerate zero-mean Gaussian process with

G0(0) = 0, and Q0(s) is a quadratic form given by

Q0(s) = −1

2s′V0s, V0 = − ∂2

∂θ∂θ′M0(θ0).

Whereas the matrix V0 governing the shape of Q0 is finite-dimensional, the covariance kernel of

G0 in (2) typically involves infinite-dimensional unknown quantities. As a consequence, the limiting

distribution of θn tends to be more diffi cult to approximate than Gaussian distributions, implying

in turn that basing inference on θn is more challenging under cube root asymptotics than in the

more familiar case where θn is (√n-consistent and) asymptotically normally distributed.

As a candidate method of approximating the distribution of θn, consider the nonparametric

bootstrap. To describe it, let z∗1,n, . . . , z∗n,n denote a random sample from the empirical distribution

of z1, . . . zn and let the natural bootstrap analogue of θn be denoted by

θ∗n = argmax

θ∈ΘM∗n(θ), M∗n(θ) =

1

n

n∑i=1

m0(z∗i,n,θ).

Then, the nonparametric bootstrap estimator of P[θn−θ0 ≤ ·] is given by P∗n[θ∗n−θn ≤ ·], where P∗n

denotes a probability computed under the bootstrap distribution conditional on the data. As is well

documented, however, this estimator is inconsistent under cube root asymptotics (Abrevaya and

Huang, 2005; Léger and MacGibbon, 2006; Kosorok, 2008; Sen, Banerjee, and Woodroofe, 2010).

For the purpose of giving a heuristic, yet constructive, explanation of the inconsistency of the

4

nonparametric bootstrap, it is helpful to recall that a proof of (2) can be based on the representation

3√n(θn − θ0) = argmax

s∈Rd{Gn(s) +Qn(s)}, (3)

where, for s such that θ0 + sn−1/3 ∈ Θ,

Gn(s) = n2/3[Mn(θ0 + sn−1/3)− Mn(θ0)−M0(θ0 + sn−1/3) +M0(θ0)] (4)

is a zero-mean random process, while

Qn(s) = n2/3[M0(θ0 + sn−1/3)−M0(θ0)] (5)

is a non-random function that is correctly centered in the sense that argmaxs∈Rd Qn(s) = 0. In

cases where m0 is non-smooth but M0 is smooth, Gn and Qn are usually asymptotically Gaussian

and asymptotically quadratic, respectively, in the sense that

Gn(s) G0(s) (6)

and

Qn(s)→ Q0(s). (7)

Under regularity conditions ensuring among other things that the convergence in (6) and (7) is

suitably uniform in s, (2) then follows from an application of a continuous mapping-type theorem

for the argmax functional to the representation in (3).

Similarly to (3), the bootstrap analogue of θn admits a representation of the form

3√n(θ

∗n − θn) = argmax

s∈Rd{Qn(s) + G∗n(s)},

where, for s such that θn + sn−1/3 ∈ Θ,

G∗n(s) = n2/3[M∗n(θn + sn−1/3)− M∗n(θn)− Mn(θn + sn−1/3) + Mn(θn)]

5

and

Qn(s) = n2/3[Mn(θn + sn−1/3)− Mn(θn)].

Under mild conditions, G∗n satisfies the following bootstrap counterpart of (6):

G∗n(s) P G0(s), (8)

where P denotes conditional weak convergence in probability (defined as van der Vaart and

Wellner, 1996, Section 2.9). On the other hand, although Qn is non-random under the bootstrap

distribution and satisfies argmaxs∈Rd Qn(s) = 0, it turns out that Qn(s) 9P Q0(s) in general. In

other words, the natural bootstrap counterpart of (7) typically fails and, as a partial consequence, so

does the natural bootstrap counterpart of (2); that is, 3√n(θ

∗n− θn) 6 P argmaxs∈Rd{Q0(s)+G0(s)}

in general.

To the extent that the implied inconsistency of the bootstrap can be attributed to the fact that

the shape of Qn fails to replicate that of Qn, it seems plausible that a consistent bootstrap-based

distributional approximation can be obtained by basing the approximation on

θ∗n = argmax

θ∈ΘM∗n(θ), M∗n(θ) =

1

n

n∑i=1

mn(z∗i,n,θ),

where mn is a suitably “reshaped” version of m0 satisfying two properties. First, G∗n should be

asymptotically equivalent to G∗n, where G∗n is the counterpart of G

∗n associated with mn;

G∗n(s) = n2/3[M∗n(θn + sn−1/3)− M∗n(θn)− Mn(θn + sn−1/3) + Mn(θn)].

Second, and most importantly, Qn should be asymptotically quadratic, where Qn is the counterpart

of Qn associated with mn :

Qn(s) = n2/3[Mn(θn + sn−1/3)− Mn(θn)], Mn(θ) =1

n

n∑i=1

mn(zi,θ).

Accordingly, let

mn(z,θ) = m0(z,θ)− Mn(θ)− 1

2(θ − θn)′Vn(θ − θn),

6

where Vn is an estimator of V0. Then

3√n(θ

∗n − θn) = argmax

s∈Rd{G∗n(s) + Qn(s)},

where, by construction, G∗n(s) = G∗n(s) and Qn(s) = −s′Vns/2. Because G∗n = G∗n, G∗n(s) P G0(s)

whenever (8) holds. In addition, Qn(s)→P Q0(s) provided Vn →P V0. As a consequence, it seems

plausible that if Vn →P V0, then

3√n(θ

∗n − θn) P argmax

s∈Rd{G0(s) +Q0(s)},

implying in particular that P∗n[θ∗n − θn ≤ ·] is a consistent estimator of P[θn − θ0 ≤ ·].

3 Main Result

When making the heuristics of Section 2 precise, we consider the more general situation where the

estimator θn is an approximate maximizer (with respect to θ ∈ Θ ⊆ Rd) of

Mn(θ) =1

n

n∑i=1

mn(zi,θ),

where mn is a known function, and where z1, . . . , zn is a random sample of a random vector z. This

formulation of Mn, which reduces to that of Section 2 when mn does not depend on n, is adopted

in order to cover certain estimation problems where, rather than admitting a characterization of

the form (1), the estimand θ0 admits the characterization

θ0 = argmaxθ∈Θ

M0(θ), M0(θ) = limn→∞

Mn(θ), Mn(θ) = E[mn(z,θ)].

In other words, the setting considered in this section is one where θn approximately maximizes

a function Mn whose population counterpart Mn can be interpreted as a regularization (in the

sense of Bickel and Li, 2006) of a function M0 whose maximizer θ0 is the object of interest. The

additional flexibility (relative to the more traditional M -estimation setting of Section 2) afforded

by the present setting is attractive because it allows us to formulate results that cover local M -

7

estimators such as the conditional maximum score estimator of Honoré and Kyriazidou (2000).

Studying this type of setting, Seo and Otsu (2018) gave conditions under which θn converges at a

rate equal to the cube root of the “effective”sample size and has a limiting distribution of Chernoff

(1964) type. Analogous conclusions will be drawn below, albeit under slightly different conditions.

For any n and any δ > 0, define Mn = {mn(·,θ) : θ ∈ Θ}, mn(z) = supm∈Mn|m(z)|,

Θδ0 = {θ ∈ Θ : ||θ − θ0|| ≤ δ}, Dδn = {mn(·,θ)−mn(·,θ0) : θ ∈ Θδ

0}, and dδn(z) = supd∈Dδn |d(z)|.

Condition CRA (Cube Root Asymptotics) For a positive qn with rn = 3√nqn → ∞, the

following are satisfied:

(i) {Mn : n ≥ 1} is uniformly manageable for the envelopes mn and qnE[mn(z)2] = O(1).

Also, supθ∈Θ |Mn(θ)−M0(θ)| = o(1) and, for every δ > 0, supθ∈Θ\Θδ0M0(θ) < M0(θ0).

(ii) θ0 is an interior point of Θ and, for some δ > 0, M0 and Mn are twice continuously

differentiable on Θδ0 and supθ∈Θδ

0

∥∥∂2[Mn(θ)−M0(θ)]/∂θ∂θ′∥∥ = o(1).

Also, rn||∂Mn(θ0)/∂θ|| = o(1) and V0 = −∂2M0(θ0)/∂θ∂θ′ is positive definite.

(iii) For some δ > 0, {Dδ′n : n ≥ 1, 0 < δ′ ≤ δ} is uniformly manageable for the envelopes dδ′nand qn sup0<δ′≤δ E[dδ

′n (z)2/δ′] = O(1).

(iv) For every positive δn with δn = O(r−1n ), q2nE[dδnn (z)3] + q3nr−1n E[dδnn (z)4] = o(1), and, for

all s, t ∈ Rd and for some C0 with C0(s, s) + C0(t, t)− 2C0(s, t) > 0 for s 6= t,

supθ∈Θδn

0

|qnE[{mn(z,θ + δns)−mn(z,θ)}{mn(z,θ + δnt)−mn(z,θ)}/δn]− C0(s, t)| = o(1).

(v) For every positive δn with δn = O(r−1n ),

limC→∞

lim supn→∞

sup0<δ≤δn

qnE[1{qndδn(z) > C}dδn(z)2/δ] = 0

and supθ,θ′∈Θδn

0E[|mn(z,θ)−mn(z,θ′)|]/||θ − θ′|| = O(1).

To interpret Condition CRA, consider first the benchmark case where mn = m0 and qn =

1. In this case, the condition is similar to, but slightly stronger than, assumptions (ii)-(vii) of

the main theorem of Kim and Pollard (1990), to which the reader is referred for a definition of

the term (uniformly) manageable. The most notable difference between Condition CRA and the

8

assumptions employed by Kim and Pollard (1990) is probably that part (iv) contains assumptions

about moments of orders three and four, that the displayed part of part (iv) is a locally uniform

(with respect to θ near θ0) version of its counterpart in Kim and Pollard (1990), and that (i) can

be thought of as replacing the high level condition θn →P θ0 of Kim and Pollard (1990) with more

primitive conditions that imply it for approximate M -estimators. In all three cases, the purpose of

strengthening the assumptions of Kim and Pollard (1990) is to be able to analyze the bootstrap.

More generally, Condition CRA can be interpreted as an n-varying version of a suitably (for the

purpose of analyzing the bootstrap) strengthened version of the assumptions of Kim and Pollard

(1990). The differences between Condition CRA and the i.i.d. version of the conditions in Seo and

Otsu (2018) seem mostly technical in nature, but for completeness we highlight two here. First,

to handle dependent data Seo and Otsu (2018) control the complexity of various function classes

using the concept of bracketing entropy. In contrast, because we assume random sampling we can

follow Kim and Pollard (1990) and obtain maximal inequalities using bounds on uniform entropy

numbers implied by the concept of (uniform) manageability. Second, whereas Seo and Otsu (2018)

controls the bias of θn through a locally uniform bound on Mn −M0, Condition CRA controls the

bias through the first and second derivatives of Mn −M0.

Under Condition CRA, the effective sample size is given by nqn = r3n. In perfect agreement with

Seo and Otsu (2018), if θn is an approximate maximizer of Mn, then

rn(θn − θ0) argmaxs∈Rd

{G0(s) +Q0(s)}, (9)

where G0 is a zero-mean Gaussian process with G0(0) = 0 and covariance kernel C0 and where

Q0(s) = −s′V0s/2. The heuristics of the previous section are rate-adaptive, so once again it stands

to reason that if Vn is a consistent estimator of V0, then a consistent distributional approximation

can be based on an approximate maximizer θ∗n of

M∗n(θ) =1

n

n∑i=1

mn(z∗i,n,θ), mn(z,θ) = mn(z,θ)− Mn(θ)− 1

2(θ − θn)′Vn(θ − θn),

where z∗1,n, . . . , z∗n,n is a random sample from the empirical distribution of z1, . . . zn.

Following van der Vaart (1998, Chapter 23), we say that our bootstrap-based estimator of the

9

distribution of rn(θn − θ0) is consistent if

supt∈Rd

∣∣∣P∗n[rn(θ∗n − θn) ≤ t]− P[rn(θn − θ0) ≤ t]

∣∣∣→P 0. (10)

Because the limiting distribution in (9) is continuous, this consistency property implies consistency

of bootstrap-based confidence intervals (e.g., van der Vaart, 1998, Lemma 23.3). Moreover, con-

tinuity of the limiting distribution implies that (10) holds provided the estimator θ∗n satisfies the

following bootstrap counterpart of (9):

rn(θ∗n − θn) P argmax

s∈Rd{G0(s) +Q0(s)}.

Theorem 1, our main result, gives suffi cient conditions for this to occur.

Theorem 1 Suppose Condition CRA holds. If Vn →P V0 and if

Mn(θn) ≥ supθ∈Θ

Mn(θ)− oP(r−2n ), M∗n(θ∗n) ≥ sup

θ∈ΘM∗n(θ)− oP(r−2n ),

then (10) holds.

Using the notation and conditions in Theorem 1, the algorithm for our proposed bootstrap-based

distributional approximation is as follows:

Step 1. Using the sample z1, . . . , zn, compute Mn(θ) and let θn be an approximate maximizer thereof.

Step 2. Using θn and z1, . . . , zn, compute Vn. (A generic estimator Vn is given in the next subsection,

and example-specific estimators are discussed in Section 4.)

Step 3. Using θn, Vn, and the bootstrap sample z∗1,n, . . . , z∗n,n, compute M

∗n(θ) and let θ

∗n be an

approximate maximizer thereof. (θn and Vn are not recomputed at this step.)

Step 4. Repeat Step 3 B times and compute the empirical distribution of rn(θ∗n − θn) using the B

bootstrap replications.

10

3.1 Generic Estimation of V0

To implement the bootstrap-based approximation to the distribution of rn(θn− θ0), only a consis-

tent estimator of V0 is needed. A generic estimator based on numerical derivatives is the matrix

VNDn with element (k, l) given by

V NDn,kl = − 1

4ε2n

[Mn(θn + ekεn + elεn)− Mn(θn + ekεn − elεn)

−Mn(θn − ekεn + elεn) + Mn(θn − ekεn − elεn)],

where ek is the kth unit vector in Rd and where εn is a positive (vanishing) tuning parameter.

Conditions under which this estimator is consistent are given in the following lemma.

Lemma 1 Suppose Condition CRA holds and that rn(θn−θ0) = OP(1). If εn → 0 and if rnεn →∞,

then VNDn →P V0.

Plausibility of the high-level condition rn(θn − θ0) = OP(1) follows from (9). More generally,

if only consistency is assumed on the part of θn, then VNDn →P V0 holds provided εn → 0 and

||θn − θ0||/εn →P 0.

For practical implementation, we also develop a Nagar-type mean squared error (MSE) expan-

sion for VNDn under additional, mild regularity conditions.

Lemma 2 Suppose the conditions of Lemma 1 hold and that, for some δ > 0, M0 and Mn are four

times continuously differentiable on Θδ0 with supθ∈Θδ

0|∂4[Mn(θ)−M0(θ)]/∂θj1∂θj2∂θj3∂θj4 | = o(1)

for all j1, j2, j3, j4 ∈ {1, 2, · · · , d}. If C0(s,−s) = 0 and if C0(s, t) = C0(−s,−t) for all s, t ∈ Rd,

then VNDn admits an approximation VND

n satisfying

VNDn − VND

n = oP(r−3/2n ε−3/2n ) + o(ε2n) +O(r−1n )

and

E[||VNDn −Vn||2] = ε4n(

d∑k=1

d∑l=1

B2kl) + r−3n ε−3n (d∑

k=1

d∑l=1

Vkl) + o(ε4n + r−3n ε−3n ),

11

where Vn = −∂2Mn(θ0)/∂θ∂θ′ and where

Bkl = −1

6

(∂4

∂θ3k∂θlM0(θ0) +

∂4

∂θk∂θ3l

M0(θ0)

),

Vkl =1

8[C0(ek + el, ek + el) + C0(ek − el, ek − el)

− 2C0(ek + el, ek − el)− 2C0(ek + el,−ek + el)].

Lemma 2 goes beyond consistency (Lemma 1) and develops an approximate MSE expansion

for VNDn , which can be used to construct optimal choices for the numerical derivative step-size εn.

Specifically, the approximate MSE can be minimized by choosing εn proportional to r−3/7n , the

optimal factor of proportionality being a functional of the covariance kernel C0 and the fourth order

derivatives of M0 evaluated at θ0:

εAMSEn =

(3∑d

k=1

∑dl=1 Vkl

4∑d

k=1

∑dl=1 B

2kl

)1/7r−3/7n .

As the notation suggests, the constants Bkl and Vkl correspond to the bias and variance of the

element (k, l) of VNDn , so the element-wise asymptotic MSE-optimal tuning parameter choice for

V NDn,kl is

εAMSEn,kl =

(3Vkl4B2kl

)1/7r−3/7n .

In the supplemental appendix, Section A.7 discusses plug-in implementations of εAMSEn and εAMSEn,kl ,

while Section A.3 discusses alternative generic estimators ofV0 based on kernel smoothing. Finally,

Section 6 below explores the numerical performance of the feasible implementation of the MSE-

optimal choice of εn in the context of two distinct examples.

4 Examples

This section illustrates our proposed bootstrap-based distributional approximation with three ex-

amples. The first two are M-estimation problems where mn does not change with n, while the last

example corresponds to a situation where mn depends on n.

12

4.1 Maximum Score

To describe a version of the maximum score estimator of Manski (1975), suppose z1, . . . , zn is a

random sample of z = (y,x′)′ generated by the binary response model

y = 1(β′0x + u ≥ 0), Median(u|x) = 0,

where β0 ∈ Rd+1 is an unknown parameter of interest, x ∈ Rd+1 is a vector of covariates, and u is

an unobserved error term. Following Abrevaya and Huang (2005), we employ the parameterization

β0 = (1,θ′0)′, where θ0 ∈ Rd is unknown. In other words, we assume that the first element of β0 is

positive and then normalize the (unidentified) scale of β0 by setting its first element equal to unity.

Partitioning x conformably with β0 as x = (x1,x′2)′, a maximum score estimator of θ0 ∈ Θ ⊆ Rd

is any θMS

n satisfying

MMSn (θ

MS

n ) ≥ supθ∈Θ

MMSn (θ)− oP(n−2/3), MMS

n (θ) =1

n

n∑i=1

mMS(zi,θ),

where mMS(z,θ) = (2y − 1)1(x1 + θ′x2 ≥ 0).

Regarded as a member of the class of M -estimators exhibiting cube root asymptotics, the

maximum score estimator is representative in a couple of respects. First, under easy-to-interpret

primitive conditions the estimator is covered by the results of Section 3. Second, in addition to

the generic estimator VNDn discussed above, the maximum score estimator admits example-specific

consistent estimators of the V0 associated with it. In what follows, we briefly illustrate both points;

for omitted details, see Section A.4.1 of the supplemental appendix.

To state primitive conditions for Condition CRA, let Fu|x1,x2 and Fx1|x2 denote the conditional

distribution function of u given (x1,x2) and x1 given x2, respectively.

Condition MS The following are satisfied:

(i) 0 < P(y = 1|x) < 1 almost surely and Fu|x1,x2(u|x1,x2) is differentiable in u and x1 with

bounded and continuous derivatives.

(ii) The support of x is not contained in any proper linear subspace of Rd+1, E[‖x2‖2] <∞,

and conditional on x2, x1 has everywhere positive Lebesgue density. Also, Fx1|x2(x1|x2) is

twice differentiable in x1 with bounded and continuous derivatives.

13

(iii) The set Θ is compact and θ0 is an interior point of Θ.

(iv)MMS is twice continuously differentiable near θ0 andVMS = −∂2MMS(θ0)/∂θ∂θ′ is positive

definite, where MMS(θ) = E[mMS(z,θ)].

Under Condition MS, θMS

n satisfies (2) with V0 = VMS and

C0(s, t) = CMS(s, t) = E[fx1|x2(−x′2θ0|x2) min{|x′2s|, |x′2t|}1(sgn(x′2s) = sgn(x′2t))],

where fx1|x2 denotes the conditional density of x1 given x2. Indeed, Condition CRA is satisfied

(with qn = 1), so Theorem 1 is applicable. To state a maximum score version of that result, let

z∗1,n, . . . , z∗n,n be a random sample from the empirical distribution of z1, . . . , zn and suppose θ

MS,∗n

satisfies

MMS,∗n (θ

MS,∗n ) ≥ sup

θ∈ΘMMS,∗n (θ)− oP(n−2/3), MMS,∗

n (θ) =1

n

n∑i=1

mMSn (z∗i,n,θ),

where, for some estimator Vn of VMS,

mMSn (z,θ) = mMS(z,θ)− MMS

n (θ)− 1

2(θ − θMSn )′Vn(θ − θMSn ).

Corollary MS If Condition MS holds and if Vn →P VMS, then

supt∈Rd

∣∣∣P∗n[ 3√n(θ

MS,∗n − θMSn ) ≤ t]− P[ 3

√n(θ

MS

n − θ0) ≤ t]∣∣∣→P 0.

Because Condition CRA is satisfied, it follows from Lemma 1 that the consistency requirement of

the corollary is satisfied by the numerical derivative estimator VNDn discussed in Section 3.1 provided

the parameter εn satisfies εn → 0 and nε3n →∞. Moreover, the other assumptions of Lemma 2 are

mild additional regularity conditions in this example, so an MSE-optimal tuning parameter choice

is feasible. In addition to the numerical derivative estimator, alternative consistent estimators of

VMS can be constructed exploiting the specific structure of this example. One option is to employ

a “plug-in”estimator of

VMS = 2E[fu|x1,x2(0| − x′2θ0,x2)fx1|x2(−x′2θ0|x2)x2x′2

],

14

where fu|x1,x2 denotes the conditional density of u given (x1,x2). To implement this estimator,

nonparametric estimators of the conditional densities fu|x1,x2 and fx1|x2 are required. A simpler

example-specific estimator is

VMSn = − 1

n

n∑i=1

(2yi − 1)Kn(x1i + θ′x2i)x2ix′2i

∣∣∣∣∣θ=θ

MS

n

,

where, for a kernel functionK and a bandwidth hn, Kn(u) = dKn(u)/du andKn(u) = K(u/hn)/hn.

In words, VMSn is constructed by “smoothing out” the indicator function entering mMS(z,θ) and

then differentiating the corresponding (regularized) objective function twice. Like the numerical

derivative estimator, the estimator VMSn is consistent under mild conditions on its tuning parameters

(hn and K); for details, see the supplemental appendix (Section A.4.1, Lemma MS), which also

reports a valid approximate MSE expansion useful to select tuning the bandwidth in an optimal

way.

Section 6 below reports results of a Monte Carlo experiment evaluating the performance of our

proposed bootstrap-based inference procedure for this example.

4.2 Empirical Risk Minimization

Mohammadi and van de Geer (2005) consider two-category classification problems in machine

learning. Specifically, given a binary outcome y ∈ {−1, 1} and a vector of features x ∈ X , the

goal is to estimate the θ0 that minimizes the misclassification error (or risk) P[hθ(x) 6= y] with

respect to θ ∈ Θ ⊆ Rd, where {hθ : θ ∈ Θ} is a collection of classifiers. For simplicity, here we

follow Mohammadi and van de Geer (2005, Section 2.1) and consider the case where the feature is

univariate with support X = [0, 1] and the classifiers are of the form

hθ(x) =

d∑`=1

(−1)`1(θ`−1 ≤ x < θ`), θ = (θ1, θ2, · · · , θd)′,

where

Θ = {(θ1, θ2, · · · , θd)′ ∈ [0, 1]d : 0 = θ0 ≤ θ1 ≤ · · · ≤ θd ≤ θd+1 = 1}.

In Section A.4.2 of the supplemental appendix we also analyze a case where x is vector-valued and

15

and H includes non-linear classifiers, and present additional results for this example.

As an estimator of θ0, we consider the empirical risk minimizer θERM

n satisfying

MERMn (θ

ERM

n ) ≥ supθ∈Θ

MERMn (θ)− oP(n−2/3), MERM

n (θ) =1

n

n∑i=1

mERM(z,θ),

where mER(z,θ) = −1(hθ(x) 6= y) and z1, . . . , zn is a random sample of z. When analyzing θERM

n ,

we follow Mohammadi and van de Geer (2005, Theorem 1) and impose the following primitive

conditions.

Condition ERM The following are satisfied:

(i) P (0) < 1/2 and P admits a continuous derivative p in a neighborhood of each element of

θ0, where P (x) = P[y = 1|x]. Also, P[hθ(x) 6= y] is continuous with respect to θ.

(ii) The distribution function of x, denoted by F , is absolutely continuous and its Lebesgue

density f is continuously differentiable in a neighborhood of each element of θ0.

(iii) θ0 is an interior point of Θ.

(iv) θ0 = (θ0,1, θ0,2, · · · , θ0,d)′ is the unique minimizer of P[hθ(x) 6= y] and p(θ0,`)f(θ0,`) 6= 0

for ` ∈ {1, · · · , d}.

Under Condition ERM, θERM

n satisfies (2) with

V0 = VERM = 2

p(θ0,1)f(θ0,1) 0 · · · 0

0 −p(θ0,2)f(θ0,2) · · · 0

· · · · · · · · · · · ·

0 0 · · · (−1)d+1p(θ0,d)f(θ0,d)

and

C0(s, t) = CERM(s, t) =

d∑`=1

f(θ0,`) min{|s`|, |t`|}1{sgn(s`) = sgn(t`)}

for s = (s1, · · · , sd)′ and t = (t1, · · · , td)′. Indeed, Condition CRA is satisfied (with qn = 1),

so Theorem 1 is applicable. To state an empirical risk minimization version of that result, let

z∗1,n, . . . , z∗n,n be a random sample from the empirical distribution of z1, . . . , zn and suppose θ

ERM,∗n

16

satisfies

MERM,∗n (θ

ERM,∗n ) ≥ sup

θ∈ΘMERM,∗n (θ)− oP(n−2/3), MERM,∗

n (θ) =1

n

n∑i=1

mERMn (z∗i,n,θ),

where, for some estimator Vn of VERM,

mERMn (z,θ) = mERM(z,θ)− MERM

n (θ)− 1

2(θ − θERMn )′Vn(θ − θERMn ).

Corollary ERM If Condition ERM holds and if Vn →P VERM, then

supt∈Rd

∣∣∣P∗n[ 3√n(θ

ERM,∗n − θERMn ) ≤ t]− P[ 3

√n(θ

ERM

n − θ0) ≤ t]∣∣∣→P 0.

As in the maximum score example, the consistency requirement on Vn can be met in various

ways. First, the generic numerical derivative estimator of Section 3.1 can be used directly. Second,

a “plug-in”estimator of VERM can be constructed with the help of nonparametric estimators of p

and f. Finally, as explained in Section A.4.2 of the supplemental appendix, kernel smoothing ideas

similar to those used to construct VMSn can be used to obtain an estimator of VERM.

4.3 Conditional Maximum Score

Consider the dynamic binary response model

Yt = 1(β′0Xt + γ0Yt−1 + α+ ut ≥ 0), t = 1, 2, 3,

where β0 ∈ Rd and γ0 ∈ R are unknown parameters of interest, α is an unobserved (time-invariant)

individual-specific effect, and ut is an unobserved error term. Honoré and Kyriazidou (2000) ana-

lyzed this model and gave conditions under which β0 and γ0 are identified up to a common scale

factor. Assuming these conditions hold and assuming the first element of β0 is positive, we can nor-

malize that element to unity and employ the parameterization (β′0, γ0)′ = (1,θ′0)

′, where θ0 ∈ Rd

is unknown.

To describe a version of the conditional maximum score estimator of Honoré and Kyriazidou

(2000), partition Xt after the first element as Xt = (X1t,X′2t)′ and define z = (y, x1,x

′2,w

′)′, where

17

y = Y2−Y1, x1 = X12−X11, x2 = ((X22−X21)′, Y3−Y0)′, and w = X2−X3. Assuming z1, . . . , zn

is a random sample of z, a conditional maximum score estimator of θ0 ∈ Θ ⊆ Rd is any θCMSn

approximately maximizing

MCMSn (θ) =

1

n

n∑i=1

mCMSn (zi,θ), mCMS

n (z,θ) = y1(x1 + θ′x2 ≥ 0)κn(w),

where, for a kernel function κ and a bandwidth bn, κn(w) = κ(w/bn)/bdn.

Through its dependence on bn, the function mCMSn depends on n in a non-negligible way. In

particular, the effective sample size is nbdn (rather than n) in the current setting, so to the extent

that they exist one would expect primitive suffi cient conditions for Condition CRA to include

qn = bdn in this example. Apart from this predictable change, the properties of the conditional

maximum score estimator θCMS

n turn out to be qualitatively similar to those of θMS

n . To be specific,

under regularity conditions, the conditional maximum score estimator is covered by the results

of Section 3 and an example-specific alternative to the generic numerical derivative estimator is

available. We outline all details in Section A.4.3 of the supplemental appendix to avoid a tedious

long list of regularity conditions and other untidy notation.

5 Nonparametric Estimation under Shape Constraints

The monotone density estimator of Grenander (1956) is another well known estimator that exhibits

cube root asymptotics and has the feature that the standard bootstrap-based approximation to its

distribution is known to be inconsistent (e.g., Kosorok, 2008; Sen, Banerjee, and Woodroofe, 2010).

In turn, the Grenander estimator can be viewed as a prototypical example of a nonparametric

estimator respecting a shape constraint (e.g., Groeneboom and Jongbloed, 2014, 2018), two other

well known examples being the isotonic regression estimator of Brunk (1958) and the Ayer, Brunk,

Ewing, Reid, and Silverman (1955) estimator of the distribution function in the current status

model. These estimators are not M -estimators and are therefore not covered by the results of

the earlier sections. Nevertheless, we show in this section that in all three examples the idea of

reshaping can be used to achieve consistency of bootstrap-based distributional approximations.

Before turning to the specific examples, we outline the general approach that we take. Suppose

18

the (scalar) estimator θn of the estimand θ0 satisfies a switch relation of the form

3√n(θn − θ0) ≤ t ⇔ argmax

s∈Sn{Gn(s) + tLn(s) +Qn(s)} ≤ 0, (11)

where the set Sn, the processes Gn, Ln, and the function Qn satisfy

1(s ∈ Sn)→P 1(s ∈ R) = 1, Gn(s) γ0W(s), Ln(s)→P −λ0s, (12)

and

Qn(s)→ −1

2χ0s

2, (13)

respectively, with W denoting a two-sided Wiener process with W(0) = 0, while γ0, λ0, and χ0

being positive constants. It then stands to reason that

P[ 3√n(θn − θ0) ≤ t]→ P

[argmaxs∈R

{γ0W(s)− tλ0s−1

2χ0s

2} ≤ 0

],

a result that can be stated more compactly as

3√n(θn − θ0) ω0 argmax

s∈R{W(s)− s2}, ω0 = 3

√4χ0γ

20

λ30, (14)

where the equivalence of the two formulations follows from the fact that

P[argmaxs∈R

{γ0W(s)− tλ0s−1

2χ0s

2} ≤ 0

]= P

[3

√4χ0γ

20

λ30argmaxs∈R

{W(s)− s2} ≤ t].

See, for instance, van der Vaart and Wellner (1996, Problem 3.2.5).

Similarly to (11), suppose the bootstrap analogue of θn admits a set S∗n and processes G∗n, L∗n,

and Qn such that

3√n(θ∗n − θn) ≤ t ⇔ argmax

s∈S∗n{G∗n(s) + tL∗n(s) + Qn(s)} ≤ 0, (15)

where

1(s ∈ S∗n)→P 1(s ∈ R) = 1, G∗n(s) P γ0W(s), L∗n(s)→P −λ0s, (16)

19

but where Qn(s) 9P −12χ0s2. In that case, 3

√n(θ∗n − θn) 6 P ω0 argmax

s∈R{W(s) − s2}, the source

of the bootstrap inconsistency being once again the inability of the process Qn to reproduce the

shape of Qn.

Adapting the ideas of Section 2 to the present setting, it seems natural to attempt to construct

a θ∗n satisfying

3√n(θ∗n − θn) ≤ t ⇔ argmax

s∈S∗n{G∗n(s) + tL∗n(s) + Qn(s)} ≤ 0, (17)

where (S∗n, G∗n, L∗n) ≈ (S∗n, G∗n, L∗n) and Q∗n(s) →P −12χ0s2, in an appropriate sense (see below),

as it seems plausible that such a construction would restore bootstrap consistency by virtue of θ∗n

satisfying

3√n(θ∗n − θn) P ω0 argmax

s∈R{W(s)− s2}. (18)

As we shall see, such a construction is feasible and achieves bootstrap consistency in all three

shape-contained estimation examples mentioned above. Indeed, in all three examples, the only

additional ingredient needed in order to construct θ∗n is a consistent estimator χn of the scalar χ0.

5.1 Monotone Density

Our first example is the celebrated 3√n-consistent monotone density estimator of Grenander (1956).

The asymptotic properties of this estimator have been studied by Prakasa Rao (1969), Groeneboom

(1985), and Kim and Pollard (1990), among others. Inconsistency of the standard bootstrap-based

approximation to the distribution of the Grenander estimator has been pointed out by Kosorok

(2008) and Sen, Banerjee, and Woodroofe (2010), among others.

Suppose the object of interest is θ0 = f0(x), where f0 is the Lebesgue density of a continuously

distributed nonnegative random variable x and where x ∈ (0,∞) is a given evaluation point.

Assuming x1, . . . , xn is a random sample from the distribution of x and assuming f0 is non-increasing

on [0,∞), a natural estimator is θn = fMDn (x), where fMDn is the Grenander estimator of f0; that is,

fMDn maximizes the log likelihood∑n

i=1 log f(xi) over all non-increasing densities f on [0,∞).

As is well known, fMDn is the left derivative of the least concave majorant (LCM) of the empirical

distribution function Fn(·) = n−1∑n

i=1 1(xi ≤ ·). Using this fact, it can be shown that θn satisfies

20

(11) with

Sn = [− 3√nx,∞),

Gn(s) = n2/3{Fn(x+ sn−1/3)− Fn(x)− F0(x+ sn−1/3) + F0(x)},

Ln(s) = −s,

and

Qn(s) = n2/3{F0(x+ sn−1/3)− F0(x)− f0(x)sn−1/3},

where F0 is the distribution function of x. As a consequence, (12) holds with γ0 =√f0(x) and

λ0 = 1. If also f0 admits a negative derivative f ′0 at x, then (13) holds with χ0 = −f ′0(x) and (14)

holds with ω0 = 3√−4f ′0(x)f0(x).

Letting x∗1,n, , . . . , x∗n,n denote a random sample from the empirical distribution of x1, . . . , xn, a

natural bootstrap analogue of θn is given by θ∗n = fMD,∗n (x), where fMD,∗n is the left derivative of the

LCM of F ∗n(·) = n−1∑n

i=1 1(x∗i,n ≤ ·). This estimator satisfies (15) with

S∗n = [− 3√nx,∞),

G∗n(s) = n2/3{F ∗n(x+ sn−1/3)− F ∗n(x)− Fn(x+ sn−1/3) + Fn(x)}

L∗n(s) = −s,

and

Qn(s) = n2/3{Fn(x+ sn−1/3)− Fn(x)− fMDn (x)sn−1/3}.

Moreover, it can be shown that (16) holds with γ0 =√f0(x) and λ0 = 1. On the other hand,

being non-smooth, the empirical distribution function Fn does not admit a quadratic approxima-

tion around x, so the natural bootstrap analog of (13) fails; that is, Qn(s) 9P12f′0(x)s

2. Sen,

Banerjee, and Woodroofe (2010, p. 1968) attribute the inconsistency of the bootstrap to this fact

and show that bootstrap consistency can be restored by generating the bootstrap sample from a

smoothed/regularized approximation to Fn. A similar result was obtained by Kosorok (2008).

To define our alternative bootstrap procedures, let f ′n(x) be an estimator of f ′0(x) and let fMD,∗n

21

be the left derivative of the LCM of F ∗n , where F∗n is the following reshaped version of F

∗n :

F ∗n(·) = F ∗n(·)− Fn(·) + Fn(x) + fMDn (x)(· − x) +1

2f ′n(x)(· − x)2.

Then θ∗n = fMD,∗n (x) satisfies (17) with (S∗n, G∗n, L∗n) = (S∗n,G∗n, L∗n) and Qn(s) = 1

2 f′n(x)s2. As one

would hope, (18) is satisfied provided f ′n(x) is consistent.

Theorem MD If f0 is differentiable at x with derivative f ′0(x) < 0 and if f ′n(x)→P f′0(x), then

supt∈R

∣∣∣P∗[ 3√n(fMD,∗n (x)− fMDn (x)) ≤ t]− P[ 3√n(fMDn (x)− f0(x)) ≤ t]

∣∣∣→P 0.

The (pointwise) consistency requirement f ′n(x) →P f ′0(x) can be met in various ways. One

possibility is to apply a version of our generic numerical derivative estimator, but an arguably more

natural option is to employ a standard kernel (derivative) density estimator with its corresponding

plug-in MSE-optimal bandwidth selector; see Section A.5.1 of the supplemental appendix for details.

Section 6 below evaluates the numerical performance of our new bootstrap-based inference

procedure for the monotone density estimator in a simulation study.

5.2 Monotone Regression

Our second example is the monotone regression estimator of Brunk (1958). In this example, the

object of interest is θ0 = µ0(x), where µ0 is the regression function µ0(x) = E(y|x) for a dependent

variable y and (scalar) continuously distributed regressor x, and x ∈ R is a given evaluation point.

Assuming z1, . . . , zn is a random sample from the distribution of z = (y, x)′ and assuming µ0 is

non-decreasing on X , a natural estimator is θn = µMRn (x), where µMRn minimizes∑n

i=1{yi − µ(xi)}2

over all non-decreasing functions µ.

We do not incorporate interpolation in the construction of the estimator θn only for simplicity,

but all our results continue to hold if interpolation is used. Therefore, we can take θn to be the left

derivative of the greatest convex minorant (GCM) of Un ◦ F−1n at Fn(x), where

Un(·) =1

n

n∑i=1

yi1(xi ≤ ·), Fn(·) =1

n

n∑i=1

1(xi ≤ ·).

22

Using this fact and defining

Mn(·; θ) = −[Un(·)− θFn(·)], M0(·; θ) = E[Mn(·; θ)],

it can be shown that θn satisfies (11) with

Sn = [ 3√n( min1≤i≤n

xi − x), 3√n( max1≤i≤n

xi − x)),

Gn(s) = n2/3{Mn(x+ sn−1/3; θ0)− Mn(x; θ0)−M0(x+ sn−1/3; θ0) +M0(x; θ0)},

Ln(s) = −n1/3{Fn(x+ sn−1/3)− Fn(x)},

and

Qn(s) = n2/3{M0(x+ sn−1/3; θ0)−M0(x; θ0)}.

Under the following condition, (12) holds with γ0 =√σ20(x)f0(x) and λ0 = f0(x), (13) holds

with χ0 = χMR = µ′0(x)f0(x), and (14) holds with ω0 = 3√

4σ20(x)µ′0(x)/f0(x).

Condition MR For some neighborhood X of x, the following are satisfied:

(i) E[{y−µ0(x)}4] <∞, µ0 admits a continuous derivative µ′0 on X , and σ20 is continuous on

X , where σ20(x) = V[y|x].

(ii) The density f0 of x has f0(x) > 0 and admits a bounded derivative f ′0 on X .

Letting z∗1,n, . . . , z∗n,n denote a random sample from the empirical distribution of z1, . . . , zn, a

natural bootstrap analogue of µMRn (x) is given by µMR,∗n (x), the left derivative of the GCM of U∗n◦F ∗−1n

at F ∗n(x), where

U∗n(·) =1

n

n∑i=1

y∗i,n1(x∗i,n ≤ ·), F ∗n(·) =1

n

n∑i=1

1(x∗i,n ≤ ·).

Defining

M∗n(·; θ) = −[U∗n(·)− θF ∗n(·)]

23

it can be shown that θ∗n = µMR,∗n (x) satisfies (15) with

S∗n = [ 3√n( min1≤i≤n

x∗i,n − x), 3√n( max1≤i≤n

x∗i,n − x)),

G∗n(s) = n2/3{M∗n(x+ sn−1/3; θn)− M∗n(x; θn)− Mn(x+ sn−1/3; θn) + Mn(x; θn)},

L∗n(s) = −n1/3{F ∗n(x+ sn−1/3)− F ∗n(x)},

and

Qn(s) = n2/3{Mn(x+ sn−1/3; θn)− Mn(x; θn)}.

Moreover, it can be shown that (16) holds with γ0 =√σ20(x)f0(x) and λ0 = f0(x). On the other

hand, the process Mn(·; θn) does not admit a quadratic approximation around x, so the natural

bootstrap analog of (13) fails; that is, Qn(s) 9P −12µ′0(x)f0(x)s

2.

Bootstrap consistency can be restored by reshaping U∗n. Let χn is an estimator of χMR and let

µMR,∗n (x), the left derivative of the GCM of U∗n ◦ F ∗−1n at F ∗n(x), where

U∗n(·) = U∗n(·)− Un(·) + Un(x) + µMRn (x){Fn(·)− Fn(x)}+1

2χn(· − x)2.

By construction, θ∗n = µMR,∗n (x) satisfies (17) with (S∗n, G∗n, L∗n) = (S∗n,G∗n, L∗n) and Qn(s) = −12 χns

2.

As one would hope, (18) is satisfied under mild conditions.

Theorem MR If Condition MR holds and if χn →P χMR, then

supt∈R

∣∣P∗[ 3√n(µMR,∗n (x)− µMRn (x)) ≤ t]− P[ 3√n(µMRn (x)− µ0(x)) ≤ t]

∣∣→P 0.

The consistency requirement χn →P χMR = µ′0(x)f0(x) can be met in various ways. One possibil-

ity is to apply a version of our generic numerical derivative estimator and another option is to use

a plug-in estimator employing separate nonparametric estimators of µ′0(x) and f0(x); see Section

A.5.2 of the supplemental appendix for details.

24

5.3 Current Status

Our final example corresponds to the basic current status model considered by Ayer, Brunk, Ewing,

Reid, and Silverman (1955). In this example, the object of interest is F0(x), where x ∈ R is a given

evaluation point and where F0 is the distribution function of an unobserved (scalar) continuously

distributed x. If y is independent of x and continuously distributed, then F0 is identified (on the

support of y) from the distribution of z = (y, d)′, where d = 1(x ≤ y). Assuming z1, . . . , zn is

a random sample from the distribution of z, a natural estimator of θ0 = F0(x) is θn = F CSn (x),

where F CSn maximizes the (partial) log likelihood∑n

i=1{di logF (yi) + (1 − di) log[1 − F (yi)]} over

all distribution functions F.

It can be shown that F CSn minimizes∑n

i=1{di − F (yi)}2 over all distribution functions F. In

other words, F CSn (x) is of the same form as the monotone regression estimator µMRn (x) analyzed in

the previous subsection. In particular, θn can be taken to be the left derivative of the GCM of

Vn ◦ H−1n at Hn(x), where

Vn(·) =1

n

n∑i=1

di1(yi ≤ ·), Hn(·) =1

n

n∑i=1

1(yi ≤ ·),

and it can be shown that (12) holds with γ0 =√F0(x)[1− F0(x)]g0(x) and λ0 = g0(x), (13) holds

with χ0 = χCS = f0(x)g0(x), and (14) holds with ω0 = 3√

4F0(x)[1− F0(x)]f0(x)/g0(x) under the

following condition.

Condition CS For some neighborhood X of x, the densities f0 and g0 of x and y are continuous

on X with f0(x) > 0 and g0(x) > 0.

Letting z∗1,n, . . . , z∗n,n denote a random sample from the empirical distribution of z1, . . . , zn,

a natural bootstrap analogue of F CSn (x) is given by F CS,∗n (x), the left derivative of the GCM of

V ∗n ◦ H∗−1n at H∗n(x), where

V ∗n (·) =1

n

n∑i=1

d∗i,n1(y∗i,n ≤ ·), H∗n(·) =1

n

n∑i=1

1(y∗i,n ≤ ·).

Once again, the distributional approximation based on θ∗n is inconsistent, but bootstrap consistency

can be restored by reshaping V ∗n . Let χn be an estimator of χCS and let F CS,∗n (x) be the left derivative

25

of the GCM of V ∗n ◦ H∗−1n at H∗n(x), where

V ∗n (·) = V ∗n (·)− Vn(·) + Vn(x) + F CSn (x){Hn(·)− Hn(x)}+1

2χn(· − x)2.

Theorem CS If Condition CS holds and if χn →P χCS, then

supt∈R

∣∣∣P∗[ 3√n(F CS,∗n (x)− F CSn (x)) ≤ t]− P[ 3√n(F CSn (x)− F0(x)) ≤ t]

∣∣∣→P 0.

The consistency requirement χn →P χCS = f0(x)g0(x) can be met by using either our generic

numerical derivative approach or an example-specific plug-in estimator; see Section A.5.3 of the

supplemental appendix for details.

6 Numerical Results

We illustrate the numerical performance of our proposed bootstrap-based inference methods for

the maximum score estimator and the monotone density estimator. In both cases we find that the

resulting confidence intervals exhibit very good coverage and length properties in the simulation

designs considered, outperforming other available resampling methods for cube root consistent

estimators.

6.1 The Maximum Score Estimator

We consider the setup of Section 4.1, and generate data from a model with d = 1, θ0 = 1,

x = (x1, x2)′ ∼ N

0

1

,

1 0

0 1

,

and u generated by three distinct distributions. Specifically, DGP 1 sets u ∼ Logistic(0, 1)/√

2π2/3,

DGP 2 sets u ∼ T3/√

3, where Tk denotes a Student’s t distribution with k degrees of freedom, and

DGP 3 sets u ∼ (1 + 2(x1 + x2)2 + (x1 + x2)

4)Logistic(0, 1)/√π2/48.

The Monte Carlo experiment employs a sample size n = 1, 000 with B = 2, 000 bootstrap

replications and S = 2, 000 simulations. For each of the three DGPs, we implement the stan-

26

dard non-parametric bootstrap, m-out-of-n bootstrap, and our proposed method using the two

estimators VMSn and VND

n of V0. We report empirical coverage for nominal 95% confidence inter-

vals and their average interval length. For the case of our proposed procedures, we investigate

their performance using both (i) a grid of fixed tuning parameter values (bandwidth/derivative

step) around the MSE-optimal choice and (ii) infeasible and feasible AMSE-optimal choices of the

tuning parameter.

Table 1 presents the main results, which are consistent across all three simulation designs. First,

as expected, the standard nonparametric bootstrap (labeled “Standard”) does not perform well,

leading to confidence intervals with an average 64% empirical coverage rate. Second, them-out-of-n

bootstrap (labeled “m-out-of-n”) performs somewhat better for small subsamples, but leads to very

large average interval length of the resulting confidence intervals. Our proposed methods, on the

other hand, exhibit excellent finite sample performance in this Monte Carlo experiment. Results

employing the example-specific plug-in estimator VMSn are presented under the label “Plug-in”while

results employing the generic numerical derivative estimator VNDn are reported under the label “Num

Deriv”. Empirical coverage appears stable across different values of the tuning parameters for each

method, with better performance in the case of VMSn . We conjecture that n = 1, 000 is too small for

the numerical derivative estimator VNDn to lead to as good inference performance as VMS

n (e.g., note

that the MSE-optimal choice εMSE is greater than 1). Nevertheless, empirical coverage of confidence

intervals constructed using our proposed bootstrap-based method is close to 95% in all cases except

when VNDn is used with either the infeasible asymptotic choice εAMSE or its estimated counterpart

εAMSE, and with an average interval length of at most half that of any of the m-out-of-n competing

confidence intervals. In particular, confidence intervals based on VMSn implemented with the feasible

bandwidth hAMSE perform quite well across the three DGPs considered.

6.2 The Monotone Density Estimator

We consider the setup of Section 5.1 and employ the DGPs and simulation setting previously

considered in Sen, Banerjee, and Woodroofe (2010). We estimate f0(x) at the evaluation point

x = 1 using a random sample of observations, where three distinct distributions are considered:

DGP 1 sets x ∼ Exponential(1), DGP 2 sets x ∼ |Normal(0, 1)|, and DGP 3 sets x ∼ |T3|. As

in the case of the maximum score example, the Monte Carlo experiment employs a sample size

27

n = 1, 000 with B = 2, 000 bootstrap replications and S = 2, 000 simulations, and compares three

types of bootstrap-based inference procedures: the standard non-parametric bootstrap, m-out-of-n

bootstrap, and our proposed method using two distinct estimators of f ′0(x), an “off-the-shelve”

plug-in estimator and our generic numerical derivative estimator VNDn .

Table 2 presents the numerical results for this example. We continue to report empirical coverage

for nominal 95% confidence intervals and their average interval length. For the case of our proposed

procedures, we again investigate their performance using both (i) a grid of fixed tuning parameter

value (derivative step/bandwidth) and (ii) infeasible and feasible AMSE-optimal choice of tuning

parameter. Also in this case, the numerical evidence is very encouraging. Our proposed bootstrap-

based inference method leads to confidence intervals with excellent empirical coverage and average

interval length, outperforming both the standard nonparametric bootstrap (which is inconsistent)

and the m-out-of-n bootstrap (which is consistent). In particular, in this example, the plug-in

method employs an off-the-shelf kernel derivative estimator, which in this case leads to confidence

intervals that are very robust (i.e., insensitive) to the choice of bandwidth. Furthermore, when

the corresponding feasible off-the-shelf MSE-optimal bandwidth is used, the resulting confidence

intervals continue to perform excellently. Finally, the generic numerical derivative estimator also

leads to very good performance of bootstrap-based infeasible and feasible confidence intervals.

7 Conclusion

We developed a valid resampling procedure for cube root asymptotics based on the nonparametric

bootstrap. While it is well known that for cube root consistent (and related) estimators exhibiting

a Chernoff (1964) type distribution the standard nonparametric bootstrap is invalid, our approach

restores bootstrap validity by applying a carefully tailored reshapement of the objective function

defining the estimator. Such reshapement is easy to implement both in general and in specific

cases, as illustrated by the six distinct examples we considered.

From a more general perspective, our approach can be heuristically explained as follows. Re-

stating the result in (9) as

rn(θn − θ0) S0(G0), S0(G) = argmaxs∈Rd

{G(s) +Q0(s)},

28

our procedure approximates the distribution of S0(G0) by that of Sn(G∗n), where

G∗n(s) = r2n[M∗n(θn + sr−1n )− M∗n(θn)− Mn(θn + sr−1n ) + Mn(θn)]

is a bootstrap process whose distribution approximates that of G0(s) and where

Sn(G) = argmaxs∈Rd

{G(s) + Qn(s)}, Qn(s) = −1

2s′Vns,

is an estimator of S0(G). In other words, our procedure replaces the functional S0 with a consistent

estimator (namely, Sn) and its random argument G0 with a bootstrap approximation (namely, G∗n).

Furthermore, it is not hard to show that our bootstrap-based distributional approximation is

consistent also in the more standard case where mn(z,θ) is suffi ciently smooth in θ to ensure that

an approximate maximizer of Mn is asymptotically normal and that the nonparametric bootstrap

is consistent. In fact, θ∗n is asymptotically equivalent to θ

∗n in that standard case, so our procedure

can be interpreted as a modification of the nonparametric bootstrap that is designed to be “robust”

to the types of non-smoothness that give rise to cube root asymptotics.

Finally, Seo and Otsu (2018) give conditions under which results of the form (9) can be obtained

also when the data is weakly dependent. See also Bagchi, Banerjee, and Stoev (2016), and references

therein. It seems plausible that a version of our procedure, implemented with resampling procedure

suitable for dependent data, can be shown to be consistent under similar conditions, but it is beyond

the scope of this paper to substantiate that conjecture.

References

Abrevaya, J., and J. Huang (2005): “On the Bootstrap of the Maximum Score Estimator,”

Econometrica, 73(4), 1175—1204.

Ayer, M., H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman (1955): “An Empir-

ical Distribution Function for Sampling with Incomplete Information,”Annals of Mathematical

Statistics, 26(4), 641—647.

Bagchi, P., M. Banerjee, and S. A. Stoev (2016): “Inference for Monotone Functions under

29

Short-and Long-range Dependence: Confidence Intervals and New Universal Limits,”Journal of

the American Statistical Association, 111(516), 1634—1647.

Bickel, P. J., and B. Li (2006): “Regularization in Statistics,”Test, 15(2), 271—344.

Brunk, H. (1958): “On the Estimation of Parameters Restricted by Inequalities,” Annals of

Mathematical Statistics, 29(2), 437—454.

Chernoff, H. (1964): “Estimation of the Mode,”Annals of the Institute of Statistical Mathemat-

ics, 16(1), 31—41.

Delgado, M. A., J. M. Rodriguez-Poo, and M. Wolf (2001): “Subsampling Inference in

Cube Root Asymptotics with an Application to Manski’s Maximum Score Estimator,”Economics

Letters, 73(2), 241—250.

Dümbgen, L. (1993): “On Nondifferentiable Functions and the Bootstrap,” Probability Theory

and Related Fields, 95, 125—140.

Grenander, U. (1956): “On the Theory of Mortality Measurement: Part II,” Scandinavian

Actuarial Journal, 39(2), 125—153.

Groeneboom, P. (1985): “Estimating a Monotone Density,”in Proceedings of the Berkeley Con-

ference in Honor of Jerzy Neyman and Jack Kiefer, pp. 539—555. Institute of Mathematical

Statistics.

Groeneboom, P., and K. Hendrickx (2018): “Current Status Linear Regression,”Annals of

Statistics, 46(4), 1415—1444.

Groeneboom, P., and G. Jongbloed (2014): Nonparametric Estimation under Shape Con-

straints. Cambridge University Press.

(2018): “Some Developments in the Theory of Shape Constrained Inference,”Statistical

Science, 33(4), 473—492.

Hong, H., and J. Li (2019): “The Numerical Bootstrap,”Annals of Statistics, forthcoming.

Honoré, B. E., and E. Kyriazidou (2000): “Panel Data Discrete Choice Models with Lagged

Dependent Variables,”Econometrica, 68(4), 839—874.

30

Kim, J., and D. Pollard (1990): “Cube Root Asymptotics,”Annals of Statistics, 18(1), 191—219.

Kosorok, M. R. (2008): “Bootstrapping the Grenander Estimator,” in Beyond Parametrics

in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen, pp. 282—292.

Institute of Mathematical Statistics.

Léger, C., and B. MacGibbon (2006): “On the Bootstrap in Cube Root Asymptotics,”Cana-

dian Journal of Statistics, 34(1), 29—44.

Manski, C. F. (1975): “Maximum Score Estimation of the Stochastic Utility Model of Choice,”

Journal of econometrics, 3(3), 205—228.

Mohammadi, L., and S. van de Geer (2005): “Asymptotics in Empirical Risk Minimization,”

Journal of Machine Learning Research, 6(Dec), 2027—2047.

Mukherjee, D., M. Banerjee, and Y. Ritov (2019): “Non-Standard Asymptotics in High

Dimensions: Manski’s Maximum Score Estimator Revisited,”arXiv:1903.10063.

Patra, R. K., E. Seijo, and B. Sen (2018): “A Consistent Bootstrap Procedure for the Maxi-

mum Score Estimator,”Journal of Econometrics, 205(2), 488—507.

Politis, D. N., and J. P. Romano (1994): “Large Sample Confidence Regions Based on Sub-

samples under Minimal Assumptions,”Annals of Statistics, 22(4), 2031—2050.

Prakasa Rao, B. (1969): “Estimation of a Unimodal Density,”Sankhya: The Indian Journal of

Statistics, Series A, 31(1), 23—36.

Sen, B., M. Banerjee, and M. Woodroofe (2010): “Inconsistency of Bootstrap: The Grenan-

der Estimator,”Annals of Statistics, 38(4), 1953—1977.

Seo, M. H., and T. Otsu (2018): “Local M-Estimation with Discontinuous Criterion for Depen-

dent and Limited Observations,”Annals of Statistics, 46(1), 344—369.

Shi, C., W. Lu, and R. Song (2018): “A Massive Data Framework for M-estimators with Cubic-

Rate,”Journal of the American Statistical Association, 113(524), 1698—1709.

van der Vaart, A. W. (1998): Asymptotic Statistics. Cambridge University Press.

31

van der Vaart, A. W., and J. A. Wellner (1996): Weak Convergence and Empirical Processes.

Springer.

32

Table 1: Simulations, Maximum Score Estimator, 95% Confidence Intervals.

(a) n = 1, 000, S = 2, 000, B = 2, 000

DGP 1 DGP 2 DGP 3h, ε Coverage Length h, ε Coverage Length h, ε Coverage Length

Standard0.639 0.475 0.645 0.480 0.640 0.247

m-out-of-nm = dn1/2e 0.998 1.702 0.997 1.754 0.999 1.900

m = dn2/3e 0.979 1.189 0.979 1.223 0.985 0.728

m = dn4/5e 0.902 0.824 0.894 0.839 0.904 0.447

Plug-in: VMSn

0.7 · hMSE 0.434 0.941 0.501 0.406 0.947 0.513 0.105 0.904 0.256

0.8 · hMSE 0.496 0.946 0.503 0.464 0.952 0.516 0.120 0.917 0.260

0.9 · hMSE 0.558 0.951 0.506 0.522 0.951 0.518 0.135 0.930 0.267

1.0 · hMSE 0.620 0.954 0.510 0.580 0.952 0.522 0.150 0.941 0.273

1.1 · hMSE 0.682 0.959 0.515 0.638 0.955 0.526 0.165 0.948 0.281

1.2 · hMSE 0.744 0.961 0.522 0.696 0.959 0.532 0.180 0.958 0.288

1.3 · hMSE 0.806 0.962 0.531 0.754 0.960 0.539 0.195 0.966 0.296

hAMSE 0.385 0.938 0.499 0.367 0.941 0.510 0.119 0.917 0.260

hAMSE 0.446 0.947 0.509 0.415 0.949 0.518 0.155 0.941 0.275

Num Deriv: VNDn

0.7 · εMSE 0.980 0.912 0.431 0.904 0.891 0.422 0.203 0.864 0.216

0.8 · εMSE 1.120 0.922 0.442 1.033 0.897 0.432 0.232 0.888 0.228

0.9 · εMSE 1.260 0.929 0.460 1.163 0.909 0.448 0.261 0.904 0.238

1.0 · εMSE 1.400 0.939 0.484 1.292 0.919 0.469 0.290 0.917 0.248

1.1 · εMSE 1.540 0.943 0.514 1.421 0.928 0.497 0.319 0.928 0.257

1.2 · εMSE 1.680 0.948 0.549 1.550 0.932 0.531 0.348 0.939 0.265

1.3 · εMSE 1.820 0.955 0.590 1.679 0.935 0.568 0.377 0.947 0.274

εAMSE 0.483 0.878 0.410 0.476 0.871 0.412 0.216 0.877 0.221

εAMSE 0.518 0.877 0.414 0.513 0.884 0.418 0.368 0.932 0.269

Notes:(i) Panel Standard refers to standard nonparametric bootstrap, Panel m-out-of-n refers to m-out-of-n nonpara-metric bootstrap with subsample m, Panel Plug-in: VMS

n refers to our proposed bootstrap-based implemented usingthe example-specific plug-in drift estimator, and Panel Num Deriv: VND

n refers to our proposed bootstrap-basedimplemented using the generic numerical derivative drift estimator.(ii) Column “h, ε”reports tuning parameter value used or average across simulations when estimated, and Columns“Coverage”and “Length”report empirical coverage and average length of bootstrap-based 95% percentile confidenceintervals, respectively.(iii) hMSE and εMSE correspond to the simulation MSE-optimal choices, hAMSE and εAMSE correspond to the AMSE-optimalchoices, and hAMSE and εAMSE correspond to the ROT feasible implementation of hAMSE and εAMSE described in thesupplemental appendix.

33

Table 2: Simulations, Isotonic Density Estimator, 95% Confidence Intervals.

(a) n = 1, 000, S = 2, 000, B = 2, 000

DGP 1 DGP 2 DGP 3h, ε Coverage Length h, ε Coverage Length h, ε Coverage Length

Standard0.828 0.146 0.808 0.172 0.821 0.155

m-out-of-nm = dn1/2e 1.000 0.438 0.995 0.495 0.998 0.452

m = dn2/3e 0.989 0.314 0.979 0.360 0.989 0.328

m = dn4/5e 0.953 0.235 0.937 0.274 0.948 0.248

Plug-in: VIDn

0.7 · hMSE 0.264 0.955 0.157 0.202 0.947 0.183 0.209 0.957 0.165

0.8 · hMSE 0.302 0.954 0.157 0.231 0.946 0.182 0.239 0.952 0.165

0.9 · hMSE 0.339 0.951 0.156 0.260 0.944 0.181 0.269 0.949 0.164

1.0 · hMSE 0.377 0.949 0.154 0.289 0.941 0.180 0.299 0.948 0.163

1.1 · hMSE 0.415 0.940 0.151 0.318 0.938 0.178 0.329 0.944 0.161

1.2 · hMSE 0.452 0.934 0.147 0.347 0.934 0.176 0.359 0.939 0.158

1.3 · hMSE 0.490 0.922 0.142 0.376 0.928 0.173 0.389 0.935 0.155

hAMSE 0.380 0.949 0.154 0.300 0.940 0.180 0.333 0.943 0.161

hAMSE 0.364 0.950 0.155 0.290 0.941 0.180 0.401 0.930 0.154

Num Deriv: VNDn

0.7 · εMSE 0.726 0.954 0.158 0.527 0.947 0.183 0.554 0.952 0.165

0.8 · εMSE 0.830 0.956 0.159 0.602 0.947 0.182 0.633 0.950 0.164

0.9 · εMSE 0.933 0.956 0.160 0.678 0.944 0.181 0.712 0.949 0.163

1.0 · εMSE 1.037 0.956 0.159 0.753 0.942 0.180 0.791 0.948 0.162

1.1 · εMSE 1.141 0.955 0.159 0.828 0.940 0.179 0.870 0.946 0.161

1.2 · εMSE 1.244 0.956 0.160 0.904 0.936 0.177 0.949 0.943 0.160

1.3 · εMSE 1.348 0.960 0.163 0.979 0.935 0.176 1.028 0.940 0.159

εAMSE 0.927 0.956 0.160 0.731 0.943 0.180 0.812 0.948 0.162

εAMSE 0.888 0.956 0.159 0.708 0.943 0.181 0.978 0.942 0.159

Notes:(i) Panel Standard refers to standard nonparametric bootstrap, Panel m-out-of-n refers to m-out-of-n nonpara-metric bootstrap with subsample m, Panel Plug-in: VID

n refers to our proposed bootstrap-based implemented usingthe example-specific plug-in drift estimator, and Panel Num Deriv: VND

n refers to our proposed bootstrap-basedimplemented using the generic numerical derivative drift estimator.(ii) Column “h, ε”reports tuning parameter value used or average across simulations when estimated, and Columns“Coverage”and “Length”report empirical coverage and average length of bootstrap-based 95% percentile confidenceintervals, respectively.(iii) hMSE and εMSE correspond to the simulation MSE-optimal choices, hAMSE and εAMSE correspond to the AMSE-optimalchoices, and hAMSE and εAMSE correspond to the ROT feasible implementation of hAMSE and εAMSE described in thesupplemental appendix.

34

bootstrap-based inference for cube root asymptotics · bootstrap-based inference for cube root...

Documents