bootstrap diagnostics and remedies

23
The Canadian Journal of Statistics Vol. 34, No. 1,2006. Pages 5-27 La rwue canadienne a2 statistique 5 Bootstrap diagnostics and remedies Angelo J. CANTY, Anthony C. DAVISON, David V. HINKLEY and Valerie VENTURA Key words andphrases: Bootstrap recycling; importance sampling; inconsistency; jackknife-after-bootstrap; outlier; pivot; resampling; spatial data; Stein estimator; subsampling; superefficiency; time series. MSC 2000: Primary 62G09; secondary 62F40. Abstract: Bootstrap diagnostics are used to assess the reliability of bootstrap calculations and may suggest useful modified calculations when these are possible. Concern focuses on susceptibility to peculiarities in data, incorrectness of a resampling model, incorrect use of resampling simulation output, and inherent inaccuracy of the bootstrap approach. The last involves issues such as inconsistency of a bootstrap method, the order of correctness of a consistent bootstrap method, and approximate pivotality. The authors review here some of these problems, provide workable diagnostic methods where possible, and discuss fast and simple ways to effect the necessary computations. Diagnostiques et remedes pour le bootstrap Rt?surnd : Les diagnostiques bootstrap sont des techniques qui permettent d’tvaluer la fiabilite des calculs effectuts par r&chantillonnage et qui conduisent ?i l’occasion h des modifications utiles dans les modes de calcul. Parmi les sources d‘ennuis communes, on note la sensibilite aux idiosyncrasies des donnbes, ]’inexactitude d’un modtle de r&chantillonnage, une utilisation fautive des dsultats de la mtthode boot- strap, et I’imprkision inherente h cette approche, h savoir I’tventuelle non-convergence de la proc&Iure, son degrt de prtcision lorsqu’elle converge, ainsi que sa pivotalid approximative. Les auteurs passent ici en revue certaines de ces questions, foumissent des methodes diagnostiques lorsqu’ils le peuvent, et presentent des procedures de calcul simples et efficaces. 1. INTRODUCTION Over the past quarter-century bootstrap procedures have become widely used in statistical appli- cations, owing to their usefulness as general-purposetools for assessing variability. As with any statistical notion, however, it is important to be clear about the assumptions from which the va- lidity and reliability of such calculations are derived. This paper reviews procedures for checking some of these assumptions and outlines some ways of using the resulting information to modify the calculations. One simple way to describe bootstrap procedures is to say that they are applications of the bootstrap substitution principle, which says that if we want to know the probability distribution of the quantity U = u(Y, F) when Y = (Yl,. . . , Y,) is randomly sampled from F, then we replace F in the probability calculation by a fitted model 8, making the approximation P {21(Y, F) 5 21 1 F} x P {u(Y*, F) 5 21 I 8}. (1) The superscript * iz used to indicate random variables and related quantities sampled from a probability model F-the resampling model-that has been fitted to the data. In the simplest applications u(Y, F) = TZ’/~(T - 8), where T is an estimator of the population parameter 8 = t(F). More complicated situations will involve transformations of T or studentized forms of T. Usually the estimator T is the same function of the empirical distribution function (EDF) F as 8 is of the true and unknown distribution F: for example, if 8 is the mean JydF(y), then estimator T is the sample mean ydF(y) = C Y,n-’ = y. Characteristics of T such as bias and variance have approximations corresponding to (l), e.g., var(T [ F) = var(T* I p). The resampling model 8 in (1) may correspond either to a parametric estimate F4, where the full parameter vector II, includes the parameter of interest 8, or it may correspond to the

Upload: cmu

Post on 22-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

The Canadian Journal of Statistics Vol. 34, No. 1,2006. Pages 5-27 La rwue canadienne a2 statistique

5

Bootstrap diagnostics and remedies Angelo J. CANTY, Anthony C. DAVISON, David V. HINKLEY and Valerie VENTURA

Key words andphrases: Bootstrap recycling; importance sampling; inconsistency; jackknife-after-bootstrap; outlier; pivot; resampling; spatial data; Stein estimator; subsampling; superefficiency; time series.

MSC 2000: Primary 62G09; secondary 62F40.

Abstract: Bootstrap diagnostics are used to assess the reliability of bootstrap calculations and may suggest useful modified calculations when these are possible. Concern focuses on susceptibility to peculiarities in data, incorrectness of a resampling model, incorrect use of resampling simulation output, and inherent inaccuracy of the bootstrap approach. The last involves issues such as inconsistency of a bootstrap method, the order of correctness of a consistent bootstrap method, and approximate pivotality. The authors review here some of these problems, provide workable diagnostic methods where possible, and discuss fast and simple ways to effect the necessary computations.

Diagnostiques et remedes pour le bootstrap Rt?surnd : Les diagnostiques bootstrap sont des techniques qui permettent d’tvaluer la fiabilite des calculs effectuts par r&chantillonnage et qui conduisent ?i l’occasion h des modifications utiles dans les modes de calcul. Parmi les sources d‘ennuis communes, on note la sensibilite aux idiosyncrasies des donnbes, ]’inexactitude d’un modtle de r&chantillonnage, une utilisation fautive des dsultats de la mtthode boot- strap, et I’imprkision inherente h cette approche, h savoir I’tventuelle non-convergence de la proc&Iure, son degrt de prtcision lorsqu’elle converge, ainsi que sa pivotalid approximative. Les auteurs passent ici en revue certaines de ces questions, foumissent des methodes diagnostiques lorsqu’ils le peuvent, et presentent des procedures de calcul simples et efficaces.

1. INTRODUCTION Over the past quarter-century bootstrap procedures have become widely used in statistical appli- cations, owing to their usefulness as general-purpose tools for assessing variability. As with any statistical notion, however, it is important to be clear about the assumptions from which the va- lidity and reliability of such calculations are derived. This paper reviews procedures for checking some of these assumptions and outlines some ways of using the resulting information to modify the calculations.

One simple way to describe bootstrap procedures is to say that they are applications of the bootstrap substitution principle, which says that if we want to know the probability distribution of the quantity U = u(Y, F) when Y = (Yl,. . . , Y,) is randomly sampled from F, then we replace F in the probability calculation by a fitted model 8, making the approximation

P {21(Y, F ) 5 21 1 F } x P {u(Y*, F ) 5 21 I 8}. (1)

The superscript * iz used to indicate random variables and related quantities sampled from a probability model F-the resampling model-that has been fitted to the data. In the simplest applications u(Y, F) = TZ’/~(T - 8), where T is an estimator of the population parameter 8 = t ( F ) . More complicated situations will involve transformations of T or studentized forms of T. Usually the estimator T is the same function of the empirical distribution function (EDF) F as 8 is of the true and unknown distribution F: for example, if 8 is the mean JydF(y) , then estimator T is the sample mean ydF(y) = C Y,n-’ = y. Characteristics of T such as bias and variance have approximations corresponding to (l), e.g., var(T [ F) = var(T* I p).

The resampling model 8 in (1) may correspond either to a parametric estimate F4, where the full parameter vector II, includes the parameter of interest 8, or it may correspond to the

6 CANTY, DAVISON, HINKLEY 8 VENTURA Vol. 34, No. 1

empirical distribution function F, which in this simple setting assigns equal h probabilities to each case. In more complicated situations, such as regression or time series, F may correspond to a semiparametric model, e.g., a parametric regression fit with additive errors whose distribution is estimated by the empirical distribution function of residuals. The more complicated the structure, the more choices there may be for p.

will take the same values y1 , . . . , yYn as the data, but with variable frequencies f: , . . . , f: that sum to n. The right-hand side of (1) is usually evaluated by Monte Car10 simulation, which will generate samples y: = (Y:~, . . . , Y:~), for T = 1, . . . , R, and associated estimator values t; . This notation extends in an obvious way to more complex data structures and semiparametric models. Extended accounts of resampling methods are given by Hall (1992), Efron & Tibshirani (1993), Shao & Tu (1995), and Davison & Hinkley (1997). The Special Issue of Statistical Science (vol. 18, no. 2,2003) in celebration of the Silver Anniversary of the Bootstrap provides further recent discussion.

The practical interest in (1) is mainly when u(Y, F ) is monotone decreasing in 8. Then if up denotes the p-quantile of U , the set of 8 values such that ua 5 u(Y, F ) 5 u1-p would form a 1 - (a + p) confidence interval. The corresponding bootstrap confidence interval estimates the required pquantiles by empirical quantiles of U * , which are the ( R + 1)pth ordered values ui(R+l)p) derived from the set of R independent resamples.

There are several situations in which standard bootstrap procedures do not give reliable an- swers. Often corrective action is possible by modifying either the resampling scheme or some other aspect of the method. Here we list some problem situations and comment on factors in- volved:

This paper mainly discusses the nonparametric case, where samples y* drawn from

1. Effect of data outliers. Data outliers may exert considerable influence not only on the estimator T but also on the resampling properties of T*. Depending upon the resampling model used, an outlier may occur with variable frequency in bootstrap samples, and the effect of this may be hard to anticipate even if T itself is unaffected.

2. Incorrect resampling model. For inhomogeneous data it is necessary to model the ran- dom variation, which in general means correctly identifying relevant strata for resampling. Examples of this occur with generalized linear models for count data, because of inhomo- geneous residuals, and with temporal and spatial correlation, where the resampling scheme must respond appropriately to the range of dependence.

3. Nonpivotalify. The key to the success of the bootstrap is the accuracy of the substitution approximation (1). Often the user can choose the working quantity U = u(Y, F ) , e.g., in confidence interval and significance test applications. The substitution principle is com- pletely successful if U is a pivot, that is if its distribution is the same for all F. This can be true exactly for parametric models, but not for nonparametric models; here the quality of the approximation can be investigated empirically.

4. Inconsistency of bootstrap method. The combination of model, statistic and resampling scheme may be such that bootstrap results fail to approximate the required properties, no matter how large the sample size.

An important aspect of all statistical methods is sensitivity to assumptions. A bootstrap pro- cedure often involves choosing a combination of statistic and method-such as transformation, studentization, or acceleration and bias adjustment-as well as a resampling model. If there are several equally reasonable resampling models, it is important to check that the conclusion drawn is adequately robust to the choice of model.

Discreteness of the statistic T* sometimes gives difficulties, particularly for rough statistics such as the sample median and maximum. These are usually evident from plots of bootstrap

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 7

output and can often be addressed using a smooth bootstrap (Hall, DiCiccio & Romano 1989; De Angelis, Hall & Young 1993; Brown, Hall & Young 2001) or subsampling (Politis, Romano & Wolf 1999), about which more will be said later,

The focus of the present paper is on diagnostics, i.e., calculations that can be done with bootstrap output to diagnose the presence of the difficulties listed above. We focus on the items in the above list, which we address in Sections 2-5 respectively, and close with a short discussion. Although this paper emphasizes empirical diagnostics, the importance of relevant theoretical work will be evident to the reader. Further theory and more diagnostics are necessary before bootstrap procedures become fully trustworthy additions to the statistical toolkit.

2. DATA OUTLIERS: JACKKNIFE-AFTER-BOOTSTRAP

One of the earliest diagnostics for bootstrap output was the jackknife-after-bootstrap of Efron (1992), which was intended to assess the effect on bootstrap results of deleting individual ob- servations or cases. Rather than rerun bootstrap calculations for the data set that results from deletion of a single case, in the nonparametric case one can simply take the full-data bootstrap output and remove those resamples in which the case appears; this leaves R-j resamples, roughly a fraction 0.368 of the original R resamples. The calculations are easily done if the resample fre- quencies f:j, the number of times case j appears in the rth resample, were retained.

In general, importance reweighting can be used to augment the full-data bootstrap output to obtain case-deletion results, so long as the case-deletion estimated model support for y* is contained in the full-data estimated model support for y*. This condition is satisfied in the nonparametric case mentioned above and for most parametric models, but not for most semi- parametric models such as that in the regression example below.

When case j is deleted from the data, for clarity we denote the value of T by t - j , and the corresponding rth resample value by t:, - j .

200 400 600 800 loo0 1200 1400 Dose

FIGURE 1 : Survival percentages of cells at a succession of radiation doses (rads) (Efron & Tibshirani 1993, $9.6). with least squares (dotted) and Huber Proposal 2 (solid) fits; the possible outlier case 13 is

shown as a hollow circle.

2.1. Example: Survival data regression.

Figure 1 shows data from Efron & Tibshirani (1993,§9.6). Each point corresponds to the propor- tion of cells which survived on one bacterial plate when exposed to a given radiation dose. The very large numbers of cells exposed on each plate are not available. The investigator was uncer- tain about the result from one of the plates, marked by an open circle on the plot, perhaps because

8 CANTY, DAVISON, HINKLEY 8 VENTURA Vol. 34, No. 1

of aberrant experimental conditions. The plot suggests the possibility of two other moderate out- liers relative to the linear regression shown. With 2 and y representing respectively dose and logarithm of survival rate, we fit the model E (Y 12) = 00 + P1x to data ( 2 1 , yl), . . . , (z,, 9,) by least squares, obtaining estimates $0 and al. There are two main resampling methods here:

(i) Case resampling: randomly sample from the empirical bivariate distribution of (x, y), tak- ing (z;, yt) = (XU,, yu,) for i = 1 , . . . , n, with U I , . . . , U, randomly sampled from {1 , . . . ,n} .

(ii) Model-based resampling: A - simulate random samples from the fitted regression model, taking x; = xi and y; = ,& + pllci + E; for i = 1, . . . , n, with errors E: randomly sampled from residuals el , . . . , en, where ei = gi - /30 - plxi, or, preferably, standardized versions of these.

For the semiparametric method (ii). the support of yt when case j is omitted from the data is not included in the corresponding full-data support, so new resamples must be generated for each of the n case-deletion model fits.

under both case and model-based resampling. The horizontal axis shows the standardized estimated influence values f!,/ (n-l C f!:) 1'2 for the slope of a simple linear regression, where

The top row of Figure 2 shows jackknife-after-bootstrap plots for quantiles of & -

is the exact empirical influence value for case j and statistic as defined in the Appendix. The vertical axis in each plot shows & - 61 (scale multiplied by lo4); dotted lines show the 5,50 and 95% empirical quantiles from R = 999 full-data resamples. The solid lines join points giving these quantiles for the distributions of

where &,-j is the average of the slopes obtained in those resamples in which case j does not appear in the case resampling and those for which case j was not used for the model-based resampling scheme. The dashed lines are obtained by averaging the results from all the data sets for the case resampling, and those for the simulations based on the entire sample for the model-based bootstrap.

The distribution of ,& - 61 is much more dispersed, and skewed, under case resampling, but the dominant outlier, case 13, shows up clearly in both plots. The message is enhanced by the extreme influence for the outlying case. This would not happen if a comparable outlier were at dose 2 = 470 or 940, for example, but there would still be a large effect on model-based resampling in particular. A simple way to highlight the disruptive effect that an outlier may have on bootstrap calculations is to surround each full-data bootstrap quantile line (the dotted lines in Figure 2) with an uncertainty band for case-deletion quantiles that is computed using a robust estimate of standard deviation. See the bottom row of Figure 2 where the width of the band is 2 x z times the standard deviation estimate based on the interquartile range of the n = 14 case-deletion quantiles and z = 1.96 based on normal theory. There are other ways to calibrate the plot, such as using simulation envelopes for the quantiles, but the bands are both clear and simple. They are particularly useful when the number of observations n is large, for then genuinely unusual points tend to be hidden by noise in the usual jackknife-after-bootstrap

Here it seems that case 13 is a real outlier. Two main options for the data analysis are omission of case 13 and reapplication of the least squares fit, and replacement of the least squares fit by

plot.

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 9

a robust fit. Under the second it is important to consider what effect the outlier might still have on the bootstrap analysis, even if the fit itself is not affected. As Singh (1998) points out, the breakdown point for bootstrap analysis may be quite different to that for the data fit itself. For example, suppose that case j is an outlier, and that the robust estimate t is not sensitive to this one outlier. Bootstrap samples under case resampling will contain case j with a frequency f; which is close to Poisson with unit mean. Thus if the estimator T is vulnerable to two or more outliers, then approximately a quarter of the values of t* will be affected, because P (f; 2 2) 2 0.264; this will corrupt estimates of both moments and tail quantiles.

0 1 2 3 Standardized Empirical Influence

0 1 2 3 Standardized Empirical Influence

0 1 2 3 Standardized Empirical Influence

0 1 2 3 Standardized Empirical Influence

FIGURE 2: Jackknife-after-bootstrap plot, with and without bands, for 5,50 and 95% quantiles (x lo4) of slope estimation errors for least squares regression fit to data in Figure 1, under case (left) and

model-based (right) resampling with R = 999. The dotted lines correspond to the full-data bootstrap quantiles. See text for details.

The left panel of Figure 3 shows a jackknife-after-bootstrap plot for the Huber Proposal 2 estimate of slope, as calculated by the function rlm of the R language (Venables & Ripley 2002) with 20 iterations from an initial least squares fit, and leaving out a few resamples in which convergence failed, under case resampling. The horizontal axis is the standardized jackknife influence value, equal to the scaled and centred version of

- where

appear, and jr,- is the average of these quantities. The vertical axis is the same as in Figure 2.

is the average of the Huber regression slopes for resamples in which case j does not -

10 CANTY, DAVISON, HINKLEY B VENTURA Vol. 34, No. 1

Case 13 has a big effect on bootstrap analysis of the robust estimate, despite having virtually no effect on the numerical value of the estimate itself. For the data, the Huber estimate (-0.0070) almost equals the least squares estimate with case 13 omitted (-0.0078), but the resampling distributions of these two estimators are very different: the central panel of Figure 3 shows the corresponding jackknife-after-bootstrap plot for least squares estimation with case 13 omitted, under case resampling.

I T o i

t-I

0 N ?I 0 N ?I

i i

0 N 0 N

i i

0 N 0 N

- 1 0 1 2 3 - 1 0 1 2 3 - 1 0 1 2 3

Standardized Jackknife Value standardized Jackknife Value Standardized Jackknife Value

FIGURE 3: Jackknife-after-bootstrap plots for 5,50 and 95% quantiles (x lo4) of slope estimation errors, for Huber estimate in the left and right plots and least squares estimate not using case 13 in the middle plot; the resampling scheme on the right uses weights taken from the robust fit. The most influential

observations are highlighted. Case resampling R = 999. All vertical scales are the same.

This difficulty with resampling robust estimates in the presence of outliers may be circum- vented by the use of weighted resampling, in which the case weights of the robust fit are used as the resampling weights. For our example, the final weights in the iterative weighted least squares fit of rlm all equal one, except for cases 11, 12, and 13, which have weights 0.40,0.46, and 0.16. respectively. Under weighted case resampling using these weights, we obtain the jackknife-after- bootstrap plot in the right panel of Figure 3. This panel is very similar to the central one: clearly case 13 no longer has damaging effects on the resampling for the Huber estimate. This trick is not available for all robust estimators: for example, it cannot be used with least trimmed squares, although then the bootstrap analysis appears to be stable and reliable.

3. VALIDITY OF RESAMPLING MODELS

A critical question in any bootstrap application is whether or not the resampling model is correct. In parametric cases this would usually be automatic, but exceptions include problems with large numbers of nuisance parameters. In nonparametric cases the situation will be less clear-cut.

Qualitatively, we want bootstrap samples y* to behave like real data samples y, and we refer to this as the resample-like-sample idea. For example, since the behaviour of an estimator T will usually depend upon the variability of Y , we would want the variability of Y* to be comparable to that of Y. Here are three examples to show this idea in use.

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 11

TABLE 1 : Numbers of AIDS reports in England and Wales to the end of 1992 (De Angelis & Gilks 1994). A t indicates a reporting delay of less than one month.

Diagnosis Reporting-delay interval (quarters) Total period reports

to end Year Quarter 0' 1 2 3 4 5 6 7 8 9 10 11 12 13 214 1992

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

3 2 6 0 1 1 0 0 1 0 0 0 0 0 0 1 12 4 2 7 1 1 1 0 0 0 0 0 0 0 0 0 0 12 1 4 4 0 1 0 2 0 0 0 0 2 1 0 0 0 14 2 0 10 0 1 1 0 0 0 1 1 1 0 0 0 0 15 3 6 17 3 1 1 0 0 0 0 0 0 1 0 0 1 30 4 5 22 1 5 2 1 0 2 1 0 0 0 0 0 0 39 1 4 23 4 5 2 1 3 0 1 2 0 0 0 0 2 47 2 11 11 6 1 1 5 0 1 1 1 1 0 0 0 1 40 3 9 22 6 2 4 3 3 4 7 1 2 0 0 0 0 63 4 2 28 8 8 5 2 2 4 3 0 1 1 0 0 1 65 1 5 2 6 1 4 6 9 2 5 5 5 1 2 0 0 0 2 82 2 7 4 9 1 7 1 1 4 7 5 7 3 1 2 2 0 1 4 120 3 13 3 7 2 1 9 3 5 7 3 1 3 1 0 0 0 6 109 4 12 5 3 1 6 2 1 2 7 0 7 0 0 0 0 0 1 1 120 1 21 4 4 2 9 1 1 6 4 2 2 1 0 2 0 2 2 8 134 2 1 7 7 4 1 3 1 3 3 5 3 1 2 2 0 0 0 3 5 141 3 3 6 5 8 2 3 1 4 7 4 1 2 1 3 0 0 0 3 1 1 5 3 4 28 7 4 2 3 1 1 8 3 3 6 2 5 4 1 1 1 3 173 1 31 8 0 1 6 9 3 2 8 3 1 4 6 2 1 2 6 174 2 26 9 9 2 7 9 8 1 1 3 4 6 3 5 5 1 1 3 211 3 31 9 5 3 5 1 3 1 8 4 6 4 4 3 3 2 0 3 3 224 4 36 7 7 2 0 2 6 1 1 3 8 4 8 7 1 0 0 2 2 205 1 32 92 32 10 12 19 12 4 3 2 0 2 2 0 2 224 2 15 92 14 27 22 21 12 5 3 0 3 3 0 1 1 219 3 3 4 1 0 4 2 9 3 1 1 8 8 6 7 3 8 0 2 1 2 2253 4 3 8 1 0 1 3 4 1 8 9 1 5 6 1 2 2 2 3 2 2233 1 31 124 47 24 11 15 8 6 5 3 3 4 2281 2 3 2 1 3 2 3 6 1 0 9 7 6 4 4 5 0 2245 3 49 107 51 17 15 8 9 2 1 1 2260 4 44 153 41 16 11 6 5 7 2 2285 1 41 137 29 33 7 11 6 4 3 - >271 2 56 124 39 14 12 7 10 1 2263 3 53 175 35 17 13 11 2 2306 4 63 135 24 23 12 1 2258 1 71 161 48 25 5 2310 2 95 178 39 6 2318 3 76 181 16 2273 4 67 66 2133

12 CANTY, DAVISON, HINKLEY 8. VENTURA Vol. 34, No. 1

3.1. Example: Count data resampling. Inhomogeneity is an important factor in Example 7.4 of Davison & Hinkley (1997), which in- volves modelling of an incomplete two-way table of counts of AIDS cases, rows corresponding to successive three-month diagnosis periods and columns corresponding to different lengths of delay in reporting diagnoses. The data-shown in Table 1 and available in the boot library of the statistical packages R and Splus-have missing (or incomplete) entries where delays extend the reporting time beyond the date at which the data were collated. The modelling is to enable prediction of these missing entries, so that the trend in numbers of diagnoses can be estimated. One simple possibility is a log-linear model in which the mean of the count in the j th row and kth column is exp(aj + P k ) , and this appears to reflect the systematic variation adequately as judged by standard residual plots; this model involves 52 parameters estimated from 465 observations. There is strong overdispersion relative to Poisson variation, however, the residual deviance D for the fitted log-linear model being 716 on 413 degrees of freedom. The residuals are very inhomogeneous, as illustrated in the left panel of Figure 4.

The overdispersion makes it inappropriate to resample table entries using the fitted Poisson model, and the strong inhomogeneity of Pearson residuals casts doubt on the appropriateness of the simple nonparametric resampling scheme

(2)

where E* denotes a randomly sampled Pearson residual from the original fit, and f i , a fitted value. The first two boxplots of bootstrap residual deviances D* in the right panel of Figure 4 corre- spond to these two resampling schemes. Both distributions of D* are far from being centred on the data value D, confirming that the resampling schemes are inappropriate. The third boxplot corresponds to a moderately successful negative binomial replacement for the Poisson model. The final resampling scheme is a stratified version of the nonparametric resampling scheme in which Pearson residuals are sampled from strata determined by fitted values, as marked by ver- tical lines in the left panel of Figure 4. The boxplot of D* for this scheme centres on the data value D, manifesting the “resample-like-sample” idea. Similar results are obtained for a variety of stratifications.

y* = max(b + b 1 I 2 ~ * , 0 ) ,

4 I. I I ..

, # , I

, I

, I

, I , I

I 1 I I - 2 0 2 4

Linear predictor Poisson Boot NegBin Strat

FIGURE 4: Bootstrap analyses of AIDS data. Left panel: Pearson residuals (y - f i ) / f i ’ / z against linear predictor log fi for log-linear model, with possible strata for resampling separated by the vertical lines. Right panel: boxplots of residual deviances D’ for 999 data sets generated under parametric simulation from fitted Poisson model, (Poisson), nonparametric resampling using (2) (Boot), parametric simulation

from a negative binomial model (NegBin), and nonparametric bootstrapping from (2) but with stratification (Strat). The horizontal lines show the original residual deviance (solid) and its degrees of

freedom (dashes).

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 13

Stratification and conditioning seem widely applicable in resampling methods, the first as a preemptive way to make the entire bootstrap output relevant, and the second as a postsimulation device for selecting only the relevant bootstrap samples when irrelevant heterogeneity is intro- duced by the resampling scheme (Hinkley & Schechtman 1987). The clearest case for stratified resampling is where the data were themselves obtained by explicit stratified sampling.

3.2. Example: River heights at Manaus. A general strategy for bootstrapping dependent data is block resampling, several variants of which have been proposed and studied for their ability to deal with stationary time series data. Recent accounts with extensive bibliographies are given by Lahiri (1999), Lahiri, Kaiser, Cressie & Hsu (1999) and Buhlmann (2002).

The simplest form of fixed-length block resampling patches together randomly selected blocks of length t!: we ignore minor difficulties associated with noninteger n/t!. A key ques- tion for an application is the choice of t!. To assess whether or not a particular C is suitable, it is useful to plot the sample correlogram, partial correlogram or cumulative periodogram, and then superimpose the corresponding plots for a small number of resampled series.

Figure 5 illustrates this for the Manaus river height data (Davison & Hinkley 1997, p. 388) using the cumulative periodogram; the autoregressive process chosen by minimising the Akaike information criterion is of order p = 8. The upper panels correspond to block resampling of the entire series of length 1080 with blocks of length t! = 1,3, 5 and 10. For a suitable resampling model we would expect the resample curves to envelop the data curve, but evidently none of these captures the second-order properties of the original data.

I=1 1=3 1=5 1=10 0 0 0 0 - - ,- -

q q 2 0 2 0

0 2 2 2 0 0 0 x 2 2 2 2

x x x

m

f . *

- 0 1 2 3 4 5 6

trsquencv

1=3

- 0 1 2 3 4 5 6

h s q v

1=5

- 0 1 2 3 4 5 6

hsquarol

1=10 0 0 r ,-

ID ID 0 0

ID 0 2 x x 2 2 x x -

0 1 2 3 4 5 6

trsqrnnl

0 1 2 3 4 5 6

hsqwncl

- 0 1 2 3 4 5 6

hsquarol

FIGURE 5: Manaus data. Cumulative periodograms for data (bold curves) and for R = 19 resampled series taken in various ways. The upper panels show results for block resampling with block length 1,

while the lower panels show results for postcolouring using a fitted AR( 1) model. The diagonal lines are 95% confidence bands for a white noise process.

One strategy for inducing the appropriate serial dependence in the resampled time series is postcolouring, in which a model that removes strong systematic structure is fitted to the series, the corresponding residuals are resampled in blocks, and bootstrap time series are generated by applying the estimated model with the resampled residuals. The residuals are generally much

14 CANTY, DAVISON, HINKLEY 8 VENTURA Vol. 34, No. 1

closer to white noise than the original observations, so shorter blocks of residuals can be used. The lower panels of Figure 5 suggest that if this strategy is applied using a fitted AR(1) model, then the resampling results depend much less on block length: postcolouring with l 2 3 seems to reproduce the second-order structure of the original series.

3.3. Example: Spatial distribution of caveolae.

The left panel of Figure 6 shows spatial locations of caveolae in a small square of muscle tissue. The resampling scheme we have in mind here is tile resampling (Hall 1985). Here the analogue of a randomly sampled observation is a randomly sampled observed square, called a tile, taken from the sample area. A number m of these tiles are laid without overlap on a blank copy of the sample area. Each tile is a square with area a fraction m-l of the sample area and centre at a random point within that area. Tiles that overlap the boundaries are completed by toroidal wrapping over to the opposite edge of the sample area. The choice of m is a critical aspect of tile resampling, analogous to the choice of block length when resampling time series.

The non-Poisson character of the data is measured by a standardized version of the K- function of Ripley (1977),

K(d) = X-*E (number of events within distance d of an arbitrary event),

with X the marginal density of points. The standardized form is Z(d) = { K ( d ) / ~ } l / ~ - d, which for a Poisson process is exactly equal to 0 for all radii d. The empirical version Z^(d) of Z(d ) is obtained from an edge-corrected empirical version of K(d) (Ripley 1981, ch. 8).

5x5 liling 1 c x l O i d i I q

30x30 tiling 5DrM tiling

FIGURE 6: Caveolae data (Appleyard et al. 1985). Left panel: locations of n = 137 points in muscle tissue, with nine random tiles ready for 3 x 3 tiling; note the toroidal wrapping of the tiles. Right panels: standardized K-function (heavy) with pointwise and overall 95% confidence bands, generated with 999

bootstrap replicates of tile resampling plans yielding tilings with m = 25, m = 100, m = 900, and m = 2500 tiles.

If a resampling scheEe correctly mirrors the data, then_corresponding resample estimates F(d) should resemble Z(d). We assess this by plotting Z(d) surrounded by simulation en- velopes based on R replicates of z * (d); it is useful also to plot a small number of the replicates

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 15

themselves. The right panels of Figure 6 show ecvelopes for four different tilings. As the num- ber of tiles increases, we see that the observed Z(d ) becomes more and more extreme relative to the envelope, since the resampling of tiny tiles induces too much independence and so fails to reproduce the rather large minimum distance between points in the original data.

4. PIVOTALITY The key to success of the bootstrap is the accuracy of the substitution principle (1). The working quantity U = u(Y, F), although necessarily acontrast, need not be n1/'(T-8). For example, we could use the estimation error on a transformed scale U = n1/2{a(T) -a(@)} or the studentized form Z = (T - 8)/V'/' with V an appropriate variance estimate. Suppose that G,(u) is the cumulative distribution function of U. At leass for regular problems where the bootstrap is consistent, we know that in the theoretical world Gn(u) = Gn(u) + O,(n-'/'). But this is not strong enough to make bootstrap methods accurate: we need at least cn(u) = G,(u)+Op(n-'), preferably cn (u) = G, (u) { 1 + 0, (n- )}, and ideally en (u) = Gn (u) for exact pivotality. The choice U = Z will generally give improved accuracy, and sometimes U = n'/'{a(T) - a(8)) will do the same for some carefully chosen transformation a( + ). But quite aside from what may be true asymptotically, for practical Eurposes we need to know just how close we are to the ideal before we can rely on quantiles of G, as approximations to those of G,.

In a parametric setting the notion of pivot is firm: the quantity U = u(Y, F) would have the same distribution for all parameter values. This could be checked by resampling from models Fe with 8 varying over a wide range that includes the estimated value t, and then plotting empirical characteristics such as mean, variance and quantiles of U against 8 to show whether or not these are effectively constant.

In the nonparametric case almost nothing useful will be an exact pivot, but we can think informally of approximate pivotality as the property that U = u(Y, F) will have roughly the same distribution for all F near the empirical distribution function F and its generalizations. This we can assess by examining the distribution of U = u(Y, F) when F ranges over the special distributions F* that are empirical distribution functions of the R bootstrap samples, on the grounds that these distributions must encompass the same neighbourhood that almost certainly includes the true F itself. Taken literally, this idea involves double (or nested) bootstrapping, but as we shall see, this can often be avoided.

Consider first checking the stability of the bias and variance of T. The bootstrap estimates of these quantities, obtained by simple resampling R times from F, are

R b = i? - t , u = ( R - 1)-' C(t: - t')2.

T = l

These correspond to the parameter value 8 = t of F. To calculate the analogous values at parameter value tf we resample from FC, the empirical distribution function of the rth bootstrap sample &, . . . , Y:~. If we draw M samples from F,?, and from these we calculate estimates tTm, m = 1,. . . , M, then estimates of the bias and variance are

- **

m=l m= 1

We would now plot these against values of tc as a surrogate for plotting bias and variance of T against 8. Tibshirani (1988) and Ventura (2002) give examples of such plots.

' h o problems with these calculations are that the total number of simulations ( R M + R) may be large,and more seriously, that the resulting graph is typically noisy because of the discreteness of the F,* and their lack of smoothness with respect to t;. We deal with these problems by

16 CANTY, DAVISON, HINKLEY & VENTURA Vol. 34, No. 1

smoothing the F,* and by using a recycling strategy that involves sampling from many fewer than all R of the F,*.

The smoothed version of g* that corresponds to parameter value 6 has probabilities on the data values y1, . . . , yn defined by (Davison, Hinkley & Worton 1995)

where f;j is the frequency with which y j appears in the rth bootstrap resample, h is a smoothing parameter, and w is a symmetric probability density with mean zero and unit variance-we use the normal density. The constant of proportionality in (3) is chosen so that + a . +pi, , = 1. We denote the corresponding cumulative distribution function by F,* . The parameter value t(F,*) does not exactly equal 8, but for small values of h, such as the range 0 . 2 ~ ~ / ~ - 1 . 0 2 1 ~ / ~ that we use, there will be negligible difference.

In principle we could use (3) with 6 = t: to replace F: for every r, but in practice we take 20 to 100 values of 8 equally spaced over the central 95% range of the t: values.

Having fixed on the distributions F,* that replace the F,*, we now introduce recycling to avoid the separate resampling from all F,*. The recycling idea is to make repeated use of the importance sampling identity

K ( H ) E E { k ( X ) I H } = E { k(X)- d H ( X ) I G } d G ( X ) ' (4)

with a fixed G that assigns probabilities 91, . . . , gn respectively to the data values y1, . . . , yn and H ranging over all F; of interest, X = (Y; . . . , Y;). The expectations in (4) are replaced by averages over Q samples drawn from G, so that for all values of 6 considered, E { k ( Y * ) I F,*} is approximated by

Q

with fij equal to the frequency of yj in the qth resample from G. The simplest choice for G is @, for which the resamples y: are just ordinary first-level bootstrap resamples, but more effective choices involve mixtures of a small number of the F,*.

The use of smoothed estimates F,* in place of the original resample empirical distribution functions F; has a dramatic effect on this recycling algorithm, which otherwise would be ex- tremely unstable. For example, if n = 10 and we were to use unsmoothed F,! with G = F , then 96% of the recycling weights w , , ~ analogous to (6) would equal zero while having mean 1.

Further improvements are possible (Hesterberg 2001; Ventura 2002) and are used in our implementation of these ideas, as described in the Appendix.

4.1. Example: Insurance data.

The left panel of Figure 7 is an exponential probability plot for 254 insurance claims y = 2 - 5 that exceed the threshold 2 = 5 in a larger set of data (Embrechts, Kliippelberg & Mikosch 1997); for commercial reasons the units are not given. Interest concerns the right tail of the claim distribution, for which a standard extreme-value model is the generalized Pareto distribution (Davison & Smith 1990; Coles 2001)

-

K, = 0, cr>o, - - O o < K , < ~ . (7)

1 - exP(-Y/c), 1 - (1 + K,y/a)T ' IK otherwise, F(Y) =

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 17

The exponential distribution occupies the central position K. = 0, but the evident upward curvature in the probability plot suggests that the data have an appreciably heavier tail, with K. > 0. To estimate the p quantile 8 = tP = r ~ { (1 - p)-" - l } /~ . , for the underlying distribution using this model, we substitute maximum likelihood estimates R = 0.63 and 6 = 3.81 based on the data plotted in Figure 7. For example, for p = 0.99 we get t = 6 = 104.5. A confidence interval for 8 can be obtained using a profile likelihood, but can we trust model (7), which is a high-threshold approximation and which yields estimates that are very sensitive to the largest values in the sample? Was the chosen threshold high enough for this approach to work? A more robust approach assesses the uncertainty in T using the nonparametric bootstrap.

0 1 2 3 4 5

Exponential plotting position

50 100 150 200 250

t'

FIGURE 7: Insurance claim data. Left panel: plot of ordered data y against exponential order statistics. Right panel: plot of vfl/' against t' for 0.99 quantile estimate t = e0.99 of claim distribution, based on

R = 999 bootstrap replicates.

A confidence interval for 8 = tP can be based on T* - t, a studentized form Z* of this, or a(T*) - a ( t ) for some monotone transformation a. That raw T* - t might not work is strongly suggested by the right panel of Figure 7, in which the standard error vf112 corresponds to the nonparametric delta method standard error v1I2, where v = n-2(l: + ... + l:) with l j the exact empirical influence value for case j ; this is easily determined by the maximum likelihood estimating equations for K. and u and the definition of tP (Davison & Hinkley 1997, p. 63). For a clearer and more constructive analysis, we consider the question of whether or not T* - t is pivotal; and if not, does a simple transformation cure the problem?

Figure 8 shows pivot plots resulting from a nonparametric bootstrap with 2000 resamples; see the Appendix for details. The panels strongly suggest that T* - t is not pivotal. The cor- responding plot for log T* - log t, however, suggests that although not perfect, the logarithmic transformation stabilizes the distribution quite well: the relative variation in the lower panel of the left column, which is around 120%, is reduced to around 20% in the corresponding panel of the third column, with the second and fourth columns showing substantial reductions in the relative variation of the quantiles as well. In order to indicate error due to variability of the bootstrap samples, the panels include bands drawn at f2 standard errors of the central bootstrap value. Illustrations of the accuracies of these plots are provided in Figure 9, whose right panel shows ten independent replicates of the quantile part of a pivot plot. The left panel shows five independent replicates of the corresponding full double bootstrap version using R = 500 and M = 1000 second-level resamples for each first-level resample.

18 CANTY, DAVISON, HINKLEY & VENTURA Vol. 34, No. 1

" 6 i o ir ._________-

.fcZ

.............. . . . . . . . . . . . . . . . .

:e: ........

. . .

i 60 80 100 140 60 80 100 140

1 t

0

I 4

B

, I ,

60 80 100 140

t

60 80 1w 140

t

60 80 100 140 €0 80 100 140

1 1

FIGURE 8: Insurance claim data. Left panels: Pivot plots for the dependence on 8 = (0.99 of bias and standard error of T = &.99. and of 5,25,50,75,95% quantiles of T - 6 = i0.99 - 60.99.

Right panels: similar output for log T - log 8.

5

P

- 1

0

?

. . .

80 100 120 140 180

t

80 1w 120 140 160

t

80 1w 120 140 160

I

80 100 120 140 1M

I

FIGURE 9: Insurance claim data. As in Figure 8, but for ten independent replicates of pivot plots (second and fourth panels from left), and ten independent replicates of the double bootstrap analogue each with

R = 500 and M 1000 (other panels).

It is important to check carefully the vertical scales and the error bands on these plots, so as to avert reading significance into practically unimportant differences.

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 19

5. INCONSISTENCY OF BOOTSTRAP METHOD

The primary meaning of the term “bootstrap consistency” is that the substitution approximation (1) has error tending to zero as sample size n (or equivalent) in5reases indefinitely. This “ab- solute” consistency may fail under the most obvious choice of F in (l), as it does for several useful estimators, such as the sample maximum, kernel estimators of densities and regression curves, and superefficient estimators. There is also a sort of “reverse” inconsistency, which we discuss at the end of this section, where the bootstrap fails because it is being used to estimate an asymptotic distribution that does not exist.

5.1. Absolute inconsistency.

When data Y are randomly sampled from a distribution F, and the distribution of u(Y, F) con- verges in an appropriate way as n, + 00, then the convergence of the resampling model (be it the empirical distribution function F or a fitted parametric model) will usually imply that the left- and right-hand sides of (1) converge to the same thing: this is absolute consistency. Failure is often a quite subtle phenomenon, but penetrating theoretical contributions have been made by Putter & van Zwet (1996) and Beran (1997). It is as yet unclear to what extent theoretical results can be translated into reliable empirical diagnostics, though Samworth (2003) suggests that at least in simple cases the consequences of attempting to correct for inconsistency can be worse than those of the inconsistency itself. We examine one potentially important application below.

The result of Beran (199_7) can be summarized in part as follows. If T is a maximum like- lihood estimator of 8 and T is a corresponding superefficient estimator, with superefficiency restricted - to the parameter subset 8 s (whose measure must be zero), thzn the dis@bution of U* = n’/’(T* - 8.) will not consistently estimate the distribution of U = n1l2(T - 8) for 8 E 0 s if resampling is done using the model with parameter value 8’ equal to either T or ?. Also, A* = n1l2(P - T*) and D* = n1l2(T* - 8.) will be asymptotically independent if and only if the bootstrap distribution of 6* converges in probability to the limiting distribution of 6. This latter convergence, which we refer to as asymptotic consistency of the bootstrap, literally means that

(8)

A potential diagnostic of inconsistency, therefore, is a scatter plot of simulated values of (a*, d*) to check for dependence; for vector T, scalar functions of a* and d* should be chosen for plotting.

Strictly speaking, this theory justifies the diagnostic plot only for 8 E Qs, which is usually not the focus of practical interest. But the discontinuity in asymptotic behaviour of the bootstrap at the boundaq of 8 s translates into degraded finite-sample performance of the bootstrap for 8 close to 8 s as well as inside 0 s . One might hope that the corresponding dependence between A* and D* will also extend to 8 near Os, so as to provide an empirical diagnostic of whether or not the intended bootstrap calculations can produce an accurate approximation.

Beran shows that the diagnostic scatter plot of a* versus d* appears to work effectively with the superefficient Hodges estimator for the mean of a normal distribution. A potentially important example that we discuss briefly here is the Stein (1981) shrinkage estimator for a multivariate mean, which may be viewed as a prototype for several nonparametric smoothers.

5.2. Example: Stein estimator of nomal means.

Suppose that YI , . . . , Y, are independent normal random variables each with the known variance u2, but with potentially different and unknown means 81, . . . ,Om. The vector T = (Y1, . . . , Y,) is the maximum likelihood estimator for 8 = (81,. . . , Om). A potentially large improvement on T is offered by the Stein estimator

E { P ( ~ * 5 u I Y ) ) + n-+m lim P (6 5 u).

with components

20 CANTY, DAVISON, HINKLEY 8 VENTURA Vol. 34, No. 1

In Beran's work, Y is the average of - n independent replicates, and n + 00 @ equivalent to 0 -+ 0. The superefficiency set for T is 0 s = (8 : 81 = ... = O m } , but T dominates T in terms of mean squared error for all 8, and strongly so for 8 near 0 s when m >> 3. One might be optimistic that the diagnostic scatter plot of a* versus d* _would show whether or not the resampling distribution of T* - 8* is close to the distribution of T - 8, for 0 > 0 and 8 either in, or sufficiently close to, 0 s so that !F is still a significant improvement on T . Unfortunately this does not happen.

The scatter plot does quite successfully diagnose when the resampling model with 8* = T does not work, but not when the use of 8* = ? works or does not work. To illustrate this, we selected four data sets of size m = 10 simulated from the model with 8 = 0 E 0 s and o2 = 1, those sets corresponding to the 0.25,0.50,0.75 and 0.95 quantiles of the xi distribution of C(K - Y)2. The values of s(y) are respectively 0, 0.16, 0.39 and 0.59. For simplicity of presentation, we consider the distribution of i = ((6Il2 = ll?-8112 and its resampling analogue, and we take scalar versions of the diagnostic statistics,

A* = x(!F; - q*)2 and D* = x ( T ; -

Figure 10 shows Q-Q plots of ?+ versus f (top row) and the corresponding scatter plots of a* versus d* (bottom row). For the first data set, 8* E 0 s because s(y) = 0, and t* has the same distribution as 1, and their distributions are just as close for the second data set. But A* and D* show strong dependence for all data sets. This does not conflict with the theory: the dependence matches the failure of the aggregate property (8), which is evident from the rightmost two upper panels of the figure. Such behaviour clearly weakens the value of the diagnostic plot for any particular data set.

Similar behaviour occurs in the more practical cases where 8 is near 0 s , and for large m it extends to the situation where 8: = + ( S ( Y ) } ' / ~ ( ~ Z - Y), which Beran (1995) shows will produce accurate resampling results.

v) N 1

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25

I I I :bl 0 N .;>:.. ..

b z .$.,:.": . . z . ... -. . ln ...I..*-

, ,:...- 0

I

1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 3 4 5 6 7 1 2 3 4 5 6 7

a' a* a' a'

FIGURE 10: Diagnostic plots for parametric bootstrap of the Stein estimator. Top row: Q-Q plots of

Resampling with 0' = f, R = 999. Bottom row: corresponding diagnostic plots of d' versus a'. e* = lli? - 0' 1 1 2 versus C = 11s - for four data sets of size m = 10 from normal model with 0 = 0.

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 21

Questions of inconsistency must perhaps be settled theoretically, with results used as warn- ings that finite-sample behaviour of potential resampling schemes must be checked with especial care.

5.3. Relative inconsistency.

A more subtle type of inconsistency is illustrated in the work of Lee & Young (1995) on non- parametric bootstrap inference for the mean p = E (Y) when data y1 , . . . , yn are unknowingly sampled from a log-normal distribution. Expansions for coverage of the main bootstrap confi- dence interval methods suggest strange behaviour, which is confirmed in simulation studies. For example, their two-sided calibrated percentile interval 1 1 emerges from their study as one of the best nonparametric confidence interval methods in general; it is second-order correct. However, the nominal 90% version of 11 has coverage expansion under the log-normal model

P (8 E I ~ ) x 0.90 - 2.5 x io20n-2, which is wildly inaccurate as a finite-sample approximation. Correspondingly, actual finite- sample coverages for this distribution are not close to 90%: for n = 20,35,100 the coverages are 53%, SO%, 70%, respectively. Jackknife-after-bootstrap plots (Section 2) typically suggest difficulty with bootstrapping the estimated mean and variance for log-normal data, under both parametric and nonparametric resampling.

Wood (2000) relates the inaccuracy of consistent methods to subexponentiality of data distri- bution F, a characteristic possessed by the log-normal model. The essence of the problem is as follows. Consistency in the usual absolute sense provides that if P (n1/2(T-8) 2 u~.-.~,,} = a, with a fixed, then the bootstrap quantile estimate i i ~ - ~ , , converges in probability to the same limit as does U I - ~ , , , as n -+ 00. But more relevant to finite-sample coverage of upper confidence limits for 8 may be the limiting behaviour of the ratio

when n + 00 with a, + 0: finite-sample accuracy may require that this ratio tends to 1. Wood obtains theoretical results connecting the behaviour of a, to the failure of the ratio p(a,; @, F) to converge to 1.

Unfortunately no empirical nonparametric mimic of p(a,; F^, F) has been found to serve as a diagnostic of this harmful condition.

5.4. Reverse inconsistency.

In a 'typical' simple problem, one expects that U,, = n1I2(T, - 8) has distribution G, with a proper limit as n --+ 00 and such that U; = n112(T; - tn) has distribution en converging to the same limit as n --+ 00. Most often the limit is a normal distribution. This being the case, one would expect that if we took a subsample of size m < n from the original sample of size n, and calculated T;,m from that subsample, then U;,m = m1/2(T;,m - t,) would have distribution 8n,m close to en. So to check whether or not the standardization n112 is correct, it would seem appropriate to plot empirical quantiles of U; against corresponding quantiles of U;,, for several m appreciably smaller than n; the quantiles should not change systematically with m if all is well. Problems to which this might apply, but with n112 replaced by na for some known or in principle knowable a, include extreme-value statistics, nonparametric density and regression estimates, and averages of heavy-tailed distributions. Other examples including the mode are discussed by LCger & MacGibbon (2006).

A more constructive development due to Bertail, Politis & Romano (1999) is an empirical subsampling method for determining the rate of convergence of T, - 8, the quantity a above. Their approach is designed for general dependence structures and so uses subsampling without

22 CANTY, DAVISON, HINKLEY & VENTURA Vol. 34, No. 1

replacement, and we shall follow this below. Suppose that for some a < 0, T, - 6 = O,(n"). Then for any probability p , the pth quantile of T, - 6 is O(n"). This suggests that, at least for large n, a reasonable estimate for a can be obtained as the slope from a plot of the logarithm of the pth quantile of T&,, - t versus log m.

This idea can be used to build a diagnostic that compares the behaviour of a desirable but potentially unreliable estimator with that of a less desirable but more reliable one. Suppose that T, is an estimator of 8 and that it is hoped that the data generation mechanism is such that the distribution G, of n"(T, -6 ) has a nondegenerate limit G as n -+ m, for some known a. Let TA be an estimator of a parameter 6' which will often equal 6, and suppose that it is known that the distribution H, of n" (TA - 8') has a nondegenerate limit H under broader conditions than those applying to T,, and that these broader conditions are known to hold. If in addition the hoped-for properties of T, hold, then ratios of measures of the dispersion of T, - 6 and TA - O', such as their interquartile ranges, will tend to nonzero constants. If however the rates of convergence of T, and TA differ, then these ratios will tend either to zero or to infinity as n + 00. Of course n is fixed and finite, so a bootstrap analogue must be based on comparison of the dispersions of the empirical distributions of subsampled quantities T;,, - t , and T;,, - tk for a variety of subsample sizes m < n. If sampling without replacement is to be used, then we should take m << n to avoid being misled by the effect of finite population sampling.

5.5. Example: Mean and median for long-tailed distributions.

To assess the potential of this diagnostic tool, we used it to compare the behaviour of the mean and median with data close to t,. The normalising rate for the median is n-lI2 for all v, but this rate applies to the mean only if v 2 3, and we would hope that this distinction would be made by the diagnostic. As a more stringent test we used a truncated t , distribution: for small v enormous observations may appear, so those greater than 40 in absolute value were deleted; thus, although they possess moments of all orders, the data are almost indistinguishable from t , samples. This is intended to mimic what might happen in practice, where data often have tails longer than normal but really huge observations tend to be discarded.

5 10 50

m

....

............... .... ....

.................... ..... ....... ....

- - _ _ - - m .

' I 9 1 ;

1 2 3 4 5 10 20 Inf

nu

FIGURE 1 1 : Subsampling diagnostic for bootstrap failure of the mean. Left: log IQR for means ( 0 ) and medians (+) of 199 subsamples of size m from a single sample of size 100 from the Cauchy distribution. Right panel: Empirical distribution of a diagnostic for failure of the bootstrap of the mean for 200 samples of size n = 100 from the truncated t distribution with v degrees of freedom; horizontal lines are at 0, f 2 .

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 23

Given a sample from this distribution, we applied the following procedure. We took from it R = 199 subsamples of size m without replacement and calculated the mean and median for each subsample. We computed the interquartile ranges of the R means and medians, thus giving us a pair ( IQGean7 I Q G d ) . This was repeated for m = 3,5 , . . . ,19. Comparison of the slopes of graphs of log IQGea, and log I Q X d against log m should be informative about whether it is safe to bootstrap the mean for that sample; a natural way to assess this is through the t statistic for slope Zsub, when log( IQGean/ IQGd) is regressed on log m.

The left panel of Figure 11 shows how this works for a single sample of size n = 100 generated with v = 1. In order to assess how large it is safe to take m, we show results for m = 25,30,. . . ,50 as well as the values above. The slopes and standard errors of the fitted lines for the mean and median are respectively -0.3 (0.03) and -0.49 (0.03), and Zsub = 4.63, giving apparently strong evidence of different rates of convergence. However the evidence would have been weaker if only the larger values of m, for instance m 2 25, had been used; for this subset Zsub = 2.99. As m increases the dominant effect becomes that of sampling from a finite population, which of course has moments of all orders, so for some sets of data the left-hand panel of Figure 11 would show clear downward curvature in the subsampling results for the mean, with slopes for the mean and median essentially equal for large m. This suggests that m needs to be quite small compared to the sample size, and hence that very large samples are needed for without-replacement subsampling to work well, though it should perform better if with-replacement subsampling can be used. Our experience with this procedure is not as good as that of Bertail, Politis & Romano (1999), who considered samples with n = 100, loo0 and 10,000.

The right panel shows 200 replicates of Zsub for various v and n = 100. There is a noticeable upward shift when v 5 2. About 60% of samples have Zsub > 2 when v 5 2, dropping to about 6% for larger v. Thus a rule of thumb that the mean should not be bootstrapped if Zsub > 2 would be conservative, but not badly so. The usual preliminary to location analysis would be a normal scores plot of the sample. If this showed long tails it would be wise to replace the mean with a robust estimator, independently of the results of a subsampling analysis.

Subsampling has been proposed for use with time series, using blocks of m consecutive observations. There are then just n - m + 1 subsamples, and the corresponding values of the statistic are highly correlated. Hence quantities based on the subsampled statistics, such as the interquartile ranges tend to be extremely variable, making it very difficult to extract reliable information from plots such as the left panel of Figure 11. Though limited, our experience suggests that huge samples of stationary data would be needed for this form of subsampling to work well. Subsampling is likely to be most valuable for more complex situations than location estimation, but using independent data.

6. DISCUSSION Of genuine concern must be the fact, seen in many applications, that in the quite complex models to which bootstrap methods are often applied, either the standard bootstrap must be modified so as to produce reliable results, or there is a dauntingly large family of potential bootstrap schemes with little theoretical guidance available as to which is best suited to a particular circumstance. In this paper we have reviewed some of the problems that may arise and described some initial steps toward providing useful and practicable diagnostics to help identify problematic situations and to indicate what corrective action might be appropriate in some cases.

Diagnostic methods in applied statistics are rarely unambiguous. This remains true for boot- strap diagnostics, and we have not necessarily teased out the best ways to search for signs of the problems mentioned at the outset. Indeed, in one notable instance-relative inconsistency-a useful empirical diagnostic has proved elusive.

We hope that these remarks will act as a challenge to readers to add to our catalogue of problem applications and, much more importantly, to invent corresponding diagnostics.

24 CANTY, DAVISON, HINKLEY & VENTURA Vol. 34, No. 1

APPENDIX

A. I . Influence values.

For a statistical estimator T = t ( F ) , exact influence values tJ are empirical values of the influ- ence function, which is defined as

9 (9) t ( (1 - &)F + &HU} - t ( F ) t ( u , F ) = lim

E'O &

with HU(y) = I{y 2 u}. Then e j = t(yj ,@), and necessarily c, t j = 0. The Taylor approximation for T is

n

t ( F ) = t ( ~ ) + t(y, F ) dF(y) = t ( ~ ) + n-l C t ( y j , F ) , J j=1

and from this flows the variance approximation var(T) M TI = n-2 values are

t:. Standardized influence

The jackknife approximation to t j corresponds to setting u = yj, F = F, and replacing the limit in (9) by setting E = - l / (n - 1). This leads to the jackknife pseudo-value (n - l ) ( t - t - j )

as an alternative to t j , where t - j denotes the estimator T calculated for the data omitting case j . If a plot based on bootstrap output is required, the j th pseudo-value can be approximated by (n - 1)(P - E J ) .

Note that in the above outline, F could be replaced by another estimate for F, and for multiple samples the extension to deal with vector F is straightforward (Davison & Hinkley 1997, 992.7, 3.2).

The jackknife method is generally appealing because it requires no theoretical input and uses only bootstrap output. But it is vulnerable to the effects of outliers on the t* values.

A.2. Details of the pivot plots.

A pivot plot is intended to suggest how the distribution of a quantity such as T - 6 would be affected by changes in the population parameter value 8. Following the bootstrap substitution principle, we instead examine how the distribution of T* - t varies with values t of T. Our pivot plots contain three panels showing aspects of this, namely estimates of the bias and standard error of T* and of certain quantiles of the distribution of T* - t. Let n(0) denote the quantity of interest, be it bias, variance, or a specified quantile.

We estimate n(t) for nt values of t , taken as equally spaced quantiles of the bootstrap out- put t;, . . . , tk between the 0.05 and 0.95 quantiles of the t*. We have found that nt > 20 is needed to approximate the relationship between n(t) and t well. At each t we first construct a distribution on the data y1, . . . , yn for which T* is centred at t by smoothing the original boot- strap frequencies as in (3) and importance reweighting the yJ using ( 5 ) and (6). From this we calculate the importance sampling weights to estimate the value of n(t) that would have arisen had we sampled from a distribution centred at t. The importance sampling regression estimates (Hesterberg 1995) perform best for this purpose.

As described above, the technique induces a bias that stems from using the same set of bootstrap frequencies to find both recentred distribution and importance sampling weights. This bias can be reduced by splitting t ; , . . . , t;l into two halves comprising their alternate ordered values, and using one half for the frequency smoothing and the other for the importance sampling and estimation. The roles of the halves are then reversed, giving two estimates of ~ ( t ) .

Importance sampling estimates in the tails of a bootstrap distribution are poor unless some of the replicates are generated from a distribution centred in the tails (Ventura 2002), so we

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 25

recommend that S 2 100 additional replicates be generated from each of the two distributions centred at the 0.05 and 0.95 quantiles of the original t*. We use an exponentially tilted bootstrap as described in Davison & Hinkley (1997,§9.4.1) or Efron & Tibshirani (1993,523.7) to generate these extra replicates, which are used in the importance sampling estimation at each t but not in the frequency smoothing. The figures in the paper used R = 1000 and S = 500.

Once this process is completed, we have 2nt importance sampling estimates of ~ ( t ) near nt values of t. We smooth the 2nt estimates using a cubic smoothing spline with 3 degrees of freedom. We account for the changing variance of the estimates of K ( t ) by using the reciprocal of the variance of the importance sampling weights as a weight for the smoothing spline fit.

If the quantity of interest is truly pivotal, then all the smoothed curves in the plots should be roughly horizontal. To help assess this, we add horizontal lines showing the limits of an approximate 95% confidence interval for ~ ( t ) at the original data value of 2'.

A.3. Importance weighted resampling. Suppose that the resampling model which generates full-data bootstrap samples y* = (y;, . . . ,y;) is completely defined by full-data estimate 4. For example, in the one-sample nonparametric case 4 would be the empirical distribution function F, whereas in the one-sample parametric case 4 could be an estimate of the parameter 6 which identifies a model cumulative distribution function F ( y 16) for each Y,. Then for any statistic s(Y) based on data vector Y, its bootstrap mean under the resampling model defined by case-deletion estimate 4 - j can be expressed as

E*{s(Y*) I & j } = E*{ s(Y*) f(y* "-'I I 4} = E*{s(Y*)wj(Y*) I G}, (10) f (Y* I 4)

say, provided the support of Y* when II, = 4 - j is included in the support when II, = 4. This is the basic importance reweighting result that allows us to use as estimates of the case-deletion expectations the full-data bootstrap output estimates

or, better,

In the nonparametric case, wj(y:) is either zero ( i f f* . # 0) or (1 - n-l ) -n (if f:j = 0). TJ Then R full-data bootstrap samples would provide the estlmate

for the mean of s(Y*) under the case-deletion empirical distribution function, where R-j is the number of resamples in which case j is absent, so that f:j = 0. Note that E (R-j) = R( 1 -

In parametric cases, with $ a finite-dimensional parameter, the support condition on fitted models with estimates 4 - j and 4 is usually satisfied, and often a simple recurrence relation between &3 and 4 leads easily from (10). But in semiparametric cases, such as model-based resampling in the example of Section 2, the support condition will often not be satisfied, which precludes the use of equation (1 1).

which is roughly 0.368R.

26 CANTY, DAVISON, HINKLEY 8 VENTURA Vol. 34, No. 1

If we want to estimate the case-deletion bootstrap a quantile of T* - t, say, using importance reweighting, then in effect we need to use equation (1 1) with

s(y*) f I{t(y*) - t 5 u}

to estimate P { t (Y*) - t 5 u} for all u and then find the u for which the estimate equals a. For a direct approximation of the a-quantile S,, we reindex the full-data bootstrap samples y,* so that t; I . . . 5 tk, then set S, = ttca) - t such that ~ ( a ) satisfies

ACKNOWLEDGEMENTS

This work was supported by the Natural Sciences and Engineering Research Council of Canada, the Swiss National Science Foundation, and the UK Engineering and Physical Sciences Research Council. We thank Richard Lockhart, Alastair Young, and various anonymous referees, whose constructive comments greatly improved the paper.

REFERENCES S. T. Appleyard, J. A. Witkowski, B. D. Ripley, D. M. Shotton & V. Dubowicz (1985). A novel procedure

for pattern analysis of features present on freeze fractured plasma membranes. Journal of Cell Science, 74,105-117.

R. J. Beran (1995). Stein confidence sets and the bootstrap. Statistica Sinica, 5, 109-127. R. J. Beran (1997). Diagnosing bootstrap success. Annals of the Institute of Statistical Mathematics, 49,

P. Bertail, D. N. Politis & J. P. Romano (1999). On subsampling estimators with unknown rate of conver-

B. M. Brown, P. G. Hall & G. A. Young (2001). The smoothed median and the bootstrap. Biometrika, 88,

P. Buhlmann (2002). Bootstraps for time series. Statistical Science, 17,52-72. S. G. Coles (2001). An Introduction to the Statistical Modeling of Extreme Values. Springer, New York. A. C. Davison & D. V. Hinkley (1 997). Bootstrap Methods and Their Application. Cambridge University

A. C. Davison, D. V. Hinkley & B. J. Worton (1995). Accurate and efficient construction of bootstrap

A. C. Davison & R. L. Smith (1990). Models for exceedances over high thresholds (with discussion).

D. De Angelis & W. R. Gilks (1994). Estimating acquired immune deficiency syndrome incidence account-

D. De Angelis, P. Hall & G. A. Young (1993) Analytical and bootstrap approximations to estimator distri-

B. Efron (1992). Jackknife-after-bootstrap standard errors and influence functions (with discussion). Jour-

B. Efron & R. J. Tibshirani (1993) An Introduction to the Bootstrap. Chapman & Hall, London. P. Embrechts, C. Kluppelberg & T. Mikosch (1997). Modelling Extremal Events for Insurance and Finance.

P. Hall (1985). Resampling a coverage pattern. Stochastic Processes and their Applications, 20,23 1-246. P. Hall (1 992) The Bootstrap and Edgeworth Expansion. Springer, New York. P. Hall, T. J. DiCiccio & J. P. Romano (1989) On smoothing and the bootstrap. The Annals of Statistics, 17,

1-24.

gence. Journal of the American Statistical Association, 94,569-579.

519-534.

Press.

likelihoods. Statistics and Computing, 5,257-264.

Journal of the Royal Statistical Society Series B, 52,393-442.

ing for reporting delay. Journal of the Royal Statistical Society Series A, 157,3140.

butions in L1 regression. Journal of the American Statistical Association, 88, 13 10-1 3 16.

nal of the Royal Statistical Sociery Series B, 54,83-127.

Springer, New York.

692-704.

2006 BOOTSTRAP DIAGNOSTICS AND REMEDIES 27

T. C. Hesterberg (1995). Weighted average importance sampling and defensive mixture distributions. Tech- nometrics, 37,185-194.

T. C. Hesterberg (2001). Bootstrap tilting diagnostics. Proceedings of the Joint Statistical Meetings, ASA Statistical Computing Section, American Statistical Association, Alexandria, Virginia, CD-ROM: 4 pp. www.statsci.com/Hesterberg/articles/JSMOl -diagnostics.pdf

D. V. Hinkley & E. Schechtman (1987). Conditional bootstrap methods in the mean-shift model. Bio- metrika, 74,85-93.

S. N. Lahiri (1999). Theoretical comparisons of block bootstrap methods. The Annals of Statistics, 27, 386404.

S. N. Lahiri, M. S. Kaiser, N. A. C. Cressie & N.-J. Hsu (1999). Prediction of spatial cumulative distribution functions using subsampling (with discussion). Journal of the American Statistical Association, 94,86- 110.

S. M. S. Lee & G. A. Young (1995). Asymptotic iterated bootstrap confidence intervals. The Annals of Statistics, 23, 1301-1330.

C. Uger & B. MacGibbon (2006). On the bootstrap in cube root asymptotics. The Canadian Journal of Statistics, 34, in press.

D. N. Politis, J. I? Romano & M. Wolf (1999). Subsampling. Springer, New York. H. Putter & W. R. van Zwet (1996). Resampling: Consistency of substitution estimators. The Annals of

B. D. Ripley (1977). Modelling spatial patterns (with discussion). Journal of the Royal Statistical Society

B. D. Ripley (1981). Spatial Statistics. Wiley, New York. R. J. Samworth (2003). A note on methods of restoring consistency to the bootstrap. Biomerrika, 90,

J. Shao & D. Tu (1995). The Jackknife and Bootstrap. Springer, New York. K. Singh (1998). Breakdown theory for bootstrap quantiles. The Annals of Statistics, 26, 1719-1732. C. M. Stein (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics,

R. J. Tibshirani (1988). Variance stabilization and the bootstrap. Biometrika, 7 5 , 4 3 3 4 . W. N. Venables & B. D. Ripley (2002). Modern Applied Statistics with S. Fourth Edition. Springer, New

V. Ventura (2002). Non-parametric bootstrap recycling. Statistics and Computing, 12,261-273. A. T. A. Wood (2000). Bootstrap relative errors and sub-exponential distributions. Bernoulli, 6,809-834.

Statistics, 24,2297-23 18.

Series B, 39, 172-2 1 2.

985-990.

9,1135-1151.

York.

Received: I 7 September 2003 Accepted: 22 May 2005

Angelo J. CANTY cantye rnath.rncrnaster.ca Department of Mathematics and Statistics

McMaster University 1280 Main Street West

Hamilton, Ontario, Canada L8S 4KI

Anthony C. DAVISON: Anthony.DavisonQepfl.ch Institut de mathhatiques, kcole Polytechnique Fkdkrale de Lausanne

EPFL-FSB-IMA-STAT, Station 8 CH-1015 Lausanne, Switzerland

David V. HINKLEY [email protected] Department of Statistics &Applied Probability

University of California Santa Barbara, CA 93106-3110, USA

Valtrie VENTURA: [email protected] Department of Statistics, Carnegie Mellon University

Pittsburgh, PA 15213, USA