bootstrap methods and their application

Bootstrap methods and their application

Cambridge Series on Statistical and Probabilistic Mathematics

Editorial Board:R. Gill (Utrecht)B.D. Ripley (Oxford)S. Ross (Berkeley)M. Stein (Chicago)D. Williams (Bath)

This series of high quality upper-division textbooks and expository monographs covers all areas of stochastic applicable mathematics. The topics range from pure and applied statistics to probability theory, operations research, mathematical programming, and optimzation. The books contain clear presentations of new developments in the field and also of the state of the art in classical methods. While emphasizing rigorous treatment of theoretical methods, the books contain important applications and discussions of new techniques made possible be advances in computational methods.

Bootstrap methods and their application

A . C. D a v iso n

Professor o f Statistics, Department o f Mathematics,Swiss Federal Institute o f Technology, Lausanne

D . V. H ink ley

Professor o f Statistics, Department o f Statistics and Applied Probability, University o f California, Santa Barbara

H I C a m b r i d g eU N IV E R S IT Y P R E SS

PU B L ISH E D BY THE PRESS SY N D IC A TE OF THE U N IV ER SIT Y OF CA M BR ID G EThe Pitt Building, Trumpington Street, Cambridge CB2 1RP, United Kingdom

C A M BR ID G E U N IV ER SIT Y PRESSThe Edinburgh Building, Cambridge CB2 2RU, United Kingdom

40 West 20th Street, N ew York, N Y 10011-4211, USA 10 Stamford Road, Oakleigh, Melbourne 3166, Australia

© Cambridge University Press 1997

This book is in copyright. Subject to statutory exception and to the provisions o f relevant collective licensing agreements,

no reproduction o f any part may take place without the written permission o f Cambridge University Press

First published 1997

Printed in the United States o f America

Typeset in TgX Monotype Times

A catalogue record fo r this book is available from the British Library

Library o f Congress Cataloguing in Publication data

Davison, A. C. (Anthony Christopher)Bootstrap methods and their application / A.C. Davison,

D.V. Hinkley. p. cm.

Includes bibliographical references and index.ISBN 0 521 57391 2 (hb). ISBN 0 521 57471 4 (pb)

1. Bootstrap (Statistics) I. Hinkley, D. V. II. Title.QA276.8.D38 1997

519.5'44~dc21 96-30064 CIP

ISBN 0 521 57391 2 hardback ISBN 0 521 57471 4 paperback

Contents

Preface ix

1 Introduction 1

2 The Basic Bootstraps 11

2.1 In troduction 112.2 Param etric Sim ulation 152.3 N onparam etric Sim ulation 222.4 Simple Confidence Intervals 272.5 Reducing E rror 312.6 Statistical Issues 372.7 N onparam etric Approxim ations for Variance and Bias 452.8 Subsam pling M ethods 552.9 Bibliographic Notes 592.10 Problems 602.11 Practicals 66

3 Further Ideas 70

3.1 In troduction 703.2 Several Samples 713.3 Sem iparam etric M odels 773.4 Sm ooth Estim ates o f F 793.5 Censoring 823.6 Missing D ata 883.7 Finite Population Sampling 923.8 Hierarchical D ata 1003.9 Bootstrapping the Bootstrap 103

v

vi Contents

3.10 Bootstrap Diagnostics 1133.11 Choice o f Estim ator from the D ata 1203.12 Bibliographic Notes 1233.13 Problems 1263.14 Practicals 131

Tests 136

4.1 Introduction 1364.2 Resampling for Param etric Tests 1404.3 N onparam etric Perm utation Tests 1564.4 N onparam etric Bootstrap Tests 1614.5 Adjusted P-values 1754.6 Estim ating Properties o f Tests 1804.7 Bibliographic Notes 1834.8 Problems 1844.9 Practicals 187

Confidence Intervals 191

5.1 Introduction 1915.2 Basic Confidence Limit M ethods 1935.3 Percentile M ethods 2025.4 Theoretical C om parison o f M ethods 2115.5 Inversion o f Significance Tests 2205.6 D ouble Bootstrap M ethods 2235.7 Empirical C om parison o f Bootstrap M ethods 2305.8 M ultiparam eter M ethods 2315.9 Conditional Confidence Regions 2385.10 Prediction 2435.11 Bibliographic Notes 2465.12 Problems 2475.13 Practicals 251

Linear Regression 256

6.1 Introduction 2566.2 Least Squares Linear Regression 2576.3 M ultiple Linear Regression 2736.4 Aggregate Prediction Error and Variable Selection 2906.5 Robust Regression 3076.6 Bibliographic Notes 3156.7 Problems 3166.8 Practicals 321

7 Further Topics in Regression 326

7.1 In troduction 3267.2 G eneralized Linear Models 3277.3 Survival D ata 3467.4 O ther N onlinear Models 3537.5 Misclassification E rror 3587.6 N onparam etric Regression 3627.7 Bibliographic N otes 3747.8 Problems 3767.9 Practicals 378

8 Complex Dependence 385

8.1 In troduction 3858.2 Time Series 3858.3 Point Processes 4158.4 Bibliographic Notes 4268.5 Problems 4288.6 Practicals 432

9 Improved Calculation 437

9.1 In troduction 4379.2 Balanced Bootstraps 4389.3 C ontrol M ethods 4469.4 Im portance Resam pling 4509.5 Saddlepoint A pproxim ation 4669.6 Bibliographic N otes 4859.7 Problems 4879.8 Practicals 494

10 Semiparametric Likelihood Inference 499

10.1 Likelihood 49910.2 M ultinom ial-Based Likelihoods 50010.3 Bootstrap Likelihood 50710.4 Likelihood Based on Confidence Sets 50910.5 Bayesian Bootstraps 51210.6 Bibliographic Notes 51410.7 Problems 51610.8 Practicals 519

Contents vii

viii Contents

11 Computer Implementation 522

11.1 In troduction 52211.2 Basic Bootstraps 52511.3 Further Ideas 53111.4 Tests 53411.5 Confidence Intervals 53611.6 Linear Regression 53711.7 Further Topics in Regression 54011.8 Time Series 54311.9 Improved Sim ulation 54511.10 Sem iparam etric Likelihoods 549

Appendix A. Cumulant Calculations 551

Bibliography 555Name Index 568Example index 572Subject index 575

Preface

The publication in 1979 of Bradley Efron’s first article on bootstrap methods was a major event in Statistics, at once synthesizing some of the earlier resampling ideas and establishing a new framework for simulation-based statistical analysis. The idea of replacing complicated and often inaccurate approximations to biases, variances, and other measures of uncertainty by computer simulations caught the imagination of both theoretical researchers and users of statistical methods. Theoreticians sharpened their pencils and set about establishing mathematical conditions under which the idea could work. Once they had overcome their initial skepticism, applied workers sat down at their terminals and began to amass empirical evidence that the bootstrap often did work better than traditional methods. The early trickle of papers quickly became a torrent, with new additions to the literature appearing every month, and it was hard to see when would be a good moment to try to chart the waters. Then the organizers of COMPSTAT ’92 invited us to present a course on the topic, and shortly afterwards we began to write this book.

We decided to try to write a balanced account of resampling methods, to include basic aspects of the theory which underpinned the methods, and to show as many applications as we could in order to illustrate the full potential of the methods — warts and all. We quickly realized that in order for us and others to understand and use the bootstrap, we would need suitable software, and producing it led us further towards a practically oriented treatment. Our view was cemented by two further developments: the appearance of two excellent books, one by Peter Hall on the asymptotic theory and the other on basic methods by Bradley Efron and Robert Tibshirani; and the chance to give further courses that included practicals. Our experience has been that hands-on computing is essential in coming to grips with resampling ideas, so we have included practicals in this book, as well as more theoretical problems.

As the book expanded, we realized that a fully comprehensive treatment was beyond us, and that certain topics could be given only a cursory treatment because too little is known about them. So it is that the reader will find only brief accounts of bootstrap methods for hierarchical data, missing data problems, model selection, robust estimation, nonparametric regression, and complex data. But we do try to point the more ambitious reader in the right direction.

No project of this size is produced in a vacuum. The majority of work on the book was completed while we were at the University of Oxford, and we are very grateful to colleagues and students there, who have helped shape our work in various ways. The experience of trying to teach these methods in Oxford and elsewhere — at the Universite de Toulouse I, Universite de Neuchatel, Universita degli Studi di Padova, Queensland University of Technology, Universidade de Sao Paulo, and University of Umea — has been vital, and we are grateful to participants in these courses for prompting us to think more deeply about the

ix

X Preface

material. Readers will be grateful to these people also, for unwittingly debugging some of the problems and practicals. We are also grateful to the organizers of COMPSTAT ’92 and CLAPEM V for inviting us to give short courses on our work.

While writing this book we have asked many people for access to data, copies of their programs, papers or reprints; some have then been rewarded by our bombarding them with questions, to which the answers have invariably been courteous and informative. We cannot name all those who have helped in this way, but D. R. Brillinger, P. Hall, M. P. Jones, B. D. Ripley, H. O’R. Sternberg and G. A. Young have been especially generous. S. Hutchinson and B. D. Ripley have helped considerably with computing matters.

We are grateful to the mostly anonymous reviewers who commented on an early draft of the book, and to R. G atto and G. A. Young, who later read various parts in detail. At Cambridge University Press, A. Woollatt and D. Tranah have helped greatly in producing the final version, and their patience has been commendable.

We are particularly indebted to two people. V. Ventura read large portions of the book, and helped with various aspects of the computation. A. J. Canty has turned our version of the bootstrap library functions into reliable working code, checked the book for mistakes, and has made numerous suggestions that have improved it enormously. Both of them have contributed greatly — though of course we take responsibility for any errors that remain in the book. We hope that readers will tell us about them, and we will do our best to correct any future versions of the book; see its WWW page, at URL

http://dmawww.epf1.ch/davison.mosaic/BMA/

The book could not have been completed without grants from the UK Engineering and Physical Sciences Research Council, which in addition to providing funding for equipment and research assistantships, supported the work of A. C. Davison through the award of an Advanced Research Fellowship. We also acknowledge support from the US National Science Foundation.

We must also mention the Friday evening sustenance provided at the Eagle and Child, the Lamb and Flag, and the Royal Oak. The projects of many authors have flourished in these amiable establishments.

Finally, we thank our families, friends and colleagues for their patience while this project absorbed our time and energy. Particular thanks are due to Claire Cullen Davison for keeping the Davison family going during the writing of this book.

A. C. Davison and D. V. Hinkley Lausanne and Santa Barbara

May 1997


1

Introduction

The explicit recognition o f uncertainty is central to the statistical sciences. N otions such as prior inform ation, probability models, likelihood, standard errors and confidence limits are all intended to formalize uncertainty and thereby make allowance for it. In simple situations, the uncertainty o f an estim ate may be gauged by analytical calculation based on an assumed probability model for the available data. But in more com plicated problems this approach can be tedious and difficult, and its results are potentially misleading if inappropriate assum ptions or simplifications have been made.

For illustration, consider Table 1.1, which is taken from a larger tabulation (Table 7.4) o f the num bers of A ID S reports in England and Wales from m id-1983 to the end o f 1992. Reports are cross-classified by diagnosis period and length o f reporting delay, in three-m onth intervals. A blank in the table corresponds to an unknow n (as yet unreported) entry. The problem was to predict the states o f the epidemic in 1991 and 1992, which depend heavily on the values missing at the bottom right o f the table.

The data support the assum ption that the reporting delay does not depend on the diagnosis period. In this case a simple model is that the num ber of reports in row j and colum n k o f the table has a Poisson distribution with mean Hjk = exp(oij -f f t ) . If all the cells o f the table are regarded as independent, then the total num ber o f unreported diagnoses in period j has a Poisson distribution with mean

njk = exp(ay) exP (Pk),k k

where the sum is over colum ns with blanks in row j. The eventual total o f as yet unreported diagnoses from period j can be estim ated by replacing a j and Pk by estim ates derived from the incomplete table, and thence we obtain the predicted total for period j . Such predictions are shown by the solid line in

1

2 1 ■ Introduction

D iagnosisperiod

Y ear Q uarter

R eporting delay interval (q u arte rs): Total reports to end

o f 19920+ 1 2 3 4 5 6 > 14

1988 1 31 80 16 9 3 2 8 ••• 6 1742 26 99 27 9 8 11 3 3 2113 31 95 35 13 18 4 6 3 2244 36 77 20 26 11 3 8 2 205

1989 1 32 92 32 10 12 19 12 2 2242 15 92 14 27 22 21 12 ••• 1 2193 34 104 29 31 18 8 6 2534 38 101 34 18 9 15 6 ••• 233

1990 1 31 124 47 24 11 15 8 ••• 2812 32 132 36 10 9 7 6 ••• 2453 49 107 51 17 15 8 9 2604 44 153 41 16 11 6 5 285

1991 1 41 137 29 33 7 11 6 2712 56 124 39 14 12 7 10 2633 53 175 35 17 13 11 3064 63 135 24 23 12 258

1992 1 71 161 48 25 3102 95 178 39 3183 76 181 2734 67 133

Figure 1.1, together with the observed total reports to the end o f 1992. How good are these predictions?

It would be tedious bu t possible to put pen to paper and estimate the prediction uncertainty through calculations based on the Poisson model. But in fact the data are much m ore variable than tha t model would suggest, and by failing to take this into account we would believe that the predictions are more accurate than they really are. Furtherm ore, a better approach would be to use a sem iparam etric model to sm ooth out the evident variability o f the increase in diagnoses from quarter to quarter; the corresponding prediction is the dotted line in Figure 1.1. Analytical calculations for this model would be very unpleasant, and a more flexible line o f attack is needed. W hile m ore than one approach is possible, the one tha t we shall develop based on com puter sim ulation is both flexible and straightforward.

Purpose of the Book

O ur central goal is to describe how the com puter can be harnessed to obtain reliable standard errors, confidence intervals, and other measures o f uncertainty for a wide range o f problems. The key idea is to resample from the original da ta — either directly or via a fitted model — to create replicate datasets, from

Table 1.1 Numbers of AIDS reports in England and Wales to the end of 1992 (De Angelis and Gilks, 1994) extracted from Table 7.4. A t indicates a reporting delay less than one month.

1 ■ Introduction 3

Figure 1.1 Predicted quarterly diagnoses from a parametric model (solid) and a semiparametric model (dots) fitted to the AIDS data, together with the actual totals to the end of 1992 (+).

Time

which the variability o f the quantities o f interest can be assessed w ithout long- winded and error-prone analytical calculation. Because this approach involves repeating the original da ta analysis procedure with m any replicate sets o f data, these are sometimes called computer-intensive methods. A nother nam e for them is bootstrap methods, because to use the da ta to generate more da ta seems analogous to a trick used by the fictional Baron M unchausen, who when he found himself a t the bottom o f a lake got out by pulling himself up by his bootstraps. In the simplest nonparam etric problems we do literally sample from the data, and a com m on initial reaction is that this is a fraud. In fact it is not. It turns out tha t a wide range o f statistical problems can be tackled this way, liberating the investigator from the need to oversimplify complex problems. The approach can also be applied in simple problems, to check the adequacy o f standard measures o f uncertainty, to relax assumptions, and to give quick approxim ate solutions. An example o f this is random sampling to estimate the perm utation distribution o f a nonparam etric test statistic.

It is o f course true tha t in m any applications we can be fairly confident in a particular param etric model and the standard analysis based on that model. Even so, it can still be helpful to see w hat can be inferred w ithout particular param etric model assumptions. This is in the spirit o f robustness o f validity o f the statistical analysis performed. N onparam etric bootstrap analysis allows us to do this.

4 1 • Introduction

Table 1.2 Service hours3 5 7 18 43 85 91 98 100 130 230 487 between failures of the

_____________________________________________________________________ air-conditioningequipment in a Boeing 720 jet aircraft (Proschan, 1963).

Despite its scope and usefulness, resampling m ust be carefully applied.Unless certain basic ideas are understood, it is all too easy to produce a solution to the wrong problem, or a bad solution to the right one. Bootstrap m ethods are intended to help avoid tedious calculations based on questionable assumptions, and this they do. But they cannot replace clear critical thought about the problem, appropriate design o f the investigation and data analysis, and incisive presentation o f conclusions.

In this book we describe how resampling m ethods can be used, and evaluate their perform ance, in a wide range o f contexts. O ur focus is on the m ethods and their practical application rather than on the underlying theory, accounts o f which are available elsewhere. This book is intended to be useful to the m any investigators who w ant to know how and when the m ethods can safely be applied, and how to tell when things have gone wrong. The m athem atical level o f the book reflects this: we have aimed for a clear account o f the key ideas w ithout an overload o f technical detail.

ExamplesBootstrap m ethods can be applied both when there is a well-defined probability model for data and when there is not. In our initial development o f the m ethods we shall make frequent use o f two simple examples, one o f each type, to illustrate the main points.

Example 1.1 (Air-conditioning data) Table 1.2 gives n = 12 times between failures o f air-conditioning equipment, for which we wish to estimate the underlying mean or its reciprocal, the failure rate. A simple model for this problem is tha t the times are sampled from an exponential distribution.

The dotted line in the left panel o f Figure 1.2 is the cumulative distribution function (CD F)

F t ) = / ° ’ y ~ ° ’\ l - e x p ( - y / n ) , y > 0,

for the fitted exponential distribution with mean fi set equal to the sample average, y = 108.083. The solid line on the same plot is the nonparam etric equivalent, the empirical distribution function (ED F) for the data, which places equal probabilities n-1 = 0.083 at each sample value. Com parison o f the two curves suggests that the exponential model fits reasonably well. A n alternative view o f this is shown in the right panel o f the figure, which is an exponential


Figure 1.2 Summary displays for the air-conditioning data. The left panel shows the EDF for the data, F (solid), and the CDF of a fitted exponential distribution (dots). The right panel shows a plot of the ordered failure times against exponential quantiles, with the fitted exponential model shown as the dotted line.

oOcoooinoo

oocoooCM

Oo

Failure time y

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Quantiles of standard exponential

Q-Q plot — a plot o f ordered data values yy) against the standard exponential quantiles

= - log (1n + 1 n + 1

K=1

A lthough these plots suggest reasonable agreem ent with the exponential model, the sample is ra ther too small to have much confidence in this. In the data source the more general gam m a model with mean /i and index k is used; its density is

1 / \ K1 I K ' „K-1.f w ( y ) = y K exP ( - Ky / v l y > o, h, k > o. ( i . i )

For our sample the estim ated index is k = 0.71, which does not differ significantly (P = 0.29) from the value k = 1 that corresponds to the exponential model. O ur reason for m entioning this will become apparent in C hapter 2.

Basic properties o f the estim ator T = Y for fj. are easy to obtain theoretically under the exponential model. For example, it is easy to show that T is unbiased and has variance fi2/n. A pproxim ate confidence intervals for n can be calculated using these properties in conjunction with a norm al approxim ation for the distribution o f T, although this does not work very well: we can tell this because Y / n has an exact gam m a distribution, which leads to exact confidence limits. Things are more com plicated under the more general gam m a model, because the index k is only estimated, and so in a traditional approach we would use approxim ations — such as a norm al approxim ation for the distribution o f T, or a chi-squared approxim ation for the log likelihood ratio statistic.


The param etric sim ulation m ethods o f Section 2.2 can be used alongside these approxim ations, to diagnose problems with them, or to replace them entirely.

■

Example 1.2 (City population data) Table 1.3 reports n = 49 da ta pairs, each corresponding to a city in the U nited States o f America, the pair being the 1920 and 1930 populations o f the city, which we denote by u and x. The da ta are plotted in Figure 1.3. Interest here is in the ratio o f means, because this would enable us to estimate the total population o f the U SA in 1930 from the 1920 figure. I f the cities form a random sample with ( U , X ) denoting the pair of population values for a random ly selected city, then the total 1930 population is the product o f the total 1920 population and the ratio o f expectations 6 = E(X )/E([7). This ratio is the param eter o f interest.

In this case there is no obvious param etric model for the jo in t distribution o f (U ,X) , so it is natural to estimate 9 by its empirical analog, T = X / U , the ratio o f sample averages. We are then concerned with the uncertainty in T. If we had a plausible param etric model — for example, tha t the pair (U, X ) has a bivariate lognorm al distribution — then theoretical calculations like those in Example 1.1 would lead to bias and variance estimates for use in a norm al approxim ation, which in turn would provide approxim ate confidence intervals for 6. W ithout such a model we m ust use nonparam etric analysis. I t is still possible to estim ate the bias and variance o f T, as we shall see, and this makes norm al approxim ation still feasible, as well as more complex approaches to setting confidence intervals. ■

Example 1.1 is special in that an exact distribution is available for the statistic o f interest and can be used to calculate confidence limits, at least under the exponential model. But for param etric models in general this will no t be true. In Section 2.2 we shall show how to use param etric sim ulation to obtain approxim ate distributions, either by approxim ating m om ents for use in norm al approxim ations, or — when these are inaccurate — directly.

In Example 1.2 we make no assum ptions about the form o f the data disribution. But still, as we shall show in Section 2.3, sim ulation can be used to obtain properties o f T, even to approxim ate its distribution. M uch o f C hapter 2 is devoted to this.

Layout of the BookC hapter 2 describes the properties o f resam pling m ethods for use with single samples from param etric and nonparam etric models, discusses practical m atters such as the num bers o f replicate datasets required, and outlines delta m ethods for variance approxim ation based on different forms o f jackknife. It

1 • Introduction

Table 13 Populations in thousands of n — 49 large US cities in 1920 (u) and in 1930 (x) (Cochran, 1977, p. 152).

u X u X u X

138 143 76 80 67 6793 104 381 464 120 11561 69 387 459 172 183179 260 78 106 66 8648 75 60 57 46 6537 63 507 634 121 11329 50 50 64 44 5823 48 77 89 64 6330 111 64 77 56 1422 50 40 60 40 6438 52 136 139 116 13046 53 243 291 87 10571 79 256 288 43 6125 57 94 85 43 50

298 317 36 46 161 23274 93 45 53 36 5450 58

Figure 1J Populations of 49 large United States cities (in 1000s) in 1920 and 1930.

co«3Q.OQ.OCOO)

1920 population


also contains a basic discussion o f confidence intervals and o f the ideas that underlie bootstrap methods.

C hapter 3 outlines how the basic ideas are extended to several samples, sem iparam etric and sm ooth models, simple cases where data have hierarchical structure or are sampled from a finite population, and to situations where data are incomplete because censored or missing. It goes on to discuss how the sim ulation output itself may be used to detect problems — so-called bootstrap diagnostics — and how it may be useful to bootstrap the bootstrap.

In C hapter 4 we review the basic principles o f significance testing, and then describe M onte Carlo tests, including those using M arkov Chain simulation, and param etric bootstrap tests. This is followed by discussion o f nonparam etric perm utation tests, and the more general m ethods o f semi- and nonparam etric bootstrap tests. A double bootstrap m ethod is detailed for improved approxim ation o f P-values.

Confidence intervals are the subject o f C hapter 5. After outlining basic ideas, we describe how to construct simple confidence intervals based on simulations, and then go on to more complex methods, such as the studentized bootstrap, percentile methods, the double bootstrap and test inversion. The m ain m ethods are com pared empirically in Section 5.7, then there are brief accounts o f confidence regions for m ultivariate param eters, and o f prediction intervals.

The three subsequent chapters deal with more complex problems. C hapter 6 describes how the basic resampling m ethods may be applied in linear regression problems, including tests for coefficients, prediction analysis, and variable selection. C hapter 7 deals with more complex regression situations: generalized linear models, other nonlinear models, semi- and nonparam etric regression, survival analysis, and classification error. C hapter 8 details m ethods appropriate for time series, spatial data, and point processes.

C hapter 9 describes how variance reduction techniques such as balanced simulation, control variates, and im portance sampling can be adapted to yield improved simulations, with the aim o f reducing the am ount o f sim ulation needed for an answer o f given accuracy. It also shows how saddlepoint m ethods can sometimes be used to avoid sim ulation entirely.

C hapter 10 describes various sem iparam etric versions o f the likelihood function, the ideas underlying which are closely related to resampling methods. It also briefly outlines a Bayesian version o f the bootstrap.

Chapters 2-10 contain problems intended to reinforce the reader’s understanding o f both m ethods and theory, and in some cases problems develop topics that could not be included in the text. Some o f these dem and a knowledge o f m om ents and cum ulants, basic facts about which are sketched in the Appendix.

The book also contains practicals that apply resampling routines w ritten in


the S language to sets o f data. The practicals are intended to reinforce the ideas in each chapter, to supplem ent the more theoretical problems, and to give examples on which readers can base analyses o f their own data.

It would be possible to give different sorts o f course based on this book. One would be a “theoretical” course based on the problems and another an “applied” course based on the practicals; we prefer to blend the two.

A lthough a library o f routines for use with the statistical package S P lus is bundled with it, m ost o f the book can be read w ithout reference to particular software packages. A part from the practicals, the exception to this is C hapter 11, which is a short introduction to the main resampling routines, arranged roughly in the order with which the corresponding ideas appear in earlier chapters. Readers intending to use the bundled routines will find it useful to work through the relevant sections o f C hapter 11 before attem pting the practicals.

NotationA lthough we believe tha t our notation is largely standard, there are not enough letters in the English and G reek alphabets for us to be entirely consistent. G reek letters such as 6, P and v generally denote param eters or o ther unknowns, while a is used for error rates in connection with significance tests and confidence sets. English letters X , Y, Z , and so forth are used for random variables, which take values x, y, z. Thus the estim ator T has observed value t, which may be an estim ate o f the unknow n param eter 0. The letter V is used for a variance estimate, and the letter p for a probability, except for regression models, where p is the num ber o f covariates. Script letters such as J/~ are used to denote sets.

Probability, expectation, variance and covariance are denoted Pr( ), E( ), var(-) and cov(-, •), while the jo in t cum ulant o f Yi, Y1Y2 and Y3 is denoted cum(Yi, Yj Y2, Y3). We use I {A} to denote the indicator random variable, which takes values one if the event A is true and zero otherwise. A related function is the Heaviside function

We use #{/!} to denote the num ber o f elements in the set A, and #{^4r} for thenum ber o f events A r tha t occur in a sequence A i , A 2, __ We use = to mean“is approxim ately equal to ”, usually corresponding to asym ptotic equivalence as sample sizes tend to infinity, ~ to m ean “is distributed as” or “is distributed

according to”, ~ to m ean “is distributed approxim ately as”, ~ to mean “is a sample o f independent identically distributed random variables from ”, while s has its usual meaning o f “is equivalent to”.


The da ta values in a sample o f size n are typically denoted by y i , . . . , y n, the observed values o f the random variables y i , . . . , y n; their average is y =n - ' Z y j -

We mostly reserve Z for random variables tha t are standard norm al, at least approximately, and use Q for random variables with o ther (approximately) known distributions. As usual N(n, a 2) represents the norm al distribution with mean \i and variance a 2, while za is often the a quantile o f the standard norm al distribution, whose cum ulative distribution function is ®( ).

The letter R is reserved for the num ber o f replicate simulations. Simulated copies o f a statistic T are denoted T ' , r = 1 ,. . . , R, whose ordered values are r ('i) ^ ^ T(R)- Expectation, variance and probability calculated with respectto the sim ulation distribution are w ritten Pr*(), E*(-) and var*(-).

W here possible we avoid boldface type, and rely on the context to make it plain when we are dealing with vectors or m atrices; aT denotes the m atrix transpose o f a vector or m atrix a.

We use PDF, CDF, and E D F as shorthand for “probability density function”, “cumulative distribution function”, and “empirical distribution function”. The letters F and G are used for C D Fs, and / and g are generally used for the corresponding PD Fs. A n exception to this is that /*; denotes the frequency with which y; appears in the rth resample.

We use M LE as shorthand for “maxim um likelihood estim ate” or sometimes “m axim um likelihood estim ation”.

The end o f each example is m arked ■, and the end o f each algorithm is m arked •.

2

The Basic Bootstraps

2.1 IntroductionIn this chapter we discuss techniques which are applicable to a single, hom ogeneous sample o f data, denoted by y i , . . . ,} V The sample values are thought o f as the outcom es o f independent and identically distributed random variables YU. . . ,Y„ whose probability density function (PD F) and cumulative distribution

function (C D F) we shall denote by / and F, respectively. The sample is to be used to make inferences about a population characteristic, generically denoted by 6, using a statistic T whose value in the sample is t. We assume for the m om ent tha t the choice o f T has been made and tha t it is an estimate for 6, which we take to be a scalar.

O ur attention is focused on questions concerning the probability distribution o f T. For example, w hat are its bias, its standard error, or its quantiles? W hat are likely values under a certain null hypothesis o f interest? How do we calculate confidence limits for 6 using T ?

There are two situations to distinguish, the param etric and the nonparam etric. W hen there is a particular m athem atical model, with adjustable constants or param eters ip tha t fully determine / , such a model is called parametric and statistical m ethods based on this model are param etric methods. In this case the param eter o f interest 6 is a com ponent o f or function o f ip. W hen no such m athem atical model is used, the statistical analysis is nonparametric, and uses only the fact tha t the random variables Yj are independent and identically distributed. Even if there is a plausible param etric model, a nonparam etric analysis can still be useful to assess the robustness o f conclusions draw n from a param etric analysis.

A n im portan t role is played in nonparam etric analysis by the empirical distribution which puts equal probabilities n-1 a t each sample value yj. The corresponding estim ate o f F is the empirical distribution function (ED F) F,

11

12 2 • The Basic Bootstraps

which is defined as the sample proportion

nM ore formally

F(y) = l i Z H ^ y - y ^ wj=i

where H(u) is the unit step function which jum ps from 0 to 1 a t u = 0. Notice that the values o f the E D F are fixed (0, j[), so the E D F is equivalentto its points o f increase, the ordered values >’(i) < • • • < y ln} o f the data. An example o f the E D F was shown in the left panel o f Figure 1.2.

W hen there are repeat values in the sample, as would often occur with discrete data, the E D F assigns probabilities proportional to the sample frequencies at each distinct observed value y. The formal definition (2.1) still applies.

The E D F plays the role o f fitted model when no m athem atical form is assumed for F, analogous to a param etric C D F with param eters replaced by their estimates.

2.1.1 Statistical functionsM any simple statistics can be thought o f in term s o f properties o f the EDF. For example, the sample average y = n_1 yj is the mean o f the E D F ; see Example 2.1 below. M ore generally, the statistic o f interest t will be a symmetric function o f y \ , . . . , y„, meaning that t is unaffected by reordering the data. This implies tha t t depends only on the ordered values y(i) < • • • < y^), or equivalently on the E D F F. Often this can be expressed simply as t = t(F), where t(-) is a statistical function — essentially just a m athem atical expression o f the algorithm for com puting t from F. Such a statistical function is o f central im portance in the nonparam etric case because it also defines the param eter o f interest 9 through the “algorithm ” 9 = t(F). This corresponds to the qualitative idea that 6 is a characteristic o f the population described by F. Simple examples o f such functions are the m ean and variance o f Y, which are respectively defined as

t(F) = J ydF(y) , t(F) = J y 2 dF(y) ~ { J ydF(y) J . (2.2)

The same definition o f 9 applies in param etric problems, although then 6 is m ore usually defined explicitly as one o f the model param eters tp.

The relationship between the estimate t and F can usually be expressed as t = t(F), corresponding to the relation 9 = t(F) between the characteristic o f interest and the underlying distribution. The statistical function t( ) defines

#{^4} means the number of times the event A occurs.

2.1 ■ Introduction 13

A quantity A„ is said to be 0(nd) if lim„_00 n~dA„ = a for some finite a, and o(nJ) if lim„_0Q n~dA„ = 0.

both the param eter and its estimate, bu t we shall use t( ) to represent the function, and t to represent the estimate o f 9 based on the observed data

Example 2.1 (Average) The sample average, y, estim ates the population mean

H = J ydF(y).

To show tha t y = t(F), we substitute for F in the defining function at (2.2) to obtain

j= i

because f a (y )dH(y — x) = a(x) for any continuous function a(-). ■

Example 2.2 (City population data) For the problem outlined in Example 1.2, the param eter o f interest is the ratio o f means 9 = E (X )/E (l/) . In this case F is the bivariate C D F o f Y = (V , X ), and the bivariate E D F F puts probability n~l a t each o f the data pairs (uj ,Xj). The statistical function version o f 9 simply uses the definition o f m ean for bo th num erator and denom inator, so that

f x d F ( u , x ) f udF(u,x)

The corresponding estimate o f 9 is

* [ xd F( u , x ) xt = t(F) =

J udF(u ,x ) u

with x = n-1 J2 x j ar*d « = n_1 J 2 uj- ■

It is quite straightforw ard to show that (2.1) implies convergence o f F to F as n—>oo (Problem 2.1). Then if t(-) is continuous in an appropriate sense, the definition T = t( ) implies tha t T converges to 6 as n—>oo, which is the property o f consistency.

N ot all estimates are exactly o f the form t(F). For example, if t(F) = var(Y ) then the usual unbiased sample variance is nt(F)/(n — 1). Also the sample m edian is not exactly F -1 ( |) . Such small discrepancies are fairly unim portant as far as applying the bootstrap techniques discussed in this book. In a very formal developm ent we could write T — tn(F) and require that tn—*t as n—>oo, possibly even tha t t„ — t = 0 (« _1). But such formality would be excessive here, and we shall assume in general discussion that T = t(F). (One case that does

require special treatm ent is nonparam etric density estim ation, which we discuss in Example 5.13.)

The representation 6 = t(F) defines the param eter and its estim ator T in a robust way, w ithout any assum ption about F, other than that 6 exists. This guarantees tha t T estim ates the right thing, no m atter w hat F is. Thus the sample average y is the only statistic tha t is generally valid as an estimate o f the population m ean f i : only if Y is symmetrically distributed about /i will statistics such as trim m ed averages also estimate fi. This property, which guarantees that the correct characteristic o f the underlying distribution is estimated, whatever that distribution is, is sometimes called robustness o f specification.

2.1.2 ObjectivesM uch o f statistical theory is devoted to calculating approxim ate distributions for particular statistics T, on which to base inferences about their estim ands 8. Suppose, for example, tha t we w ant to calculate a (1 — 2a) confidence interval for 6. It may be possible to show that T is approxim ately norm al with mean 6 + P and variance v; here P is the bias o f T. I f p and v are both known, then we can write

P r(T < 1 1 F) = O . (2-3)

where <t>() is the standard norm al integral. I f the a quantile o f the standard norm al distribution is z« = <D- 1(a), then an approxim ate (1 — 2a) confidence interval for 6 has limits

t - p - v ^ \ , (2.4)

as follows from

Pr(/? + v1/2za < T — 0 < ft + v1/2Z!_a) = 1 - 2a.

There is a catch, however, which is tha t in practice the bias /? and variancev will not be known. So to use the norm al approxim ation we m ust replace P and v with estimates. To see how to do this, note tha t we can express P and v as

P = b(F) = E ( T | F) - t(F), v = v(F) = var( T \ F), (2.5)

thereby stressing their dependence on the underlying distribution. We useexpressions such as E (T | F) to m ean tha t the random variables from which T is calculated have distribution F; here a pedantic equivalent would be

E{t(F) | YU . . . ,Y„ ~ F} . Suppose tha t F is estim ated by F, which m ight be the empirical distribution function, or a fitted param etric distribution. Then estimates o f bias and variance are obtained simply by substituting F for F in

= means “is approximately equal to”.

2.2 ■ Parametric Simulation 15

(2.5), that is

B = b(F) = E ( T \ F ) - t ( F ) , V = v(F) = v ar(T | F). (2.6)

These estimates B and V are used in place o f (i and v in equations such as (2.4).

Example 2.3 (Air-conditioning data) U nder the exponential model for the d a ta in Example 1.1, the m ean failure time n is estim ated by the average T = Y , which has a gam m a distribution with m ean fi and shape param eter k = n. Therefore the bias and variance o f T are b(F) = 0 and i>(F) = /i2/ n , and these are estim ated by 0 and y 2/n. Since n = 12, y = 108.083, and 20.025 = —1.96, a 95% confidence interval for /i based on the norm al approxim ation (2.3) is

+ 1.96n_1/2y = (46.93,169.24). ■

Estim ates such as those in (2.6) are bootstrap estimates. Here they have been used in conjunction with a norm al approxim ation, which sometimes will be adequate. However, the bootstrap approach o f substituting estimates can be applied m ore am bitiously to improve upon the norm al approxim ation and other first-order theoretical approxim ations. The elaboration o f the bootstrap approach is the purpose o f this book.

2.2 Parametric SimulationIn the previous section we pointed out that theoretical properties o f T might be hard to determine with sufficient accuracy. We now describe the sound practical alternative o f repeated sim ulation o f da ta sets from a fitted param etric model, and empirical calculation o f relevant properties o f T.

Suppose tha t we have a particular param etric model for the distribution o f the data y \ , . . . , y„ . We shall use Fv(y) and f v (y) to denote the C D F and P D F respectively. W hen 1p is estim ated by (p — often but not invariably its maximum likelihood estim ate — its substitution in the model gives the f itted model, with C D F F{y) = F^(y), which can be used to calculate properties o f T, sometimes exactly. We shall use Y * to denote the random variable distributed according to the fitted model F, and the superscript * will be used with E, var and so forth when these m om ents are calculated according to the fitted distribution. Occasionally it will also be useful to write \p = xp’ to emphasize tha t this is the param eter value for the sim ulation model.

Example 2.4 (Air-conditioning data) We have already calculated the mean and variance under the fitted exponential model for the estim ator T = Y o f Example 1.1. O ur sample estimate for the m ean fi is t = y. So here 7* is exponential with m ean y. In the no tation just introduced, we have by


theoretical calculation with this exponential distribution that

E*(Y*) = y, var'(Y *) = y 2/n.

N ote tha t the estim ated bias o f Y is zero, being the difference between E '(Y *) and the value ji = y for the mean o f the fitted distribution. These m om ents were used to calculate an approxim ate norm al confidence interval in Example 2.3.

If, however, we wished to calculate the bias and variance o f T = log Y under the fitted model, i.e. E* (log Y*) — logy and var’ (lo g Y '), exact calculation is more difficult. The delta m ethod o f Section 2.7.1 would give approxim ate values — (2n)~* and n-1 . But m ore accurate approxim ations can be obtained using sim ulated samples o f 7* s.

Similar results and com m ents would apply if instead we chose to use the more general gam m a model (1.1) for this example. Then Y* would be a gam m a random variable with m ean y and index k. m

2.2.1 Moment estimatesSo now suppose that theoretical calculation with the fitted model is too complex. Approxim ations may not be available, or they may be untrustworthy, perhaps because the sample size is small. The alternative is to estimate the properties we require from simulated datasets. We write such a dataset as Yj",. . . , Y„* where the YJ are independently sampled from the fitted distribution F. W hen the statistic o f interest is calculated from a simulated dataset, we denote it by T*. From R repetitions o f the data sim ulation we obtain T [ , . . . , T ’R. Properties o f T — 6 are then estim ated from T,*,. . . , T^. For example, the estim ator o f the bias b(F) — E (T | F) — 0 o f T is

B = b(F) = E (T | F) — t = E*(T*) - t,

and this in tu rn is estim ated by

R

B r = / r 1 Y , Tr ~ t = T* - 1. (2.7)r= 1

N ote that in the sim ulation t is the param eter value for the model, so that T ' — t is the sim ulation analogue o f T — 6. The corresponding estim ator o f the variance o f T is

1 RVr = D 7’-* - f *)2’ (2-8)

with similar estim ators for other moments.These empirical approxim ations are justified by the law o f large numbers.

For example, B r converges to B, the exact value under the fitted model, as R

2.2 ■ Parametric Simulation 17

Figure 2.1 Empirical biases and variances of Y* for theair-conditioning data from four repetitions of parametric simulation. Each line shows how the estimated bias and variance for R ~ 10 initial simulations change when further simulations are successively added. Note how the variability decreases as the simulation size increases, and how the simulated values converge to the exact values under the fitted exponential model, given by the horizontal dotted lines.

c/>COin

increases. We usually drop the subscript R from B R, VR, and so forth unless we are explicitly discussing the effect o f R. How to choose R will be illustrated in the examples tha t follow, and discussed in Section 2.5.2.

It is im portant to recognize that we are not estim ating absolute properties of T, but ra ther o f T relative to 9. Usually this involves the estim ation error T —9, but we should not ignore the possibility that T / 0 (equivalently log T — log 9) o r some other relevant measure of estim ation error m ight be more appropriate, depending upon the context. B ootstrap sim ulation m ethods will apply to any such measure.

Example 2.5 (Air-conditioning data) Consider Example 1.1 again. As we have seen, sim ulation is unnecessary in practice for this problem because the m om ents are easy to calculate theoretically, bu t the example is useful for illustration. Here the fitted model is an exponential distribution for the failure times, with mean estim ated by the sample average y = 108.083. All simulated failure times Y * are generated from this distribution.

Figure 2.1 shows the results from several simulations, four for each of eight values o f R, in each o f which the empirical biases and variances of T" = Y" have been calculated according to (2.7) and (2.8). O n both panels the “correct” values, namely zero and y 2/ n = (108.083)2/1 2 = 973.5, are indicated by horizontal dotted lines.

Evidently the larger is R, the closer is the sim ulation calculation to the right answer. How large a value o f R is needed? Figure 2.1 suggests tha t for some purposes R = 100 or 200 will be adequate, but tha t R = 10 will not be large enough. In this problem the accuracy o f the empirical approxim ations is quite easy to determ ine from the fact that n Y / n has a gam m a distribution with

index n. The sim ulation variances o f BR and F r are

t 2 t4 / 2 6 \

n R ’ n2 \ R - 1 + n R . ) ’

and we can use these to say how large R should be in order tha t the simulated values have a specified accuracy. For example, the coefficients o f variation o f VR a t R = 100 and 1000 are respectively 0.16 and 0.05. However, for a com plicated problem where sim ulation was really necessary, such calculations could not be done, and general rules are needed to suggest how large R should be. These are discussed in Section 2.5.2. ■

2.2.2 Distribution and quantile estimatesThe sim ulation estim ates o f bias and variance will sometimes be o f interest in their own right, but more usually would be used with norm al approxim ations for T, particularly for large samples. For situations like those in Examples 1.1 and 1.2, however, the norm al approxim ation is intrinsically inaccurate. This can be seen from a norm al Q-Q plot o f the simulated values t \ , . . . , t R, tha t is, a plot o f the ordered values < • • • < t ’R) against expected norm al order statistics. It is the empirical distribution o f these simulated values which can provide a more accurate distributional approxim ation, as we shall now see.

If as is often the case we are approxim ating the distribution o f T — 8 by that o f T m — t, then cumulative probabilities are estim ated simply by the empirical distribution function o f the simulated values t ' — t. M ore formally, if G(u) = Pr( T — 8 < u), then the sim ulation estimate o f G(u) is

n i \ — t < u} 1 ,G*(U) = ~ ^ R ------- = R Z 2 1{tr ~ 1 -

r= l

where I {A} is the indicator o f the event A, equal to 1 if A is true and 0 otherwise. As R increases, so this estim ate will converge to G(u), the exact C D F of T* — t under sampling from the fitted model. Just as with the m om ent approxim ations discussed earlier, so the approxim ation GR to G contains two sources o f error, i.e. tha t between G and G due to da ta variability and that between GR and G due to finite simulation.

We are often interested in quantiles o f the distribution o f T — 8, and these are approxim ated using ordered values o f t* — t. The underlying result used here is tha t if X i , . . . , X N are independently distributed with C D F K and if

denotes the j \ h ordered value, then

This implies tha t a sensible estim ate o f K ~ l(p) is X ^ N+i)p), assuming that

2.2 • Parametric Simulation 19

(N + l)p is an integer. So we estimate the p quantile o f T — 9 by the (R + l)p th ordered value o f t" — t, that is t(‘(R+1)p) — t. We assume that R is chosen so that (/? l)p is an integer.

The sim ulation approxim ation GR and the corresponding quantiles are in principle better than results obtained by norm al approxim ation, provided that R is large enough, because they avoid the supposition tha t the distribution of T* — t has a particular form.

Example 2.6 (Air-conditioning data) The sim ulation experiments described in Example 2.5 can be used to study the sim ulation approxim ations to the distribution and quantiles o f Y — fi. First, Figure 2.2 shows norm al Q-Q plots o f t* values for R = 99 (top left panel) and R = 999 (top right panel). Clearly a norm al approxim ation would not be accurate in the tails, and this is already fairly clear with R = 99. For reference, the lower half o f Figure 2.2 shows corresponding Q-Q plots with exact gam m a quantiles.

The nonnorm ality o f T * is also reasonably clear on histogram s o f t* values, shown in Figure 2.3, at least at the larger value R = 999. Corresponding density estim ate plots provide sm oother displays o f the same information.

We look next at the estim ated quantiles o f Y — p.. The p quantile is ap proxim ated by J'f’jK+np) — y for p = 0.05 and 0.95. The values o f R are 1 9 ,39 ,99 ,199 ,..., 999, chosen to ensure tha t (R + 1 )p is an integer throughout. Thus at R = 19 the 0.05 quantile is approxim ated by y ^ — y and so forth. In order to display the m agnitude o f sim ulation error, we ran four independent sim ulations at R = 19 ,39 ,99 ,...,999 . The results are plotted in Figure 2.4. Also shown by do tted lines are the exact quantiles under the model, which the sim ulations approach as R increases. There is large variability in the approxim ate quantiles for R less than 100 and it appears tha t 500 or m ore simulations are required to get accurate results.

The same sim ulations can be used in o ther ways. For example, we might w ant to know about log Y — log /i, in which case the empirical properties of logy* — logy are relevant. ■

The illustration used here is very simple, but essentially the same m ethods can be used in arbitrarily com plicated param etric problems. For example, distributions o f likelihood ratio statistics can be approxim ated when large- sample approxim ations are inaccurate or fail entirely. In Chapters 4 and 5 respectively we show how param etric bootstrap m ethods can be used to calculate significance tests and confidence sets.

It is sometimes useful to be able to look at the density o f T, for example to see if it is m ultim odal, skewed, or otherwise differs appreciably from normality. A rough idea o f the density g(u) of U = T —6, say, can be had from a histogram o f the values o f t ' — t. A som ewhat better picture is offered by a kernel density

Quantiles of standard normal Quantiles of standard normal

ooo

o

oo

oCD

O''fr

/■ • / ■ooC\J

• > /S ’ o

to //*

J r o /o /

j oin /

60 80 120 160

Exact gamma quantile

200 50 100 150 200

Exact gamma quantile

estimate, defined by

<«>r= l v y

where w is a symmetric P D F with zero mean and h is a. positive bandw idth that determines the sm oothness o f gh. The estimate gh is non-negative and has unit integral. It is insensitive to the choice o f w(-), for which we use the standard norm al density. The choice o f h is m ore im portant. The key is to produce a sm ooth result, while not flattening out significant modes. If the choice o f h is quite large, as it may be if R < 100, then one should rescale the density

Figure 2.2 Normal (upper) and gamma (lower) Q-Q plots of (* values based on R = 99 (left) and R = 999 (right) simulations from the fitted exponential model for the air-conditioning data.

2.2 - Parametric Simulation 21

Figure 23 Histograms of t* values based on R = 99 (left) and R = 999 (right) simulations from the fitted exponential model for the air-conditioning data.

Figure 2.4 Empirical quantiles (p = 0.05, 0.95) of T* — t under resampling from the fitted exponential model for the air-conditioning data. The horizontal dotted lines are the exact quantiles under the model.

oo

or~Oo

inooo

oo

liB l50 100 150 200

t*

oo

cooo

Ttoo

oo lb

50 100 150 200

t*

estimate to make its mean and variance agree with the estim ated m ean bR and variance vR o f T — 9; see Problem 3.8.

As a general rule, good estimates o f density require at least R = 1000: density estim ation is usually harder than probability or quantile estimation.

N ote tha t the same m ethods o f estim ating density, distribution function and quantiles can be applied to any transform ation o f T. We shall discuss this further in Section 2.5.


2.3 Nonparametric SimulationSuppose tha t we have no param etric model, but tha t it is sensible to assume that Y i,. . . , Y„ are independent and identically distributed according to an unknown

A

distribution function F. We use the E D F F to estim ate the unknow n C D F F. We shall use F ju st as we would a param etric m odel: theoretical calculation if possible, otherwise sim ulation o f datasets and empirical calculation o f required properties. In only very simple cases are exact theoretical calculations possible, but we shall see in Section 9.5 that good theoretical approxim ations can be obtained in many problem s involving sample moments.

Example 2.7 (Average) In the case o f the average, exact m om ents undersampling from the E D F are easily found. For example,

E*(Y*) = E '(Y *) = ^ ^ ; = y j = i

and similarly

- 1 1 1 " 1var*(Y *)= -v a r* (Y ') = -E*{Y* — E*(Y*)}2 = - x V - { y , — y f

n n 1 1 n ^ n 1}= i

(n — 1) 1 2= —

A part from the factor (n — 1 )/n, this is the usual result for the estim ated variance o f Y . ■

O ther simple statistics such as the sample variance and sample m edian are also easy to handle (Problems 2.3, 2.4).

To apply sim ulation with the E D F is very straightforward. Because the E D F puts equal probabilities on the original da ta values y i , . . . , y„ , each Y* is independently sampled a t random from those da ta values. Therefore the simulated sample Y(’, . . . , Y„* is a random sample taken with replacement from the data. This simplicity is special to the case o f a hom ogeneous sample, but m any extensions are straightforward. This resampling procedure is called the nonparametric bootstrap.

Example 2.8 (City population data) Here we look at the ratio estimate for the problem described in Example 1.2. For convenience we consider a subset o f the da ta in Table 1.3, comprising the first ten pairs. This is an application with no obvious param etric model, so nonparam etric sim ulation makes good sense. Table 2.1 shows the da ta and the first simulated sample, which has been draw n by random ly selecting subscript j ' from the set { l , . . . ,n } with equal probability and taking (w*,x*) = (uj-,xj-). In this sample j ' = 1 never occurs

2.3 ■ Nonparametric Simulation 23

Table 2.1 The dataset for ratio estimation, and one synthetic sample. The values j* are chosen randomly with equal probability from

with replacement; the simulated pairs are(«/ - Xj* ).

Table 2.2 Frequencies with which each original data pair appears in each of R = 9 nonparametric bootstrap samples for the data on US cities.

.7 1 2 3 4 5 6 7 8 9 10u 138 93 61 179 48 37 29 23 30 2X 143 104 69 260 75 63 50 48 111 50

/ ' 6 7 2 2 3 3 10 7 2 9u ’ 37 29 93 93 61 61 2 29 93 30X* 63 50 104 104 69 69 50 50 104 111

j 1 2 3 4 5 6 7 8 9 10u 138 93 61 179 48 37 29 23 30 2X 143 104 69 260 75 63 50 48 111 50

N um bers o f times each pair sam pledStatistic

D a ta 1 1 1 1 1 1 1 1 1 1 t = 1.520

R eplicate r

1 3 2 1 2 1 1 t \ = 1.4662 1 1 2 2 1 2 1 t* = 1.7613 1 1 1 1 4 2 r; = 1.9514 1 2 1 1 2 2 1 t \ = 1.5425 3 1 3 1 1 1 t'5 = 1.3716 1 1 2 1 1 1 3 t'6 = 1.6867 1 1 2 2 2 1 1 t; = 1.3788 2 1 3 1 1 1 1 t j = 1.4209 1 1 1 2 1 2 1 1 (j = 1.660

and / = 2 occurs three times, so that the first da ta pair is never selected, the second is selected three times, and so forth.

Table 2.2 shows the same simulated sample, plus eight more, expressed in terms o f the frequencies o f original data pairs. The ratio t* for each simulated sample is recorded in the last colum n o f the table. After the R sets o f calculations, the bias and variance estimates are calculated according to (2.7) and (2.8). The results are, for the R = 9 replicates shown,

b = 1.582 — 1.520 = 0.062, v = 0.03907.

A simple approxim ate distribution for T — 6 is N(b,v). W ith the results so far, this is N(0.062,0.0391), but this is unlikely to be accurate enough and a larger value o f R should be used. In a sim ulation with R = 999 we obtained b = 1.5755 — 1.5203 = 0.0552 and v = 0.0601. The latter is appreciably bigger than the value 0.0325 given by the delta m ethod variance estimate

n

vL = n~2 J ^ ( x ; - t u j f / u 1, j=i

24 2 ■ The Basic Bootstraps

oc\i

ino

oo

oCOo<No

I ■1 llJ llll.-_ q _ n . l l l l

0.5 1.0 1.5

t*

2.0 2.5 -8 -6 -4 -2 0

z*

Figure 2.5 City population data. Histograms of t9 and z * under nonparametric resampling for sample of size n — 10, R = 999 simulations. Note the skewness of both t* and

which is based on an expansion that is explained in Section 2.7.2; see also Problem 2.9. The discrepancy between v and Vi is due partly to a few extreme values o f f \ an issue we discuss in Section 2.3.2.

The left panel o f Figure 2.5 shows a histogram o f t \ whose skewness is evident: use o f a norm al approxim ation here would be very inaccurate.

We can use the same sim ulations to estim ate distributions o f related statistics, such as transform ed estim ates or studentized estimates. The right panel of Figure 2.5 shows a histogram o f studentized values z* = (t* — t ) / v ^ /2, where v'L is the delta m ethod variance estimate based on a simulated sample. T hat is,

v'L = n~2 Y ^ ( x ,j - t , Uj)2/ u 2.7=1

The corresponding theoretical approxim ation for Z is the N (0,1) distribution, which we would judge also inaccurate in view o f the strong skewness in the histogram. We shall discuss the rationale for the use o f z* in Section 2.4.

One natural question to ask here is w hat effect the small sample size has on the accuracy o f norm al approxim ations. This can be answered in part by plotting density estimates. The left panel o f Figure 2.6 shows three estimated densities for T* — t with our sample o f n = 10, a kernel density estimate based on our simulations, the N(b, v) approxim ation with m om ents com puted from the same simulations, and the N ( 0 , vl) approxim ation. The right panel shows corresponding density approxim ations for the full da ta with n = 49; the empirical bias and variance o f T are b = 0.00118 and v = 0.001290, and the

2.3 ■ Nonparametric Simulation 25

Figure 2.6 Density estimates for 7* — t based on 999 nonparametric simulations for the city population data. The left pane! is for the sample of size n = 10 in Table 2.1, and the right panel shows the corresponding estimates for the entire dataset of size n = 49. Each plot shows a kernel density estimate (solid), the N(b,v) approximation (dashes), with these moments computed from the same simulations, and the N(0,vl) approximation (dots).

delta m ethod variance approxim ation is vl = 0.001166. At the larger sample size the norm al approxim ations seem very accurate. ■

2.3.1 Comparison with parametric methodsA natural question to ask is how well the nonparam etric resampling m ethods m ight com pare to param etric m ethods, when the latter are appropriate. Equally im portant is the question as to which param etric model would produce results like those for nonparam etric resampling: this is another way o f asking just what the nonparam etric bootstrap does. Some insight into these questions can be gained by revisiting Example 1.1.

Example 2.9 (Air-conditioning data) We now look at the results o f applying nonparam etric resam pling to the air-conditioning data. One might naively expect to obtain results similar to those in Example 2.5, where exponential resampling was used, since we found in Example 1.1 tha t the da ta appear com patible with an exponential model.

Figure 2.7 is the nonparam etric analogue o f Figure 2.4, and shows quantiles o f T* — t. It appears tha t R = 500 or so is needed to get reliable quantile estimates; R = 100 is enough for the corresponding plot for bias and variance. Under nonparam etric resampling there is no reason why the quantiles should approach the theoretical quantiles under the exponential model, and it seems that they do not do so. This suggestion is confirmed by the Q-Q plots in Figure 2.8. The first panel com pares the ordered values o f t ' from R = 999 nonparam etric sim ulations with theoretical quantiles under the fitted exponential model, and the second panel com pares the t' with theoretical quantiles


R

Figure 2.7 Empiricalquantiles (p = 0.05, 0.95) of T* —t under nonparametric resampling from the air-conditioning data. The horizontal lines are the exact quantiles based on the fitted exponential model.

Figure 2.8 Q-Q plots of y* under nonparametric resampling from the air-conditioning data, first-against theoretical quantiles under fitted exponential model (left panel) and then against theoretical quantiles under fitted gamma model (right pane!).

under the best-fitting gam m a model with index k = 0.71. The agreement in the second panel is strikingly good. On reflection this is natural, because the E D F is closer to the larger gam m a model than to the exponential model. ■

2.3.2 Effects o f discretenessFor intrinsically continuous data, a m ajor difference between param etric and nonparam etric resampling lies in the discreteness o f the latter. Under nonpara-

2.4 ■ Simple Confidence Intervals 27

metric resampling, T* and related quantities will have discrete distributions, even though they may be approxim ating continuous distributions. This makes results som ewhat “fuzzy” com pared to their param etric counterparts.

Example 2.10 (Air-conditioning data) For the nonparam etric sim ulation discussed in the previous example, the right panels o f Figure 2.9 show the scatter plots of sample standard deviation versus sample average for R = 99 and R = 999 sim ulated datasets. Corresponding plots for the exponential simulation are shown in the left panels. The qualitative feature to be read from any one o f these plots is that da ta standard deviation is proportional to data average. The discreteness o f the nonparam etric model (the E D F) adds noise whose peculiar banded structure is evident a t R = 999, although the qualitative structure is still apparent. ■

For a statistic that is symmetric in the data values, there are up to

_ f i n — 1\ _ (2n — 1)!W" \ n — 1 ) n\(n — 1)!

possible values o f t*, depending upon the sm oothness o f the statistical function t( ). Even for m oderately small samples the support o f the distribution o f T* will often be fairly dense: values o f m„ for n = 7 and 11 are 1716 and 352 716 (Problem 2.5). It would therefore usually be harmless to think o f there being a P D F for T*, and to approxim ate it, either using sim ulation results as in Figure 2.6 or theoretically (Section 9.5). There are exceptions, however, most notably when T is a sample quantile. The case o f the sample median is discussed in Example 2.16; see also Problem 2.4 and Example 2.15.

For m any practical applications o f the sim ulation results, the effects o f discreteness are likely to be fairly minimal. However, one possible problem is that outliers are more likely to occur in the sim ulation output. For example, in Example 2.8 there were three outliers in the simulation, and these inflated the estimate v‘ o f the variance o f T*. Such outliers should be evident on a norm al Q-Q plot (or com parable relevant plot), and when found they should be omitted. M ore generally, a statistic that depends heavily on a few quantiles can be sensitive to the repeated values that occur under nonparam etric sampling, and it can be useful to sm ooth the original da ta when dealing with such statistics; see Section 3.4.

2.4 Simple Confidence IntervalsThe m ajor application for distributions and quantiles o f an estim ator T is in the calculation o f confidence limits. There are several ways o f using bootstrap sim ulation results in this context, m ost o f which will be explored in C hapter 5. Here we describe briefly two basic methods.

Bootstrap average Bootstrap average

QC/)o.COto8CD

OOCO

OinC\J

ooCsJoLO

oo

0 50 100 150 200 250 300

Bootstrap average

QcoQ.(0

8m

Bootstrap average

The simplest approach is to use a norm al approxim ation to the distribution o f T. As outlined in Section 2.1.2, this means estim ating the limits (2.4), which require only bootstrap estimates o f bias and variance. As we have seen in previous sections, a norm al approxim ation will not always suffice. Then if we use the bootstrap estimates o f quantiles for T — 6 as described in Section 2.2.2, an equitailed (1 — 2a) confidence interval will have limits

1 ~ (^(R+lXl-a)) — f)> 1 — (^(R+lJa) — 0- (2.10)

This is based on the probability implication

P r r „ < T - f l < M = 1 - 2 a => P r ( T - b < 6 < T - a) = 1 - 2 a .

Figure 2.9 Scatter plots of sample standard deviation versus sample average for samples generated by parametric simulation from the fitted exponential model (left panels) and by nonparametric resampling (right panels). Top line is for R = 99 and bottom line is for R — 999.

2.4 ■ Simple Confidence Intervals 29

We shall refer to the limits (2.10) as the basic bootstrap confidence limits. Their accuracy depends upon R, o f course, and one would typically take R > 1000 to be safe. But accuracy also depends upon the extent to which the distribution of T" — t agrees with that o f T — 9. Complete agreement will occur if T — 9 has a distribution not depending on any unknowns. This special property is enjoyed by quantities called pivots, which we discuss in more detail in Section 2.5.1.

If, as is usually the case, the distribution o f T — 9 does depend on unknowns, then we can try alternative expressions contrasting T and 6, such as differences o f transform ed quantities, o r studentized com parisons. For the latter, we define the studentized version o f T — 9 as

where V is an estimate o f var(T | F): we give a fairly general form for V in Section 2.7.2. The idea is to mimic the Student-t statistic, which has this form, and which eliminates the unknown standard deviation when m aking inference about a norm al mean. Throughout this book we shall use Z to denote a studentized statistic.

Recall tha t the Student-t (1 — 2a) confidence interval for a norm al m ean n has limits

where v is the estim ated variance o f the mean and f„_i(a), t„_ i(l — a) are quantiles o f the Student-f distribution with n — 1 degrees o f freedom, the distribution o f the pivot Z . M ore generally, when Z is defined by (2.11), the (1 — 2a) confidence interval limits for 9 have the analogous form

where zp denotes the p quantile o f Z . One simple approxim ation, which can often be justified for large sample size n, is to take Z as being N ( 0,1). The result would be no different in practical terms from using a norm al approxim ation for T — 9, and we know that this is often inadequate. It is more accurate to estimate the quantiles o f Z from replicates o f the studentized bootstrap statistic, Z* = (T* — t ) /V *1/2, where T ' and V * are based on a simulated random sample, Y ’, . . . , Yn'.

If the model is param etric, the Y ' are generated from the fitted param etric distribution, and if the model is nonparam etric, they are generated from the E D F F, as outlined in Section 2.3. In either case we use the (R + l)a th order statistic o f the simulated values z \ , . . . , z 'R, namely z(*(K+1)(x), to estimate z„. Then the studentized bootstrap confidence interval for 9 has limits

y - v l/2tn- i ( l - a ) , y - v l/2t„-i(a),

(2.12)


This studentized bootstrap m ethod is most likely to be o f use in nonparametric problems. One reason for this is that with param etric models we can sometimes find “exact” solutions (as with the exponential model for Exam ple 1.1), and otherwise we have available m ethods based on the likelihood function. This does not necessarily rule out the use o f param etric simulation, o f course, for approxim ating the distribution o f the quantity used as basis for the confidence interval.

Example 2.11 (Air-conditioning data) Under the exponential model for the d a ta o f Example 1.1, we have T = Y , and since var(T | FM) = n 2/n, we would take V = Y 2/n. This gives

Z = (T - n ) / V l/2 = n1/2(l - n / Y ) ,

which is an exact pivot because Q = Y / n has the gam m a distribution with index n and unit mean. Sim ulation to construct confidence intervals is unnecessary because the quantiles o f the gam m a distribution are available from tables. Param etric sim ulation would be based on Q* = Y*/ t , where Y* is the average o f a random sample Y , \ . . . , Y* from the exponential distribution with m ean t. Since Q‘ has the same distribution as Q, the only error incurred by simulation would be due to the random ness o f the simulated quantiles. For example, the estimates o f the 0.025 and 0.975 quantiles o f Q based on R = 999 simulations are 0.504 and 1.608, com pared to the exact values 0.517 and 1.640; these lead to estim ated and exact 95% confidence intervals (67.2,214.6) and (65.9,209.2) respectively. We shall discuss these intervals m ore fully in C hapter 5. ■

Example 2.12 (City population data) For the sample o f n = 10 pairs analysed in Example 2.8, our estim ate o f the ratio 8 is t = x / u = 1.52. The 0.025 and 0.975 quantiles o f the 999 values o f t ‘ are 1.236 and 2.059, so the 95% basic bootstrap confidence interval (2.10) for 8 is (0.981,1.804).

To apply the studentized interval, we use the delta m ethod approxim ation to the variance o f T, which is (Problem 2.9)

n

VL = n~ 2 J ^ (x y - tUj)2/Q 2, j=i

and base confidence intervals for 8 on ( T — 0 ) / v lL[2, using sim ulated values o f z ' = (t* — t ) /vL . The simulated values in the right panel o f Figure 2.5 show that the density o f the studentized bootstrap statistic Z ' is not close to normal. The 0.025 and 0.975 quantiles of the 499 simulated z ' values are -3.063 and 1.447, and since v i = 0.0325, an approxim ate 95% equitailed confidence interval based on (2.12) is (1.260,2.072). This is quite different from the interval above.

The usefulness o f these confidence intervals will depend on how well F

2.5 ■ Reducing Error 31

estimates F and the extent to which the distributions o f T — 6 and of Z depend on F. We cannot judge the former, but we can check the latter using the m ethods outlined in Section 3.9.2; see Examples 3.20 and 9.11. ■

2.5 Reducing ErrorThe error in resampling m ethods is generally a com bination o f statistical error and sim ulation error. The first o f these is due to the difference between F and F, and the m agnitude o f the resulting error will depend upon the choice of T. The sim ulation error is wholly due to use o f empirical estim ates o f properties under sampling from F, ra ther than exact properties.

Figure 2.7 illustrates these two sources o f error in quantile estimation. The decreasing sim ulation error shows as reduced scatter o f the quantile estimates for increased R. Statistical error due to an inappropriate model for T is reflected by the difference between the simulated nonparam etric quantiles for large R and the dotted lines that indicate the quantiles under the exponential model. The further statistical error due to the difference between F and F cannot be illustrated, because we do not know the true model underlying the data. However, other samples of the same size from tha t model would yield different estim ates of the true quantiles, quite apart from the variability o f the quantile estim ates obtained from each specific dataset by simulation.

2.5.1 Statistical errorThe basic bootstrap idea is to approxim ate a quantity c{F) — such as var(T | F)— by the estim ate c(F), where F is either a param etric or a nonparam etric estimate o f F based on da ta The statistical error is then the differencebetween c(F) and c(F), and as far as possible we wish to minimize this or remove it entirely. This is sometimes possible by careful choice o f c(-). For example, in Example 1.1 with the exponential model, we have seen tha t working with T / 9 removes statistical error completely.

For bo th confidence interval and significance test calculation, we usually have a choice as to w hat T is and how to use it. Significance testing raises special issues, because we then have to deal with a null hypothesis sampling distribution, so here it is best to focus on confidence interval calculation. For simplicity we also assume tha t estimate T is decided upon. Then the quantity c(F) will be a quantile or a m om ent o f some quantity Q = q(F,F) derived from T, such as h(T) — h{6) or (T — 6 ) / V l/2 where V is an estim ated variance, or som ething more com plicated such as a likelihood ratio. The statistical problem is to choose am ong these possible quantities so that the resulting Q is as nearly pivotal as possible, tha t is it has (at least approximately) the same distribution under sampling from bo th F and F.

Provided that Q is a m onotone function o f 8, it will be straightforw ard to obtain confidence limits. For example, if Q = h(T ) — h(8) with h(t) increasing in t, and if ax is an approxim ate lower a quantile o f h(T) — h(8), then

1 - a = Pr{Ji(T) - h(8) > aa} = Pr [0 < h~l {h(T) - a*}], (2.13)

where /i_1( ) is the inverse transform ation. So h~l {h(T) — aa} is an upper (1 — a) confidence limit for 8.

Parametric problems

In param etric problems F = F# and F = Fv have the same form, differing only in param eter values. The notion o f a pivot is quite simple here, meaning constant behaviour under all values o f the model param eters. M ore formally, we define a pivot as a function Q = q (T ,8 ) whose distribution does or not a particular quantity Q is exactly or nearly pivotal, by examining its behaviour under the model form with varying param eter values. For example, in the context o f Example 1.1 not depend on the value o f \p: for all q,

Pr{q(T ,0 ) < q | v>}

is independent o f \p.One can check, sometimes theoretically and always empirically, whether,

we could simultaneously examine properties o f T — 8, log T — log 8 and the studentized version o f the former, by sim ulation under several exponential models close to the fitted model. This might result in plots o f variance or selected quantiles versus param eter values, from which we could diagnose the nonpivotal behaviour o f T — 6 and the pivotal behaviour o f log T — log 8.

A special role for transform ation h(T) arises because sometimes it is relatively easy to choose h{-) so tha t the variance o f T is approxim ately or exactly independent o f 8, and this stability is the prim ary feature o f stability o f distribution. Suppose that T has variance v(6). Then provided the function h(-) is well behaved at 8, Taylor series expansion as described in Section 2.7.1 leads to

W L i { h ( T ) } ± { h ( 8 ) } 2 v(8),

which in tu rn implies that the variance is made approxim ately constant (equal to 1) if

H{t) = / M l i j p ' (114)This is known as the variance-stabilizing transformation. Any constant multiple o f h(T) will be equally effective: often in one-sample problems where v{8) = ri~l it2(8) equation (2.14) would be applied with a(u) in place o f {u(m)}1/2, in which case h(-) is independent o f n and var(T ) = n-1 .

For a problem where v{8) varies strongly with 8, use o f this transform ation

In general Q may also depend on other statistics, as when Q is the studentized form of T.

h(8) is the first derivative dh(6)/d6.

2.5 ■ Reducing Error 33

Figure 2.10 Log-log plot of estimated variance of Y against 6 for the air-conditioning data with an exponential model. The plot suggests strongly that var(Y | 0) oc 62.

<Doc(0•c(0>

ooo

50 60 70 90

theta

200

in conjunction with (2.13) will typically give m ore accurate confidence limits than would be obtained using direct approxim ations o f quantiles for T — 6.

If such use o f the transform ation is appropriate, it will sometimes be clear from theoretical considerations, as in the exponential case. Otherwise the transform ation would have to be identified from a scatter plot o f simulation- estim ated variance o f T versus 6 for a range o f values o f 8.

Example 2.13 (Air-conditioning data) Figure 2.10 shows a log-log plot o f the empirical variances o f r* = y ' based on R = 50 simulations for each o f a range o f values o f 6. T hat is, for each value o f 0 we generate R values t ’ corresponding to samples y y " „ from the exponential distribution with mean 6, and then plot log { ( R — l ) -1 X)(t* — r*)2} against log0. The linearity and slope o f the plot confirm that var(T | F ) oc 62, where 6 = E (T | F). a

Nonparametric problemsIn nonparam etric problems the situation is m ore complicated. It is now unlikely (but not strictly impossible) that any quantity can be exactly pivotal. Also we cannot simulate da ta from a distribution with the same form as F, because that form is unknown. However, we can simulate da ta from distributions near to and similar to F, and this may be enough since F is near F. A rough idea o f w hat is possible can be had from Example 2.10. In the right-hand panels o f Figure 2.9 we plotted sample standard deviation versus sample average for a series o f nonparam etrically resampled datasets. If the E D Fs o f those datasets are thought o f as models near both F and F, then although the pattern is obscured by the banding, the plots suggest tha t the true model has standard deviation proportional to its mean — which is indeed the case for the m ost

likely true model. There are conceptual difficulties with this argum ent, but there is little question tha t the implication draw n is correct, namely that log Y will have approxim ately the same variance under sampling from both F and F.

A more thorough discussion o f these ideas for nonparam etric problems will be given in Section 3.9.2.

A m ajor focus o f research on resam pling m ethods has been the reduction o f statistical error. This is reflected particularly in the development o f accurate confidence limit methods, which are described in C hapter 5. In general it is best to remove as much o f the statistical error as possible in the choice of procedure. However, it is possible to reduce statistical error by a bootstrap technique described in Section 3.9.1.

2.5.2 Simulation errorSim ulation error arises when M onte C arlo sim ulations are perform ed and properties o f statistics are approxim ated by their empirical properties in these simulations. For example, we approxim ate the estim ate B = E*(T* | F) — t of bias /? = E (T ) — 8 by the average B R = R ~ l — t) = T ' — t, using the independent replications Tj*,. . . , TR, each based on a random sample from our da ta E D F F. The M onte C arlo variability in R ~ ] T ’ can only be removed entirely by an infinite simulation, which seems bo th impossible and unnecessary in practice. The practical question is, how large does R need to be to achieve reasonable accuracy, relative to the statistical accuracy o f the quantity (bias, variance, etc.) being approxim ated by sim ulation? While it is not possible to give a completely general and firm answer, we can get a fairly good sense o f w hat is required by considering the bias, variance and quantile estimates in simple cases. This we now do.

Suppose tha t we have a sample y u - - - , y n from the N(p,<j2) distribution, and that the param eter o f interest 9 — n is estim ated by the sample average t = y. Suppose tha t we use nonparam etric sim ulation to approxim ate the bias, variance and the p quantile ap o f T — 8 = Y — jx. Then the first step, as described in Section 2.3, is to take R independent replicate samples from y y n, and calculate their means Yj* ,. . . , Y^. From these we calculate the bias, variance and quantile estim ators as described earlier. O f course the problem is so simple that we know the real answers, namely 0, n~xa 2 and w~1/2<rzp, where zp is the p quantile o f the standard norm al distribution. So the corresponding (infinite simulation) estimates o f bias and variance are 0 and n-1ff2, where a 2 — n r 1 J2(yj ~ y )2- The corresponding estim ate ap of the p quantile ap is approxim ately n~{/2azp under nonparam etric resampling, ignoring 0 ( n ~ l ) terms. We now com pare the finite-simulation approxim ations to these estimates.

2.5 • Reducing Error 35

First consider the bias estim ator

Br = R ~ l £ t ; - Y.

C onditional on the particular sample y \ , . . . , y n, or equivalently its E D F F, the mean and variance of the bias estim ator across all possible sim ulations are

because E*(Y/) = y and var’(Yr*) = n~la 2. The unconditional variance of B R, taking into account the variability between different samples from the underlying distribution, is

where E y ( - ) and vary(-) denote the mean and variance taken with respect to the jo in t distribution o f Y \ , . . . , Y n. From (2.15) this gives

This result does not depend on norm ality o f the data. A similar expression holds for any sm ooth statistic T with a linear approxim ation (Section 2.7.2), except for an 0(n~2) term.

N ext consider the variance estim ator VR = (R — I)-1 XXYr’ — Y*)2, where Y* = R ^ 1 Yr*. The m ean and variance of VR across all possible simulations, conditional on the data, are

where 72 is the standardized fourth cum ulant — the standardized kurtosis — of the data (Appendix A). N ote that would be zero for a param etric sim ulation but not for our nonparam etric simulation, although in general 72 = 0( n ~ l ) because the da ta are norm ally distributed. The unconditional variance of VR, averaging over all possible datasets, is

The first term on the right o f (2.17) is due to da ta variation, the second to

(2.15)

var ( R ? ; - y ) = vary |e * ( R~1 ^ Yr* - j

+E y | var' ( r ' 5 ] y ; - y ) } ,

var (Br ) = vary(O) + E ya 2 n — 1 — x — — . n nR

(2.16)

var(F«) = vary + E y

which reduces to

(2.17)

sim ulation variation. The im plication is tha t to make the sim ulation variance as small as 10% o f that due to da ta variation, one m ust take R = lOn.

The corresponding result for general da ta distributions would include an additional term from the kurtosis o f the Yj. A similar result holds for a general sm ooth statistic T.

Finally consider the estim ator o f the p quantile ap for Y — p., which is <*p,r = ^((r+1)p) — y with Y(’r+i)p) (R + l)p th order statistic o f the simulated values Y ,*,..., Yr. The general calculation o f sim ulation properties o f aPtR is complicated, so we make the simplifying assum ption tha t the N(y,ri~la 2) approxim ation for Y * is exact. W ith this assum ption, standard properties of order statistics give

E ‘(ap,R) = ap = n~l/2azp,

andp ( l - p ) = 2np(l — p)a2 exp(z2)Rg2(ap) nR

var (aPjR) = p = ------------ — ----------(2.18)

where g( ) is the density o f Y* — Y conditional on F, here taken to be the N(0, n_1a 2) density. (Note that the middle term o f (2.18) applies for any T and any data distribution, with g( ) the density o f T — 8.) The unconditional variance o f aPiR over all datasets can then be reduced to

( zp , 2 n p ( l - p ) e \ p ( z 2) \var(„„s ) = - | ^ + -----------------------’- j . (2.19)

The im plication o f (2.19) is tha t the variance inflation due to sim ulation is approxim ately

4nnp(l — p)exp(z2) _ nd(p) z 2pR ~ R ’

say. Some values o f d(p) are as follows.

p o r l - p 0.01 0.025 0.05 0.10 0.25d(p) 5.15 3.72 3.30 3.56 8.16

So to make the variance inflation factor 10% for the 0.025 quantile, for example, we would need R = 40n. Equation (2.19) may not be useful in the centre o f the distribution, where d(p) is very large because zp is small.

Example 2.14 (Air-conditioning data) To see how well this discussion applies in practice, we look briefly a t results for the data in Example 1.1. The statistic o f interest is T = log Y, which estim ates 8 = log fi. The true model for Y is taken to be the gam m a distribution with index k = 0.71 and mean p. = 108.083; these are the da ta estimates. Effects due to sim ulation error are approxim ated

2.6 ■ Statistical Issues 37

Table 23 Components of variance (xlO-3) in bootstrap estimation of p quantile for log Y — log /i, due to data variation and simulation variation, based on nonparametric simulation applied to the data of Example 1.1.

Source Type P

0.01 0.99 0.05 0.95 0.10 0.90

D ata actual 31.0 6.9 14.0 3.6 8.3 2.2theoretical 26.6 26.6 13.3 13.3 8.1 8.1

Sim ulation, R = 100 actual 53.6 9.4 8.5 3.2 3.8 2.6theoretical 32.9 32.9 10.5 10.5 6.9 6.9

Sim ulation, R = 500 actual 4.3 2.4 2.0 0.6 1.2 0.4theoretical 6.6 6.6 2.1 2.1 . 1.4 1.4

Sim ulation, R = 1000 actual 2.2 0.8 1.5 0.1 0.8 0.2theoretical 3.3 3.3 1.0 1.0 0.7 0.7

by taking sets o f R sim ulations from one long nonparam etric sim ulation of 9999 datasets. Table 2.3 shows the actual com ponents of variation due to sim ulation and data variation, together with the theoretical com ponents in (2.19), for estim ates o f quantiles o f l o g ? — log/i. On the whole the theory gives a fairly accurate prediction o f performance. ■

It is not necessarily best to choose R solely on the basis o f the variance inflation factor. For example, if we had been discussing the studentized statistic Z defined by (2.11) and its quantiles, then the com ponent o f variation due to data variance would be approxim ately zero to the accuracy used in (2.18), based on the N (0 ,1) approxim ation. So the variance inflation factor would be enorm ous. W hat really counts is the effect o f the sim ulation on the final result, say the length and coverage o f the confidence interval. This presents a much m ore delicate question (Problem 5.5).

A nother way to estimate quantiles for T —6 is by norm al approxim ation with bootstrap estimates o f bias and variance. Similar calculations o f sim ulation error are possible; see Problem 2.7. In general the norm al approxim ation is suspect, although its applicability can be assessed by a norm al Q-Q plot o f the simulated t ' values.

2.6 Statistical Issues

2.6.1 When does the bootstrap work?ConsistencyThere are two senses in which resampling m ethods m ight “w ork”. First, do they give reliable results when used with the sort o f da ta encountered in practice? This question is crucial in applications, and is a m ajor focus o f this book. It leads one to consider how the resamples themselves can be used to tell

when and how a bootstrap calculation might fail, and ideally how it should be amended to yield useful answers. This topic o f bootstrap diagnostics is discussed m ore fully in Section 3.10.

A second question is: under w hat idealized conditions will a resampling procedure produce results tha t are in some sense m athem atically correct? Answers to questions o f this sort involve an asym ptotic fram ework in which the sample size n—>oo. A lthough such asym ptotics are ultim ately intended to guide practical work, they often act only as a backstop, by removing from consideration procedures tha t do not have appropriate large-sample properties, and are usually not subtle enough to discrim inate am ong com peting procedures according to their finite-sample characteristics. Nevertheless it is essential to appreciate when a naive application o f the bootstrap will fail.

To put the theoretical basis for the bootstrap in simple terms, suppose that we have a random sample or equivalently its E D F F, from whichwe wish to estimate properties o f a standardized quantity Q = q(YU---, Y„;F). For example, we might take

Q{Yu . . . , Y n\F) = n1/2 j ? - J ydF( y) ^ = n ^ 2( ? - 6 ) ,

say, and w ant to estimate the distribution function

GfA<1) = Pt {Q(Yu . . . ,Y„' ,F) < q \ F } , (2.20)

where the conditioning on F indicates tha t Y \ , . . . , Y„ is a random sample from F. The bootstrap estimate o f (2.20) is

G*»(«) = Pr { Q ( lY , . . . ,y B*;F) ^ q I F } (2.21)

where in this case Q{Y{, . . . , Y * ; F ) = n{/1{ Y ' —y). In order for Gpn to approach G f n as n—*■ oo, three conditions m ust hold. Suppose tha t the true distribution F is surrounded by a neighbourhood in a suitable space o f distributions, and that as n—*oo, F eventually falls into J f with probability one. Then the conditions are:

1 for any A € jV, GA,n m ust converge weakly to a limit Ga,x ;2 this convergence m ust be uniform on J/'-, and3 the function m apping A to GAyCC m ust be continuous.

Here weak convergence o f GA,„ to GA,X means that as n-*oo,

J h(u)dGAy„(u) ->■ J h(u)dGAi0D(u)

for all integrable functions h(-). Under these conditions the bootstrap is consistent, meaning tha t for any q and e > 0, Pr{\Gpn(q) — GF }(q)\ > e}—>0 as n—y oo.

Here and below we say X n = Op{nd) when Prfn^l-Xnl > e)-*p for some constant p as n—►oo, and X„ = op(nd) whenPr(n rf|ATn| > e)-*0 as n—>cc, for any e > 0.

The first condition ensures that there is a limit for Gf,„ to converge to, and would be needed even in the happy situation where F equalled F for every n > n', for some ri. Now as n increases, F changes, so the second and third conditions are needed to ensure that Gpn approaches G fi00 along every possible sequence o f F s. If any one o f these conditions fails, the bootstrap can fail.

Example 2.15 (Sample maximum) Suppose that Y i,. . . , Yn is a random sample from the uniform distribution on (0 ,9). Then the m axim um likelihood estimate o f 9 is the largest sample value, T = Yln), where Y(i) < ■ ■ < Y(n) are the sample order statistics. Consider nonparam etric resampling. The limiting distribution o f Q = n(9 — T ) / 9 is standard exponential, and this suggests that we take our standardized quantity to be Q' = n(t — T ' ) / t , where t is the observed value o f T , and T* is the maxim um o f a bootstrap sample of size n taken from y i , . . . , y n. As n—>oo, however,

Pr(g* = 0 | F) = Pr(T* = t \ F) = 1 - (1 - n_1)"-> 1 - e_1,

and consequently the lim iting distribution o f Q* cannot be standard exponential. The problem here is that the second condition fails: the distributional convergence is not uniform on useful neighbourhoods o f F. Any fixed order statistic Y(k) suffers from the same difficulty, but a statistic like a sample quantile, where we would take k = pn for some fixed 0 < p < 1, does not. ■

Asymptotic accuracyConsistency is a weak property, for example guaranteeing only that the true probability coverage o f a nom inal (1 — 2a) confidence interval is 1 — 2ot + op(l). S tandard norm al approxim ation m ethods are consistent in this sense. Once consistency is established, meaning that the resampling m ethod is “valid”, we need to know w hether the m ethod is “good” relative to o ther possible methods. This involves looking at the rate o f convergence to nom inal properties. For example, does the coverage o f the confidence interval deviate from (1 —2a) by 0 p(n~l/2) or by 0 p(n-1 )? Some insight into this can be obtained by expansion methods, as we now outline. M ore detailed calculations are m ade in Section 5.4.

Suppose that the problem is one where the limiting distribution o f Q is standard norm al, and where an Edgeworth expansion applies. Then the distribution o f Q can be w ritten in the form

Pr (Q < q \ F ) = <S>(q) + n~x/1 a{q)<t>(q) + 0 ( n ~ l ), (2.22)

where <!>(•) and </>{■) are the C D F and PD F o f the standard norm al distribution, and a(-) is an even quadratic polynomial. For a wide range o f problems it can be shown tha t the corresponding approxim ation for the bootstrap version of

Q is

Pr(2* < q \ F ) = <b(q) + n~l/2a(q)(l>(q) + 0 ^ ) , (2.23)

where a(-) is obtained by replacing unknowns in a(-) by estimates. Now typically a(q) = a(q) + 0 p(n~1/2), so

Pr(Q' < q \ F) — Pr«2 < q \ F) = Op(n~l ). (2.24)

Thus the estim ated distribution for Q differs from the true distribution by a term tha t is Op(n_1), provided that Q is constructed in such a way that it is asymptotically pivotal. A similar argum ent will typically hold when Q has a different limiting distribution, provided it does not depend on unknowns.

Suppose tha t we choose not to standardize Q, so tha t its limiting distribution is norm al with variance v. An Edgeworth expansion still applies, now with form

Pr(fi £ , I F) _ * ( - « j ) + ( - k ) * ( J L ) + 0(n-1), (125)

where a'(-) is a quadratic polynom ial tha t is different from a( ). The corresponding expansion for Q' is

Pr(Q■ < , | F) - ® ( ^ ) + „ - ' 'V ( j i j ) * ( j i j ) + O , (2.26)

Typically v = v + Op(n~l/2), which would imply that

P r(2 “ < q I F) - P r(Q < q \ F) = Op(n~V2), (2.27)

because the leading terms on the right-hand sides o f (2.25) and (2.26) are different.

The difference between (2.24) and (2.27) explains our insistence on working with approxim ate pivots whenever possible: use o f a pivot will m ean that a bootstrap distribution function is an order o f m agnitude closer to its target. It also gives a cogent theoretical m otivation for using the bootstrap to set confidence intervals, as we now outline.

We can obtain the a quantile o f the distribution o f Q by inverting (2.22), giving the Cornish-Fisher expansion

qx = z a + n- '^a '^Zx) + 0 (n _1),

where za is the a quantile o f the standard norm al distribution, and a"(-) is a further polynomial. The corresponding bootstrap quantile has the property that q ’ —qn = Op(n~l ). For simplicity take Q = ( T — 0 ) / V l/1, where V estimates the variance o f T. Then an exact one-sided confidence interval for 9 based on Q would be I a = [T — V 1/2qx, oo), and this contains the true 6 with probability a. The corresponding bootstrap interval is / ’ = [T — I/1/2g”,oo), where q’ is the a quantile o f the distribution o f Q* — which would often be estim ated by simulation, as we have seen. Since q'x — qx = Op(n~[), we have

Pr(0 e I a) = a, Pr(0 e /* ) = a + 0 ( n ~ l ),


so that the actual probability that / ' contains 6 differs from the nominal probability by only 0 (n -1 ). In contrast, intervals based on inverting (2.25) will contain 8 with probability a + 0(n ~ l/2). This interval is in principle no more accurate than using the interval [T — F 1/2za, oo) obtained by assuming that the distribution o f Q is standard normal. Thus one-sided confidence intervals based on quantiles o f Q’ have an asym ptotic advantage over the use o f a norm al approxim ation. Similar com m ents apply to two-sided intervals.

The practical usefulness o f such results will depend on the numerical value o f the difference (2.24) at the values o f q o f interest, and it will always be wise to try to decrease this statistical error, as outlined in Section 2.5.1.

The results above based on Edgeworth expansions apply to m any common statistics: sm ooth functions o f sample moments, such as means, variances, and higher moments, eigenvalues and eigenvectors o f covariance m atrices; sm ooth functions o f solutions to sm ooth estim ating equations, such as m ost maximum likelihood estim ators, estim ators in linear and generalized linear models, and some robust estim ators; and to many statistics calculated from time series.

2.6.2 Rough statistics: unsmooth and unstableW hat typically validates the bootstrap is the existence o f an Edgeworth expansion for the statistic of interest, as would be the case when that statistic is a differentiable function o f sample moments. Some statistics, such as sample quantiles, depend on the sample in an unsm ooth or unstable way such that standard expansion theory does not apply. Often the nonparam etric resampling m ethod will still be valid, in the sense that it is consistent, but for finite samples it may not work very well. Part o f the reason for this is that the set of possible values for T* may be very small, and very vulnerable to unusual data points. A case in point is that o f sample quantiles, the most familiar o f which— the sample median — is discussed in the next example. Example 2.15 gives a case where naive resampling fails completely.

Example 2.16 (Sample median) Suppose that the sample size is odd, n = 2m + 1, so that the sample m edian is y = y(m+\). In large samples the median is approxim ately normally distributed about the population m edian //, but standard nonparam etric m ethods o f variance estim ation (jackknife and delta m ethod) do not work here (Example 2.19, Problem 2.17). N onparam etric resampling does work to some extent, provided the sample size is quite large and the da ta are not too dirty. Crucially, bootstrap confidence limits work quite well.

N ote first that the bootstrap statistic Y* is concentrated on the sample values y^k), which makes the estim ated distribution o f the median very discrete and very vulnerable to unusual observations. Problem 2.4 shows that the exact

Normal f3 Cauchy

11 21 11 21 11 21

Theoretical 14.3 7.5 16.8 8.8 22.4 11.7Empirical 13.9 7.3 19.1 9.5 38.3 14.6Mean bootstrap 17.2 8.8 25.9 11.4 14000 22.8

Effective df 4.3 5.4 3.2 4.9 0.002 0.5

distribution o f Y * ism , \ m , s

pr(y* = ^ " (2.28);=0 j=0 '■*'

for k = l , . . . , n where = k /n; sim ulation is not needed in this case. The m om ents o f this bootstrap distribution, including its mean and variance, converge to the correct values as n increases. However, the convergence can be very slow. To illustrate this, Table 2.4 com pares the average bootstrap variance with the empirical variance o f the m edian for data samples o f sizes n = 11 and 21 from the standard norm al distribution, the Student-t distribution with three degrees o f freedom, and the Cauchy distribution; also shown are the theoretical variance approxim ations, which are incalculable when the true distribution F is unknown. We see that the bootstrap variance can be very poor for n = 11 when distributions are long-tailed. The value 1.4 x 104 for average bootstrap variance with Cauchy data is not a m istake: the bootstrap variance exceeds 100 for about 1% o f datasets: for some samples the bootstrap variance is huge. The situation stabilizes when n reaches 40 or more.

The gross discreteness o f y * could also affect the simple confidence limit method described in Section 2.4. But provided the inequalities used to justify (2.10) are taken to be < and > rather than < and > , the m ethod works well. For example, for Cauchy samples o f size n = 11 the coverage o f the 90% basic bootstrap confidence interval (2.10) is 90.8% in 1000 samples; see Problem 2.4. We suggest adopting the same practice for all problems where t* is supported on a small num ber o f values. ■

The statistic T will certainly behave wildly under resampling when t(F) does not exist, as happens for the mean when F is a Cauchy distribution. Quite naturally over repeated samples the bootstrap will produce silly and useless results in such cases. There are two points to make here. First, if data are taken from a real population, then such m athem atical difficulties cannot arise. Secondly, the standard approaches to da ta analysis include careful screening o f da ta for outliers, nonnorm ality, and so forth, which leads either to deletion o f disruptive data elements or to sensible and reliable choices o f estim ators

Table 2.4 Theoretical, empirical and mean bootstrap estimates of variance (x 10“2) of sample median, based on 10000 datasets of sizes n = 11,21. The effective degrees of freedom of bootstrap variances uses a x2 approximation to their distribution.


T. In short, the m athem atical pathology o f nonexistence is unlikely to be a practical problem.

2.6.3 Conditional propertiesResam pling calculations are based on the observed data, and in that sense resampling m ethods are conditional on the data. This is especially so in the nonparam etric case, where nothing but data is used. Because of this, the question is sometimes asked: “Are resampling m ethods therefore conditional in the inferential sense?” The short answer is: “No, a t least not in any useful way — unless the relevant conditioning can be m ade explicit.”

Conditional inference arises in param etric inference when the sufficient statistic includes an ancillary statistic A whose distribution is free o f param eters. Then we argue that inferences about param eters (e.g. confidence intervals) should be based on sampling distributions conditional on the observed value o f A ; this brings inference m ore into line with Bayesian inference. Two examples are the configuration o f residuals in location models, and the values of explanatory variables in regression models. The first cannot be accom m odated in nonparam etric bootstrap analysis because the effect depends upon the unknown F. The second can be accom m odated (C hapter 6) because the effect does not depend upon the stochastic part of the model. It is certainly true that the bootstrap distribution o f T* will reflect ancillary features o f the data, as in the case o f the sample m edian (Example 2.16), but the reflection is pale to the point o f uselessness.

There are situations where it is possible explicitly to condition the resampling so as to provide conditional inference. Largely these situations are those where there is an experim ental ancillary statistic, as in regression. One other situation is discussed in Example 5.17.

2.6.4 When might the bootstrap fail?Incomplete dataSo far we have assumed that F is the distribution o f interest and that the sample y i , . . . , y„ draw n from F has nothing removed before we see it. This m ight be im portant in several ways, not least in guaranteeing statistical consistency of our estim ator T. But in some applications the observation tha t we get may not always be y itself. For example, with survival da ta the ys m ight be censored, m eaning that we may only learn tha t y was greater than some cut-off c because observation o f the subject ceased before the event which determines y. Or, with multiple m easurem ents on a series o f patients it may be that for some patients certain m easurem ents could not be m ade because the patient did not consent, or the doctor forgot.

Under certain circumstances the resampling m ethods we have described will work, but in general it would be unwise to assume this w ithout careful thought. A lternative m ethods will be described in Section 3.6.

Dependent dataIn general the nonparam etric resampling m ethod that we have described will not work for dependent data. This can be illustrated quite easily in the case where the da ta form one realization o f a correlated time series. Forexample, consider the sample average y and suppose that the da ta come from a stationary series {Yj} whose m arginal variance is a 2 = var(Y; ) and whose autocorrelations are ph = c o n ( Y j , Y j +h) for h = 1 ,2 ,... In Example 2.7 we showed that the nonparam etric bootstrap estimate o f the variance o f Y is approxim ately s2/n, and for large n this will approach <r2/n . But the actual variance o f Y is

The sum here would often differ considerably from one, and then the bootstrap estimate o f variance would be badly wrong.

Similar problems arise with other forms o f dependent data. The essence of the problem is that simple bootstrap sampling imposes m utual independence on the Yj , effectively assuming that their jo in t C D F is F(yi) x • ■ • x F(yn) and thus sampling from its estimate x • • • x F (y '). This is incorrect for dependent data. The difficulty is that there is no obvious way to estimate a general jo in t density for Y i,...,Y „ given one realization. We shall explore this im portant subject further in C hapter 8.

Weakly dependent da ta occur in the altogether different context o f finite population sampling. Here the basic nonparam etric resampling m ethods work reasonably well. M ore will be said about this in Section 3.7.

Dirty dataW hat if sim ulated resampling is used when there are outliers in the data? There is no substitute for careful da ta scrutiny in this or any other statistical context, and if obvious outliers are found, they should be removed or corrected. W hen there is a fitted param etric model, it provides a benchm ark for plots o f residuals and the panoply o f statistical diagnostics, and this helps to detect poor model fit. W hen there is no param etric model, F is estim ated by the EDF, and the benchm ark is swept away because the data and the model are one and the same. It is then vital to look closely a t the sim ulation output, in order to see w hether the conclusions depend crucially on particular observations. We return to this question o f sensitivity analysis in Section 3.10.

2.7 • Nonparametric Bias and Variance 45

means “is approximately distributed as”.

In some cases the Op(n~l ) “remainder term” in the second expression would be op(n~i/2), but this would not affect the principal result of the delta method below.

2.7 Nonparametric Approximations for Variance and Bias

2.7.1 Delta methodsIn param etric analysis it is often possible to represent estim ators T in terms of fundam ental statistics U i , . . . , Um, such as sample moments, for which exact or approxim ate distributional calculations are relatively easy. Then we can take advantage o f the delta m ethod to obtain distributional approxim ations for T itself.

Consider first the case o f a scalar estim ator T which is a function o f the scalar statistic U based on a sample o f size n, say T = g(U). Suppose that it is known that

U ~ N (C ,n - 'a 2(l:) ) .

Two form al expressions are U = £ + op( l) and U = ( + n~1/2(T(C)Z + Op(n_1), where Z is a N ( 0,1) variable. The first o f these corresponds to a statem ent of the consistency property o f U, and the second amplifies this to state both the rate o f convergence and the norm al approxim ation in an alternative form.

Now consider T = g(U), where g(-) is a sm ooth function. We shall see below that provided tha t g(£) ^ 0,

T -JV (0 ,n_1{g(C)}2o’2(C)) ,

where 0 = g(£); the do t indicates differentation with respect to (. This result is what is usually m eant by the delta method result, the principal feature being the delta method variance approximation

var{g([/)} = {g(0}2var(l/). (2.29)

To see why (2.29) should be true, note tha t if g( ) is sm ooth then T is consistent for 0 = g((), since

g ( t / ) = g ( C + o P ( l ) ) = g ( 0 + o P ( l ) .

Further, by Taylor series expansion we can write

T = g(U) = g(0 + { U - C)g(C) + \ { U - o 2g ( 0 + 0p(n~l ), (2.30)

since the rem ainder is proportional to (U — ( ) 3- A truncated version o f the series expansion is

T = g(U) = g ( 0 + ( U - O g (0 + 0p(rT1/2). (2.31)

From the latter, we can see tha t the norm al approxim ation for U implies that

T = g (0 + n - 1/2g(C M £)Z + op(n~1/2),

which in tu rn entails (2.29).


N othing has yet been said about the bias o f T, which would usually be hidden in the Op(n_1) term. I f we take the larger expansion (2.30), ignore the rem ainder term, and take expectations, we obtain

E (T ) = g(C) + g(C )E(u - 0 + A t ) ;

or, if U is unbiased for £,

E (T ) = 0 + ^ g ( C)a2(0 .

These results extend quite easily to the case o f vector U and vector T, as outlined in Problem 2.9. The extension includes the case where U is the set of observed frequencies / i , . . . , / m when Y is discrete with probabilities on m possible values. Then the analogue o f (2.31) is

T = g ( n u . . . , n m) + ^ 2 - n j ' j d8 nu ' ,7tm\ (2.32)j=i

In this case the norm al approxim ation for / i , . . . , / m is easy to derive, but is singular because o f the constraint 5 Z /; = n /C 71 j = n- 1° effect (2.32) provides a version of the nonparam etric delta m ethod, restricted to discrete da ta problems. In the next subsection we extend the expansion m ethod to the general nonparam etric case.

2.7.2 Influence function and nonparametric delta method

There is a simple variance approxim ation for m any statistics T with the representation t(F). The key idea is an extension of the Taylor series expansion to statistical functions, which allows us to extend (2.32) to continuous distributions. The linear form o f the expansion is

j u r ,t{G) = t ( F ) + L t(y,F)dG(y) , (2.33)

where Lt, the first derivative o f f(-) at F, is defined by

t{(l - e)F + eHy} - t(F) _ 8t {(1 - e)F + eHy}L t(y;F) = lim

■o e de, (2.34)

E=0

with H y(u) = H(u — y) the Heaviside or unit step function jum ping from 0 to 1 at u = y. In this form the derivative satisfies / L t(y;F)dF(y) = 0, as seen on setting G = F in (2.33). Often the function L t(y) = L t(y;F) is called the influence function o f T and its empirical approxim ation l(y) = L t(y;F) is called the empirical influence function. The particular values lj = l(yj) are called the empirical influence values.

2.7 ■ Nonparametric Bias and Variance 47

The nonparametric delta method comes from applying the first-order approxim ation (2.33) with G = F,

t(F) = r(F) + / L ((y ; F)dF(y) = t(F) + - £ L, ( y j ; F). (2.35)J H j - i

The right-hand side o f (2.35) is also known as the linear approximation. We apply the central limit theorem to the sum on the right-hand side o f (2.35) and obtain

T — 9 ~ N ( 0 , vl (F))

because f L , (y ;F)dF(y ) = 0, where

vl (F) = n - 'v a r jL ^ Y )} = n~l J L 2{y)dF{y).

In practice vL(F) is approxim ated by substituting F for F in the result, that is by using the sample version

n

vL = vL(F) = n - 2 Y , ‘j , (2-36)j = i

which is known as the nonparametric delta method variance estimate. N ote that (2.35) implies that

J L, (y ;F)dF{y) = n-1 ^ lj = 0.

In some cases it may be difficult to evaluate the derivative (2.34) theoretically. Then a num erical approxim ation to the derivative can be made, that is

,2.37)s

with a small value o f e such as (100n)-1 . The same m ethod can be used for empirical influence values lj = L,(yj;F). A lternative approxim ations to the empirical influence values lj, which are all tha t are needed in (2.36), are described in the following sections.

Example 2.17 (Average) Let t = y, corresponding to the statistical function t(F) = f ydF(y). To apply (2.34) we write

£{(1 — e)F + eHy} = (1 — e)fi + sy,

and differentiate to obtain

d{(l - e)n + ey}M y ) = de

= y - H .e=0

The empirical influence function is therefore /(y) = y — y, with lj = y j — y . Thus the delta m ethod variance approxim ation (2.36) is Vi = (n — 1 )s2/ n 2, where s2

is the unbiased sample variance o f the yj. This differs by the factor (n — 1 )/n from the m ore usual nonparam etric variance estimate for y. m

The m ean is an example o f a linear statistic, whose general form is / a(y) dF(y). As the term inology suggests, linear statistics have zero derivatives beyond the first; they have influence function a{y) — E{a(Y)}. This applies to all m om ents about zero; see Problem 2.10.

Com plicated statistics which are functions o f simple statistics can be dealt with using the chain rule. So if t(F) = a { t i (F ) , . . . , rm(F)}, then

m ~

= E <2-38)i= l o ti

This can also be used to find the influence function for a transform ed statistic, given the influence function for the statistic itself.

Example 2.18 (Correlation) The sample correlation is the sample version of the product m om ent correlation, which for the pair Y = ( U, X) can be defined in terms o f prs = E (UrX s) by

__ / r>\ ___ P n —PioPoi{ (^ 2 0 - P ?0 )(W>2 - / * o i ) } l / r

The influence functions o f means are given in Example 2.17. For second m om ents we have L ^ J u , x) = urx s — /xrs, when r + s = 2, because the [irs are linear statistics (Problem 2.10). The partial derivatives o f p(-) with respect to the ps are straightforward, and (2.38) leads to the influence function for the correlation coefficient,

L p(u, x) = usx s — \p{u] + x ] ), (2.40)

where us = (u — pio)/(P20 - Pi0)l/2> and xs = (x - pm )/(P02 - Poi)l/2 are standardized variates.

If we w anted to work with the standard transform ation £ = | log

whose derivative is ( l — p2)_ l , then another application o f (2.38) shows that the influence function would be

L d u>x ) = j Z T p i {MsXs ~ + xs)} •

Example 2.19 (Quantile) The p quantile qp o f a distribution F with density / is defined as the solution to the equation F{qp(F)} = p. I f we set Fe(x) = ( l — e)F(x) + eH(x — y), we have

p = Fe{qp(Fe)} = ( l - e)F{qp(Fe)} + sH{qp(Fe) - y}.


Table 2.5 Exact empirical influence values and their regression estimates for the ratio applied to the city population data with n — 10.

Figure 2.11 Empirical influence values for city population example.The left panel shows the lj for the n = 10 cases; the line has slope t = 1.52. The right panels show 999 values o f t ‘ plotted against jittered values of for j = 1,2,9,4 (clockwise from top left); the lines have slope lj and pass through t when / ' = 0.

Case 1 2 3 4Exact -1 .04 -0.58 -0.37 -0.19Regression -1.11 -0.65 -0 .44 -0.38

5 6 7 8 9 100.03 0.11 0.09 0.20 1.02 0.73

-0.04 0.12 0.13 0.27 1.16 0.94

0 50 100 150 200 250 300

O n differentiating this with respect to e and setting e = 0, we find that

L q,{y;F) =H{qp - y ) ~ p

f(qP)

Evidently this has mean zero.The approxim ate variance of qp(F) is

v l { F ) = tT 1 / L U y ; F ) d F ( y ) =_ P(1 ~ P ) n f { q vip)

the empirical version o f which requires an estimate o f f ( q p). But since nonparam etric density estimates converge much more slowly than estimates o f means, variances, and so forth, estim ation o f variance for quantile estimates is harder and requires much larger samples. ■

Example 2.20 (City population data) For the ratio estimate t = x /u , calculations in Problem 2.16 lead to empirical influence values lj = (xj — tuj)/u. Num erical values for the city population da ta o f size 10 are given in Table 2.5; the regression estim ates are discussed in Example 2.23. The variance estimateis Vl = = 0.182.

50 2 - The Basic Bootstraps

The lj are plotted in the left panel o f Figure 2.11. Values o f yj = (uj, x j) close to the line x = tu have little effect on the ratio t. C hanging the data by giving m ore weight to those yj with negative influence values, for which (uj, Xj) lies below the line, would result in smaller values o f t than tha t actually observed, and conversely. We discuss the right panels in Example 2.23. ■

In some applications the estim ator T will be defined by an estim ating equation, the simplest form being ^2 c(yj, t) = 0 such that f c(y, 8)dF(y) = 0. Then the influence function for scalar t is (Problem 2.12)

L ( v ) _

,(y) E { - c ( y ,0 ) } ’

where c = dc/dd. The corresponding empirical influence values are therefore

—nc(yj, t)

j Z n y j J Yand the nonparam etric delta m ethod variance estimate is

„ _ E ( £ t o 0 1 11 {£«>■;. Of2'

A simple illustration is Example 2.20, where t is determ ined by the estimating function c(y, 6) = x — 6u.

For some purposes it is useful to go beyond the first derivative term in the expansion o f t(F) and obtain the quadratic approxim ation

t(F) = t (F) + j L t(y;F) dF(y) + \ j j Qt(y, 2; F) dF(y)dF(z), (2.41)

where the second derivative Qt(y , z ; F ) is defined by

d£ld£2 £,=82=0

This derivative satisfies / Qt( x ,y ,F ) d F ( x ) = / Qt(x ,y ;F)dF{y) = 0, but in general J Q,(x,x;F)dF(x) ^ 0. The values qjk = Qt(yj,yk' ,F) are empirical second derivatives o f t(-) analogous to the empirical influence values lj. In principle (2.41) will be more accurate than (2.35).

2.7.3 Jackknife estimatesA nother approach to approxim ating the influence function, but only a t the sample values y \ , . . . , y„ themselves, is the jackknife. Here lj is approxim ated by

ljackj = { n - W - t - j ) , (2.42)

where t - j is the estimate calculated with y; om itted from the data. In effect this corresponds to numerical approxim ation (2.37) using e = — (n — I)- 1 ; see Problem 2.18.


The jackknife approxim ations to the bias and variance o f T are1 n j

bjack = ~ ~ I jack,j, Vjack = ackj ~ Htfack) ' (2-43)

It is reasonably straightforw ard to apply (2.33) with F - j and F in place o f G and F, respectively, to show that

IjackJ — lj 5

see Problem 2.15.

Example 2.21 (Average) For the sample average t = y and the case deletion values are = (ny — y j ) / ( n — 1) and so ljack,j = }’j ~ V- This is the same as the empirical influence function because t is linear. The variance approxim ation in (2.43) reduces to {n{n — l )}-1 ^2(yj — y)2 because bjack = 0; the denom inator n — 1 in the form ula for vjack was chosen to ensure that this happens. ■

One application o f (2.43) is to show tha t in large samples the jackknife bias approxim ation gives

n

bjack = E*(T") — t = \ n ~ 2 Qjj'ij=i

see Problem 2.15.So far we have seen two ways to approxim ate the bias and variance o f T

using approxim ations to the influence function, namely the nonparam etric delta m ethod and the jackknife m ethod. One can generalize the basic approxim ation by using alternative numerical derivatives in these two methods.

2.7.4 Empirical influence values via regressionThe approxim ation (2.35) can also be applied to the bootstrap estim ate T*. If the E D F o f the bootstrap sample is denoted by F*, then the analogue o f (2.35) is

t(F*) = t(F) + - V L t(y*;F), n J

7=1

or in simpler notation

= (2.44)j - 1

say, where /* is the num ber o f times that y* equals yj, for j = 1 , . . . ,n . The linear approxim ation (2.44) will be used several times in future chapters.

Under the nonparam etric bootstrap the jo in t distribution o f the /* is m ultinom ial (Problem 2.19). It is easy to see that var(T*) = n~2 = vl , showing


that the bootstrap estim ate o f variance should be similar to the nonparam etric delta m ethod approxim ation.

Example 2.22 (City population data) The right panels o f Figure 2.11 show how 999 resampled values o f f* depend on «-1 / j for four values o f j , for the data with n = 10. The lines with slope lj summarize fairly well how t ’ depends on /* , but the correspondence is not ideal.

A different way to see this is to plot t* against the corresponding t'L. Figure 2.12 shows this for 499 replicates. The line shows where the values for an exactly linear statistic would fall. The linear approxim ation is poor for n = 10, but it is more accurate for the full dataset, where n = 49. In Section 3.10 we outline how such plots may be used to find a suitable scale on which to set confidence limits. ■

Expression (2.44) suggests a way to approxim ate the /,-s using the results of a bootstrap simulation. Suppose that we have simulated R samples from F as described in Section 2.3. Define /*• to be the frequency with which the data value yj occurs in the rth bootstrap sample. Then (2.44) implies that

t; = t + ^ ] T r = l , . . . , R .j = i

This can be viewed as a linear regression equation for “ responses” t* with “covariate values” and “coefficients” lj. We should, however, adjust forthe facts that E*(7” ) =f= t in general, tha t J2j h = 0, and that J 2 j f r j = n- F ° r the first o f these we add a general intercept term, or equivalently replace t with T .

Figure 2.12 Plots of linear approximation t*L against r* for the ratio applied to the city population data, with n = 10 (left panel), and n = 49 (right panel).


For the second two we drop the term resulting in the regression equation

where F* is the R x ( n — 1) m atrix with (r, j) element n-1/*;, and the rth row o f the R x 1 vector d* is t* — f*. In fact (2.45) is related to an alternative, orthogonal expansion o f T in which the “rem ainder” term is uncorrelated with the “linear” piece.

The several different versions o f influence produce different estimates of var(T). In general vl is an underestim ate, whereas use o f the jackknife values or the regression estimates o f the Is will typically produce an overestimate. We illustrate this in Section 2.7.5.

Example 2.23 (City population data) For the previous example o f the ratio estim ator, Table 2.5 gives regression estimates o f empirical influence values, obtained from R = 1000 samples. The exact estimate v l for var(T ) is 0.036, com pared to the value 0.043 obtained from the regression estimates. The bootstrap variance is 0.042. For n = 49 the corresponding values are 0.00119, 0.00125 and 0.00125.

O ur experience is that R m ust be in the hundreds to give a good regression approxim ation to the empirical influence values. ■

2.7.5 Variance estimatesIn previous sections we have outlined the merits o f studentized quantities

where V = v{F) is an estimate of var(T | F). One general way to obtain a value for V is to set

A A A _

So the vector I = ( / j ,___ i ) o f approxim ate values o f the lj is obtained withthe least-squares regression formula

/ = (F*TF*)_1F*r d*, (2.46)

Mv = (M - 1) 1 - 0 2>

m= 1

where t ] , . . . , t 'M are calculated by bootstrap sampling from F. Typically we would take M in the range 50-200. N ote that resampling is needed to produce a standard error for the original value t o f T.


Now suppose that we wish to estimate the quantiles o f Z , using empirical quantiles o f bootstrap simulations

we would require R ( M + 1) = 50949 samples in all, which seems prohibitively large for m any applications. This suggests that we should replace u1/2 with a standard error that involves no resampling, as follows.

W hen a linear approxim ation (2.44) applies, we have seen tha t var(T* | F) can be estim ated by v l = n~2 ^ l], where the lj = L ((y; ;F ) are the empirical influence values for t based on the E D F F o f y \ , . . . , y n- The corresponding variance estimate for v a r(T ’ | F ') is v‘Lr = ri~2 ^ L 2{yy,F') , based on the empirical influence values for t ’ at the E D F F ’ o f y ‘rl, . . . , y ' rn. A lthough this requires no further simulation, the L t(y’ \ F *) m ust be calculated for each of the R samples. If an analytical expression is known for the empirical influence values, it will typically be straightforw ard to calculate the VLr- If not, numerical differentiation can be used, though this is more time-consuming. I f neither of these is feasible, we can use the further approxim ation

which is exact for a linear statistic. In effect this uses the usual formula, with lj replaced by L t(y*j\F) — n-1 J 2 L t(y*k ;F) in the rth resample. However, the right-hand side o f (2.49) can badly underestim ate v'Lr if the statistic is not close to linear. An improved approxim ation is outlined in Problem 2.20.

Example 2.24 (City population data) Figure 2.13 com pares the variance ap proxim ations for n = 10. The top left panel shows v" with M = 50 plotted against the values

for R = 200 bootstrap samples. The top right panel shows the values o f the approxim ate variance on the right o f (2.49), also plotted against v'L. The lower

the horizontal axis. Plainly v’L underestim ates v' , though not so severely as to have a big effect on the studentized bootstrap statistic. But the right o f (2.49) underestim ates v'L to an extent tha t greatly changes the distribution o f the corresponding studentized bootstrap statistics.

r = (2-48)

Since M bootstrap samples from F were needed to obtain v, M bootstrap samples from F ' are needed to produce v". Thus with R = 999 and M = 50,

2

(2.49)

n

panels show Q-Q plots o f the corresponding z* values, with (t* — t ) / v ^ /2 on

2.8 ■ Subsampling Methods 55

Figure 2.13 Variance approximations for the city population data, n — 10. The top panels compare the bootstrap variance v* calculated with M = 50 and the right o f (2.49) with v*L for R = 200 samples. The bottom panels compare the corresponding studentized bootstrap statistics.

co>Q_22ooCO

vL*

The right-hand panels o f the corresponding plots for the full da ta show more nearly linear relationships, so it appears that (2.49) is a better approxim ation at sample size n = 49. In practice the sample size cannot be increased, and it is necessary to seek a transform ation o f t to attain approxim ate linearity. The transform ation outlined in Example 3.25 greatly increases the accuracy of(2.49), even with n = 10. ■

2.8 Subsampling MethodsBefore and after the development o f nonparam etric bootstrap m ethods, other m ethods based on subsamples were developed to deal with special problems.


We briefly review three such m ethods here. The first two are in principle superior to resam pling for certain applications, although their competitive merits in practice are largely untested. The third m ethod provides an alternative to the nonparam etric delta m ethod for variance approxim ation.

2.8.1 Jackknife methodsIn Section 2.7.3 we m entioned briefly the jacknife m ethod in connection with estim ating the variance o f T, using the values o f t obtained when each case is deleted in turn. Generalized versions o f the jackknife have also been proposed for estim ating the distribution of T — 0, as alternatives to the bootstrap. For this to work, the jackknife m ust be generalized to multiple case deletion. For example, suppose that we delete d observations rather than one, there being N = (j) ways o f doing this; this is the same thing as taking all subsets o f size n — d. The full set o f group-deletion estim ates is t{,. . . , t fN, say. The empirical distribution o f — t will approxim ate the distribution of T — 6 only if we renormalize to remove the discrepancy in sample sizes, n — d versus n. So if T — 6 = Op(n~a), we take the empirical distribution of

z f = (n - d)a{S - t) (2.50)

as the delete-^ jackknife approxim ation to the distribution o f Z = na( T — 6). In practice we would not use all N subsamples o f size n — d, but rather R random subsamples, just as with ordinary resampling.

In principle this m ethod will apply m uch more generally than bootstrap resampling. But to work in practice it is necessary to know a and to choose d so that n — d—>oo and d /n —> 1 as n increases. Therefore the m ethod will work only in rather special circumstances.

Note tha t if n — d is small relative to n, then the m ethod is not very different from a generalized bootstrap that takes samples o f size n — d ra ther than n.

Example 2.25 (Sample maximum) We referred earlier to the failure o f the bootstrap when applied to the largest order statistic t = y(n), which estimates the upper limit o f a distribution on [0,0]. The jackknife m ethod applies here with a = 1, as n(9— T) is approxim ately exponential with m ean 6 for uniformly distributed ys. However, empirical evidence suggests that the jackknife m ethod requires a very large sample size in order to give good results. For example, if we take samples o f n = 100 uniform variables, for values o f d in the range 80-95 the distribution o f (n — d)(t — T +) is close to exponential, but the mean is wrong by a factor tha t can vary from 0.6 to 2. ■

2.8 ■ Subsampling Methods 57

2.8.2 All-subsamples methodA different type of subsam pling consists o f taking all N = 2" — 1 non-empty subsets o f the data. This can be applied to a limited type o f problem, including M -estim ation where mean /i is estim ated by the solution t to the estim ating equation ^ c(yj — t) = 0. If the ordered estimates from subsets are denoted by tJ’j) ,. . . , f[N), then rem arkably fi is equally likely to be in any o f the N + 1 intervals

Hence confidence intervals for fi can be determined. In practice one would take a random selection o f R such subsets, and attach equal probability (R + I)-1 to the R + 1 intervals defined by the R ff values. It is unclear how efficient this m ethod is, and to w hat extent it can be generalized to o ther estim ation problems.

2.8.3 Half-sampling methodsThe jackknife m ethod for estim ating var(T ) can be extended to deal with estim ates based on m any samples, but in one special circum stance there is another, simpler subsam pling method. Originally this was proposed for sample- survey da ta consisting o f stratified samples o f size 2. To fix ideas, suppose that we have samples o f size 2 from each o f m strata, and that we estim ate the population m ean n by the weighted average t = Y27=i wifi^ these weights reflect stratum sizes. The usual estimate for var(T ) is v = J 2 wf sf with sj the sample variance for the ith stratum . The half-sam pling m ethod is designed to reproduce this variance estimate using only subsam ple values o f t, ju st as the jackknife does. Then the m ethod can be applied to m ore complex problems.

In the present context there are N = 2m half-samples formed by taking one element from each stratum sample. If ft denotes the estim ator calculated on such a half-sample, then clearly ft — t equals \ ~ya) c] , where cj = +1according to which o f yn and y,% is in the half-sample. Direct calculation shows tha t for a random half-sample E (T t — T )2 = jv a r(T ), so that an unbiased estim ate o f var(T ) is obtained by doubling the average o f (ft — t)2 over all N half-samples: this average equals the usual estimate given earlier. But it is unnecessary to use all N half-samples. If, say, we use R half-samples, then we require that


From the earlier representation for - 1 we see tha t this implies that

. R [ 1 m 1 m m

i s i E wf(yn - y a )1 + j(yn - ya ){yn - yj i )r= l I i= 1 i= l j= 1

equals

1 m4 E - >‘2)2-

i=l

For this to hold for all da ta values we m ust have = 0 for all i ± j.This is a standard problem arising in factorial design, and is solved by what are known as Plackett-Burm an designs. If the rth half-sample coefficients cfrj form the rth row o f the R x m m atrix C +, and if every observation occurs in exactly | R half-samples, then C +TC f = rnlmxm. In general the ith colum n o f C + can be expressed as ( cy, . — 1) with the first R — 1 elements obtained by i — 1 cyclic shifts o f c i j , . . . , For example, one solution for m = 7 with R = 8 is

( +l- 1 - 1 +1 - 1 +1

+ 1 ni+1 +1 - 1 - 1 +1 - 1 +1+1 +1 +1 - 1 - 1 +1 - 1- 1 +1 +1 +1 - 1 - 1 - 1+1 - 1 +1 +1 +1 - 1 -1- 1 +1 - 1 +1 +1 +1 - 1- 1 - 1 +1 - 1 +1 +1 +1

U i - 1 - 1 - 1 - 1 - 1 1)This solution requires that R be the first m ultiple o f 4 greater than or equal to m. The half-sample designs for m = 4 ,5 ,6 ,7 are the first in colum ns o f this C + matrix.

In practice it would be com m on to double the half-sam pling design by adding its com plem ent — C \ which adds further balance.

It is fairly clear that the half-sam pling m ethod extends to stratum sample sizes k larger than 2. The basic idea can be seen clearly for linear statistics of the form

m k m k

t = n + X k~l E = ^ + E a,>i= 1 7=1 i= l j= l

say. Suppose tha t in the rth subsample we take one observation from each stratum , as specified by the zero-one indicator c jy . Then

' ! - , = E E cl,,j(aU - a,),

which is a linear regression model w ithout error in which the atj — a, are coefficients and the are covariate values to be determined. If the ay — a,

2.9 ■ Bibliographic Notes 59

can be calculated, then the usual estim ate o f var(T ) can be calculated. The choice o f - values corresponds to selection o f a fractional factorial design, with only main effects to be calculated, and this is solved by a Plackett-Burm an design. Once the subsam pling design is obtained, the estim ate o f var(T ) is a form ula in the subsample values tj. The same form ula works for any statistic that is approxim ately linear.

The same principles apply for unequal stratum sizes, although then the solution is more com plicated and makes use o f orthogonal arrays.

2.9 Bibliographic NotesThere are two key aspects to the m ethods described in this chapter. The first is that in order for statistical inference to proceed, an unknown distribution F must be replaced by an estimate. In a param etric model, the estimate is a param etric distribution F$, whereas in a nonparam etric situation the estimate is the empirical distribution function or some modification o f it (Section 3.3). A lthough the use o f the E D F to estim ate F may seem novel a t first sight, it is a natural developm ent o f replacing F by a param etric estimate. We have seen tha t in essence the E D F will produce results similar to those for the “nearest” param etric model.

The second aspect is the use o f sim ulation to estimate quantities o f interest. The widespread availability o f fast cheap com puters has made this a practical alternative to analytical calculation in m any problems, because com puter time is increasingly plentiful relative to the num ber o f hours in a researcher’s day. Theoretical approxim ations based on large samples can be time-consuming to obtain for each new problem, and there may be doubt about their reliability in small samples. Contrariwise, sim ulations are tailored to the problem at hand and a large enough sim ulation makes the numerical error negligible relative to the statistical error due to the inescapable uncertainty about F.

M onte C arlo m ethods o f inference had already been used for m any years when Efron (1979) m ade the connection to standard m ethods o f param etric inference, drew the attention o f statisticians to their potential for nonparam etric inference, and originated the term “boo tstrap”. This work and subsequent developments such as his 1982 m onograph m ade strong connections with the jackknife, which had been introduced by Quenouille (1949) and Tukey (1958), and with o ther subsam pling m ethods (H artigan, 1969, 1971, 1975; M cCarthy, 1969). M iller (1974) gives a good review o f jackknife m ethods; see also G ray and Schucany (1972).

Young and Daniels (1990) discuss the bias in the nonparam etric bootstrap introduced by using the empirical distribution function in place of the true distribution.

Hall (1988a, 1992a) strongly advocates the use o f the studentized bootstrap

statistic for confidence intervals and significance tests, and makes the connection to Edgeworth expansions for sm ooth statistics. The empirical choice o f scale for resampling calculations is discussed by C hapm an and Hinkley (1986) and Tibshirani (1988).

Hall (1986) analyses the effect o f discreteness on confidence intervals. Efron (1987) discusses the num bers o f sim ulations needed for bias and quantile estim ation, while Diaconis and Holmes (1994) describe how sim ulation can be avoided completely by complete enum eration o f bootstrap samples; see also the bibliographic notes for C hapter 9.

Bickel and Freedm an (1981) were am ong the first to discuss the conditions under which the bootstrap is consistent. Their work was followed by Bre- tagnolle (1983) and others, and there is a growing theoretical literature on modifications to ensure that the bootstrap is consistent for different classes of awkward statistics. The main m odifications are sm oothing o f the data (Section 3.4), which can improve m atters for nonsm ooth statistics such as quantiles (De Angelis and Young, 1992), subsam pling (Politis and Rom ano, 1994b), and reweighting (Barbe and Bertail, 1995). H all (1992a) is a key reference to Edge- w orth expansion theory for the bootstrap, while M am m en (1992) describes sim ulations intended to help show when the bootstrap works, and gives theoretical results for various situations. Shao and Tu (1995) give an extensive theoretical overview o f the bootstrap and jackknife.

A threya (1987) has shown that the bootstrap can fail for long-tailed distributions. Some other examples o f failure are discussed by Bickel, Gotze and van Zwet (1996).

The use o f linear approxim ations and influence functions in the context o f robust statistical inference is discussed by H am pel et al. (1986). Fernholtz (1983) describes the expansion theory that underlies the use o f these approxim ation m ethods. An alternative and orthogonal expansion, similar to that used in Section 2.7.4, is discussed by Efron and Stein (1981) and Efron (1982). Tail-specific approxim ations are described by H esterberg (1995a).

The use o f multiple-deletion jackknife m ethods is discussed by Hinkley (1977), Shao and Wu (1989), Wu (1990), and Politis and R om ano (1994b), the last with num erous theoretical examples. The m ethod based on all non-em pty subsamples is due to H artigan (1969), and is nicely put into context in C hapter 9 o f Efron (1982). Half-sample m ethods for survey sampling were developed by M cC arthy (1969) and extended by Wu (1991). The relevant factorial designs for half-sam pling were developed by Plackett and Burm an (1946).

2.10 Problems1 Let F denote the E D F (2.1). Show that E {f(y )} = F(y) and that var{F(y)} =

f (3'){l — F(y)}/n. Hence deduce that provided 0 < F(y) < 1, F(y) has a limiting

2.10 ■ Problems 61

normal distribution for large n, and that Pr(|F(y) — F(y)| < e)—>1 as n—too for any positive e. (In fact the much stronger property s u p ^ ^ ^ ^ |F(y) — F (y)|—>0 holds with probability one.)(Section 2.1)

2 Suppose that Y ],..., Y„ are independent exponential with mean their average is

(a) Show that Y has the gamma density (1.1) with k = n, so its mean and variance are n and fi2/n.(b) Show that log Y is approximately normal with mean log^i and variance n~'.(c) Compare the normal approximations for Y and for log Y in calculating 95% confidence intervals for /z. Use the exact confidence interval based on (a) as the baseline for the comparison, which can be illustrated with the data o f Example 1.1. (Sections 2.1, 2.5.1)

3 Under nonparametric simulation from a random sample y [ , . . . , y„ in which T = nr1 Yj — Y)2 takes value t, show that

E '(T ') = (n — l)t/n, var‘(7” ) = (n — l ) 2 [m4/n + (3 - n)t2/ {n(n — 1)}] / n2,

where w4 = n- 1 E /X ; - f ) 4- (Section 2.3; Appendix A)

4 Let t be the median o f a random sample o f size n = 2m + 1 with ordered values >>(i) < • • • < y(„); t = y(m+i).(a) Show that T" > if and only if fewer than m + 1 o f the Y ’ are less than or equal to y^ .(b) Hence show that

This specifies the exact resampling density (2.28) o f the sample median. (The result can be used to prove that the bootstrap estimate o f var(T) is consistent as n—>oo.)(c) Use the resampling distribution to show that for n = 11

and apply (2.10) to deduce that the basic bootstrap 90% confidence interval for the population median 6 is (2 y(6) — y(9 ) ,2 y(6) —(d) Examine the coverage o f the confidence interval in (c) for samples from normal and Cauchy distributions.(Sections 2.3, 2.4; Efron, 1979, 1982)

5 Consider nonparametric simulation o f Y* based on distinct linearly independent observations y i,. . . ,y „ .(a) Show that there are m„ = ( "T,1) ways that n — 1 red balls can be put in a line with n white balls. Explain the connection to the number o f distinct values taken by Y'.(b) Suppose that the value y" taken by Y* is n~l J2f jy j< where / ” can be one o f 0 and J 2 j f j ~ n- Find Pr(Y” = y), and deduce that the most likely value o f Y ” is y, with probability p„ = n'./n".(c) Use Stirling’s approximation, i.e. n \ ~ (27r)l/2e~"n"+1//2 as n—>oo, to find approximate formulae for m„ and p„.(d) For the correlation coefficient T calculated from distinct pairs («i, x j ) ,. . . , (u„,x„),

Y = n ~ ' E Yj .

P r * (r < y,3 j) = Pr’(T ‘ > y(9)) = 0.051,

show that T* is indeterminate with probability What is the probability that17” | = 1? Discuss the implications o f this when n < 10.(Section 2.3; Hall, 1992a, Appendix I)

Suppose that are independently distributed with a two-parameter densityfeAy)- What simulation experiment would you perform to check whether or not Q = q (Yu . . . , Y n;6) is a pivot?If / is the gamma density (1.1), let fi be the MLE o f n, let

tpin) = max Y l°g//vc(y; )j=i

be the profile log likelihood for n and let Q = 2 { /p(/i) — /?p(n)}. In theory Q should be approximately a x] variable for large n. Use simulation to examine whether or not Q is approximately pivotal for n = 10 when k is in the range (0.5,2).(Section 2.5.1)

7 The bootstrap normal approximation for T — 9 is N(bR,vR), so that the p quantile ap for T — 6 can be appro variance o f this estimate isap for T — 9 can be approximated by ap = bR + zpvR 2. Show that the simulation

i* \ ■ v°° I . , *3 , l 2 / t , k4K ) - R { ' +Z ' , ^ + i2' ( 2 + <

where k 3 and k4 are the third and fourth cumulants o f T" under bootstrap resampling. If T is asymptotically normal, k^/vU2 = 0 (n~ l/2) and k4/ v1 = 0 (n “’). Compare this variance to that o f the bootstrap quantile estimate — t inthe special case T = Y .(Sections 2.2.1, 2.5.2; Appendix A)

8 Suppose that estimator T has expectation equal to 0(1 + y ) , so that the bias is 9y. The bias factor y can be estimated by C = E’( T ' ) / T — 1. Show that in the case o f the variance estimate T = ri [ ^2(Yj — Y ) 2, C is exactly equal to y. If C were approximated from R resamples, what would be the simulation variance o f the approximation?(Section 2.5)

9 Suppose that the random variables U = (Ui , . . . ,Um) have means C i, . . . , (m and covariances cov(Uk,Ui) = n-1cow( 0 , and that Ti = g t ( U ) , . . . , T q = gq(U). Showthat

E(T,) = g , . ( 0 + i n - > f > w( 0 | ^ ,

cov(Tj, Tj) = / r ‘ f > w(

How are these estimated in practice? Show that

2 \ " (x i — tuj)2" -2£

i=i

is a variance estimate for t = x /u , based on independent pairs (ui, Xi) , . . . , ( « „ ,x n). (Section 2.7.1)


10 (a) Show that the influence function for a linear statistic t(F) = / a(x) dF(x) is a(y ) — t(F). Hence obtain the influence functions for a sample moment fir — f x r dF(x), for the variance /12 (F) — {/ti(F)}2, and for the correlation coefficient (Example 2.18).(b) Show that the influence function for {t(F) — 6} /v (F) i/2 evaluated at 9 = t{F) is v(F)~l/2L, (y;F) . Hence obtain the empirical influence values lj for the studentized quantity {t{F) — t(F)}/vL(F) l/2, and show that they have the properties E O = 0 and n~2 E I2 = 1 .(Section 2.7.2; Hinkley and Wei, 1984)

11 The pairs (U[ ,X i ) , . . . , { U„ , Xn) are independent bivariate normal with correlation 9. Use the influence function o f Example 2.18 to show that the sample correlation T has approximate variance n~l { 1 — 92)2. Then apply the delta method to show that \ log ( jr £ ) , called Fisher’s z-transform, has approximate variance n~]. (Section 2.7.1; Appendix A)

12 Suppose that a parameter 0 = t(F) is determined implicitly through the estimating equation

J u{y,9)dF(y) = 0.

(a) Write the estimating equation as

J u{ y J ( F ) } dF(y) = 0,

replace F by (1 — e)F + eHy, and differentiate with respect to e to show that the u(x;0) = du(x-,6)/d8 influence function for f(-) is

,(-V’ * — f U(x;9)dF(x) '

Hence show that with 9 = t{F) the y'th empirical influence value is

t = u(y j ; 6)

1 - n ~ l E L i “(w ;

(b) Let {p be the maximum likelihood estimator o f the (possibly vector) parameter o f a regular parametric model / v(y) based on a random sample y u . . . ,y„. Show that the j \ h empirical influence value for \p at yj may be written as n I~ lSj, where

y-v g2 l o g /v-,(y; ) d \ o g j j i y j )dxpdip7 ’ J dxp

Hence show that the nonparametric delta method variance estimate for ip is the so-called sandwich estimator

/-> ( X sA r ) ' - ' -

Compare this to the usual parametric approximation when y \ , . . . , y„ is a random sample from the exponential distribution with mean tp .(Section 2.7.2; Royall, 1986)

13 The a trimmed average is defined by

t {F)=r h a [

computed at the E D F F. Express t(F) in terms o f order statistics, assuming that na is an integer. How would you extend this to deal with non-integer values o f not? Suppose that F is a distribution symmetric about its mean, p.. By rewriting t(F) as

) rii-«(f)-— — / udF(u),1 - 2 a

where qa(F) is the a quantile o f F, use the result o f Example 2.19 to show that the influence function o f t(F) is

l - 2 « r I, y < q « ( F ) ,L t ( y , F ) = l 1 — 2a) ', q„(F) < y < <ji_a(F ),

{ { q ' ( F ) - p } ( l - 2 * ) - \ q t - x( F ) < y .

Hence show that the variance o f t (F) is approximately

i r r<i\-AF)n(1 _ 2ay [ J {F) ^ “ >*)2 dF(y) + « te«(F) - n}2 + a {qi-x(F) - n}2 .

Evaluate this at F = F.(Section 2.7.2)

14 Let Y have a p-dimensional multivariate distribution with mean vector p and covariance matrix fi. Suppose that Q has eigenvalues Aj > • • • > Xp and corresponding orthogonal eigenvectors ej, where e j e y = 1. Let Fc = (1 — s)F + eHy. Show that the influence function for Q is

La(y ',F) = { y - n)(y - p )T — fl,

and by considering the identities

Q(Fc)ej(Fs) = Xj(Fe)ej(Fc), e j(F£)t e j(Fc) = 1,

or otherwise, show that the influence function for l j is {e j ( y — p)}2 — Xj.(Section 2.7.2)

15 Consider the biased sample variance t — n_ 1 J2(yj ~ J')2-(a) Show that the empirical influence values and second derivatives are

lj = (yj - y )2 - U qjk = - 2 ( y j - y)(yk - y).

(b) Show that the exact case-deletion values o f t are

Compare these with the result o f the general approximation

t - t - j = ( n - 1 y ' l j - - 1 )~2qjj,

which is obtained from (2.41) by substituting F for F and for F.(c) Calculate jackknife estimates o f the bias and variance o f T. Are these sensible estimates?(Section 2.7.3; Appendix A)

16 The empirical influence values lj can also be defined in terms o f distributions supported on the data values. Suppose that the support o f F is restricted to y i , . . . , y n, with probabilities p = ( p i , . . . , pn) on those values. For such distributions t(F) can be re-expressed as t{p).(a) Show that

h = j Rt{(l - e)p + s l j}e=0

where P = ($ ,■-■>%) and 1 j is the vector with 1 in the y'th position and 0 elsewhere. Hence or otherwise show that

n

0 = Mp) - X Mp)> k=\

where 'tj(p) = 8t(p)/dpj.(b) Apply this result to derive the empirical influence values lj = (xj — tuj )/u for the estimate t = J2 Pjx j ! 5Z Pjuj o f the ratio o f two means.(c) The empirical second derivatives qtj can be defined similarly. Show that

d2qtj = g~ ^ t{(l - El - E 2)p + Ell, + E 2 ly}

£| =£2=0

Hence deduce that

<2.7 = 'iij(P) ~ n ‘ 5Z tik (P ) - n ' + n 2 Y1 iklk=l k=l k,l= 1

(Section 2.7.2)

17 Suppose that t = + }W i)) is the median o f a sample o f even size n = 2mfrom a distribution with continuous C DF F and PDF / whose median is p. Show that the case-deletion values are either y lmj or and that the jackknifevariance estimate is

" “ 1 ( 'i2 vjack = — (,y<m+i) - y(m)) ■

By writing Yu> = F '{1 — exp(—£y))}, where is the 7 th order statistic o f a random sample from the standard exponential distribution, and recallingproperties o f exponential order statistics, show that

nVJadl~ ( ‘X2 )J ___ / I ?\2

4 P i n )

as n—*oo. This confirms that the jackknife variance estimate is not consistent. (Section 2.7.3)

18 A generalized form o f jackknife can be defined by estimating the influence function at yj by

t { ( l - e ) F + eHyj} - t ( F )

e- 1for some value e. Discuss the effects o f (a) e—>0, (b) e = — ( n— l ) -1 , (c) e = ( n + 1)

which respectively give the infinitesimal jackknife, the ordinary jackknife, and the positive jackknife.

Show that in (b) and (c) the squared distance (dF — dFe)T(dF — dFc) from F to Fe = (1 — s)F + eH Vj is o f order 0 (n~2), but that if F* is generated by bootstrap

sampling, E* j(d F ‘ — dF)T {dF’ — dF) j = 0 (n ~ l ). Hence discuss the results you

would expect from the butcher knife, which uses e = n~l/2. How would you calculate it?(Section 2.7.3; Efron, 1982; Hesterberg, 1995a)

19 The cumulant generating function o f a multinomial random variable with denominator n and probability vector (7 1 1 , . . . , n„) is

K(£) = n log 7ty e x p (^ -) |,

where £ =(a) Show that with Kj = n~l, the first four cumulants o f the /* are

E‘(/D = 1,co v '( / ' , / * ) = d i j - n ~ \

' ( f i J ’j J k ) = n~2{n2Sijk -«<5ft[3] + 2 } ,


cum

cum‘ (/,',/* , f l J J ) = n }{n}dijki - n2 (c5ft<5y,[3] + SJkl[ 4]) + 2nSit [6 ] - 6 },

where S:J = 1 when i = j and zero otherwise, and so on, and d:k [3] = d,k + S,j + Sjk, and so forth.(b) Now consider t ’Q = f + n ~ ' J 2 f j h + \ n~2 H Show that E*(tg) =t + \ n~2 ^2 qjj and that t'g has variance

i £ 1j + ^ E { E 4 - 1 ( E « « ) * + E E • (2-51)

(Section 2.7.2; Appendix A; Davison, Hinkley and Schechtman, 1986; McCullagh, 1987)

20 Show that the difference between the second derivative Q, (x ,y ) and the first derivative o f L,(x) is equal to L,(y). Hence show that the empirical influence value can be written as

nlj = L t( y j ) + n~l ^ { 2 ,(y y ,y (c ) ~ L ,(yk)}-

k= 1

Use the resampling version o f this result to discuss the accuracy o f approximation(2.49) for v‘L.(Sections 2.7.2, 2.7.5)

2.11 Practicals1 Consider parametric simulation to estimate the distribution o f the ratio when a

bivariate lognormal distribution is fitted to the data in Table 2.1:

ml <- m e a n ( lo g (c ity $ u )) ; m2 < - m e a n ( lo g (c ity $ x ) )s i <- s q r t ( v a r ( l o g ( c i t y $ u ) ) ) ; s 2 <- s q r t ( v a r ( lo g ( c i t y $ x ) ) )rho <- c o r ( l o g ( c i t y ) ) [ 1 , 2 ]c it y .m le <- c (m l, m2 , s i , s 2 , rho)

2.11 ■ Practicals 67

city.sim <- function(city, mle){ n <- nrow(city)

zl <- rnorm(n) ; z2 <- rnonn(n) z2 <- mle[5]*zl+sqrt(1-mle[5]”2)*z2data, frame(u=exp(mle [1] +mle [3] *zl) , x=exp(mle[2]+mle[4]*z2)) }

city.fun <- function(data, i=l:nrow(data)){ d <- data[i,]

tstar <- sum(d$x)/sum(d$u) ubar <- mean(d$u)

c(tstar, sum((d$x-tstar*d$u)~2/(nrow(d)*ubar)"2)) } city.para <- boot (city, city. fun, 11=999,

sim="parametric",ran.gen=city. sim,mle=city.mle)

Are histograms o f t ’ and z ' similar to those for nonparametric simulation, shown in Figure 2.5?

tstar <- city.para$t[,l]zstar <- (tstar-city.para$tO[1])/sqrt(city.para$t[,2])split.screen(c(1,2))screen(l); hist(tstar)screen(2); hist(zstar)screen(l); qqnorm(tstar,pch=".")screen(2); qqnorm(zstar,pch="."); abline(0,1,lty=2)

Use (2.10) and (2.12) to give 95% confidence intervals for the true ratio under this model:

city.para$tO[l] - sort (tstar-city .para$tO [1] ) [c (975,25)] city.para$tO [1] - sqrt(city.para$t0[2])*sort(zstar) [c(975,25)]

Compare these intervals with those given in Example 2.12.Repeat this with R = 199 and R = 399.(Sections 2.2, 2.3, 2.4)

2 c o .t r a n s fe r contains data on the carbon monoxide transfer factor for seven smokers with chickenpox, measured on admission to hospital and after a stay of one week. The aim is to estimate the average change in the factor.To display the data:

attach(co.transfer)plot(0.5*(entry+week),week-entry)t . test(week-entry)

Are the differences normal? Is the Student-t confidence interval reliable?For a bootstrap approach:

co.fun <- function(data, i){ d <- data[i,]

y <- d$week-d$entry c(mean(y), var(y)/nrow(d)) }

co.boot <- boot(co.transfer, co.fun, R=999)

Compare the variance o f the bootstrap estimate t" with the estimated variance o f t, in c o .b o o t$ t 0 [ 2 ]. Compare normal-based and studentized bootstrap 95% confidence intervals.To display the bootstrap output:

2 • The Basic Bootstraps

split.screen(c(l,2)) screen(l); split.screen(c(2,1))screen(3); qqnonn(co,boot$t[,1],ylab="t*",pch=".") abline(co.boot$tO[l],sqrt(co.boot$t0[2]) ,lty=2) screen(2)

plot(co.boot$t[,1],sqrt(co.boot$t[,2]),xlab="t*",ylab="SE*",pch=".") screen(4); z <- (co,boot$t[,1]-co.boot$tO[1])/sqrt(co.boot$t[,2]) qqnorm(z); abline(0,1,lty=2)

What is going on here? Is the normal interval useful? What difference does dropping the simulation outliers make to the studentized bootstrap confidence interval?(Sections 2.3, 2.4; Hand et ai , 1994, p. 228)

cd4 contains the CD4 counts in hundreds for 20 HIV-positive patients at baseline and after one year o f treatment with an experimental anti-viral drug. We attempt to set a confidence interval for the correlation between the baseline and later counts, using the nonparametric bootstrap.

corr.fun <- function(d, w = rep(l, nrow(d))/nrow(d)){ w <- w/sum(w)

n <- nrow(d) ml <- sum(d[, 1] * w)m2 <- sum(d[, 2] * w)vl <- sum(d[, 1] “2 * w) - ml"2v2 <- sum(d[, 2] “2 * w) - m2~2rho <- (sum(d[, 1] * d[, 2] * w) - ml * m2)/sqrt(vl * v2) i <- rep(l:n,round(n*w)) us <- (d[i, 1] - ml)/sqrt(vl) xs <- (d[i, 2] - m2)/sqrt (v2)L <- us * xs - 0.5 * rho * (us~2 + xs'2) c(rho, sum(L"2)/n"2) >

cd4.boot <- boot(cd4, corr.fun, R=999, stype="w")

Is the variance independent o f t? Is z* pivotal? Should we transform the correlation coefficient?

t0 <- cd4.boot$t0[l]tstar <- cd4.boot$t[,1]vL <- cd4.boot$t[,2]zstar <- (tstar-tO)/sqrt(vL)

fisher <- function( r ) 0.5*log( (l+r)/(l-r) )split.screen(c(1,2))screen(l); plot(tstar,vL)screen(2); plot(fisher(tstar),vL/(l-tstar"2)~2)

For a studentized bootstrap confidence interval on transformed scale:

zstar <- (fisher(tstar)-fisher(tO))/sqrt(vL/(l-tstar~2)~2)vO <- cd4.boot$t0[2]/(l-t0“2)"2fisher(tO) - sqrt(v0)*sort(zstar) [c(975,25)]

What are these on the correlation scale? How do they compare to intervals obtained without the transformation?If there are simulation outliers, delete them and recalculate the intervals.(Sections 2.3, 2.4, 2.5; DiCiccio and Efron, 1996)

4 How many simulations are required for quantile estimation? To get some idea, we make four replicate plots with 39, 99, 399 and 999 simulations.

split.screen(c(4,4)) quantiles <- matrix(NA,16,4) n <- c (39,99,399,999) p <- c(0.025,0.05,0.95,0.975) for (i in 1:4){ y <- rnorm(999)

for (j in 1:4) {quantiles[(j-1)*4+i,] <- quantile(y [1 :n[j]] , probs=p) screen((i-1)*4+j)qqnorm(y [1 :n[j] ] ,ylab="y" ,main=paste("R = ",n[j])) abline(h=quantile(y[l :n[j]] ,p) ,lty=2) } }

Repeat the loop a few times. How large a simulation is required to get reasonable estimates o f the 0.05 and 0.95 quantiles? O f the 0.025 and 0.975 quantiles? (Section 2.5.2)

5 Following on from Practical 2.3, we compare variance approximations for the correlation in cd4:

L.inf <- empinf(data=cd4,statistic=corr.fun)L.jack <- empinf(data=cd4,statistic=corr.fun,type="jack")L.reg <- empinf(boot.out=cd4.boot,type="reg") split.screen(c(1,2))screen(l); plot(L.inf,L.jack); screen(2); plot(L.inf,L.reg)v.inf <- sum(L.inf“2)/nrow(cd4)'2v.jack <- var(L.jack)/nrow(cd4)v.reg <- sum(L.reg~2)/nrow(cd4)~2v.boot <- var(cd4.boot$t[,l])c(v.inf,v .reg,v .j ack,v .boot)

Discuss the different variance approximations in relation to the values o f the influence values. Compare with results for the transformed correlation coefficient. To see the accuracy o f the linear approximation:

close.screen(all=T);plot(tstar.linear.approx(cd4.boot,L.reg))

Find the correlation between t ’ and its linear approximation. Make the corresponding plots for the other empirical influence values. Are the plots better on the transformed scale?(Section 2.7)

3

Further Ideas

3.1 Introduction

In the previous chapter we laid out the basic elements o f resampling or bootstrap m ethods, in the context o f the analysis o f a single hom ogeneous sample o f data. This chapter deals with how those ideas are extended to some m ore complex situations, and then turns to uses for variations and elaborations o f simple bootstrap schemes.

In Section 3.2 we describe how to construct resampling algorithm s for several independent samples, and then in Section 3.3 we discuss briefly the use of partial modelling, either qualitative or sem iparam etric, a topic explored more fully in the later chapters on regression models (Chapters 6 and 7). Section 3.4 examines when it is worthwhile to modify the statistic by using a sm oothed empirical distribution function. In Sections 3.5 and 3.6 we turn to situations where data are censored or missing and therefore are incomplete. One relatively simple situation where the standard bootstrap m ust be modified to succeed is finite population sampling, which we consider in Section 3.7. In Section 3.8 we deal with simple situations o f hierarchical variation. Section 3.9 is an account o f nested bootstrapping, where we outline how to overcome some o f the shortcom ings o f a single bootstrap calculation by a further level of simulation. Section 3.10 describes bootstrap diagnostics, which are concerned with the assessment o f sensitivity o f resampling analysis to individual observations, as well as the use o f bootstrap ou tput to suggest modifications to the calculations. Finally, Section 3.11 describes the use o f nested bootstrapping in selecting an estim ator from the data.

70

3.2 ■ Several Samples 71

3.2 Several SamplesSuppose that we are interested in a param eter tha t depends on the populations F \ , . . . , F k , and that the data consist o f independent random samples from these populations. The ith sample is >'n, ■ ■ and arises from population F t, for i = 1 If there is no further inform ation about the populations, thenonparam etric estimate o f F t is the E D F o f the ith sample,

Since each o f the k populations is separate, nonparam etric sim ulation from their respective E D Fs F i , . . . , F k leads to datasets

where is generated by sampling n,- times with equal probabilities,n“ ', from the ith original sample, independently o f all other sim ulated samples. This am ounts to stratified sampling in which each of the original samples corresponds to a stratum , and nt observations are taken with equal probability from the ith stratum . W ith this extension of the resampling algorithm, we proceed as ou tlined in C hapter 2. For example, if v = v(Fi,. . . ,F/c) is an estim ated variance for t, confidence intervals for 6 could be based on simulated values o f z* = (t* — t ) / v ' ll2 ju st as described in Section 2.4, where now t* and v' are formed from samples generated by the sim ulation algorithm described above.

Example 3.1 (Difference of population means) Suppose we are interested in the difference o f two population means, 6 = t(Fi ,F2) = f ydFi (y ) — f The corresponding estimate o f f(F], F2) based on independent samples from the two distributions is the difference o f the two sample averages,

This differs slightly from the delta m ethod variance approxim ation, which we describe in Section 3.2.1.

A simulated value o f T would be f* = yj — y ’2> where yj is the average o f n\ observations generated with equal probability from the first sample, y ii , - - - , yim, and is the average o f n2 observations generated with equal

Recall that the Heaviside function H(u) jumps from 0 to 1 at u = 0.

for which the usual unbiased estimate o f variance is

72 3 ■ Further Ideas

1 2 3Series 4 5 6 7 8

76 87 105 95 76 78 82 8482 95 83 90 76 78 79 8683 98 76 76 78 78 81 8554 100 75 76 79 86 79 8235 109 51 87 72 87 77 7746 109 76 79 68 81 79 7687 100 93 77 75 73 79 7768 81 75 71 78 67 78 80

75 62 75 79 8368 82 82 8167 83 76 78

73 7864 78

probability from the second sample, y 2i , - - - , y 2n2- The corresponding unbiased estimate o f variance for t* based on these samples would be

1 ”> 1 «2

Example 3.2 (Gravity data) Between M ay 1934 and July 1935, a series of experiments to determine the acceleration due to gravity, g, was perform ed at the N ational Bureau o f Standards in W ashington DC. The experiments, made with a reversible pendulum , led to eight successive series o f measurements. The data are given in Table 3.1. Figure 3.1 suggests tha t the variance decreases from one series to the next, tha t there is a possible change in location, and tha t mild outliers may be present.

The m easurem ents for the later series seem m ore reliable, and although we would wish to estimate g from all the data, it seems inappropriate to pool the series. We suppose tha t each o f the series is taken from a separate population, F i, . . . ,F g , but that each population has m ean g; for a check on this see Example 4.14. Then the appropriate form o f estim ator is a weighted com bination

r = Ef=i V(Fi)/<r2(Fi) E li IM A) ’

where F, is the E D F o f the ith series, fi(Fi) is an estimate o f g from F„ and

Table 3.1 Eight series of measurements of the acceleration due to gravity, g, given as deviations from 980000 xlO-3 cm s"2, in units of cms-2 x 10-3. (Cressie, 1982)

Figure 3.1 Gravity series box plots, showing a reduction in variance, a shift in location, and possible outliers.

O00

OCD

o

<j2(Fj) is an estim ated variance for n(Fi). The estim ated variance o f T is

v = j E 1/ ^1 1=1

If the da ta were thought to be norm ally distributed with mean g but different variances, we would take

KFi) = yh v 2(Fi) = {n,(n, - l)}-1 E ^ 'V “ ^ ) 2j

to be the average o f the ith series and its estim ated variance. The resulting estim ator T is then an empirical version o f the optim al weighted average. For our da ta t = 78.54 with standard error uI/2 = 0.59.

Figure 3.2 shows summary plots for R = 999 nonparam etric simulations from this model. The top panels show norm al plots for the replicates t ' and for the corresponding studentized bootstrap statistics z* = (f* — t ) / v ' l/2. Both are more dispersed than normal. There is one large negative value o f z*, and the lower panels show w hy: on the left we see tha t the u* for the smallest value o f t* is very small, which inflates the corresponding z*. We would certainly om it this value on the grounds that it is a sim ulation outlier.

The average and variance o f the £* are 78.51 and 0.371, so the bias estimate for t is 78.51 — 78.54 = —0.03, and a 95% confidence interval for g based on a norm al approxim ation is (77.37,79.76). The 0.025 x (R + 1) and 0.975 x (R + 1) order statistics o f the z* are -3.03 and 2.50, so the 95% studentized bootstrap confidence interval for g is (77.07,80.32), slightly wider than that based on the norm al approxim ation, as the top right panel o f Figure 3.2 would suggest.

74 3 • Further Ideas

o00

O)r

GO

h-r-.

10

/ ' y o

m . . . y* 1N

/ oV

/ in•

-2 0 2

Quantiles of standard normal

-2 0 2


77 78 79

t*

80 81

tr

h-oCDoino

ocod

c\jo

A part from the resampling algorithm, this mimics exactly the studentized bootstrap procedure described in Section 2.4. ■

Other, constrained resam pling plans m ay be suggested by stronger assum ptions about the populations, as discussed in Section 3.3. The advantage o f the resampling plan described here is that it is robust.

3.2.1 Influence functions and variance approximationsThe discussion in Section 2.7.2 generalizes quite easily to the case o f multiple independent samples, with separate influence functions corresponding to each

Figure 3.2 Summary plots for 999 nonparametric simulations of the weighted average £ for the gravity data and its estimated variance v.The top panels show normal quantile plots of t* and the studentized bootstrap statistic z* = (t* — t)/v*^2. The line on the top left has intercept t and slope vl/2, and on the top right the line has intercept zero and unit slope. The bottom panels show that the smallest t* also has the smallest v*, leading to an outlying value of z*.

population represented. W hen T has the representation f(FI ; . .. ,Fk), the analogue o f the linear approxim ation (2.35) is

k J Hit(Fu . . . , F k) = t(Fu . . . , Fk) + E ~ t1 1 )

,=i j = 1


where the influence functions L t i are defined by

St (Fu . . . , ( l - e ) F i + eHy, . . . , F k)L v ( y ; F ) =

ds(3.2)

6=0

and for brevity we write F = (Fi , . . . , Fk). As in the single sample case, the influence functions have m ean zero, E {Ltii(y;F)} = 0 for each i. Then the immediate consequence o f (3.1) is the nonparam etric delta m ethod approxim ation

T — 6 ~ N( 0, vL),

for large where the variance approxim ation vL is given by the variance of the second term on the right-hand side o f (3.1), that is

k 1v l = V - v a r { L M(Y ;F) \ F}. (3.3)

£ f n‘

By analogy with the single sample case, empirical influence values are obtained by substituting the E D Fs F = ( F i , . . . , F k) for the C D Fs F in (3.2) to give

h j = LtAyj i f )-

These values satisfy E y=i kj ~ ® f° r eac^ *• Substitution o f empirical variances o f the empirical influence values in (3.3) gives the variance approxim ation

' ‘■“ E i D S - <1 4 >i= i n i j = i

which generalizes (2.36).

Example 3.3 (Difference of population means) For the difference between sample averages in Example 3.1, the first influence function is

j x \{{\ - £)dFi(xi) + edHy(xi)} - J x 2dF2(x2) = y - p ie=0

ju st as in Example 2.17. Similarly L 2(y,F) = — (y — fi2). In this case the linear approxim ation (3.1) is exact. The variance approxim ation form ula (3.3) gives

vL = — var(F i) + — var(Y2), n i n2

and the empirical version (3.4) is

vl = X ~ h ) 2 + i _ *2)2-"1 ;=1 ;=1

As usual this differs slightly from the unbiased variance approxim ation.N ote that if we could assume that the two population variances were equal,

then it would be appropriate to replace vl by

( b , + i ) ( 5 > u - J1'12+ 2 > « - »>2} •similar to the usual “pooled variance” formula. ■

The various com m ents m ade about calculation in Section 2.7 apply herewith obvious modifications. Thus the empirical influence values can be approxim ated accurately by num erical differentiation, which here means

f ^ t(F\ , . . . , (1 - e ) F j + eHyj , . . . ,Fk) - t lj ~ e

for small e . We can also use the generalization o f (2.44), namely

' ■ - ‘ + E r E ^ - <3-5>. - = 1 J . 1

where / ' denotes the frequency o f da ta value in the bootstrap sample.Then given simulated values we can approxim ate the ly by regression,generalizing the m ethod outlined in Section 2.7.4. A lternative ways to calculate the Ijj and vL are described in Problems 3.6 and 3.7.

The m ultisample analogue o f the jackknife m ethod o f Section 2.7.3 involves the case deletion estimates

jack,]] = (tlj l)(t t—jj),

where is the estimate obtained by om itting the yth case in the ith sample. Then

k ^ n,

vjack = E _ J) ~^jack,if-

One can also generalize the discussion o f bias approxim ation in Section 2.7.3. However, the extension o f the quadratic approxim ation (2.41) is not straightforward, because there are “cross-population” terms.

The same approxim ation (3.1) could be used even when the samples, and hence the F,s, are correlated. But this would have to be taken into account in (3.3), which as stated assumes m utual independence o f the samples. In general it would be safer to incorporate dependence through the use o f appropriate m ultivariate EDFs.

3.3 ■ Semiparametric Models 77

3.3 Semiparametric ModelsIn a sem iparam etric model, some aspects o f the data distribution are specified in terms o f a small num ber o f param eters, but other aspects are left arbitrary. A simple example would be the characterization Y = fi + ae, with no assum ption on the distribution of e except that it has centre and scale zero and one. Usually a sem iparam etric model is useful only when we have nonhom ogeneous data, with only the differences characterized by param eters, com m on elements being nonparam etric.

In the context o f Section 3.2, and especially Example 3.2, we might for example be fairly sure tha t the distributions F, differ only in scale or, more cautiously, scale and location. T hat is, Yy might be expressed as

Yy — fli 4“ 6i'c-ij>

where the ey are sampled from a com m on distribution with C D F Fo, say. The norm al distribution is a param etric model o f this form. The form can be checked to some extent by plotting standardized residuals such as

for appropriate estim ates jl, and au to verify homogeneity across samples. The com m on Fo will be estim ated by the E D F o f all n, o f the ey-s, or better by the E D F o f the standardized residuals e y /( l — n f 1)1 /2. The resampling algorithm will then be

Y y fii ”1" *7['£y, j 1, • • • , Wj, i 1 , . . . , /c,

where the £y-s are random ly sampled from the EDF, i.e. random ly sampled with replacem ent from the standardized eys; see Problem 3.1.

In another context, with positive da ta such as lifetimes, it might be appropriate to think o f distributions as differing only by multiplicative effects, i.e. Yy = HiSij, where the ey are random ly sampled from some baseline distribution with unit mean. The exponential distribution is a param etric model o f this form. The principle here would be essentially the same: estimate the ey by residuals such as ey = yy //i„ then define Yy = &£*• with the e*- random ly sampled with replacem ent from the eys.

Similar ideas apply in regression situations. The param etric part o f the model concerns the systematic relationship between the response y and explanatory variables x, e.g. through the mean, and the nonparam etric part concerns the random variation. We consider this in detail in Chapters 6 and 7.

Resampling plans such as those just outlined will give more accurate answers when their assum ptions about the relationships between F, are correct, but they are not robust to failure o f these assumptions. Some pooling o f inform ation


across samples may be essential in order to avoid difficulties when the samples are small, but otherwise it is usually unnecessary.

If we widen the meaning o f sem iparam etric to include any partial modelling, then features less tangible than param eters come into play. The following two examples illustrate this.

Example 3.4 (Symmetric distribution) Suppose tha t with our simple random sample it was appropriate to assume tha t the distribution was symmetric about its m ean or median. Using this assum ption could be critical to correct statistical analysis; see Example 3.26. W ithout a param etric model it is hard to see a clear choice for F. But we can argue as follows: under F the distributions of Y - n and —(7 — n) are the same, so under F the distributions o f Y* — fi and

should be the same. This will be true if we symmetrize the E D F about p., m eaning that we take F to be the E D F o f y \ , . . . ,y„,2p.—y \ , . . . ,2 p . —y„. A robust choice for p. would be the median. (For discrete distributions we could equivalently average sample proportions for appropriate pairs o f data values.) The mean, m edian and other symmetrically defined location estim ates o f the resulting estim ated distribution are all equal. ■

Example 3.5 (Equal marginal distributions) Suppose tha t Y is bivariate, say Y = (U ,X) , and that it is appropriate from the context to assume tha t U and X have the same m arginal distribution. Then F can be forced to have the same m argins by defining it as the E D F o f the 2 n pairs (u i ,x i) , . x „ ) ,(x i,u i) , . . . ,(xn,Un). B

In both o f these examples the resulting estimate will be more efficient than the EDF. This may be less im portant than producing a m odel which satisfies the practical assum ptions and makes intuitive sense.

Example 3.6 (Mixed discrete-continuous distributions) There will be situations where the raw E D F is not suitable for resampling because it is not a credible model. Such a situation arises in classification, where we have a binary response y and covariates x which are used to predict y. If the observed covariate values x i , . . . , x„ are distinct, then the conditional probabilities Tt(x) = Pr(Y = 1 | x) estim ated from the E D F are all 0 or 1. This is clearly not credible, so the E D F should not be used as a resam pling model if the focus o f interest is a property tha t depends critically on the conditional probabilities n(x). A natural modification o f the E D F is to keep the m arginal E D F o f x, but to replace the 0-1 values o f the conditional distribution by a sm ooth estimate o f n(x). This is discussed further in Example 7.9. ■

3.4 ■ Smooth Estimates o f F 79

3.4 Smooth Estimates of F

For nonparam etric situations we have so far mostly assumed that the E D F F is a suitable estimate o f F. But F is discrete, and it is natural to ask if a sm ooth estimate o f F might be better. The m ost likely situation for improvement is where the effects o f discreteness (Section 2.3.2) are severe, as in the case o f the sample m edian (Example 2.16) or o ther sample quantiles.

W hen it is reasonable to suppose tha t F has a continuous PDF, one possibility is to use kernel density estimation. For scalar y we take

t M - h t H n r 1 ) - ™j=i

where w(-) is a continuous and symmetric P D F with mean zero and unit variance, and do calculations or simulations based on the corresponding C D F Fh, rather than on the E D F F. This corresponds to sim ulation by setting

Y ‘ = yr. + h£j, j = l , . . . , n ,

where the l j are independent and uniformly distributed on the integers 1 , . . . , n and the ej are a random sample from w(-), independent o f the l j . This is the smoothed bootstrap. N ote that h = 0 recovers the EDF.

The variance o f an observation generated from (3.6) is n~l J2(yj ~ S’)2 + ^2> and it may be preferable for the samples to have the same variance as for the unsm oothed bootstrap. This is implemented via the shrunk smoothed bootstrap, under which h sm ooths between F and a model in which data are generated from density w(-) centred at the m ean and rescaled to have the variance o f F ; see Problem 3.8.

Having decided which sm oothed bootstrap is to be used, we estimate the required property of F , a(F), by a(F/,) ra ther than a(F). So if T is an estim ator o f 9 = t(F), and we intend to estimate a(F) = v ar(T | F) by simulation, we would obtain values t \ , . . . , t ’R calculated from samples generated from F/,, and then estimate a(F) by (R — I)-1 — F ) 2. Notice that it is a(F), not t(F),tha t is estim ated using smoothing.

To see when a(F/,) is better than a(F), suppose that a(F) has linear approxim ation (2.35). Then

n

a(Fh) - a(F) = n~l ^7=1

n

= n - 1 Y , L a(Yj ;F) + \ h 2n~l £ L "(Y ,;F ) + • ■ •7=1 7=1

J L a( Y j + h £ j - , F ) w ( E j ) d E j - i -------

for large n and small h, where L "(u ;F ) = d2L a(u ;F) /3u2. It follows that the


n Usual h = 0

Smoothed, h

0 . 1 0.25 0.5 1 . 0

2 0

8018.911.4

18.61 1 . 2

16.610.4

11.98.5

6 . 6

6.4

mean squared error o f a(Fh), MSE(h) = E[{a(Fj,) — a(F)}2], roughly equals

n~lj L a( y ; F)2 dF(y)+h2n ^ j L a( y ; F ^ y ; F) d F ( y ) + \ hA^ f U'a{y ; F) dF(y) J .(3.7)

Sm oothing is no t beneficial if the coefficient o f h2 is positive, but if it is negative (3.7) can be reduced by choosing a positive value o f h tha t trades off the last two terms. The leading term in (3.7) is unaffected by the choice o f h, which suggests tha t in large samples any effect o f sm oothing will be m inor for such statistics.

Example 3.7 (Sample correlation) To illustrate the discussion above, we take a(F) to be the scaled standard deviation o f T = i log{(l + C )/( 1 — C)}, where C is the correlation coefficient for bivariate norm al data. We extend (3.6) to bivariate y by taking w( ) to be the bivariate norm al density with mean zero and variance m atrix equal to the sample variance matrix. For each o f 200 samples, we applied the sm oothed bootstrap with different values o f h and R = 200 to estim ate a(F).

Table 3.2 shows results for two sample sizes. For n = 20 there is a reduction in root mean squared error by a factor o f about three, whereas for n = 80 the factor is about two. Results for the shrunk sm oothed bootstrap are the same, because o f the scale invariance o f C and the form o f w( ). ■

Sm oothing is potentially m ore valuable when the quantity o f interest depends on the local behaviour o f F, as in the case o f a sample quantile.

Example 3.8 (Sample median) Suppose that t(F) is the sample median, and that we wish to estimate its variance a(F). In Example 2.16 we saw that the discreteness o f the m edian posed problems for the ordinary, unsm oothed, bootstrap. Does sm oothing improve m atters?

U nder regularity conditions on F and h, detailed calculations show that the m ean squared error o f na(Fh) is proportional to

(n/i)-1ci + h4C2, (3.8)

where c\ and c2 depend on F and w(-) bu t not on n. Provided that c\ and c2 are non-zero, (3.8) is minimized at h oc n-1/5, and (3.8) is then o f order n-4/5,

Table 3.2 Root mean squared error (xlO-2) for estimation of n1//2 times the standard deviation of the transformed correlation coefficient for bivariate normal data with correlation 0.7, for usual and smoothed bootstraps with R = 200 and smoothing parameter h.

3.4 ■ Smooth Estimates o f F 81

Table 3.3 Root mean squared error for estimation of n times the variance of the median of samples of size n from the £3 and exponential densities, for usual, smoothed and shrunk smoothed bootstraps with R = 200 and smoothing parameter h.

n Usual h = 0

Sm oothed, h Shrunk sm oothed, h

0.1 0.25 0.5 1.0 0.1 0.25 0.5 1.0

£3 11 2.27 2.08 2.17 3.59 10.63 2.06 2.00 2.72 4.9181 0.97 0.76 0.77 1.81 6.07 0.75 0.67 1.17 2.30

Exp 11 1.32 1.15 1.02 1.18 7.53 1.13 0.92 0.76 0.9381 0.57 0.48 0.37 0.41 1.11 0.47 0.34 0.27 0.27

whereas it is 0 ( n ~ ,/2) in the unsm oothed case. Thus there are advantages to sm oothing here, a t least in large samples. Similar results hold for other quantiles.

Table 3.3 shows results o f sim ulation experiments where 1000 samples were taken from the exponential and tj distributions. For each sample sm oothed and shrunk sm oothed bootstraps were perform ed with R = 200 and several values o f h. Unlike in Table 3.2, the advantage due to sm oothing increases with n, and the shrunk sm oothed bootstrap improves on the sm oothed bootstrap, particularly at larger values o f h.

As predicted by the theory, as n increases the root m ean squared error decreases m ore rapidly for sm oothed than for unsm oothed bootstraps; it decreases fastest for shrunk smoothing. For the tj d a ta the root m ean squared error is not m uch reduced. For the exponential da ta sm oothing was performed on the log scale, leading to reduction in root m ean squared error by a factor two or so. Too large a value o f h can lead to large increases in root mean squared error, but choice o f h is less critical for shrunk sm oothing. Overall, a small am ount o f shrunk sm oothing seems worthwhile here, provided the data are well-behaved. But similar experim ents with Cauchy data gave very poor results m ade worse by smoothing, so one m ust be sure that the data are not pathological. Furtherm ore, the gains in precision are not large enough to be critical, at least for these sample sizes.

■

The discussion above begs the im portant question o f how to choose the sm oothing param eter for use with a particular dataset. One possibility is to treat the problem as one o f choosing am ong possible estim ators a(Fh) and use the nested bootstrap, as in Example 3.26. However, the use o f an estim ated h is no t sure to give improvement. W hen the rate o f decrease o f the optim al value o f h is known, another possibility is to use subsampling, as in Example 8.6.

3.5 Censoring

3.5.1 Censored data

Censoring is present when data contain a lower or upper bound for an observation rather than the value itself. Such da ta often arise in medical and industrial reliability studies. In the medical context, the variable o f interest m ight represent the time to death o f a patient from a specific disease, with an indicator o f w hether the time recorded is exact or a lower bound due to the patient being lost to follow-up or to death from other causes.

The com m onest form o f censoring is right-censoring, in which case the value observed is Y = m in (7°, C), where C is a censoring value, and Y° is a nonnegative failure time, which is known only if Y° < C. The data themselves are pairs (Y , D ), where D is a censoring indicator, which equals one if Y° is observed and equals zero if C is observed. Interest is usually focused on the distribution F° o f Y°, which is obscured if there is censoring.

The survivor function and the cumulative hazard function are central to the study o f survival data. The survivor function corresponding to F°(y) is Pr(Y° > y) = 1 — F°(y), and the cumulative hazard function is A°(y) =— lo g { l— -F°(y)}. The cum ulative hazard function may be written as / 0y dA°(u), where for continuous y the hazard function dA°(y) /dy measures the instantaneous rate o f failure at time y, conditional on survival to tha t point. A constant hazard X leads to an exponential distribution of failure times with survivor and cum ulative hazard functions exp(—Ay) and Ay; departures from these simple forms are often o f interest.

The simplest model for censoring is random censorship, under which C is a random variable with distribution function G, independent o f Y°. In this case the observed variable Y has survivor function

O ther forms o f censoring also arise, and these are often m ore realistic for applications.

Suppose that the da ta available are a hom ogeneous random sample (yi,di), . . . , (y n, dn), and tha t censoring occurs at random . Let y\ < ■ ■ ■ < y„, so there are no tied observations. A standard estim ate o f the failure-time survivor function, the product-limit or Kaplan-Meier estimate, may then be written as

Pr(Y > y ) = { I - F ° ( y ) } { l - G ( y ) } .

(3.9)

I f there is no censoring, all the dj equal one, and F°(y) reduces to the E D F of y i , . . . , y n (Problem 3.9). The product-lim it estim ate changes only a t successive failures, by an am ount tha t depends on the num ber o f censored observations

3.5 ■ Censoring 83

H{u) is the Heaviside function, which equals zero if u < 0 and equals one otherwise.

between them. Ties between censored and uncensored data are resolved by assuming that censoring happens instantaneously after a failure m ight have occurred; the estim ate is unaffected by o ther ties. A standard error for 1—F°(y) is given by Greenwood’s formula,

1/2

(3.10)

In setting confidence intervals this is usually applied on a transform ed scale. Both (3.9) and (3.10) are unreliable where the num bers a t risk o f failure are small.

Since 1 — dj is an indicator o f censoring, the product-lim it estimate o f the censoring survivor function 1 — G is

■-*M- n Gr^H- <>j : y j< y v J /

The cum ulative hazard function may be estim ated by the Nelson-Aalen estimate

----- —y (3.12)

Since y\ < • ■ • < y„, the increase in A (> a t yj is dA°(yj) = dj /(n — j + 1). The in terpretation o f (3.12) is tha t at each failure the hazard function is estim ated by the num ber observed to fail, divided by the num ber o f individuals at risk (i.e. available to fail) immediately before tha t time. In large samples the increments o f A 0, the dA0(yj), are approxim ately independent binom ial variables with denom inators (n + 1 — j ) and probabilities dj /(n — j + 1). The product-lim it estimate may be expressed as

1 - F 0( y ) = J ] { l - d A ° ( y j ) } (3.13)j-yj^y

in terms o f the com ponents o f (3.12).

Example 3.9 (AML data) Table 3.4 contains data from a clinical trial conducted at Stanford University to assess the efficacy o f m aintenance chemotherapy for the remission o f acute myelogeneous leukaem ia (AM L). After reaching a state o f remission through treatm ent by chem otherapy, patients were divided random ly into two groups, one receiving m aintenance chem otherapy and the other not. The objective o f the study was to see if m aintenance chem otherapy lengthened the time o f remission, when the symptoms recur. The data in the table were gathered for preliminary analysis before the study ended.


G roup 1 9 13 > 13 18 23 >28 31 34 >45 48 >161G roup 2 5 5 8 8 12 > 16 23 27 30 33 43 45

The left panel o f Figure 3.3 shows the estim ated survivor functions for the times o f remission. A plus on one o f the lines indicates a censored observation. There is some suggestion tha t m aintenance prolongs the time to remission, but the samples are small and the evidence is no t overwhelming. The right panel shows the estim ated survivor functions for the censoring times. Only one observation in the non-m aintained group is censored, but the censoring distributions seem similar for both groups.

The estim ated probabilities that remission will last beyond 20 weeks are respectively 0.71 and 0.59 for the groups, with standard errors from (3.10) both equal to 0.14. ■

3.5.2 Resampling plans CasesW hen the da ta are a hom ogeneous sample subject to random censorship, the m ost direct way to bootstrap is to set 7* = min( Y°’,C ’ ), where Y °* and C* are independently generated from F° and G respectively. This implies that

P r { Y ' > y ) = { l - G ( y ) } { l - F ° ( y ) } = U ( " ~ J_ ) ,j-yj^y

which corresponds to the E D F tha t places mass n~l on each o f the n cases (yj,dj). T hat is, ordinary bootstrap sampling under the random censorship model is equivalent to resampling cases from the original data.

Conditional bootstrapA second sampling scheme starts from the premise tha t since the censoring variable C is unrelated to Y°, knowledge o f the quantities C i,. . . ,C „ alone would tell us nothing about F°. They would in effect be ancillary statistics. This suggests tha t sim ulations should be conditional on the pattern o f censorship, so far as practicable. To allow for the censoring pattern, we argue tha t although the only values o f cj known exactly are those with dj = 0, the observed values o f the rem aining observations are lower bounds for the censoring variables, because Cj > yj when d} = 1. This suggests the following algorithm.

Table 3.4 Remission times (weeks) for two groups o f patients with acute myelogeneous leukaemia (AML), one receiving maintenance chemotherapy (Group 1) and the other not (Miller, 1981, p. 49).^ indicates right-censoring.


Figure 3.3Product-limit survivor function estimates for two groups o f patients with AM L, one receiving maintenance chemotherapy (solid) and the other not (dots). The left panel shows estimates for the time to remission, and the right panel shows the estimates for the time to censoring. In the left panel, + indicates times o f censored observations; in the right panel + indicates times o f uncensored observations.

nano

CO>3

C/D

Time (weeks) Time (weeks)

Algorithm 3.1 (Conditional bootstrap for censored data)

For r = 1 ,.. . ,/? ,

1 generate Y| ° \ . . . , Fn°* independently from F°;2 for j = 1 , . . . , n, make simulated censoring variables by setting C ’ = yj

if dj = 0, and if dj = 1, generating Cj from {G(y) — G(y; )}/{ 1 — G(y; )}, which is the estim ated distribution o f Cj conditional on Cj > y j ; then

3 set YJ = min( Y,0*, CJ), for j = 1 , . . . , n. .

I f the largest observation is censored, it is given a notional failure time to the right o f the observed value, and conversely if the largest observation is uncensored, it is given a notional censoring time to the right o f the observed value. This ensures that the observation can appear in bootstrap resamples.

Both the above sampling plans can accom m odate m ore com plicated patterns o f censoring, provided it is uninformative. For example, it might be decided at the start o f a reliability experiment on independent and identical com ponents that if they have no t already failed, items will be censored at fixed times c i , . . . , c„. In this situation an appropriate resampling plan is to generate failure times Y?* from F°, and then to take YJ = min(YJ0*,c,), for j = 1 Thi s am ounts to having separate censoring distributions for each item, with the j th putting mass one at c; . O r in a medical study the yth individual might be subject to random censoring up to a time c“ , corresponding to a fixed calendar date for the end o f the study. In this situation, Yj = m in( Y f , C j , d f ) , with the indicator Dj equalling zero, one, or two according to w hether Cj, Y j \ or c j was observed. Then an appropriate conditional sam pling plan would generate

8 6 3 ■ Further Ideas

Yj0' and C* as in the conditional plan above, bu t take YJ = m in(y;°”, and make D ’ accordingly.

Weird bootstrapThe sampling plans outlined above mimic how the data are thought to arise, by generating individual failure and censoring times. W hen interest is focused on the survival o r hazard functions, a third and quite different approach uses direct sim ulation from the N elson-A alen estimate (3.12) o f the cum ulative hazard.The idea is to treat the num bers o f failures a t each observed failure time asindependent binom ial variables with denom inators equal to the num bers of individuals at risk, and means equal to the num bers that actually failed. Thus when yi < ■ • ■ < y n, we take the sim ulated num ber to fail at time yj, N*, to be binom ial with denom inator n — j + 1 and probability o f failure dj / (n — j + 1). A simulated N elson-A alen estimate is then

A°*00 = E V n - L v V (3-14);=1 l^k=i ™\yj yk)

which can be used to estim ate the uncertainty o f the original estim ate A Q(y). In this weird bootstrap the failures at different times are unrelated, the num ber at risk does not depend on previous failures, there are no individuals whose simulated failure times underlie -4°’ (y), and no explicit assum ption is made about the censoring mechanism. Indeed, under this scheme the censored individuals are held fixed, bu t the num ber o f failures is a sum o f binom ial variables (Problem 3.10).

The sim ulated survivor function corresponding to (3.14) is obtained by substituting

into (3.13) in place o f dA°(yj).

Example 3.10 (AML data) Figure 3.3 suggests tha t the censoring distributions for both groups o f da ta in Table 3.4 are similar, but that the survival distributions themselves are not. To com pare the resam pling schemes described above, we consider estim ates o f two param eters, the probability o f remission beyond 20 weeks and the m edian survival time, both for G roup 1. These estimates are 1 — F°(20) = 0.71 and inf{t : F°(t) > 5} = 31.

Table 3.5 com pares results from 499 sim ulations using the ordinary, conditional, and weird bootstraps. For the survival probabilities, the ordinary and conditional bootstraps give similar results, and both standard errors are similar to tha t from G reenw ood’s form ula; the weird bootstrap probabilities are significantly higher and are less variable. The schemes give infinite estimates


Table 3.5 Results for 499 replicates of censored data bootstraps of Group 1 of the AML data: average (standard deviation) for estimated probability of remission beyond 20 weeks, average (standard deviation) for estimated median survival time, and the number of resamples in which case 3 occurs 0, 1, 2 and 3 or more times.

Probability M edian

Frequency o f case 3

0 1 2 > 3

Cases 0.72 (0.14) 32.5 (8.5) 180 182 95 42C onditional 0.72 (0.14) 32.8 (8.5) 75 351 71 3W eird 0.73 (0.12) 33.3 (7.2) 0 499 0 0

Figure 3.4 Comparison of distributions of differences in median survival times for censored data bootstraps applied to the AML data. The dotted line is the line x = y.

coco

V-

coO

-20 20 40

■oa>

-20

Cases

0 20

Cases

40

of the m edian 21, 19, and 2 times respectively. The weird bootstrap results for the m edian are less variable than the others.

The last colum ns o f the table show the num bers o f samples in which the smallest censored observation appears 0, 1, 2, and 3 or more times. U nder the conditional scheme the observation appears more often than under the ordinary bootstrap, and under the weird bootstrap it occurs once in each resample.

Figure 3.4 com pares the distributions o f the difference of m edian survival times between the two groups, under the three schemes. Results for the conditional and ordinary bootstraps are similar, bu t the weird bootstrap again gives results that are less variable than the others.

This set o f da ta gives an extreme test o f m ethods for censored data, because quantiles o f the product-lim it estimate are very discrete.

The weird bootstrap also gave results less variable than the o ther schemes for a larger set o f data. In general it seems tha t case resampling and conditional resampling give quite similar and reliable results, bo th differing from the weird bootstrap. ■


3.6 Missing DataThe expression “missing d a ta” relates to datasets o f a standard form for which some entries are missing or incomplete. This happens in a variety o f different ways. For example, censored da ta as described in Section 3.5 are incomplete when the censoring value c is reported instead o f y°. O r in a factorial experim ent a few factor com binations may not have been used. In such cases estimates and inferences would take a simple form if the dataset were “com plete”. But because part o f the standard form is missing, we have two problem s: how to estimate the quantities o f interest, and how to make inferences about them. We have already discussed ways o f dealing with censored data. Now we examine situations where each response has several com ponents, some of which are missing for some cases.

Suppose, then, that the fictional o r potential complete da ta are y°s and that corresponding observed da ta are ys, with some com ponents taking the value N A to represent “not available”.

Parametric problems

For param etric problems the situation is relatively straightforward, at least in principle. First, in defining estim ators there is a general fram ework within which com plete-data M LE m ethods can be applied using the iterative EM algorithm , which essentially works by estim ating missing values. Form ulae exist for com puting approxim ate standard errors o f estim ators, but sim ulation will often be required to obtain accurate answers. One extra com ponent that m ust be specified is the m echanism which takes complete da ta y° into observed data y, i.e. f ( y \ y°). The m ethodology is simplest when da ta are missing at random .

The corresponding Bayesian m ethodology is also relatively straightforw ard in principle, and num erous general algorithm s exist for using com plete-data forms o f posterior distribution. Such algorithms, although they involve sim ulation, are som ewhat removed from the general context o f bootstrap m ethods and will not be discussed here.

Nonparametric problemsN onparam etric analysis is som ewhat m ore complicated, in part because o f the difficulty o f defining appropriate estimators. The following artificial example illustrates some o f the key ideas.

Example 3.11 (Mean with missing data) Suppose tha t responses y° had been obtained from n random ly chosen individuals, but tha t m random ly selected values were then lost. So the observed data are

The EM or expectation maximization algorithm is widely used in incomplete data problems.

y u - - - , y n = y \ , - . - , y l - m, N A , . . . , N A .

3.6 • Missing Data 89

To estimate the population mean /i we should o f course use the average response y = (n — m)-1 X/’ whose variance we would estimate by

But think o f this as a prototype missing data problem, to which resampling m ethods are to be applied. Consider the following two approaches:

1 First estimate fi by t = y, the average o f the non-missing data. Then

(a) simulate samples y\ , . . . ,y*n by sampling with replacement from the n observations y \ , . . . , y„-m, N A , . . . , N A ; then

(b ) calculate f* as the average o f non-missing values.

2 First estim ate the missing values y„_m+l, . . . , by = y for j = n — m + 1 , . . . , n and estimate n as the mean o f y \ , . . . , y°_m, }>°_m+1). . . , y°. Then

(a) sample with replacement from y \ , . . . , y Qn_m, f n_m+x, . . . , f n to get

(ft) duplicate the data-loss procedure by replacing a random ly chosen m o f the y*° with N A ; finally

(c) duplicate the data estim ation o f fi to get /*.

In the first approach, we choose the form o f t to take account o f the missing data. Then in the resampling we get a random num ber o f missing values, M* say, whose mean is m. The effect o f this is to make the variance o f T* som ewhat larger than the variance o f T : specifically

Assuming that we discard all resamples with rn = n (all da ta missing), the bootstrap variance will overestimate var(T ) by a factor which ranges from 15% for n = 10, m = 5 to 4% for n = 30, m = 15.

In the second approach, the first step was to fix the data so that the com plete-data estim ation form ula /t = n-1 YTj=i y*j f ° r t could be used. Then we attem pted to simulate da ta according to the two steps in the original data-generation process. U nfortunately the E D F o f y®,...,y®_m,y®_m+l,. . . ,y® is an underdispersed estimate o f the true C D F F. Even though the estimate t is not affected in this particularly simple problem, the bootstrap distribution certainly is. This is illustrated by the bootstrap variance

Both approaches can be repaired. In the first, we can stratify the sampling with complete and incomplete data as strata. In the second approach, we can add variability to the estimates o f missing values. This device, called multiple

n—m

v = (n — m) 2 Y ( y j - y f ■


imputation, replaces the single estimate y® = y by the set y® + e \ , . . . , yj + e„_m, where ek = yk — y for k = 1 , . . . , n — m. W here the estimate yj was previously given weight 1, the n — m im puted values for the y'th case are now given equal weights (n — m)~l . The im plication is tha t F is modified to equal n~] on each com plete-data value, and n_1 x (n — m)_1 on the m(n — m) values + ek. In this simple case y® + ek = yk, so F reduces to the E D F o f the non-missing data

yn-m, as a consequence o f which t(F) = y and the bootstrap distribution o f T* is correct. ■

This example suggests two lessons. First, if the com plete-data estim ator can be modified to work for incomplete data, then resampling cases will work reasonably well provided the proportion o f missing da ta is small: stratified resampling would reduce variation in the am ount o f missingness. Secondly, the com plete-data estim ator and full sim ulation o f da ta observation (including the data-loss step) cannot be based on single im putation estim ation o f missing values, but may work if we use multiple im putation appropriately.

One further point concerns the data-loss mechanism, which in the example we assumed to be completely random . If da ta loss is dependent upon the response value y, then resampling cases should still be valid : this is som ewhat similar to the censored-data problem. But the o ther approach via multiple im putation will become com plicated because o f the difficulty o f defining ap propriate m ultiple im putations.

Example 3.12 (Bivariate missing data) A more realistic example concerns the estim ation o f bivariate correlation when some cases are incomplete. Suppose tha t Y is bivariate with com ponents U and X. The param eter o f interest is 6 = co t t (U,X) . A random sample o f n cases is taken, such tha t m cases havex missing, bu t no cases have both u and x missing or just u missing. I f it is safe to assume tha t X has a linear regression on U, then we can use fitted regression to m ake single im putations o f missing values. T hat is, we estimate each missing x; by

Xj = x + b(uj — u),

where x, u and b are the averages and the slope o f linear regression o f x on u from the n — m complete pairs.

It is easy to see tha t it would be wrong to substitute these single im putations in the usual form ula for sample correlation. The result would be biased away from zero if b ± 0. Only if we can modify the sample correlation form ula to remove this effect will it be sensible to use simple resam pling o f cases.

The other strategy is to begin with m ultiple im putation to obtain a suitable bivariate F, next estimate 6 with the usual sample correlation t(F), and then resample appropriately. M ultiple im putation uses the regression residuals from

3.6 • Missing Data 91

Figure 3.5 Scatter plot of bivariate sample and multiple imputation values. Left panel shows observed pairs (o) and cases where only u is observed (•). Right panel shows observed pairs (o) and multiple imputation values (+). Dotted line is imputation regression line obtained from observed pairs.

- 3 - 2 - 1 0 1 2 3 - 3 - 2 - 1 0 1 2 3

complete pairs,

ej = Xj — Xj = Xj — {x + b(uj — u)},

for j = — Then each missing Xj is k j plus a random ly selected O urestimate F is the bivariate distribution which puts weight n~l on each complete pair, and weight n-1 x (n — m)-1 on each o f the n — m multiple im putations for each incomplete case. There are two strong, implicit assum ptions being made here. First, as throughout our discussion, it is assumed that values are missing at random . Secondly, homogeneity o f conditional variances is being assumed, so that pooling o f residuals makes sense.

As an illustration, the left panel o f Figure 3.5 shows a scatter plot for a sample o f n = 20 where m = 5 cases have x com ponents missing. Complete cases appear as open circles, and incomplete cases as filled circles — only the u com ponents are observed. In the right panel, the dotted line is the im putation line which gives x , for j = 1 6 ,...,2 0 , and the multiple im putation values are plotted with symbol + . The multiple im putation E D F will put probability ^ on each open circle, and probability on each + .

The results in Table 3.6 illustrate the effectiveness o f the m ultiple im putation EDF. The table shows sim ulation averages and standard deviations for estimates o f correlation 6 and a \ = var(X) using the standard com plete-data forms o f the estimators, when half o f the x values are missing in a sample o f size n = 20 from the bivariate norm al distribution. In this problem there would be little gain from using incomplete cases, bu t in more complex situations there m ight be so few complete cases that m ultiple im putation would be highly effective or even essential.

O bserved d a ta estim atesFull d a ta ---------------------------------------------------------------------------------------------------estim ates C om plete case only Single im puta tion M ultiple im putation

a\ 1.00 (0.33) 1.01 (0.49) 0.79 (0.44) 0.96 (0.46)9 0.69 (0.13) 0.68 (0.20) 0.79 (0.18) 0.70 (0.19)

Having set up an appropriate multiple im putation E D F F, resampling proceeds in an obvious way, first creating a full set o f n pairs by random sampling from F, and then selecting m cases random ly w ithout replacement for which the x values are “lost”. The first stage is equivalent to random sampling with replacement from n — m copies o f the complete da ta plus all m x (n — m) possible m ultiple im putation values. ■

Table 3.6 Average (standard deviation) o f estimators for variance

and correlation 6 from bivariate normal data (u,x) with sample size n = 20 and m = 10 x values missing at random. True values o^ — l and B — 0.7. Results from 1000 simulated datasets.

3.7 Finite Population SamplingBasicsThe simplest form o f finite population sampling is when a sample is taken random ly w ithout replacem ent from a population ^ with values

with N > n known. The statistic t ( y \ , . . . , y n) is used to estimate the corresponding population quantity 9 = t{°i)\ , . . . ,ay ^ ) . The da ta are one o f the (^) possible samples Y \ , . . . , Y n from the population, and the without- replacem ent sampling means that the Yj are exchangeable bu t no t independent; the sampling fraction is defined to be / = n /N . I f n <C N, f is very small and correlation am ong the Y i,.. . , Yn will have little effect, but in practice / often lies in the range 0.1-0.5 and cannot be ignored. Dependence am ong the Y, complicates inference for 9, as the following example indicates.

Example 3.13 (Sample average) Suppose that the yj are scalar and that we w ant a confidence interval for the population average 9 = )■ A lthoughthe sample average Y = n~l Yj is an unbiased estim ator o f 9, when sampling with and w ithout replacem ent we find

var( Y ) = ( , with ^Placement,[ (1 — f ) n y, w ithout replacement,

where y = (N — I )-1 The sample variance c = (n — I )-1 XX>;— y )2is an unbiased estimate o f y, and the usual standard error for y under without- replacem ent sampling is obtained from the second line o f (3.15) by replacing y with c. N orm al approxim ation to the distribution o f Y then gives approxim ate (1 — 2a) confidence limits y + (1 — / ) 1'/2c1/2n_ 1/2za for 9, where za is the a

3.7 ■ Finite Population Sampling 93

quantile o f the standard norm al distribution. Such confidence intervals are a factor (1 —/ ) 1/2 shorter than for sampling with replacement.

The lack o f independence affects possible resampling plans, as is seen by applying the ordinary bootstrap to 7 . Suppose tha t 7 1*,...,Y„* is a random sample taken with replacem ent from y i , . . . , y n- Their average 7* has variance var*(7*) = n~2 ^2( y j—y)2, and this has expected value n~2(n—l)y over possible samples y i , . . . , y„ . This only m atches the second line o f (3.15) if / = n~l . Thus for the larger values o f / generally met in practice, ordinary bootstrap standard errors for y are too large and the confidence intervals for 6 are systematically too wide. ■

Modified sample size

The key difficulty with the ordinary bootstrap is that it involves with- replacement samples o f size n and so does not capture the effect o f the sampling fraction, which is to shrink the variance o f an estimator. One way to deal with this is to take resamples o f size n', resampling with or w ithout replacement. The value o f n' is chosen so tha t the estim ator variance is matched, a t least approximately.

For with-replacem ent resamples the average 7 ’ o f 7 ,* , . . . ,7 n* has variance var*(7*) = (n — 1 )c/{n'n), which is only an unbiased estimate o f (1 — f ) y / n when n' = (n — 1)/(1 — / ) ; this usually exceeds n.

For w ithout-replacem ent resampling, a similar argum ent implies tha t we should take n' = fn . One obvious difficulty with this is tha t if / <C 1, the resample size is much smaller than n, and then the resampled statistics may be much less stable than those based on samples o f size n. This suggests that we m irror the dependence induced by sampling w ithout replacement bu t try to m atch the original sample size, by resampling as follows. Suppose first that m = n f and k = n /m are both integers, and tha t to form our resample we concatenate k w ithout-replacem ent samples o f size m taken independently from y \ , . . . , y n. Then our resample has size n' = mk, and the same sampling fraction as the original data. This is known as the mirror-match bootstrap. W hen m and k are not integers we choose m to be the positive integer closest to n f and take k so that km < n < (k + l)m. We then select random ly either k or k + 1 w ithout-replacem ent samples from y \ , . . . , y n with probabilities chosen to m atch the original sampling fraction. If random ization is used it is im portant that it be incorporated correctly into the resampling scheme (Problem 3.15).

Population and superpopulation bootstraps

Suppose for the m om ent tha t N / n is an integer, k. Then one obvious idea is to form a fake population o f size N by concatenating k copies o f yi, . . . ,y„. The natural next step — which mimics how the da ta were sampled — is to generate a bootstrap replicate o f y i , . . . , y„ by taking a sample o f size n w ithout replacem ent from . So the bootstrap sample, 7 ,* ,..., 7„\ is one of

the (^) possible w ithout-replacem ent samples from 9 ’, and the corresponding bootstrap value is X* = f(Y,*,. • •, Y„’).

If N / n is not an integer, we write N = kn + 1, where 0 < I < n, and form t y ’ by taking k copies o f y i , . . . , y n and adding to them a sample o f size I taken w ithout replacem ent from y i , . . . , y n- Bootstrap samples are formed as when N = kn, bu t a different <&' is used for each. We call this the population bootstrap. Under a superpopulation model, the m embers o f the population aJJ are themselves a random sample from an underlying distribution, 2P. The nonparam etric maxim um likelihood estimate o f & is the E D F o f the sample, which suggests the following resampling plan.

Algorithm 3.2 (Superpopulation bootstrap)

For r = 1

1 generate a replicate population = (/W \ , . . . , by sampling N times with replacement from y \ , . . . , y n\ then

2 generate a bootstrap sample Y ,*,...,Y B* by sampling n times w ithout replacement from and set T ‘ = t ( Y ‘, . . . , Y„*).

As one would expect, this gives results similar to the population bootstrap.

Example 3.14 (Sample average) Suppose that y \ , . . . , y n are scalars, tha t N = kn, and tha t interest focuses on 6 = N ~ l J2 <3fj, as in Example 3.13. Then under the population bootstrap,

v v * \ N ( n - 1) iv a r ( y , = < A r= T j ; ’'< 1 - ' , “

and this is the correct form ula apart from the first factor on the right, which is typically close to one. U nder the superpopulation bootstrap a straightforw ard calculation establishes that the mean variance o f Y ’ is (n — l ) /n x (1 —/) n -1 c (Problem 3.12).

These sampling schemes m ake alm ost the right allowance for the sampling fraction, at least for the average.

For the m irror-m atch scheme we suppose that n = km for integer m, and write Y* = n~l ]Tf=i Y,j, where (Y(j , . . . , Y^) is the ith w ithout-replacem ent resample, independent o f the o ther w ithout-replacem ent resamples. Then we can use (3.15) to establish that var*(Y ’ ) = (km)~l ( 1 — m/n)m~lc. Because our assum ptions imply tha t / = m/n, this is an unbiased estimate o f var(Y ), bu t it would be biased if m ^ nf . m

3.7 • Finite Population Sampling 95

Studentized confidence intervalsSuppose tha t v = v(yu . . . , y n) is an estim ated variance for the statistic t = t(yu- ■ ■, yn), based on the w ithout-replacem ent sample y i , . . . , y n, and tha t some bootstrap scheme is used to form replicates t* and v* o f t and v, for r = 1 , . . . , R. Then the studentized bootstrap can be used to form confidence intervals for 6, based on the values o f z* = (t* — t ) / v ' l/2. As outlined in Section 2.4, a (1 — 2a) confidence interval has limits

t - „ V2_* t __*.1/2 *1 z ((R +l)(l-a))> 1 v Z(UM-1)«)>

where z(*(R+1)p) is the empirical p quantile o f the z*. If the population or superpopulation bootstraps are used, and N, n—>cc in such a way tha t / = n / N —>n, where 0 < n < 1, these intervals can be shown to have the same good properties as when the are a random sample from an infinitepopulation; see Section 5.4.1.

Example 3.15 (City population data) For a numerical assessment o f the schemes outlined above, we consider again the da ta in Example 1.2, on 1920 and 1930 populations (in thousands) o f N = 49 US cities. Table 2.1 contains populations yj = (Uj,Xj) for a sample o f n = 10 cities taken w ithout replacem ent from the 49, and we use them to estimate the mean 1930 population 6 = N ~ l x j f° r the 49 cities.

Two standard estim ators o f 6 are the ratio and regression estimators. The ratio estimate and its estim ated variance are given by

* _ - „ ^ 2 j = i x j .. _ ( I - / ) y -' ( ~ u j t r a t \ 2 _ _ 1 V '' . .frat — MJV x > v rat — 7T / I x j ~ ) > WjV — TV /_ UJ '

E j= i n(n ~ 1) “ V «jv / N(3.16)

For our da ta trat = 156.8 and vrat = 10.852. The regression estimate is based on the straight-line regression x = j?o + fixu fit to the data (w i,x i),...,(u„ ,x„), using least squares estimates /?o and (1]. The regression estimate o f 9 and its estim ated variance are

11 _ntreg = Po + Vreg = ^ ^ Pluj) j (3-17)

for our da ta treg = 138.3 and vreg = 8.322.Table 3.7 contains 95% confidence intervals for 6 based on norm al approxi

m ations to trat and treg, and on the studentized bootstrap applied to (3.16) and (3.17). N orm al approxim ations to the distributions o f trat and treg are poor, and intervals based on them are considerably shorter than the o ther intervals. The population and superpopulation bootstraps give rather similar intervals.

The sampling fraction is / = 10/49, so the estimate o f the distribution o f 7” using modified sample size and w ithout-replacem ent resampling uses


Scheme R a tio Regression

N orm al 137.8 174.7 123.7 152.0M odified size, n' = 2 58.9 298.6 —M odified size, n' = 11 111.9 196.2 1 M il 258.2M irror-m atch , m = 2 115.6 196.0 112.8 258.7Population 118.9 193.3 116.1 240.7Superpopulation 120.3 195.9 114.0 255.4

Coverage Length

Lower U p p er Overall A verage SD

N orm al 7 89 82 23 8.2M odified size, n' = 2 1 98 98 151 142M odified size, n' = 11 2 91 89 34 19M irror-m atch , m = 2 3 91 88 33 19Population 2 91 89 36 21S uperpopulation 1 92 91 41 24

samples o f size n f = 2. N ot surprisingly, w ithout-replacem ent resamples o f size n' = 2 from 10 observations give a very poor idea o f w hat happens when samples o f size 10 are taken w ithout replacem ent from 49 observations, and the corresponding confidence interval is very wide. Studentized bootstrap confidence limits cannot be based on treg, because with ri = 2 we have v’eg = 0. For with-replacem ent resampling, we take (n — 1)/(1 — / ) = n' = 11, giving intervals quite close to those for the m irror-m atch, population and superpopulation bootstraps.

Figure 3.6 shows why the upper endpoints o f the ratio and regression confidence intervals differ so much. The variance estimate v*eg is unstable because o f resamples in which case 4 does not appear and case 9 appears just once or not at all; then z* takes large negative values. The right panel of the figure explains this; the regression slope changes m arkedly when case 4 is deleted. Exclusion o f case 9 further reduces the regression sum o f squares and hence v’eg. The ratio estim ate is m uch less sensitive to case 4. I f we insisted on using treg, one solution would be to exclude from the sim ulation samples in which case 4 does no t appear. Then the 0.025 and 0.975 quantiles o f z ’eg using the population bootstrap are -1.30 and 3.06, and the corresponding confidence interval is [112.9,149.1].

Table 3.7 City population data: 95% confidence limits for the mean population per city in 1930 based on the ratio and regression estimates, using normal approximation and various resampling methods with R = 999.

Table 3.8 City population data. Empirical coverages (%) and average and standard deviation of length of 90% confidence intervals based on the ratio estimate of the 1930 total, based on 1000 samples of size 10 from the population of size 49. The nominal lower, upper and overall coverages are 5, 95 and 90.

3.1 • Finite Population Sampling 91

Figure 3.6 Population bootstrap results for regression estimator based on city data with n = 10. The left panel shows values o f z'eg and

ivJ/2 for resamples in which case 4 appears at least once (dots), and in which case 4 does not appear and case 9 appears zero times (0), once (1), or more times (+ ); the dotted line shows The right panel shows the sample and the regression lines fitted to the data with case 4 (dashes) and without it (dots); the vertical line shows the value fi at which 0 is estimated.

oCM

Oco

o ------------o yCO /o 4 / //inC\J //ooCVJoX lO

o9 / > 2'o

o •my Atin

o

,IU /Qf

2 4 6 8 10

sqrt(v*)

0 50 100 150 200 250 300

u

To com pare the perform ances o f the various m ethods in setting confidence intervals, we conducted a numerical experim ent in which 1000 samples of size n = 10 were taken w ithout replacem ent from the population o f size N = 49. For each sample we calculated 90% confidence intervals [L, U] for6 using R = 999 bootstrap samples. Table 3.8 contains the empirical values o f Pr(0 < L), Pr(0 < U), and Pr(L < 9 < U). The norm al intervals are short and their coverages are much too small, while the modified intervals with ri = 2 have the opposite problem. Coverages for the modified sample size with ri = 11 and for the population and superpopulation bootstrap are close to their nom inal levels, though their endpoints seem to be slightly too far left. The 80% and 95% intervals and those for the regression estim ator have similar properties. In line with o ther studies in the literature, we conclude tha t the population and superpopulation bootstraps are the best o f those considered here. ■

Stratified samplingIn m ost applications the population is divided into k strata, the ith o f which contains N t individuals from which a sample o f size n, is taken w ithout replacement, independent o f o ther strata. The ith sampling fraction is f i = tii/Ni and the proportion o f the population in the ith stratum is vv, = N t/ N ,where N = N i H--------1- N k- The estimate o f 9 and its standard error are foundby com bining quantities from each stratum .

Two different setups can be envisaged for m athem atical discussion. In the first — the “small-fc” case — there is a small num ber o f large stra ta : the asym ptotic regime takes k fixed and n„N j—>oo with where 0 < 7tj < 1.


A part from there being k strata, the same ideas and results will apply as above, with the chosen resampling scheme applied separately in each stratum . The second setup — the “large-/c” case — is where there are m any small stra ta ; in m athem atical terms we suppose that k —>00 but tha t N, and n, are bounded. This situation is m ore complicated, because biases from each stratum can com bine in such a way tha t a bootstrap fails completely.

Example 3.16 (Average) Suppose tha t the population ,]M comprises k strata, and that the yth item in the ith stratum is labelled the average for that stratum is ^ . Then the population average is 6 = which is estim atedby T = wiYi, where % is the average o f the sample Y,i,. . . , Yint from the ith stratum . The variance o f T is

k . NiV = £ W,2(l - / ,) X — — - W f , (3.18)

i=l 1 j= 1

an unbiased estimate o f which isk . Hi

v = £ v v ,2( l - U ) x — - £ ( Y y - Yj)2. (3.19)>=1 Hi 1 j=1

Suppose for sake o f simplicity that each N ,/n, is an integer, and that the population bootstrap is applied to each stratum independently. Then the variance o f the bootstrap version o f T is

v a r '(T ') - E » , 2( l - / , ) X x - l j (3.20)

the m ean o f which is obtained by replacing the last term on the right by (Ni — I )-1 Z j i & i j — &i)2- I f k is fixed and TV,—>-oo while f~*n t , (3.20) will converge to v, but this will no t be the case if n!; N, are bounded and k —>00. The bootstrap bias estim ate also m ay fail for the same reason (Problem 3.12).

■

For setting confidence intervals using the studentized bootstrap the key issue is no t the perform ance o f bias and variance estimates, bu t the extent to which the distribution o f the resampled quantity Z* = (T* — t ) / V ’ll2 matches thato f Z = ( T — 6 ) / V 1/2. D etailed calculations show that when the populationand superpopulation bootstraps are used, Z and Z* have the same limiting distribution under bo th asym ptotic regimes, and tha t under the fixed-/c setup the approxim ation is better than tha t using the other resampling plans.

Example 3.17 (Stratified ratio) For empirical com parison o f the more prom ising o f these finite population resampling schemes with stratified data, we generated a population with N pairs (u,x) divided into strata o f sizes N i , . . . , N k

3.7 ■ Finite Population Sampling 99

Table 3.9 Empirical coverages (%) of nominal 90% confidence intervals using the ratio estimate for a population average, based on 1000 stratified samples from populations with k strata of size N, from each of which a sample of size n = N/'i was taken without replacement. The nominal lower (L), upper (U) and overall (O) coverages are 5, 95 and 90.

k = 20, N = 18 k = 5 , N = 72 k = 3 , N = 18

L U O L U O L U O

N orm al 5 93 88 4 94 90 7 93 86M odified size 6 94 89 4 94 90 6 96 90M irror-m atch 9 92 83 8 90 82 6 94 88Population 6 95 89 5 95 90 6 95 89Superpopulation 3 97 95 2 98 96 3 98 96

according to the ordered values o f u. The aim was to form 90% confidence intervals for

k N,

e = r l E E x'>.=i j=\

where x,j is the value o f x for the j th element o f stratum i.We took independent samples (uy,Xy) o f sizes n, w ithout replacem ent from

the ith stratum , and used these to form the ratio estimate o f 9 and its estim ated variance, given by

k k i n,

t = V WjU, X ti, V = Y Wi ( 1 ~ f i ) X — (----7T ~~ t t o j ) 2’i= 1 i= 1 ^ l } j —1

where

E / ' = 1 X ij _ 1 N i

E j W .....

these extend (3.16) to stratified sampling. We used bootstrap resamples with R = 199 to com pute studentized bootstrap confidence intervals for 9 based on 1000 different samples from simulated datasets. Table 3.9 shows the empirical coverages o f these confidence intervals in three situations, a “large-/c” case with k = 20, Nj = 18 and n, = 6, a “small-fc” case with k = 5, Ni = 72 and n, = 24, and a “small-fc” case with k = 3, Ni = 18 and n, = 6. The modified sampling m ethod used sam pling with replacement, giving samples o f size n' = 7 when n = 6 and size ri = 34 when n = 24, while the corresponding values o f m forthe m irror-m atch m ethod were 3 and 8. T hroughout / i = j-

In all three cases the coverages for norm al, population and modified sample size intervals are close to nominal, while the m irror-m atch m ethod does poorly. The superpopulation m ethod also does poorly, perhaps because it was applied to separate s tra ta ra ther than used to construct a new population to be stratified at each replicate. Similar results were obtained for nom inal 80% and 95% confidence limits. Overall the population bootstrap and modified sample


size m ethods do best in this limited com parison, and coverage is not improved by using the m ore com plicated m irror-m atch m ethod. ■

3.8 Hierarchical DataIn some studies the variation in responses may be hierarchical or m ultilevel, as happens in repeated-m easures experim ents and the classical split-plot experiment. D epending upon the nature o f the param eter being estimated, it may be im portant to take careful account o f the two (or more) sources of variation when setting up a resam pling scheme. In principle there should be no difficulty with param etric resampling: having fitted the model param eters, resample da ta will be generated according to a completely defined model. N onparam etric resampling is not straightforw ard: certainly it will not make sense to use simple nonparam etric resampling, which treats all observations as independent. Here we discuss some o f the basic points about nonparam etric resampling in a relatively simple context.

Perhaps the m ost basic problem involving hierarchical variation can be form ulated as follows. For each o f a groups we obtain b responses y tj such that

y i} = X; + Z i j , i = 1 , . . . ,a, j = l , . . . , b , (3.21)

where the x,s are random ly sampled from Fx and independently the z^s are random ly sampled from Fz, with E (Z ) = 0 to force uniqueness o f the model. Thus there is homogeneity o f variation in Z between groups, and the structure is additive. The feature o f this model tha t complicates resampling is the correlation between observations within a group,

var(Yjy) = c* + a\, cov(y,; , Yik) = a 2x, j f k. (3.22)

For data having this nested structure, one m ight be interested in param eters o f Fx o r Fz o r some com bination o f both. For example, when testing for presence o f variation in X the usual statistic o f interest is the ratio o f between-group and w ithin-group sums o f squares.

How should one resample nonparam etrically for such a data structure? There are two simple strategies, for both o f which the first stage is to random ly sample groups with replacement. A t the second stage we random ly sample within the groups selected at the first stage, either w ithout replacem ent (Strategy 1) or with replacem ent (Strategy 2). N ote that Strategy 1 keeps selected groups intact. To see which strategy is likely to work better, we look at the second m om ents o f resampled da ta y'j to see how well they m atch (3.22). Consider selectingy'iV. . . , y ’ib. A t the first stage we select a random integer /* from {1,2 ,__a}.A t the second stage, we select random integers from {1,2either w ithout replacem ent (Strategy 1) or with replacement (Strategy 2): the

3.8 ■ Hierarchical Data 101

sampling w ithout replacem ent is equivalent to keeping the J* th group intact. U nder both strategies

where y = a 1 £ y t, SSB = E L iO ^ - y - f and S S W = £ ? =1 E*=i(.Vy ~ tf)2- To see how well the resampling variation mimics (3.22), we calculate expectations o f (3.23) and (3.24), using

O n balance, therefore, Strategy 1 m ore closely mimics the variation properties o f the data, and so is the preferable strategy. Resam pling should work well so long as a is m oderately large, say at least 10, just as resampling hom ogeneous data works well if n is m oderately large. O f course both strategies would work well if both a and b were very large, but this is rarely the case.

An application o f these results is given in Example 6.9.The preceding discussion would apply to balanced da ta structures, but not

to more complex situations, for which a more general approach is required. A direct, model-based approach would involve resampling from suitable estimates o f the two (or more) da ta distributions, generalizing the resampling from F in C hapter 2. H ere we outline how this m ight work for the da ta structure (3.21).

E*(5y I /* = O = )V,

and

However,

E*(Yy* Y*’ | /* = n =6(6- 1) yi’iyi'm, Strategy 1,

h £ tm = i ynyi'm, Strategy 2.

Therefore

E*(Yt;) = ?., var*(Y,*) = — +1 J a

SSg SSyya ab

(3.23)

andStrategy 1, Strategy 2,

(3.24)

This gives

E {va r'( i'jy )} =

andStrategy 1, Strategy 2.

1 0 2 3 ■ Further Ideas

Estimates o f the two C D Fs Fx and Fz can be formed by first estim ating the xs and zs, and then using their ED Fs. A naive version o f this, which parallels standard linear model theory, is to define

xi = yu ztj = y,j - % (3.25)

The resulting way to obtain a resampled dataset is to

1 choose x j , . . . , x* by random ly sampling with replacement from x i , . . . , x a ; then

2 choose z*n , . . . , z 'ab by random ly sampling ab times with replacem ent from z n , . . . , z ab; and finally

3 set y-j = x* + z-j, i = j = l , . . . , b .

Straightforw ard calculations (Problem 3.17) show tha t this approach has the same second-m om ent properties o f as Strategy 2 earlier, shown in (3.23) and (3.24), which are not satisfactory. Somewhat predictably, Strategy 1 is mimicked by choosing z \ r a n d o m l y with replacement from one group o f residuals Zki,...,Zkb — either a random ly selected group or the group corresponding to x* (Problem 3.17).

W hat has gone wrong here is that the estimates x* in (3.25) have excess variation, namely a ^ S S g = <xl + b~loj, relative to The estimates Zy defined in (3.25) will be satisfactory provided b is reasonably large, although in principle they should be standardized to

- - (3.26)11 ( 1 - f c - 1)1/2 '

The excess variation in X; can be corrected by using the shrinkage estimate

= cy■■ + (1 -c )y i . ,

where c is given by

( i - c Y =1 b ( b - l ) S S B ’

or 1 if the right-hand side is negative. A straightforw ard calculation shows tha t this choice for c makes the variance o f the x* equal to the com ponents of variance estim ator o f \ see Problem 3.18. N ote tha t the wisdom o f m atching first and second m om ents may depend upon 9 being a function o f such moments.

3.9 ■ Bootstrapping the Bootstrap 103

3.9 Bootstrapping the Bootstrap

3.9.1 Bias correction of bootstrap calculationsAs with m ost statistical methods, the bootstrap does not provide exact answers. For example, the basic confidence interval m ethods outlined in Section 2.4 do no t have coverage exactly equal to the target, or nominal, coverage. Similarly the bias and variance estimates B and V o f Section 2.2.1 are typically biased. In m any cases the discrepancies involved are not practically im portant, or there is some specific remedy — as with the improved confidence limit m ethods o f C hapter 5. Nevertheless it is useful to have available a general technique for m aking a bias correction to a bootstrap calculation. T hat technique is the bootstrap itself. Here we describe how to apply the bootstrap to improve estim ation o f the bias o f an estim ator in the simple situation o f a single random sample.

In the notation o f C hapter 2, the estim ator T = t(F) has bias

P = b(F) = E ( T ) - 0 = E{t(F) | F} - t(F).

The bootstrap estimate of this bias is

B = b(F) = E*(T*) - T = E*{t(F*) | F} - t(F), (3.27)

where F* denotes either the E D F o f the bootstrap sample Y J , . . . , Y * draw n from F or the param etric model fitted to that sample. Thus the calculation applies to bo th param etric and nonparam etric situations. There is bo th random variation and systematic bias in B in general: it is the bias with which we are concerned here.

As with T itself, so with B : the bias can be estim ated using the bootstrap. If we write y = c(F) = E (B \ F ) — b(F), then the simple bootstrap estimate according to the general principle laid out in C hapter 2 is C = c(F). From the definition o f c(F) this implies

C = E*(B* | F ) - B ,

the bootstrap estimate o f the bias o f B. To see just w hat C involves, we use the definition o f B in (3.27) to obtain

C = E*[E**{r(F**) | F*} - t(F*) | F] - [E*{t(F*) | F} - t (F)]; (3.28)

or m ore simply, after com bining terms,

C = E*{E**(T**)} - 2E*(T* | F ) + T . (3.29)

Here F** denotes the E D F o f a sample draw n from F*, or from the param etric model fitted to that sample; T** is the estimate com puted with tha t sample; and E** denotes expectation over the the distribution o f tha t sample conditional on F*. There are two levels o f bootstrapping in this procedure, which is therefore

called the nested or double bootstrap. In principle a nested bootstrap might involve m ore than two levels, but in practice the com putational burden would ordinarily be too great for m ore than two levels to be worthwhile, and we shall assume tha t a nested bootstrap has just two levels.

The adjusted estimate o f the bias o f T is

B a d j = B — C .

Since typically bias is o f order n-1 , the adjustm ent C is typically o f order n~2. The following example gives a simple illustration o f the adjustment.

Example 3.18 (Sample variance) Suppose tha t T = n~l Z ( Y j — Y )2 is used to estimate var(Y ) = a 2. Since E {J](Y / — Y ) 2} = (n — l )a 2, the bias o f T is easily seen to be /? = — n_1<x2, which the bootstrap estimates by B = —n~l T. The bias o f this bias estimate is E (B) — ft = n~2o 2, which the bootstrap estimates by C = n~2T. Therefore the adjusted bias estimate is

B — C = —n-1 T — n~2 T.

T hat this is an improvement can be checked by showing that it has expectation /?(1 + n~2), whereas B has expectation /?(1 + n~]). ■

In m ost applications bootstrap calculations are approxim ated by simulation. So, as explained in C hapter 2, for m ost estim ators T we would approxim ate the bias B by ]Tt* — t using the resampled values and the data valuet o f the estim ator. Likewise the expectations involved in the bias adjustm ent C will usually be approxim ated by simulation. The calculation is as follows.

Algorithm 3.3 (Double bootstrap for bias adjustment)

For r = 1

1 generate the rth original bootstrap sample y j , . . . ,y * and then t ’ by

• sampling at random from (nonparam etric case) or• sampling param etrically from the fitted model (param etric case);

2 obtain M second-level bootstrap samples y \ ' , . . . , y ' n’, either

• sampling with replacement from y \ , . . . , y 'n (nonparam etric case) or

• sampling from the model fitted to y [ , . . . , y * (param etric case);

3 evaluate the estim ator T for each o f the M second-level samples to give. . . .

Vl’ - '- ’ VM-Then approxim ate the bias adjustm ent C in (3.29) by

. R m _ R

c - k m E E ' ~ - r E ' ; + ' »3 3 °)r = l m= 1 r = l


A t first sight it would seem that to apply (3.30) successfully would involve a vast am ount o f com putation. If a general rule is to use a t least 100 samples when bootstrapping, this would imply a total o f R M + R = 10100 simulated samples and evaluations o f t. But this is unnecessary, because o f theoretical and com putational techniques that can be used, as explained in C hapter 9. For the case o f the bias B discussed here, the sim ulation variance o f B — C would be no greater than it was for B if we used M = 1 and increased R by a factor o f about 5, so that a total o f about 500 samples would seem reasonable; see Problem 3.19.

M ore com plicated applications o f the technique are discussed in Exam ple 3.26 and in C hapters 4 and 5.

Theory

It may be intuitively clear tha t bootstrapping the bootstrap will reduce the order o f bias in the original bootstrap calculation, at least in simple situations such as Example 3.18. However, in some situations the order o f the reduction may not be clear. Here we outline a general calculation which provides the answer, so long as the quantity being estim ated by the bootstrap can be expressed in term s o f an estim ating equation. For simplicity we focus on the single-sample case, but the calculations extend quite easily.

Suppose tha t the quantity P = b(F) being estim ated by the bootstrap is defined by the estim ating equation

E { h ( F , F - p ) \ F } = 0, (3.31)

where h(G,F;P) is chosen to be o f order one. The bootstrap solution is P = b(F), which therefore solves

E * { h (F \F - ,P ) \P } = 0.

In general p has a bias o f order n~a, say, where typically a is 1 or §. Therefore, for some e(F) tha t is o f order one, we can write

E { h ( F , F ; p ) \ F } = e ( F ) n - a. (3.32)

To correct for this bias we introduce the ideal perturbation y = c„(F) which modifies b(F) to b(F,y) in order to achieve

E [ h { F ,F - ,b (F ,y ) } \F ] = 0 . (3.33)

There is usually m ore than one way to define b(F,y), but we shall assume that y is defined to make b{F, 0) = b(F). The bootstrap estimate for y is y = cn(F), which is the solution to

E ' [ h { F ' , F ; b ( F \ y ) } \ F ] = 0 ,

and the adjusted value o f P is then = b(F,y ); it is b(F",y) tha t requires the second level o f resampling.

W hat we w ant to see is the effect o f substituting pajj for ft in (3.32). First we approxim ate the solution to (3.33). Taylor expansion about 7 = 0 , together with (3.32), gives

E [h{F, F ; b(F, y ) } \ F] = e(F)n~a + dn(F)y, (3.34)

where


dn( F ) = ^ E [ h { F , F ; b ( F , y ) } \ F ]y=0

Typically d„{F) = d(F) =f= 0, so tha t if we write r(F) = e(F)/d(F) then (3.33) and (3.34) together imply that

7 = c„(F) = —r(F)n~a.

This, together with the corresponding approxim ation for y = cn(F), gives

? - y = —n~a{r(F) - r(F)} = - r T ^ X , ,

say. The quantity

X n = n ^ 2{r(F) - r(F)}

is Op(l) because F and F differ by Op(n~l/1). It follows that, because y = 0(n~a),

h{F,F\b(F, y)} = h{F,F;b(F,y)} - n~a~ l/2X n- ^ h { F , F -,b(F, y)} . (3.35)y = 0

We can now assess the effect o f the adjustm ent from [3 to (iadj- Define the conditional quantity

k H( X„) = ^ - E [ h { F , F ; b ( F , y ) } \ X „ , F ] ,8y y=0

which is Op(l). Then taking expectations in (3.35) we deduce that, because of (3.34),

E[h{F,F;b(F,y)} \ F] = - n~a- V 2E{X„kn(X n) I F}. (3.36)

In m ost applications E { X nkn(X n) \ F} = 0(n~b) for b = 0 or j , so com paring (3.36) with (3.32) we see that the adjustm ent does reduce the order o f bias by at least j .

Example 3.19 (Adjusted bias estimate) In the case o f the bias /? = E( T \F) — 9, we take h(F, F;[5) = t(F) — t(F) — /? and b(F, y) = b(F) — y. In regular problems the bias and its estimate are o f order n ~ \ and in (3.32) a = 2. I t is easy to check tha t d„(F) = 1, so that X„ = nl/2{e(F) — e(F)} and

kn(Xn) = ^ E { t ( F ) - t(F) - ( P - y ) \ e(F), F} = 1.?=o

Note that if the next term in expansion (3.34) were 0(n~a~c), then the right-hand side of (3.35) would strictly be 0 (n- a-(.-l/2) +

1/2) +0(n-a-c-1/2). In almost all cases this will lead to the same conclusion.


This implies that

E { X nkn(Xn) I F} = n i/2E{e(F) - e(F) \ F} = 0 ( n ~ l/2).

Equation (3.36) then becomes E { T — 9 — (fi — y)} = 0 (n -3 ). This generalizes the conclusion o f Example 3.18, that the adjusted bootstrap bias estimate fi — y is correct to second order. ■

Further applications o f the double bootstrap to significance tests and confidence limits are described in Sections 4.5 and 5.6 respectively.

3.9.2 Variation o f properties o f TA som ewhat different application o f bootstrapping the bootstrap concerns assessment o f how the distribution o f T depends on the param eters o f F. Suppose, for example, that we want to know how the variance o f T depends upon 9 and other unknown model param eters, but that this variance cannot be calculated theoretically. One possible application is to the search for a variance-stabilizing transform ation.

The param etric case does not require nested bootstrap calculations. However, it is useful to outline the approach in a form that can be mimicked in the nonparam etric case. The basic idea is to approxim ate var(T | ip) = v(xp) from simulated samples for an appropriately broad range o f param eter values. Thus we would select a set o f param eter values ipn---,V>K, f ° r each o f which we would simulate R samples from the corresponding param etric model, and com pute the corresponding R values o f T. This would give t'kl, . . . , t'kR, say, for the model with param eter value \pk. Then the variance v(tpk) = var(T | xpk) would be approxim ated by

R

v(Vk) = (3'37)r= l

where t*k = J T 1 £ ? = i C Plots o f v{\pk) against com ponents o f yik can then be used to see how var(T ) depends on 1p. Example 2.13 shows an application o f this. The same sim ulation results can also be used to approxim ate other properties, such as the bias or quantiles o f T , or the variance o f transform ed T.

As described here the num ber of simulated datasets will be R K , but in fact this num ber can be reduced considerably, as we shall show in Section 9.4.4. The simulation can be bypassed completely if we estimate v(ipk) by a delta-m ethod variance approxim ation VL(y)k), based on the variance o f the influence function under the param etric model. However, this will often be impossible.

In the nonparam etric case there appears to be a m ajor obstacle to performing calculations analogous to (3.37), namely the unavailability o f models corresponding to a series o f param eter values rpi, . . . , \pK. But this obstacle can


be overcome, at least partially. Suppose for simplicity that we have a singlesample problem, so tha t the E D F F is the fitted model, and imagine that we have draw n R independent bootstrap samples from this model. These bootstrap samples can be represented by their E D Fs F ’, which can be thought o f as the analogues o f param etric models defined by R different values o f param eter ip. Indeed the corresponding values o f 9 = t(F) are simply t(F*) = (*, and other com ponents o f ip can be defined similarly using the representation ip = p(F). This gives us the same fram ework as in the param etric case above. For example consider variance estimation. To approxim ate var(T ) under param eter value tp* = p(F') , we simulate M samples from the corresponding model F*; calculate the corresponding values o f T , which we denote by , m = 1 , . . . , M; and then calculate the analogue o f (3.37),

MK = v(Wr) = M ~ l “ fr*)2, (3.38)

m= 1

with t ’’ = M ~ l Em=i Cm- The scatter plot o f v’ against t* will then be a proxy for the ideal plot o f var(T | ip) against 6, and similarly for o ther plots.

Example 3.20 (City population data) Figure 3.7 shows the results o f thedouble bootstrap procedure outlined above, for the ratio estim ator applied tothe data in Table 2.1, with n = 10. The left panel shows the bias b’ estim atedusing M = 50 second-level bootstrap samples from each o f R = 999 first-levelbootstrap samples. The right panel shows the corresponding standard errors

* 112vr . The lines from applying a locally weighted robust sm oother confirm the clear increase with the ratio in each panel.

The limplication o f Figure 3.7 is tha t the bias and variance o f the ratio are not stable with n = 10. Confidence intervals for the true ratio 9 based on norm al approxim ations to the distribution o f T — 9 will therefore be poor, as will basic bootstrap confidence intervals, and those based on related quantities such as the studentized bootstrap are suspect. A reasonable in terpretation of the right panel is that var(T ) oc 92, so that log T should be m ore stable. ■

The particular application o f variance estim ation can be handled in a simpler way, a t least approximately. I f the nonparam etric delta m ethod variance approxim ation vL (Sections 2.7.2 and 3.2.1) is fairly accurate, which is to say if the linear approxim ation (2.35) or (3.1) is accurate, then v'r = v(tp') can be estim ated by v l = vl (f ;).

Example 3.21 (Transformed correlation) A n example where simple bootstrap m ethods tend to perform badly w ithout the (explicit or implicit) use o f transform ation is the correlation coefficient. For a sample o f size n = 20 from a bivariate norm al distribution, with sample correlation t = 0.74, the left panel


Figure 3.7 Bias and standard error estimates for ratio applied to city population data, n = 10. For each of R = 999 bootstrap samples from the data, M = 50 second-level samples were drawn, and the resulting bias and standard error estimates b* and v '1/2 plotted against the bootstrapped ratio t*. The lines are from a robust nonparametric curve fit to the simulations.

Figure 3.8 Scatter plot of v*L versus t* for nonparametric simulation from a bivariate normal sample of size n = 20 with R = 999. The left panel is for t the sample correlation, with dotted line showing the theoretical relationship. The right panel is for transformed sample correlation.

t* t*

Transformed t*

of Figure 3.8 contains a scatter plot o f v’L versus t* from R = 999 nonparam etric sim ulations: the dotted line is the approxim ate norm al-theory relationship var(T ) = n ~ '( l — 02)2. The plot correctly shows strong instability o f variance. The right panel shows the corresponding plot for bootstrapping the transformed estim ate ^ lo g ^ l + f ) /( l - t)}, whose variance is approxim ately n~l : here v i is com puted as in Example 2.18. The plot correctly suggests quite stable variance. ■


As presented here the selection o f param eter values ip* is completely random , and R would need to be m oderately large (at least 50) to get a reasonable spread o f values o f \p*. The total num ber o f samples, R M + R, will then be very large. It is, however, possible to improve upon the algorithm ; see Section 9.4.4. A nother im portant problem is the roughness o f variance estimates, apparent in both o f the preceding examples. This is due not just to the size o f M , but also to the noise in the E D Fs F* being used as models.

Frequency smoothing

One m ajor difference between the param etric and nonparam etric cases is that the param etric models vary sm oothly with param eter values. A simple way to inject such sm oothness into the nonparam etric “m odels” F ’ is to sm ooth them. For simplicity we consider the one-sample case.

Let w( ) be a symmetric density with mean zero and unit variance, and consider the sm oothed frequencies

f j ( o , e ) c c ( n r O ^ , j = (3-39)r= l ' '

H ere e > 0 is a sm oothing param eter that determines the effective range of values o f t* over which the frequencies are smoothed. As is com m on with kernel smoothing, the value o f e is m ore im portant than the choice o f w(-), which we take to be the standard norm al density. Num erical experim entation suggests tha t close to 6 = t, values o f e in the range 0 .2 v l/2- 1 .0v l/2 are suitable, where v is an estim ated variance for t. We choose the constant o f proportionality in (3.39) to ensure tha t Z j f j { 8 ,E) = n- F ° r a given e, the relative frequencies n~1f j ( 8 , e) determine a distribution F ‘e, for which the param eter value is 8 " = t{Fg); in general 0* is not equal to 8 , although it is usually very close.

Example 3.22 (City population data) In continuation o f Example 3.20, the top panels o f Figure 3.9 show the frequencies f j for four samples with values o f t ' very close to 1.6. The variation in the f j leads to the variability in both b* and v" tha t shows so clearly in Figure 3.7.

The lower panels show the sm oothed frequencies (3.39) for distributions Fg with 8 = 1.2, 1.52, 1.6, 1.9 and e = 0.2u1/2. The corresponding values of the ratio are 8 ’ = 1.23, 1.51, 1.59, and 1.89. The observations with the smallest empirical influence values are m ore heavily weighted when 8 is less than the original value o f the statistic, t = 1.52, and conversely. The third panel, for6 = 1.6, results from averaging frequencies including those shown in the upper panels, and the distribution is much sm oother than those. The results are not very sensitive to the value o f e, although the tilting o f the frequencies is less m arked for larger s.

The sm oothed frequencies can be used to assess how the bias and variance

3.9 ■ Bootstrapping the Bootstrap 1 1 1

Figure 3.9 Frequencies for city population data. The upper panels show frequencies /* for four samples with values of t‘ close to 1.6, plotted against empirical influence values lj for the ratio. The lower panels show smoothed frequencies f ’{6,e) for

distributions Fq with 9 = 1.2, 1.52, 1.6, 1.9 and e = 0.2i>1//2.

r= 1 .5988 t*=1,6015

theta=1.2

1.0 -0.5 0.0 0.5 1.0

theta=1.52 theta=1.6 theta=1.9

-1.0 -0.5 0.0 0.5 1.0

o f T depend on 0. For each o f a range o f values o f 0, we generate samples from the m ultinom ial distribution Fg with expected frequencies (3.39), and calculate the corresponding values o f t*, t'(0). say. We then estimate the bias for sampling from F'e by t*(0) — O’, where t '(9) is the average o f the t'r(6 ). The variance is estim ated similarly.

The top panel o f Figure 3.10 shows values o f t*(8 ) plotted against jittered values o f 0 for 100 samples generated from Fg a t 0 = 1 .2 ,...,1 .9 ; we took e = 0.2. The lower panels show tha t the corresponding biases and standard deviations, which are connected by the rougher solid lines, com pare well with the double bootstrap results. The am ount o f com putation is much less, however. The sm oothed estimates are based on 1000 samples to estimate the Fg, and then 100 samples at each o f the eight chosen values o f 0 , whereas the double bootstrap required about 25 000 samples. ■

O ther applications o f (3.39) are described in Chapters 9 and 10.

Variance stabilizationExperience suggests tha t bootstrap m ethods for confidence limits and significance tests based on estim ators T are m ost effective when 9 is essentially a location param eter, which is approxim ately induced by a variance-stabilizing transform ation. Ideally such a transform ation would be derived theoretically from (2.14) with variance function v(0) = var( T | F).

In a nonparam etric setting a suitable transform ation may sometimes be suggested by analogy with a param etric problem, as in Example 3.21. If not, a transform ation can be obtained empirically using the double bootstrap estimates o f variance discussed earlier in the section. Suppose tha t we have bootstrap samples F* = . - , y ' J and the corresponding statistics t", for

theta (jittered)

Figure 3.10 Use ofsmoothed nonparametric distributions to estimate bias and standard deviation functions for the ratio of the city population data. The top panel shows 100 bootstrapped ratios calculated from samples generated from Fg, for each of 6 = 1.2,..., 1.9; for clarity the 0 values are jittered. The lower panels show 200 of the points from Figure 3.7 and the estimated bias and standard deviation functions from that figure (smooth curves), with the biases and standard deviations estimated from the top panel (rougher curves).

r = 1 W ithout loss o f generality, suppose that t\ < ■ ■ ■ < t*R. Oneway to implement empirical variance-stabilization is to choose Ri o f the t" that are roughly evenly-spaced and tha t include and t'R. For each o f the corresponding F* we then generate M bootstrap values t” , from which we estimate the variance o f t ’ to be v'r as defined in (3.38). We now sm ooth a plot o f the v ’ against the t’ , giving an estimate v(Q) o f the variance var(T | F) as a function o f the param eter 0 = t(F), and integrate numerically to obtain the estim ated variance-stabilizing transform ation

t t f ‘ d d {t) J { m v / r

(3.40)

3.10 ■ Bootstrap Diagnostics 113

In general, but especially for small Ri, it will be better to fit a sm ooth curve to values o f logt>*, in part to avoid negative estim ates v(0). Provided that a suitable sm oothing m ethod is used, inclusion o f t\ and t'R in the set for which the v" are estim ated implies that all the transform ed values h(t*) can be calculated. The transform ed estim ator h(T) should have approxim ately unit variance.

Any o f the com m on sm oothers can be used to obtain v(0), and simple integration algorithm s can be used for the integral (3.40). I f the nested bootstrap is used only to obtain the variances o f Ri o f the f*, the total num ber o f bootstrap samples required is R + M Ri . Values of R\ and M in the ranges 50-100 and 25-50 will usually be adequate, so if R = 1000 the overall num ber o f bootstrap samples required will be 2250-6000. If variance estimates for all the t ’ are available, for example nonparam etric delta m ethod estimates, then the delta m ethod shows tha t approxim ate standard errors for the h(t'r) will bei>*1/2/v ( t ') 1/2; a plot o f these against t* will provide a check on the adequacy o f the transform ation.

The same procedure can be applied with second-level resampling done from sm oothed frequencies, as in Example 3.22.

Example 3.23 (City population data) For the city population data o f Exam ple 2.8 the param eter o f interest is the ratio 6 , which is estim ated by t = x /u. Figure 3.7 shows that the variance o f T depends strongly on 6 . We used the procedure outlined above to estimate a transform ation based on R = 999 bootstrap samples, with R\ = 50 and M = 25. The transform ation is shown in the left panel o f Figure 3.11: the right panel shows the standard errors v ^ 2 / v ( O l/2 o f the h(t'). The transform ation has been largely successful in stabilizing the variance.

In this case the variances VLr based on the linear approxim ation are readily calculated, and the transform ation could have been estim ated from them rather than from the nested bootstrap. ■

3.10 Bootstrap Diagnostics

3.10.1 Jackknife-after-bootstrapSensitivity analysis is im portant in understanding the implications o f a statistical calculation. A conclusion tha t depended heavily on just a few observations would usually be regarded as more tentative than one supported by all the data. W hen a param etric model is fitted, difficulties can be detected by a wide range o f diagnostics, careful scrutiny o f which is part o f a param etric bootstrap analysis, as o f any param etric modelling. But if a nonparam etric bootstrap is used, the E D F F is in effect the model, and there is no baseline against which


fID CO

Figure 3.11Variance-stabilization for the city population ratio. The left panel shows the empirical transformation «(•), and the right panel shows the standard errors u jy2/{v(r*)}1,/2 of the h{t*), with a smooth curve.

to com pare outliers, for example. In this situation we m ust focus on the effect o f individual observations on bootstrap calculations, to answer questions such as “would the confidence interval differ greatly if this point were rem oved?”, or “w hat happens to the significance level when this observation is deleted?”

Nonparametric case

Once a nonparam etric resampling calculation has been performed, a basic question is how it would have been different if an observation, yj, say, had been absent from the original data. For example, it m ight be wise to check w hether or not a suspicious case has affected the quantiles used in a confidence interval calculation. The obvious way to assess this is to do a further sim ulation from the rem aining observations, bu t this can be avoided. This is because a resample in which y; does not appear can be thought o f as a random sample from the data with yj excluded. Expressed formally, if J* is sampled uniformly from { l ,. . . ,n } , then the conditional distribution o f J ' given that J* =/= j is the same as the distribution o f /*, where /* is sampled uniformly from {1 ,... , j — \ , j + 1 ,...,« } . The probability tha t is not included in a bootstrap sample is (1 — n-1 )" = e ~ \ so the num ber o f simulations R - j that do not include yj is roughly equal to Re~l = 0.368R.

So we can measure the effect o f on the calculations by com paring the full sim ulation with the subset o f t \ , . . . , t ’R obtained from bootstrap samples where yj does not occur. In terms o f the frequencies f ’j which count the num ber of times yj appears in the rth simulation, we simply restrict attention to replicates with f ' j = 0. For example, the effect o f yj on the bias estimate B can be

3.10 ■ Bootstrap Diagnostics 115

Table 3.10Measurements on the head breadth and length of the first two adult sons in 25 families (Frets, 1921).

First son Len Brea

SecondLen

sonBrea

FirstLen

sonBrea

SecondLen

sonBrea

1 191 155 179 145 14 190 159 195 1572 195 149 201 152 15 188 151 187 1583 181 148 185 149 16 163 137 161 1304 183 153 188 149 17 195 155 183 1585 176 144 171 142 18 186 153 173 1486 208 157 192 152 19 181 145 182 1467 189 150 190 149 20 175 140 165 1378 197 159 189 152 21 192 154 185 1529 188 152 197 159 22 174 143 178 14710 192 150 187 151 23 176 139 176 14311 179 158 186 148 24 197 167 200 15812 183 147 174 147 25 190 163 187 15013 174 150 185 152

m easured by the scaled difference

n(B_j - B) = J J - £ (t; - t - j ) - i - t ) 1 , (3.41)I J ' >=0 r J

where B - j is the bias estimate from the resamples in which yj does not appear, and r_; is the value o f t when yj is excluded from the original data. Such calculations are applications o f the jackknife m ethod described in Section 2.7.3, so the technique applied to bootstrap results is called the jackknife-after-bootstrap. The scaling factor n in (3.41) is not essential.

A useful diagnostic is the plot o f jackknife-after-bootstrap measures such as (3.41) against empirical influence values, possibly standardized. For this purpose any o f the approxim ations to empirical influence values described in Section 2.7 can be used. The next example illustrates a related plot tha t shows how the distribution o f r* — t changes when each observation is excluded.

Example 3.24 (Frets’ heads) Table 3.10 contains data on the head breadth and length o f the first two adult sons in 25 families.

The correlations am ong the log m easurem ents are given below the diagonal in Table 3.11. The values above the diagonal are the partial correlations. For example, the value 0.13 in the second row is the correlation between the log head breadth o f the first son, b i, and the log head length o f the second son, h, after allowing for the other variables. In effect, this is the correlation between the residuals from separate regressions o f b\ and lj on the other two variables. The correlations are all large, bu t four o f the partial correlations are small, which suggests the simple in terpretation that each o f the four pairs o f m easurem ents for first and second sons is independent conditionally on the values o f the o ther two measurements.


First son Second sonLength Breadth Length B readth

F irst son Length 0.43 0.21 0.17B readth 0.75 0.13 0.22

Second son Length 0.72 0.70 0.64B readth 0.72 0.72 0.85

We focus on the partial correlation t = 0.13 between log foj and log I2. The top panel o f Figure 3.12 shows a jackknife-after-bootstrap plot for t, based on 999 bootstrap samples. The points at the left-hand end show the empirical0.05, 0.1, 0.16, 0.5, 0.84, 0.9, and 0.95 quantiles o f the values o f t’ — t*_2 for the 368 bootstrap samples in which case 2 was not selected; ~t’_ 2 is the average of t* for those samples. The dotted lines are the corresponding quantiles for all 999 values o f t* — t. The distribution is clearly much more peaked when case2 is left out. The panel also contains the corresponding quantiles when other cases are excluded. The horizontal axis shows the empirical influence values for t: clearly putting m ore weight on case 2 sharply decreases the value o f t.

The lower left panel o f the figure shows that case 2 lies som ewhat away from the rest, and the plot o f residuals for the regressions o f logfti and lo g /2 on (logb2,log h) in the lower right panel accounts for the jackknife-after- bootstrap results. Case 2 seems outlying relative to the others: deleting it will clearly increase t substantially. The overall average and standard deviation of the t* are 0.14 and 0.23, changing to 0.34 and 0.17 when case 2 is excluded. The evidence against zero partial correlation depends heavily on case 2. ■

A nother version o f the diagnostic plot uses case-deletion averages o f the i-e- t_j = R_j X>r:/*.=0 instead o f the empirical influence values. This more clearly reveals how the quantity o f interest varies with param eter values.

Parametric case

In the param etric case different calculations are needed, because random samples from a case-deletion model are not simply an unweighted subset o f the original bootstrap samples. Nevertheless, those original bootstrap samples can still be used if we make use of the following identity relating expectations under two different param eter values:

E { h ( Y ) \ r p ' } = E { h ( Y ) f Y li 'Pw )) | y j - (3.42)

Suppose tha t the full-data estimate (e.g. m axim um likelihood estimate) o f the model param eter is xp, and that when case j is deleted the corresponding estimate is xp^j. The idea is to use (3.42) with xp and xp-j in place o f xp and xpr

Table 3.11 Correlations (below diagonal) and partial correlations (above diagonal) for log measurements on the head breadth and length of the first two adult sons in 25 families.

3.10 • Bootstrap Diagnostics 117

Figure 3.12 Jackknife- after-bootstrap analysis for the partial correlation between logb\ and lo g /2 for Frets’ heads data. The top panel shows 0.05, 0.1,0.16, 0.5, 0.84, 0.9 and 0.95 empirical quantiles o f r’ — t*_j when each o f the cases is dropped from the bootstrap calculation in turn. The lower panels show scatter plots of the raw values o f logfci and log fe, and o f their residuals when regressed on the other two variables.

- 3 - 2 - 1 0 1 2

infinitesimal jackknife value

Log b1 Residual for log b1

respectively. For example,

Therefore the param etric analogue o f (3.41) is

/d di _ f l W . * . \ f ( y * I V-y) 1 V~V**; } ~ " \ R r j) f ( y; I v) R § ( r } J ’

where the samples y* are draw n from the full-data fitted model, that is with param eter value ip. Similar weighted calculations apply to o ther features o f the


distribution o f T* — t; see Problem 3.20. O ther applications o f the importance reweighting identity (3.42) will be discussed in C hapter 9.

3.10.2 LinearityStatistical analysis is simplified when the statistic o f interest T is close to linear. In this case the variance approxim ation v i will be an accurate estimate o f the bootstrap variance var(T | F), and saddlepoint m ethods (Section 9.5) can be applied to obtain accurate estim ates o f the distribution o f t \ w ithout recourse to simulation. A linear statistic is not necessarily close to normally distributed, as Example 2.3 illustrates. N or does linearity guarantee that T is directly related to a pivot and therefore useful in finding confidence intervals. O n the o ther hand, experience from other areas in statistics suggests that these three properties will often occur together.

This suggests that we aim to find a transform ation h(-) such tha t h(T) is well described by the linear approxim ation that corresponds to (2.35) or (3.1). For simplicity we focus on the single-sample case here. The shape o f h(-) would be revealed by a plot of h(t) against t, but o f course this is not available because h(-) is unknown. However, using Taylor approxim ation and (2.44) we do have

h(t') = h(t l) = h{t) + h(t)± Y ' f j l j - h(t) + h(t)(t'L - t)," i =i

which shows tha t t ’L = c + dh(t') with appropriate definitions o f constants c and d. Therefore a plot o f the values o f t'L = t + m_1 Y ^ f ) h against the t* will look roughly like h(-), apart from a location and scale shift. We can now estimate h(-) from this plot, either by fitting a particular param etric form, or by nonparam etric curve estimation.

Example 3.25 (City population data) The top left panel o f Figure 3.13 shows t ’L plotted against t" for 499 bootstrap replicates o f the ratio t = x / u for the data in Table 2.1. The plot is highly nonlinear, and the logarithm ic transformation, or one even m ore extreme, seems appropriate. N ote tha t the plot has shape similar to that for the empirical variance-stabilizing transform ation in Figure 3.11.

For a param etric transform ation, we try a Box-Cox transform ation, h{t) = (tx — 1) /1, with the value o f k estim ated by maximizing the log likelihood for the regression o f the h(t') on the t'Lr. This strongly suggests that we use I = —2, for which the fitted curve is shown as the solid line on the plot. This is close to the result for a sm oothing spline, shown as the dotted line. The top right panel shows the linear approxim ation for h(t‘), i.e. h(t) + h(t)n~l Y T j = i f j b ’ plotted against h(tm). This plot is close to the line with unit gradient, and confirms the results o f the analysis o f transform ations.

h(t) is dh(t)/dt.

3.10 • Bootstrap Diagnostics 119

Figure 3.13 Linearity transformation for the ratio applied to the city population data. The top left panel shows linear approximations t*L plotted against bootstrap replicates f \ with the estimated parametrictransformation (solid) and a transformation estimated by a smoothing spline (dots). The top right panel shows the same plot on the transformed scale. The lower left panel shows the plot for the studentized bootstrap statistic. The lower right panel shows a normal Q-Q plot of the studentized bootstrap statistic for the transformed values h{t*).

h(t*)

CO CO

CM ..y -‘

r * CM

O_c=*N

O

V • ‘ jf c aCNJ

C \1

CO CO

-6 - 4 - 2 0 2

z*

- 3 - 2 - 1 0 1 2 3

Quantiles of Standard Normal

The lower panels show related plots for the studentized bootstrap statistics on the original scale and on the new scale,

. t ' - t . h ( t ' ) - h ( t )Z ~ *1/2 ’ Z>> ~ *1/2 ’ vL h(t)vL

where v’L = n~2 ^ 2 f j l j . The left panel shows that, like t*, z ’ is far from linear. The lower right panel shows that the distribution o f z ’h is fairly close to standard norm al, though there are some outlying values. The distribution o f z* is far from normal, as shown by the right panel o f Figure 2.5. It seems that, here, the transform ation that gives approxim ate linearity o f t* also

makes the corresponding studentized bootstrap statistic roughly normal. The transform ation based on the sm oothing spline would give similar results. ■

3.11 Choice of Estimator from the DataIn some applications we m ay w ant to choose an estim ator or o ther procedure after looking at the data, especially if there is considerable prior uncertainty about the nature o f random variation or o f the form o f relationship am ong variables. The simplest example with hom ogeneous da ta involves the choice of estim ator for a population mean fi, when empirical evidence suggests that the underlying distribution F has long, non-norm al tails.

Suppose that T( 1 ) ,... , T ( K ) can all be considered potentially suitable estim ators for n, and for the m om ent assume that all are unbiased, which means tha t the underlying da ta distribution is symmetric. Then one natural criterion for choice am ong these estim ators is variance or, since their exact variances will be unknown, estim ated variance. So if the estim ated variance o f T(i) is V(i), a natural procedure is to select as estimate for a given dataset that t(i) whose estim ated variance is smallest. This defines the adaptive estim ator T by

T = T(i) if V(i) = min V(k).1 Zk<.K

For m ost simple estim ators we can use the nonparam etric delta m ethod variance estimates. But in general, and for m ore com plicated problems, we use the bootstrap to implement this procedure. Thus we generate R bootstrap samples, com pute the estimates f* (l),. . . , t ' (K ) for each sample, and then choose t to be tha t t(i) for which the bootstrap estimate o f variance

R

«;(0 = (« - l r 15 3 {t;(o - r ( o }2r= 1

is smallest; here t ’{i) = R~' J2r f’(0- How we generate the bootstrap samples is im portant here. H aving assumed

symmetry o f da ta distribution, the resampling distribution should be symmetric so tha t the t'(i) are unbiased for fi. Otherwise selection based on variance alone is questionable. Further discussion o f this is postponed to Example 3.26.

So far the procedure is straightforward. But now suppose tha t we want to estimate the variance o f T, or quantiles o f T — p. For the variance, the minimum estimate v(i) used to select t = t{i) will tend to be too low: if / is the random index corresponding to the selected estim ator, then E{K(/)} < var{ T(J)} = var(T ). Similarly the resam pling distribution o f T* = T *(/) will be artificially concentrated relative to that o f T, so tha t empirical quantiles o f the t ’(i) values will tend to be too close to t. W hether or not this selection bias

3.11 ■ Choice o f Estimator from the Data 1 2 1

is serious depends on the context. However, the bias can be adjusted for by bootstrapping the whole procedure, as follows.

Let y\ , . . . ,y*n be one o f the R simulated samples. Suppose that we apply the procedure for choosing am ong T( 1 ) ,.. . , T{ K) to this bootstrap sample. T hat is, we generate M samples with equal probability from y \ , . . . , y ’n, and calculate the estimates f” (l), . . . , f ” (K) for the mth such sample. Then choose the estim ator with the smallest estim ated variance

Doing this for each o f the R samples y [ , . . . , y * gives t \ , . . . , t ’R, and the empirical distribution o f the t ‘ — t values approxim ates the distribution o f T — For example, v = ( R —I )-1 ^ ( t * —t)2 estimates the variance o f T, and by accounting for the selection bias should be more accurate than t>(i).

There are two byproducts o f this double bootstrap procedure. One is inform ation on how well-determined is the choice o f estim ator, if this is o f interest, simply by examining the relative frequency with which each estim ator is chosen. Secondly, the bias o f v(i) can be approxim ated: on the log scale bias is estim ated by R ~ l ^ l o g y ’ — log v, where v'r is the smallest value o f the v’(i)s in the rth bootstrap sample.

Example 3.26 (Gravity data) Suppose that the da ta in Table 3.1 were only available as a com bined sample o f n = 81 measurements. The different dispersions o f the ingredient series m ake the combined sample very nonnorm al, so tha t the simple average is a poor estim ator o f the underlying mean fi. One possible approach is to consider trim m ed average estimates

which are averages after dropping the k smallest and k largest order statistics yyy The usual average and sample median correspond respectively to k =0 and \{n — 1). The left panel o f Figure 3.14 plots the trim m ed averages against k. The mild downward trend in the plot suggests slight asymmetry of the data distribution. O ur aim is to use the bootstrap to choose am ong the trim m ed averages.

The trim m ed averages will all be unbiased if the underlying data distribution is symmetric, and estim ator variance will then be a sensible criterion on which to base choice. The bootstrap procedure m ust build in the assumed symmetry,

M

m= 1

where f'*(i) = C (0 - T hat is,

t* = t*(i) if v'(i) = min v'(k).

n-k

1 2 2 3 • Further Ideas

2.0

2.0

9 9

1 '&

§ I Oa> % "

69

e * • , 69 . * * ’

00 O o 0

00 00

Figure 3.14 Trimmed averages and their estimated variances and mean squared errors for the pooled gravity data, based on R = 1000 bootstrap samples, using the ordinary bootstrap (•) and the symmetric bootstrap (o).

10 20 30 40 10 20 30 40 10 20 30 40

and this can be done (cf. Example 3.4) by simulating samples from asymmetrized version o f F such as

Fsym(y) = l2 { F ( y ) + F(2U - y - 0)} ,

which is simply the E D F o f y i , . . . , y„,p. — {y\ — p.),. . . , p — (y„ — p.), with p. an estimate o f fi which for this purpose we take to be the sample median. The centre panel o f Figure 3.14 shows bootstrap estimates o f variance for eleven trim m ed averages based on R = 1000 samples draw n from Fsym. We conclude from this tha t k = 36 is best, bu t tha t there is little to choose am ong trim m ed averages with k = 2 4 ,. . . , 40. A similar conclusion emerges if we sample from F, although the bootstrap variances are noticeably higher for k > 24.

If symmetry o f the underlying distribution were in doubt, then we should take the biases o f the estim ators into account. One natural criterion then would be mean squared error. In this case our bootstrap samples would be draw n from F, and we would select am ong the trim m ed averages on the basis o f bootstrap m ean squared error

R

mse(i) = K_ 1£ { r ; ( 0 - y } 2 -r= 1

N ote tha t mean squared error is m easured relative to the m ean y of the bootstrap population. The right panel o f Figure 3.14 shows the bootstrap m ean squared errors for our trim m ed averages, and we see tha t the estim ated biases do have an effect: now a value o f k nearer 20 would appear to be best. U nder the symmetric bootstrap, when the mean o f Fsym is the sample median because we symmetrized about this point, bootstrap m ean squared error equals bootstrap variance.

To focus the rest o f the discussion, we shall assume symmetry and therefore choose t to be the trim m ed average with k = 36. The value o f t is 78.33, and the m inimum bootstrap variance based on 1000 simulations is 0.321.

We now use the double bootstrap procedure to estimate the variance for t, and to determine appropriate quantiles for t. First we generate R = 1000


samples y j,.. . ,y g [ from Fsym. To each o f these samples we then apply the original symmetric bootstrap procedure, generating M = 100 samples o f size n = 81 from the symmetrized E D F o f y \ , . .. , 3 , choosing t* to be that one of the 11 trim m ed averages with smallest value o f v’(i). The variance v o f t \ , . . . , t'R equals 0.356, which is 10% larger than the original minimum variance. If we use this variance with a norm al aproxim ation to calculate a 95% confidence interval centred on t, the interval is [77.16,79.50]. This is very similar to the intervals obtained in Example 3.2.

The frequencies with which the different trim m ing proportions are chosen are:

k 12 16 20 24 28 32 36 40Frequency 1 25 54 96 109 131 498 86

Thus when symmetry o f the underlying distribution is assumed, a fairly heavy degree o f trim m ing seems desirable for these data, and the value k = 36 actually chosen seems reasonably well-determined. ■

The general features o f this discussion are as follows. We have a set of estim ators T (a) = t(a, F) for a e A, and for each estim ator we have an estim ated value C (a ,F ) for a criterion C (a ,F ) = E{c(T(a),0) | F} such as variance or mean squared error. The adaptive estim ator is T = t(a, F) where a = a(F) minimizes C (a ,F ) with respect to a. We want to know about the distribution o f T, including for example its bias and variance. The distribution o f T — 6 = t(F) — t(F) under sampling from F will be approxim ated by evaluating it under sampling from F. T hat is, it will be approxim ated by the distribution of

T* - t = t (F') - f(F) = t( a , F*) - t( a, F)

under sampling from F. Here F* is the analogue o f F based on y y * : if Fis the E D F o f the data, then F* is the E D F o f sampled from F.

W hether or not the allowance for selection bias is numerically im portant will depend upon the density o f a values and the variability o f C(a,F).

3.12 Bibliographic NotesThe extension o f bootstrap m ethods to several unrelated samples has been used by several authors, including Hayes, Perl and Efron (1989) for a special contrast-estim ation problem in particle physics; the application is discussed also in Efron (1992) and in Practical 3.4.

A general theoretical account o f estim ation in sem iparam etric models is given in the book by Bickel et al. (1993). The m ajority of applications of sem iparam etric models are in regression; see references for Chapters 6 and 7.


Efron (1979, 1982) suggested and studied empirically the use o f sm ooth versions o f the EDF, but the first systematic investigation o f sm oothed bootstraps was by Silverman and Young (1987). They studied the circumstances in which sm oothing is beneficial for statistics for which there is a linear approxim ation. Hall, DiCiccio and R om ano (1989) show that when the quantity o f interest depends on a local property o f the underlying CDF, as do quantiles, sm oothing can give worthwhile theoretical reductions in the size o f the m ean squared error. Similar ideas apply to m ore complex situations such as L\ regression (De Angelis, Hall and Young 1993); see however the discussion in Section 6.5. De Angelis and Young (1992) give a useful review o f bootstrap smoothing, and discuss the empirical choice o f how m uch sm oothing to apply. See also W ang (1995). Rom ano (1988) describes a problem — estim ation o f the mode o f a density — where the estim ator is undefined unless the E D F is sm oothed; see also Silverman (1981). In a spatial da ta problem, K endall and Kendall (1980) used a form o f bootstrap tha t jitters the observed data, in order to keep the rough configuration o f points constant over the sim ulations; this am ounts to sampling w ithout replacem ent when applying the sm oothed bootstrap. Young (1990) concludes that although this approach can outperform the unsm oothed bootstrap, it does not perform so well as the sm oothed bootstrap described in Section 3.4.

General discussions o f survival da ta can be found in the books by Cox and Oakes (1984) and Kalbfleisch and Prentice (1980), while Fleming and H arrington (1991) and A ndersen et al. (1993) give m ore m athem atical accounts. The product-lim it estim ator was derived by K aplan and M eier (1958): it and variants are widely used in practice.

Efron (1981a) proposed the first bootstrap m ethods for survival data, and discussed the relation between traditional and bootstrap standard errors for the product-lim it estimator. A kritas (1986) com pared variance estimates for the median survival time from Efron’s sampling scheme and a different ap proach o f Reid (1981), and concluded tha t Efron’s scheme is superior. The conditional m ethod outlined in Section 3.5 was suggested by H jort (1985), and subsequently studied by Kim (1990), who concluded tha t it estimates the conditional variance o f the product-lim it estim ator somewhat better than does resampling cases. Doss and Gill (1992) and Burr and Doss (1993) give weak convergence results leading to confidence bands for quantiles o f the survival time distribution. The asym ptotic behaviour o f param etric and nonparam etric bootstrap schemes for censored data is described by H jort (1992), while Andersen et al. (1993) discuss theoretical aspects o f the weird boo tstrap.

The general approach to m issing-data problems via the EM algorithm is discussed by Dempster, Laird and Rubin (1977). Bayesian m ethods using multiple im putation and data augm entation are decribed by Tanner and W ong (1987)


and Tanner (1996). A detailed treatm ent o f multiple im putation techniques for m issing-data problems, with special emphasis on survey data, is given by Rubin (1987). The principal reference for resampling in m issing-data problems is Efron (1994), together with the useful, cautionary discussion by D. B. Rubin. The account in Section 3.6 puts more emphasis on careful choice o f estimators.

C ochran (1977) is a standard reference on finite population sampling. Variance estim ation by balanced subsampling m ethods was discussed in this context as early as M cC arthy (1969), but the first a ttem pt to apply the bootstrap directly was by Gross (1980), who describes w hat we have term ed the “population boo tstrap”, bu t restricted to cases where N / n is an integer. This approach was subsequently developed by Bickel and Freedm an (1984), while Chao and Lo (1994) also make a case for this approach. Booth, Butler and Hall (1994) describe the construction o f studentized bootstrap confidence limits in this context. Presnell and Booth (1994) give a critical discussion o f earlier literature and describe the superpopulation bootstrap. The use o f modified sample sizes was proposed by M cC arthy and Snowden (1985) and the m irror-m atch m ethod by Sitter (1992). A different approach based on rescaling was introduced by R ao and Wu (1988). A comprehensive theoretical discussion o f the jackknife and bootstrap in sample surveys is given in C hapter 6 o f Shao and Tu (1995), with later developments described by Presnell and Booth (1994) and Booth, Butler and Hall (1994), on which the account in Section 3.7 is largely based.

Little has been w ritten about resampling hierarchical da ta although two relevant references are given in the bibliographic notes for C hapter 7. Related m ethods for bootstrapping empirical Bayes estimates in hierarchical Bayes models are described by Laird and Louis (1987). N onparam etric estim ation o f the C D F for a random effect is discussed by Laird (1978).

Bootstrapping the bootstrap is described by Chapm an and Hinkley (1986), and was applied to estim ation of variance-stabilizing transform ations by Tib- shirani (1988). Theoretical aspects o f adjustm ent o f bootstrap calculations were developed by H all and M artin (1988). See also the bibliographic notes for Chapters 4 and 5. M ilan and W hittaker (1995) give a param etric bootstrap analysis o f the da ta in Table 3.10, and discuss the difficulties that can arise when resam pling in problems with a singular value decomposition.

Efron (1992) introduced the jackknife-after-bootstrap, and described a variety o f ingenious uses for related calculations. Different graphical diagnostics for bootstrap reliability are developed in an asym ptotic fram ework by Beran (1997). The linearity plot o f Section 3.10.2 is due to C ook and W eisberg (1994).

Theoretical aspects o f the empirical choice o f estim ator are discussed by Leger and R om ano (1990a,b) and Leger, Politis and R om ano (1992). Efron (1992) gives an example o f choice o f level o f trim m ing o f a robust estimator, w ithout double bootstrapping. Some o f the general issues, w ith examples, are discussed by Faraw ay (1992).

3.13 Problems1 In a two-sample problem, with data y tj, j = 1 ,. . . , n„ i = 1,2, giving sample averages

y,- and variances t>„ describe models for which it would be appropriate to resample the following quantities:(a) ey = ytj - %(b) ei} = (ytj - 3>.)/(l + n~l )l/2,(c) etj = (ytj - y,•)/{«.■( 1 + n - l )}l/2,(d) = + ( y , j — yi) /{vt( 1 + n~l )}l/1, where the signs are allocated with equal probabilities,(e) etj = yij/%In each case say how a simulated dataset would be constructed.What difficulties, if any, would arise from replacing y and v, by more robust estimates o f location and scale?(Sections 3.2, 3.3)

2 A slightly simplified version o f the weighted mean o f k samples, as used in Example 3.2, is defined by

E k -= i=i w.-y,-

E i= i wi

where w, = n j a j , with y,- = n~' J2 j ytj and af = n~[ J2 j(yij ~ Pi)2 estimates o f mean /j, and variance of o f the ith distribution. Show that the influence functions for T are

Ltjiy-;F) = ^ 7- [yi - /*.• - (w- - 0) { (y< - ^ } ! ^ \ ./ v Wi

where qj,- = n ja } . Deduce that the first-order approximation under the constraint Hi = ■ ■ ■ = Hk for the variance o f T is vL = 1 / ^ with empirical analogue vL = 1/ vv>- Compare this to the corresponding formula based on the unconstrained empirical influence values.(Section 3.2.1)

3 Suppose that Y is bivariate with polar representation (X , m ), so that Y T = (X cos co, X sin co). If it is known that w has a uniform distribution on [0,27t), independent o f X , what would be an appropriate resampling algorithm based on the random sample y i , . . . , y „ l(Section 3.3)

4 Spherical data y i , . . . , y„ are points on the sphere o f unit radius. Suppose that it is assumed that these data come from a distribution that is symmetric about the unknown mean direction /i. In light o f the symmetry assumption, what would be an appropriate resampling algorithm for simulating data y j ,. . . ,y * ?(Section 3.3; Ducharme et a l., 1985)

5 Two independent random samples y ii , . . . ,y i„ , and y i \ , . . . , y 2ni o f positive data are obtained, and the ratio o f sample means t = y^ /y i is used to estimate the corresponding population ratio 9 = ^2 /^ 1 -(a) Show that the influence functions for t are

Lt. i (yi ;F) = -(J'i - 111)6 / ni , L t<1 (y2 \F) = ( y i ~ n i ) / n 1 .

Hence obtain the formula

vl = {n^^^iyij - yi)2 + n22^2(y2j - y2)2} /yi


for the approximate variance o f T.(b) Describe an appropriate resampling algorithm. How could this be modified if one could assume a multiplicative model, i.e. Y\j = n\E\j and Y2j = ^ £ 2 ; with all es sampled from a common distribution o f positive random variables?(c) Show that under the multiplicative model the approximate variance formula can be changed to vL = t2 Y , , / e'j - 1 )2/(« i 2 ), where eu = y,j/y,.(Section 3.2.1)

6 The empirical influence values can be calculated more directly as follows. Consideronly distributions supported on the data values, with probabilities pt = (pn, . . . , p inf) on the values in the ith sample for i = 1 , . . . , k . Then write T = sothat t = t(pi,...,pk) with pi = (},■■■, ^) . Show that the empirical influence value lij corresponding to the 7 'th case in sample i is given by

lv = ^ - t { p u . . . , ( l - s ) p i + e l j , . . . , p k} I ,u £ I £=0

where l j is the vector with 1 in the j th position and zeroes elsewhere.(Section 3.2.1)

7 Following on from the previous problem, re-express t(p 1 , . . . , pk) as a function u(n) o f a single probability vector n = (7t1 1 , . . . , 7t1„1 , . . . , nk„t ). For example, for the ratio o f means o f two independent samples, t = _p2 />’ i,

= ( £ j W I > y ) / ( 5 > u * y /£*iy)-

The observed value t is then equal to u{n) where n = (~n, . . . , -n) with n = £ * 1 , nt. Show that

l j = j u {(1 - e)n + ehj} = -/y,

where 1 y is the vector with 1 in the («,■_ 1 + j )th position, with n0 = 0 , and zeroes elsewhere. One consequence o f this is that vL = n~ 2 J2j'=i %•Apply these calculations to the ratio t = yi/yi.(Section 3.2.1)

8 If x i , . . . , x „ is a random sample from some distribution G with density g, suppose that this density is estimated by

- Vh i b w i ^ r ) = l j w { ^ r )

where w is a symmetric PD F with mean zero and variance t2.(a) Show that this density estimate has mean x and variance n~l YH x j ~ x ) 2 + h2x2.(b) Show that the random variable x = x j + he has PDF gh, where J is uniformly distributed on ( l , . . . ,n ) and e has PDF w. Hence describe an algorithm for bootstrap simulation from a smoothed version o f the EDF.(c) Show that the rescaled density

1 ^ ( x — a — b x j \hb J

j= 1 v 7

will have the same first two moments as the EDF if a = (1 — b)x and b = {1 + nh2z2/ J2(x j “ x)2} ~ l/2. What algorithm simulates from this smoothed EDF?

(d) Discuss the special problems that arise from using gh(x) when the range o f x is [0, oo) rather than (—oo, oo).(e) Extend the algorithms in (b) and (c) to multivariate x.(Section 3.4; Silverman and Young, 1987; Wand and Jones, 1995)

9 Consider resampling cases from censored data (y i, d \ ) , . . . , (y„, dn), where yi < ■ ■ ■ < yn. Let f j denote the number o f times that (y j , d j ) occurs in an ordinary bootstrap sample, and let Sj = / ' H-------1- / ' .(a) Show that when there is no censoring, the product-limit estimate puts mass n-1 on each observed failure yi <■■■< y„, so that F° = F.(b) Show that if B(m,p) denotes binomial distribution with index m and probability p, then

and deduce that this has variance J2j-.yJ<ydjVar{log(l — f ’ /Sj )} .

(d) Use the delta method to show that var[log{l—F°*(y)}] = J2jyj<y d j / ( n ~ 7 + 1)2> and infer that

This equals the variance from Greenwood’s formula, (3.10), apart from replacement o f (n — j + l ) 2 by (n - j)(n - j + 1).(Section 3.5; Efron, 1981a; Cox and Oakes, 1984, Section 4.3)

10 Consider the weird bootstrap applied to a homogeneous sample o f censored data, (yi ,di), . . . , (y„,d„), in which >’i < ■ - < >>„. Let dA0,(yj) = N j / ( n — j + l), where the N'j are independent binomial variables with denominators n—j+ 1 and probabilities dj / (n — j + 1).(a) Show that the total number o f failures under this resampling scheme is distributed as a sum o f independent binomial observations.(b) Show that

(Section 3.5; Andersen et al., 1993)

11 Suppose that Yj = ( U j , X j ) , j = 1 , . . . ,« are bivariate normal with mean vector H and variance matrix Q. When the Ys are observed, a random m cases have x missing. Obtain formulae for the maximum likelihood estimators o f fi and fl. Verify that these formulae agree with the multiple-imputation estimators constructed by the method o f Section 3.6.

v a r { l - F ° - ( y ) } = { l - F ° ( y ) } 2 £( n - j + l ) 2 '

)-yj<y

and that if dn = 1 then dA°' (yn) always equals one.

12 (a) Establish (3.15), and show that the sample variance c is an unbiased estimate o f y.

#{/!} is the number of elements in the set A.

(b) Now suppose that N = kn for some integer k. Show that under the population bootstrap,

E'(y*) = y, v a r - ( n = x ( l - / ) n - 1c.

(c) In the context o f Example 3.16, suppose that the parameter o f interest is a nonlinear function o f 9, say t] — g (6), which is estimated by g(T ). Use the delta method to show that the bias o f g(T ) is roughly ^g"(0)var(T), and that the bootstrap bias estimate is roughly ig"(t)var'(T ‘ ). Under what conditions on n and N does the bootstrap bias estimate converge to the true bias?(Section 3.7; Bickel and Freedman, 1984; Booth, Butler and Hall, 1994)

13 To model the superpopulation bootstrap, suppose that the original data arey i , . . . , y n and that <9* contains copies o f y u . . - , y n; the joint distribution o f the Mj is multinomial with probabilities n~{ and denominator N. If Y]’, . . . , y„* are sampled without replacement from <&' and if Y ’ = n~l J2 Y/> show that

E-(Y') = y, Em {var'(Y* | M)} = ^ x (1 - f h ^ c .

(Section 3.7; Presnell and Booth, 1994)

14 Suppose we wish to perform mirror-match resampling with k independent without- replacement samples o f size m, but that k = {n(l — m/n)} / {{m( 1 — / ) } is not an integer. Let K ’ be the random variable such that

Pr( K ’ = k') = 1 - Pr(X* = k' + 1) = k'(l + k' - k) /k,

where k' = [k] is the integer part o f k. Show that if the mirror-match algorithm is applied for an average Y ’ with this distribution fo rX ', var”(Y ”) = (1—m/n)c/(mk). Show also that under mirror-match resampling with the simplifying assumption that randomization is not required because k is an integer,

f , « ( * - ! )E ( C ) = c l 1 - ^ j ^ T j

where C‘ is the sample variance o f the Y-.What implications are there for variance estimation for more complex statistics? (Section 3.7; Sitter, 1992)

15 Suppose that n is a large even integer and that N = 5n/2, and that instead o f applying the population bootstrap we choose a population from which to resample according to

f y i , - - - , yn, yi , - - - ,y«, with probability \ ,

y u . . . , y n, y u . . . , y n, y i , . . . , y „ , with probability

Having selected <&’ we take a sample Y ,\...,Y „ ' from it without replacement and calculate Z ‘ = (Y* — y){(l — f ' ) n~ l c}~i/2. Show that if f = n / N the approximate distribution o f Z ’ is the normal mixture |N (0, | ) + |N (0, y ) , but that if f = n/#{<&'} the approximate distribution o f Z ‘ is N ( 0,1). Check that in the first case, E*(Z*) = 0 and var’(Z ’ ) = 1.Comment on the implications for the use o f randomization in finite population resampling.(Section 3.7; Bickel and Freedman, 1984; Presnell and Booth, 1994)

16 Suppose that we have data y i,. . . ,y „ , and that the bootstrap sample is taken to be

where I \ , . are independently chosen at random from 1Show that when d = {n'( 1 —/ ) / ( « — 1)}I/2, we have E * (y ‘) = y and var'(Y*) =(1 — f ) n ~ lc. How might the value o f ri be chosen?Discuss critically this resampling scheme.(Section 3.7; Rao and Wu, 1988)

17 Suppose that y , j = x ,+ z i;, i = 1 , . . . , a and j — 1 where the x,s are independent with mean [i and variance <x2, and the zi;s are independent with mean 0 and variance <x2. Consider the resampling schemes

v'j = yij + (yK,jj — yx,),

where / i a n d K u . . . , K a are randomly sampled with replacement from { l , . . . ,a } , and J x, . . . , J b are randomly sampled from { 1 ,. . . ,6 } either with or without replacement. Show that the second-moment properties o f the YJs are given by (3.23) and (3.24).(Section 3.8)

18 For the model o f Problem 3.17, define estimates o f the x,s and z,;s by

Show that the EDFs o f % and have first two moments which are unbiased for the corresponding moments o f the A"s and Z s if

(Section 3.8)

19 Consider the double bootstrap procedure for adjusting the estimated bias o f T, as described in Section 3.9, when T is the average Y. Show that the variance o f simulation error for the adjusted bias estimate B — C is

with s2 the sample variance.Hence deduce that for fixed R M the best choice for M is 1. How would the results change for a statistic other than the average?Derive the corresponding result for the bias correction o f the bootstrap estimate o f var(T).(Sections 2.5.2, 3.9)

20 Extend the discussion following (3.42) to jackknife-after-bootstrap calculations for

Describe the calculation in detail when parametric simulation is performed from the exponential density.(Section 3.10; Efron, 1992)

21 Let tp(F) denote the p x 100% trimmed average o f distribution F, i.e.

Yj = y + d(yij — y), j

^ = cy. + (1 - c)yh ztj = d(y,7 - y,)-

■F-'a-p)

(a) If Fk denotes the gamma distribution with index k and unit mean, show that tp(FK) = k(1 - 2p )-'{F K+1(yK,i_p) - / \ +](>v,P)}, where yK<p is the p quantile o f FK. Hence evaluate tp(FK) for k: = 1, 2, 5, 10 and p = 0, 0.1, 0.2, 0.3, 0.4, 0.5.(b) Suppose that the parameter o f interest, 6 = £ * =1 Cjt()(FjKi), depends on several gamma distributions FiJtr Let F, denote the EDF o f a sample o f size n, from f , K|. Under what circumstances is T = £ , = 1 c,rp(F,) (i) unbiased, (ii) nearly unbiased, as an estimate o f 0? Test your conclusions by a small simulation experiment. (Section 3.11)

3.14 Practicals1 To perform the analysis for the gravity data outlined in Example 3.2:

g ra v .fu n < - fu n c t io n (d a ta , i ){ d < - d a t a f i , ]

m < - ta p p ly (d $ g ,d $ se r ie s ,m e a n ) v < - ta p p ly (d $ g ,d $ s e r ie s ,v a r ) n < - ta b le (d S s e r ie s ) c(su m (m * n /v )/su m (n /v ), l /su m (n /v ) ) >

g ra v .b o o t <- b o o t (g r a v ity , g r a v .fu n , R=200, s t r a ta = g r a v ity $ s e r ie s )

Plot the estimate and its variance. Is the simulation well-behaved? How normal are the bootstrapped estimates and studentized bootstrap statistics?Now for a semiparametric analysis, as suggested in Section 3.3:

a t ta c h (g r a v ity ) n <- t a b l e ( s e r i e s )m <- r e p (ta p p ly (g , s e r i e s , mean), n) s <- r e p ( s q r t ( t a p p ly ( g ,s e r i e s ,v a r ) ) ,n ) r e s < - (g -m )/sqq n orm (res); a b l i n e ( 0 ,1 ,lt y = 2 ) grav < - data .fram e(m , s , s e r i e s , r e s ) g ra v .fu n <- fu n c t io n (d a ta , i ){ e <- d a ta $ r e s [ i]

y < - data$m + d a ta$s*e m <- ta p p ly (y , d a ta $ s e r ie s , mean) v <- ta p p ly (y , d a ta $ s e r ie s , var) n <- t a b le (d a ta $ s e r ie s ) c(su m (m * n /v )/su m (n /v ), l /su m (n /v ))

>g r a v l.b o o t <- b o o t(g r a v , g r a v .fu n , R=200)

D o residuals r e s for the different series look similar? Compare the values o f t’ and d' for the two sampling schemes. Compare also 80% confidence intervals for g. (Section 3.2)

2 Dataframe charming contains data on the survival o f 97 men and 365 women in a retirement home in California. The variables are sex, ages in months at which individuals entered and left the home, the time in months they spent there, and a censoring indicator (0/1 denoting censored due to leaving the hom e/died there). For details see Hyde (1980). We compare the variability o f the survival probabilities at 75 and 85 years (900 and 1020 months), and o f the estimated 0.75 and 0.5 quantiles o f the survival distribution.

chan <- charming[1:97,] # men onlychan$age <- chan$entry+chan$time attach(chan)chan.F <- survfit(Surv(age, cens)) chan.Fmax(chan.F$surv[chan.F$time>900]) max(chan.F$surv[chan.F$time>1020]) chan.G <- survfit(Surv(age-0.01*cens,l-cens)) split.screen(c(2,1))screen(l); plot(chan.F,xlim=c(760,1200),main="survival") screen(2); plot(chan.G,xlim=c(760,1200),main="censoring") chan.fun <- function(data){ s <- survfit(Surv(age,cens),data=data)

c(max(s$surv[s$time>900]), max(s$surv[s$time>1020]),

min(s$time[(s$surv<=0.75)]), min(s$time[(s$surv<=0.5)])) }chan.bootl <- censboot(chan, chan.fun, R=99, sim = "ordinary") chan.boot2 <- censboot(chan, chan.fun, R=99, F .surv=chan.F ,

G.surv=chan.G, sim = "cond",index=c(6,5)) chan.boot3 <- censboot(chan, chan.fun, R=99, F .surv=chan.F ,

sim = "weird",index=c(6,5))

Give normal-approximation confidence limits for each o f the survival probabilities, transformed if necessary, and compare them with those from chan.F. How do the intervals for the different bootstraps compare?(Section 3.5; Efron, 1981a)

To study the performance o f censored data resampling schemes when the censoring pattern is fixed, we perform a small simulation study. We apply a fixed censoring pattern to samples o f size 50 from the unit exponential distribution, and for each sample we calculate t = ( t \ , t2), where t\ is the maximum likelihood estimate o f the distribution mean and t2 is the number o f censored observations. We apply each bootstrap scheme to the sample, and record the mean and standard deviation of t from the bootstrap simulation. (This is quite time-consuming: take nreps and R as big as you dare.)

exp.fun <- function(d)

{ d.s <- survfit(Surv(y, cens),data=d) prob <- min(d.s$surv[d.s$time<l]) med <- min(d.s$time[(1-d.r$surv)>=0.5]) c(sum(d$y)/sum(d$cens), sum(l-d$cens)) >

results <- NULL; nreps <- 100; n <- 50; R <- 25 cens <- 3*runif(n) for (i in 1:nreps){ yO <- rexp(n)

junk <- data.frame(y = pmin(yO.cens), cens = as.numeric(yOCcens))junk.F <- survfit(Surv(y,cens),data=junk)junk.G <- survfit(Surv(y,1-cens),data=junk)ord.boot <- censboot(junk, exp.fun, R=R)con.boot <- censboot(junk, exp.fun, R=R,

F.surv=junk.F, G.surv=junk.G, sim = "cond") wei.boot <- censboot(junk, exp.fun, R=R,

F.surv=junk.F, sim = "weird") res <- c(exp.fun(junk ),

apply(ord.boot$t, 2, mean),

apply(con.boot$t, 2, mean), apply(wei.boot$t, 2, mean), sqrt(apply(ord.boot$t, 2, var)),

sqrt(apply(con.boot$t, 2, var)), sqrt(apply(wei.boot$t, 2, var)))

results <- rbind(results, res) }

The estimated bias and standard deviation o f t ly and the bootstrap bias estimates are

mean(results[,1])-lsqrt(var(results[,1] ))bias.o <- results[,3]-results[,1]bias.c <- results[,5]-results[,1]bias.w <- results[,7]-results[,1]

How do they compare? What about the estimated standard deviations? How do the numbers o f censored observations vary under the schemes?(Section 3.5; Efron, 1981a; Burr, 1994)

4 The tau particle is a heavy electron-like particle which decays into various collections o f other charged particles shortly after its production. The decay usually involves one charged particle, in which case it can happen in a number o f modes, the main four o f which are labelled p, n, e, and p. It takes a major research project to measure the rate o f occurrence o f single-particle decay, decayi, or any o f its component rates decay,,, decay^, decaye, and decay,,, and just one o f these can be measured in any one experiment. Thus dataframe tau on decay rates for 60 experiments represent several years o f work. Here we use them to estimate and form a confidence interval for the parameter

8 = decay! — decay p — decay n — decay e — decay

Suppose that we had thought o f using the 0, 12.5, 25, 37.5 and 50% trimmed averages to estimate the difference. To calculate these and to obtain bootstrap confidence intervals for the estimates o f 8:

tau.diff <- function(data){ yO <- tapply(data[,l] ,data[,2] ,mean)

yl <- tapply(data[,1],data[,2],mean,trim=0.125) y2 <- tapply(data[,1],data[,2],mean,trim=0.25) y3 <- tapply(data[, 1] ,data[,2] ,mean,trim=0.375) y4 <- tapply(data[, 1] ,data[,2] .median) y <- rbind(y0, yl, y2, y3, y4) y[,l]-apply(y[,-l] ,l,sum) >

tau.diff(tau)tau.fun <- function(data, i) tau.diff(data[i,]) tau.boot <- boot(tau,tau.fun,R=999,strata=tau$decay)

boot.ci(tau.boot, type=c("norm","basic"), index=l) boot.ci(tau.boot, type=c("norm","basic"), index=2)

and so forth, with index=3, 4, 5 for the remaining degrees of trim. Does the degree of trimming affect the interval much?To see the jackknife-after-bootstrap plot when 8 is estimated using the average:

j a ck . a f t e r . b o o t ( ta u .b o o t , in d ex = l)

How does the degree o f trim affect the bootstrap distributions o f the different estimators o f 0?

Now suppose that we want to choose the estimator from the data, by taking the trimmed average with smallest variance. For the original data this is the 25% trimmed average, so the estimate is 16.87. Its variance can be estimated by a double bootstrap, which we can implement as follow s:

tau.nest <- function(data, i){ d <- data[i,]

d.trim <- tau.diff(d)v.trim <- apply(boot(d, tau.fun, R=25, strata=d$decay)$t, 2, var) c(d.trim, v.trim) }

tau.boot2 <- boot(tau, tau.nest, R=100, strata=tau$decay)

To see what degrees o f trimming give the smallest variances, and to calculate the corresponding estimates and obtain their variance:

i <- matrix(l:5,5,tau.boot2$R)

i <- i[t(tau.boot2$t[,6:10]==apply(tau.boot2$t[,6:10] ,l,min))] table(i)t.best <- tau.boot2$t[cbind(l:tau.boot2$R,i)] var(t.best)

Is the optimal degree o f trimming well-determined?How would you use the results o f Problems 2.13 and 2.4 to avoid the second level o f bootstrapping?(Section 3.11; Efron, 1992)

5 We apply the jackknife-after-bootstrap to the correlation coefficient between plumage and behaviour in cross-bred ducks.

ducks.boot <- boot(ducks, corr, R=999, stype="w") ducks.L <- empinf(data=ducks, statistic=corr) split.screen(c(1,2)) screen(l)split.screen(c(2,1)) screen(4) attach(ducks)plot(plumage,behaviour,type="n")text(plumage,behaviour,round(ducks.L ,2))screen(3)plot(plumage.behaviour,type="n") text(plumage.behaviour,1:nrow(ducks)) screen(2)jack.after.boot(boot.out=ducks.boot,useJ=F,stinf=F, L=ducks.L)

(a) The value o f the correlation is t = 0.83. Will it increase or decrease if observation7 is deleted from the sample? (Be careful.) What is the effect on t o f deleting observation 6?(b) What happens to the bootstrap distribution o f t" — t when observation 8 is deleted from the sample? What about observation 6?(c) Show that the probability that neither observation 5 nor observation 6 is in a bootstrap sample is (1 — ^ ) u = 0.11. Now suppose that observation 5 is deleted, and calculate the probability that observation 6 is not in a bootstrap sample. D oes this explain what happens in (b)?

6 Suppose that we are interested in the largest eigenvalue o f the covariance matrix between the baseline and one-year CD4 counts in cd4; see Practical 2.3. To

calculate this and its approximate variance using the nonparametric delta method (Problem 2.14), and to bootstrap it:

eigen.fun <- functioned, w = rep(l, nrow(d))/nrow(d)){ w <- w/sum(w)

n <- nrow(d) m <- crossprod(w, d) m2 <- sweep(d,2,m)v <- crossprod(diag(sqrt(w)) m2) eig <- eigen(v,symmetric=T) stat <- eig$values[1] e <- eig$vectors[ ,1 ] i <- rep(l:n,round(n*w)) ds <- sweep(d[i,],2,m)L <- (ds7.*’/,e) "2 - stat c(stat, sum(L~2)/n~2) >

cd4.boot <- boot(cd4,eigen.fun,R=999,stype="w")

Some diagnostic plots:

split.screen(c(l,2)) screen(l); split.screen(c(2,1)) screen(3)plot(cd4.boot$t[,1],cd4.boot$t[,2],xlab="t*",ylab="vL*",pch=".") screen(4)plot(cd4[,l],cd4[,2],type="n",xlab="baseline",

ylab="one year" ,xlim=c(l,7) ,ylim=c(1,7)) text(cd4[, 1] ,cd4[,2] ,c(l :20) ,cex=0.7) screen(2); jack.after.boot(cd4.boot,useJ=F,stinf=F)

What is going on here?(Section 3.10.1; Canty, Davison and Hinkley, 1996)

4

Tests

4.1 Introduction

M any statistical applications involve significance tests to assess the plausibility o f scientific hypotheses. Resampling m ethods are not new to significance testing, since random ization tests and perm utation tests have long been used to provide nonparam etric tests. Also M onte Carlo tests, which use simulated datasets, are quite commonly used in certain areas o f application. In this chapter we describe how resam pling m ethods can be used to produce significance tests, in both param etric and nonparam etric settings. The range o f ideas is somewhat wider than the direct bootstrap approach introduced in the preceding two chapters. To begin with, we summarize some o f the key ideas o f significance testing.

The simplest situation involves a simple null hypothesis Ho which completely specifies the probability distribution o f the data. Thus, if we are dealing with a single sample y \ , . . . , y n from a population with C D F F, then Ho specifies tha t F = Fo, where F0 contains no unknow n param eters. A n example would be “exponential with mean 1”. The m ore usual situation in practice is that Ho is a composite null hypothesis, which means tha t some aspects o f F are no t determ ined and remain unknown when Ho is true. A n example would be “norm al with mean 1”, the variance of the norm al distribution being unspecified.

P-values

A statistical test is based on a test statistic T which measures the discrepancy between the da ta and the null hypothesis. In general discussion we shall follow the convention tha t large values o f T are evidence against H0. Suppose for the m om ent tha t this null hypothesis is simple. If the observed value o f the test statistic is denoted by t then the level o f evidence against Ho is m easured by

136

the significance probability

p = P r(T > t | Ho), (4.1)

often called the P-value. A corresponding notion is that o f a critical value tp for t, associated with testing at level p: if t > tp then Ho is rejected at level p, or 100p%. Necessarily tp is defined by P r(T > t P | Ho) = p. The level p is also called the error rate or the size of the test, and { (y i,...,} 'n) : t > tp} is called the level p critical region o f the test. The distribution o f T under Ho is called the null distribution o f T.

Under Ho the P-value (4.1) has a uniform distribution on [0,1], if T is continuous, so tha t the corresponding random variable P has distribution

This yields the error rate in terpretation o f the P-value, namely that if the observed test statistic were regarded as just decisive against Ho, then this is equivalent to following a procedure which rejects H0 with error rate p. The same is not exactly true if T is discrete, and for this reason modifications to (4.1) are sometimes suggested for discrete da ta problems: we shall not worry about the distinction here.

It is im portant in applications to give a clear idea o f the degree o f discrepancy between data and null hypothesis, if not giving the P-value itself then at least indicating how it com pares to several levels, say p = 0.10,0.05,0.01, rather than just testing at the 0.05 level.

Choice o f test statisticIn the param etric setting, we have an explicit form for the sampling distribution o f the data with a finite num ber o f unknow n param eters. Often the null hypothesis specifies numerical values for, or relationships between, some or all o f these param eters. There is also an alternative hypothesis H A which describes w hat alternatives to Ho it is m ost im portant to detect, or w hat is thought likely to be true if Ho is not. This alternative hypothesis guides the specific choice of T, usually through use o f the likelihood function

i.e. the jo in t density o f the observations. For example, when Ho and HA are both simple, say Ho : 8 = 0o and Ha : 0 = dA, then the best test statistic is the likelihood ratio

Pr(P < p \ H 0) = p. (4.2)

L(e) = f Yu...,Yn( y u . - . , y n \ 0 ) ,

T = L(9a )/L{60). (4.3)

A rather different situation is where we wish to test the goodness of fit of the param etric model. Sometimes this can be done by em bedding the model into a larger model, with one or a few additional param eters corresponding

138 4 ■ Tests

to departure from the original model. We would then test those additional param eters. Otherwise general purpose goodness o f fit tests will be used, for example chi-squared tests.

In the nonparam etric setting, no particular forms are specified for the distributions. Then the appropriate choice o f T is less clear, but it should be based on at least a qualitative notion o f w hat is o f concern should Ho not be true. Usually T would be based on a statistical function s(F) tha t reflects the characteristic o f physical interest and for which the null hypothesis specifies a value. For example, suppose that we wish to test the null hypothesis Hq that X and Y are independent, given the random sample (Xi, Vi) , . . . , (X„, Y„). The correlation s(F) = corr(AT, Y ) = p is a convenient m easure o f dependence, and p = 0 under Hq. If the alternative hypothesis is positive dependence, then a natural test statistic is T = s(F), the raw sample correlation; if the alternative hypothesis is just “dependence”, then the two-sided test statistic T = s2(F) could be used.

Conditional tests

In m ost param etric problems and all nonparam etric problems, the null hypothesis Ho is composite, that is it leaves some param eters unknow n and therefore does not completely specify F. Therefore P-value (4.1) is not generally well- defined, because Pr( T > t \ F) may depend upon which F satisfying Ho is taken. There are two clean solutions to this difficulty. One is to choose T carefully so tha t its distribution is the same for all F satisfying Ho: examples include the Student-t test for a norm al mean with unknow n variance, and rank tests for nonparam etric problems. The second and more widely applicable solution is to eliminate the param eters which rem ain unknow n when Ho is true by conditioning on the sufficient statistic under Ho- I f this sufficient statistic is denoted by S, then we define the conditional P-value by

p = Pr(T > t \ S = s ,H0). (4.4)

Fam iliar examples include the Fisher exact test for a 2 x 2 table and the Student-t test m entioned earlier. O ther examples will be given in the next two sections.

A less satisfactory approach, which can nevertheless give good approxim ations, is to estimate F by a C D F f '0 which satisfies Ho and then calculate

p = Pr( T > t \ Fo). (4.5)

Typically this value will not satisfy (4.2) exactly, but will deviate by an am ount which may be practically negligible.

Pivot testsW hen the null hypothesis concerns a particular param eter value, the equivalence between significance tests and confidence sets can be used. This equiv


alence is that if the value 6q is outside a 1 — a confidence set for 6 , then 6

differs from do with P-value less than a. The particular alternative hypothesis for which this applies is determ ined by the type o f confidence se t: for example, if the confidence set is all values to the right o f a lower confidence limit, then the implied alternative is H A : 6 > do- A specific form o f test based on this equivalence is the pivot test. For example, suppose tha t T is an estim ator for scalar 6 , with estim ated variance V. Suppose further tha t the studentized form Z = ( T — 6 ) / V x/1 is a pivot, meaning that its distribution is the same for all relevant F, and in particular for all 6 . The Student-r statistic is a familiar instance o f this. For the one-sided test o f Ho : 6 = do versus H A : 6 > 6o, the P-value attached to the observed studentized test statistic zo = (f — 0q)/v1/2 is

p = Pr{(T - 60) / V 112 > ( t - 60) / v l/2 | H0}.

But because Z is a pivot,

Pr{Z > (t - 6o) /v1/2 I Ho} = Pr{Z > (t - 60) / v [/2 \ F},

and therefore

p = Pr (Z > z0 | F ) . (4.6)

The particular advantage of this, in the resampling context, is tha t we do not have to construct a special null hypothesis sampling distribution.

In param etric problems it is usually possible to express the model in terms o f the param eter o f interest ip and other (nuisance) param eters X, so that the null hypothesis concerns only ip. In the above discussion o f conditional tests, (4.4) would be independent o f X. One general approach to construction o f a test statistic T is to generalize the simple likelihood ratio (4.3), and to define

L R =maxHa L(\p, a )

m axWo L(rp, X)

For testing Ho : rp = xpo versus HA : ip =j= xpo, this generalized likelihood ratio is equivalent to the m ore convenient expression

L R = L(v>A) = maxy^ L(\p, A) ^

L(wo,%>) maxA L(ip0, A)'O f course this also applies when there is no nuisance param eter. For manymodels it is possible to show that T = 2 log L R has approxim ately the Xddistribution under Ho, where d is the dimension o f ip, so that

p = Pr(X2d > t), (4.8)

independently o f X. Thus the likelihood ratio L R is an approxim ate pivot.There is a variety o f related statistics, including the score statistic, and the

signed likelihood ratio for one-param eter problems. W ith each likelihood-based

140 4 ■ Tests

statistic there is a simple approxim ation to the null distribution, and modifications to improve approxim ation in moderate-sized samples. The likelihood ratio m ethod appears limited to param etric problems, but as we shall see in C hapter 10 it is possible to define analogues in the nonparam etric case.

With all o f the P-value calculations introduced thus far, simple approxim ations for p exist in m any cases by appealing to limiting results as n increases. Part o f the purpose o f this chapter is to provide resam pling alternatives to such approxim ations when they either fail to give appropriate accuracy or do not exist a t all. Section 4.2 discusses ways in which resam pling and sim ulation can help with param etric tests, starting with exact M onte Carlo tests. Section 4.3 briefly reviews perm utation and random ization tests. This leads on to the wider topic of nonparam etric bootstrap tests in Section 4.4. Section 4.5 describes a simple m ethod for improving P-values when these are biased. M ost o f the examples in this chapter involve relatively simple applications. Chapters 6 and beyond contain more substantial applications.

4.2 Resampling for Parametric Tests

Broadly speaking, param etric resampling may be useful in any testing problem where either standard approxim ations do not apply or where the accuracy o f such approxim ations is suspect. There is a wide range of such problems, including hypotheses with order constraints, hypotheses involving separate models, and graphical tests. In all o f these problems, the basic m ethod is to use a param etric resampling scheme as outlined in Section 2.2 except that here the sim ulation model m ust satisfy the relevant null hypothesis.

4.2.1 M onte Carlo tests

One special situation is when the null hypothesis distribution o f T does not involve any nuisance param eters. Occasionally this happens directly, but more often it is induced, either by standardizing some initial statistic, or by conditioning on a sufficient statistic, as explained earlier. In the latter case the exact P-value is given by (4.4) ra ther than (4.1). In practice the exact P-value may be difficult or impossible to calculate, and M onte Carlo tests provide convenient approxim ations to the full tests. As we shall see, M onte C arlo tests are exact in their own right, and am ong bootstrap tests are special in this way.

The basic M onte C arlo test com pares the observed statistic t to R independent values o f T which are obtained from corresponding samples independently sim ulated under the null hypothesis model. If these simulated values are denoted by t j , . . . , tR, then under H q all R + 1 values t , t \ , . . . , t R are equally

4.2 ■ Resampling fo r Parametric Tests 141

#(y4) means the number of times the event A occurs.

likely values o f T. T hat is, assuming T is continuous,r

R + 1P r(T < T(*, | Ho) = TrT- r , (4.9)

where as usual Tlr) denotes the rth ordered value. If exactly k o f the simulated t* values exceed t and none equal it, then

p = P r(T > t | H0) = Pmc = (4.10)

The right-hand side is referred to as the M onte Carlo P-value. I f T is continuous, then it follows from (4.9) tha t under H q the distribution of the corresponding random variable Pmc is uniform on ( ^ y , • • •, 1). This resultis the discrete analogue o f (4.2), and guarantees that Pmc has the error rate interpretation. In this sense the M onte Carlo test is exact. It differs from the full test, which corresponds to R = oo, by blurring the critical region o f the full test for any attainable level.

If T is discrete, then repeat values o f t* can occur. If exactly / of the t ’ values equal t, then it is sometimes advocated tha t one bounds the significance probability,

k + l k + l + l---------< pmc < --------------- .R + l - ^ m c - R + 1

O ur strict in terpretation o f (4.1) would have us use the upper bound, and so we adopt the general definition

1 + #{t* ^ f) / j i nP™ = R + 1 ■ (4 U )

Example 4.1 (Logistic regression) Suppose that y i , . . . , y „ are independent binary outcomes, with corresponding scalar covariate values x i , . . . ,x „ , and that we wish to test w hether or not x influences y. I f our chosen model is the logistic regression model

, Pr(F; = 1 | x/)PriYj = 0 | x j) ~ ^ X>’

then the null hypothesis is H q :\p = 0 . U nder H q the sufficient statistic for X isS and T = J2 x j Yj is the natural test statistic; T is in fact optim al forthe logistic model, but is also effective for m onotone transform ations o f the odds ratio o ther than logarithm. The significance is to be calculated according to (4.4).

The null distribution o f Y i,...,Y „ given S = s is uniform over all (") perm utations o f y i , . . . , y„ . R ather than generate all o f these perm utations to com pute (4.4) exactly, we can generate R random perm utations and apply(4.11). A sim ulated sample will then be (x j ,y j) , . . . , (xn,y^), where y \ , . . . , y ’n is a random perm utation o f y \ , . . . , y n, and the associated test statistic will be

= E x jyj .

142 4 ■ Tests

0 1 2 3 4 3 4 2 2 10 2 0 2 4 2 3 3 4 21 1 1 1 4 1 5 2 2 34 1 2 5 2 0 3 2 1 13 1 4 3 1 0 0 2 7 0

In some applications there will be repeats am ong the x values, o r equivalently m, binom ial trials with a, occurrences o f y = 1 a t the ith distinct value o f x. If the data are expressed in the latter form, then the same random perm utation procedure can be applied to the original expanded form o f da ta with « = ^ m, individual ys. ■

Example 4.2 (Overdispersed counts) The data in Table 4.1 are n = 50 counts o f fir seedlings in small quadrats, part o f a larger dataset. The actual spatial layout is preserved, although we are not concerned with this here. R ather we wish to test the null hypothesis that these da ta are a random sample from a Poisson distribution with unknow n mean. The concern is tha t the da ta are overdispersed relative to the Poisson distribution, which strongly suggests that we take as test statistic the dispersion index T = J2(Yj — Y ) 2 / Y . Under the Poisson model S = J 2 ^ j is sufficient for the com m on mean, so we carry out a conditional test and apply (4.4). For the data, t = 55.15 and s = 107.

Now under the null hypothesis Poisson model, the conditional distribution of Y \ , . . . ,Y„ given J 2 ^ j = s is m ultinom ial with denom inator s and n categories each having probability n-1 . It is easy to simulate from this distribution. In the first R = 99 simulated values t ', 24 are larger than t = 55.15. So the M onte Carlo P-value (4.11) is equal to 0.25, and we conclude tha t the da ta dispersion is consistent with Poisson dispersion. Increasing R to 999 makes little difference, giving p = 0.235. The left panel o f Figure 4.1 shows a histogram o f all 999 values o f t' — t: the unshaded part o f the histogram corresponds to values t ' > t which count toward significance.

For this simple problem the null distribution o f T given S = s is approximately j. T hat this approxim ation is accurate for our da ta is illustrated in the right panel o f Figure 4.1, which plots the ordered values o f t* against quantiles o f the X49 distribution. The P-value obtained with this approxim ation is 0.253, close to the exact value. There are two points to make about this. First, the sim ulation results enable us to check on the accuracy o f the theoretical approxim ation: if the approxim ation is good, then we can use it; bu t if it isn’t, then we have the M onte C arlo P-value. Secondly, the M onte C arlo m ethod does not require knowledge o f a theoretical approxim ation, which may not even exist in m ore com plicated problems, such as spatial analysis o f these data. The M onte C arlo m ethod applies very generally. ■

Table 4.1 n = 50 counts o f balsam-fir seedlings in five feet square quadrats.

4.2 • Resampling fo r Parametric Tests 143

Figure 4.1 Simulation results for dispersion test. Left panel: histogram of R = 999 values of the dispersion statistic t* obtained under multinomial sampling: the data value is t = 55.15 and pmc = 0.235. Right panel: chi-squared plot of ordered values of t*, dashed line corresponding to xl$ approximation to null conditional distribution. 20 40 60 80

ow§GO

coe0Q_CO

Chi-squared quantiles

It seems intuitively clear tha t the sensitivity o f the M onte Carlo test increases with R. We shall discuss this issue later, but for now we note tha t it is advisable to take R to be a t least 99.

There are two im portant aspects o f the M onte C arlo test which make it widely useful. The first is tha t we only need to be able to simulate da ta under the null hypothesis, this being relatively simple even in some very com plicated problems, such as those involving spatial processes (Chapter 8). Secondly,

do no t need to be independent outcom es: the m ethod remains valid so long as they are exchangeable outcomes, which is to say tha t the jo in t density o f T, TR under Ho is invariant under perm utation o f itsarguments. This allows us to apply M onte Carlo tests to quite com plicated problems, as we see next.

4.2.2 Markov chain Monte Carlo testsIn some applications o f the exact conditional test, with P-value given by (4.4), the conditional probability calculation is difficult or impossible to do directly. The M onte Carlo test is in principle appropriate here, since the null d istribution (given s) does not depend upon unknow n param eters. A practical obstacle is tha t in com plicated problems it may be difficult to simulate independent samples directly from tha t conditional null distribution. However, as we observed before, the M onte Carlo test only requires exchangeable samples. This opens up a new possibility, the use o f M arkov chain M onte C arlo simulation, in which only the unconditional null distribution is needed.

The basic idea is to represent da ta y = ( y i , . . . , y n) as the result o f N steps o f a M arkov chain with some initial state x = (x i , . . . ,x „ ) , and to

144 4 ■ Tests

generate each y ’ by an independent sim ulation o f N steps with the same initial state x. If the M arkov chain has equilibrium distribution equal to the null hypothesis distribution o f Y = (Y [,..., Y„), then y and the R replicates o f y * are exchangeable outcom es under Ho and (4.11) applies.

Suppose that under H q the data have jo in t density fo(y) for whereboth /o and & are conditioned on sufficient statistic s if we are dealing with a conditional test. For simplicity suppose that has \3S\ elements, which we now regard as possible states labelled (1 ,2 ,...,\&S\) o f a M arkov chain {Zr, t = . . . , —1,0 ,1 ,...} in discrete time. Consider the data y to be one realization o f Zjy. We then have to fix an appropriate value or state for Zo, and with this initial state simulate the R independent values o f Z N which are the R values o f Y \ The M arkov chain is defined so that /o is the equilibrium distribution, which can be enforced by appropriate choice o f the one-step forward transition probability m atrix Q, say, with elements

quv = Pr(Z I+i = v | Z ( = u), u,v € &.

For the m om ent suppose that Q is already known.The first part o f the sim ulation is to produce a value for Z 0. S tarting from

state y a t time N, we simulate N backward steps o f the M arkov chain using the one-step backward transition probabilities

Pr(Z, = u | Z t+1 = v ) = fo(u)quv/fo(v). (4.12)

Let the final state, the realized value o f Zo, be x. N ote that if Ho is true, so that y was indeed sampled from / 0, then Pr(Zo = x) = /o(x). In the second part o f the simulation, which we repeat independently R times, we simulate N forward steps o f the M arkov chain, starting in state x and ending up in state y ' = (>>i,...,y '). Since under Ho the chain starts in equilibrium,

Pr(Y* = / | H0) = Pr( Z N = / ) = / 0( / ) .

T hat is, if Ho is true, then the R replicates and da ta y are allsampled from /o, as we require. M oreover, the R replicates o f y ‘ are jointly exchangeable with the da ta under Ho- To see this, we have first that

R

f ( y , y l . . . , f R | Ho) = fo(y) £ Pr(Z0 = x | Z N = y ) ] ] Pr(ZN = | Z 0 = x),x r= l

using the independence o f the replicate sim ulations from x. But by the definition of the first part o f the simulation, where (4.12) applies,

/o(y)Pr(Z 0 = x | Z N = y) = / 0(x)Pr(Z^ = y \ Z 0 = x),


and so

f ( y , y[, . . . ,y 'R \ H 0) = J 2 /o (x ){ p r(Z N = y | Z 0 = x) [ J Pr(ZN = y*r | Z 0 = x ) \ ,x r= l '

which is a symmetric function o f y ,y { , - - - , y R as required. Given tha t the data vector and sim ulated da ta vectors are exchangeable under Ho, the associated test statistic values ( t , t j , . . . , t R) are also exchangeable outcom es under H q. Therefore (4.11) applies for the P-value calculation.

To com plete the description o f the m ethod, it rem ains to define the transition probability m atrix Q so that the chain is irreducible with equilibrium distribution fo(y)- There are several ways to do this, all o f which use ratios fo(v) / fo (u)- For example, the M etropolis algorithm starts with a carrier M arkov chain on state space @S having any symmetric one-step forward transition probability m atrix M , and defines one-step forward transition from state u in the desired M arkov chain as follows:

• given we are in state u, select state v with probability muv;• accept the transition to v with probability min{ l , fo(v)/ fo(u)}, otherwise

reject it and stay in state u.

It is easy to check tha t the induced M arkov chain has transition probabilities

quv = min{ l , fo(v )/ fo (u)}muv, u ^ v ,

and

Qua = muu + Y ^ max{0 ,1 - fo(v)/fo{u)}muo,V U

and from this it follows tha t fo is indeed the equilibrium distribution o f the M arkov chain, as required. In applications it is not necessary to calculate the probabilities muv explicitly, although the symmetry and irreducibility of the carrier chain m ust be checked. If the m atrix M is not symmetric, then the acceptance probability in the M etropolis algorithm m ust be modified to min [l , fo(v)mvu/{fo(u)muv}].

The crucial feature o f the M arkov chain m ethod is that fo itself is not needed, only ratios fo(v)/ fo(u) being involved. This means tha t for conditional tests, where fo is the conditional density for Y given S = s, only ratios o f the unconditional null density for Y are needed:

fo(v) = P r(7 = v \ S = s, H q) = P r(7 = v | H0) fo(u) P r(7 = u | S = s , H 0) P r ( Y = u \ H o ) '

This greatly simplifies m any applications.The realizations o f the M arkov chain are symmetrically tied to the artificial

starting value x, and this induces a symmetric correlation am ong (t,

146 4 ■ Tests

This correlation depends upon the particular construction o f Q, and reduces to zero at a rate which depends upon Q as m increases. While the correlation does not affect the validity o f the P-value calculation, it does affect the power o f the te s t: the higher the correlation, the lower the power.

Example 4.3 (Logistic regression) We return to the problem o f Example 4.1, which provides a very simple if artificial illustration. The data y are a binary sequence o f length n with s ones, and calculations are to be conditional on Y , Yj = s. Recall that direct M onte C arlo sim ulation is possible, since all (") possible da ta sequences are equally likely under the null hypothesis o f constant probability o f a unit response.

One simple M arkov chain has one-step transitions which select a pair of subscripts i, j a t random , and switch y t and yj. Clearly the chain is irreducible, since one can progress from any one binary sequence with s ones to any other. All ratios o f null probabilities /o (u)//o («) are equal to one, since all binary sequences with s ones are equally probable. Therefore if we run the M etropolis algorithm, all switches are accepted. But note that this M arkov chain, while simple to implement, is inefficient and will require a large num ber o f steps to induce approxim ate independence o f the t ’s. The m ost effective M arkov chain would have one-step transitions which are random perm utations, and for this only one step would be required. ■

Example 4.4 (AML data) For data such as those in Example 3.9, consider testing the null hypothesis o f proportional hazard functions. D enote the failure times by z\ < z2 < • • • < z„, assuming no ties for the moment, and define rtj to be the num ber in group i who were at risk just prior to zj. Further, let yj be0 or 1 according as the failure at zj is in group 1 or 2, and denote the hazard function at time z for group i by fy(z). Then

P r(y . = l ) = _____r*Mzj>_____ = °J1 rljh l (zj) + r2jh2(zj) aj + 0 /

where aj = rij /rzj and 6j = h2{zj)/h\(zj) for j = 1 The null hypothesis o f proportional hazards implies the hypothesis Hq : 6\ = • • • = 6n.

For the da ta o f Example 3.9, where n — 18, the values o f y and a are given in Table 4.2; one tie has been random ly split. N ote that censored da ta contribute only to the rs: the times are not used.

O f course the YjS are not independent, because aj depends upon the ou tcomes o f Yu . . . , Y j - i . However, for the purposes o f illustration here we shall pretend that the ajS are fixed, as well as the survival times and censoring times. T hat is, we shall treat the Y)s as independent Bernoulli variables with probabilities as given above. Under this pretence the conditional likelihood for


z 5 5 8 8 9 12 13 18 23 23 27 30 31 33 34 43 45 48n 11 11 11 11 11 10 10 8 7 7 6 5 5 4 4 3 3 2

n 12 11 10 9 8 8 7 6 6 5 5 4 3 3 2 2 1 0

a n12 1 11

10n9

n8

108

107

86

76

75

65

54

53

43 2 3

2 3 oo

y 1 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0

Table 4.2 Ingredients o f the conditional test for proportional hazards. Failure times as in Table 3.4; at time z = 23 the failure in group 2 is taken to occur first.

*18 is simply

18n7=1

dj + Oj

N ote that because aig = oo, m ust be 0 whatever the value o f 0ig, and so this final response is uninformative. We therefore drop yig from the analysis. H aving done this, we see that under Ho the sufficient statistic for the common hazard ratio 0 is S = Yj, whose observed value is s = 11.

W hatever the test statistic T, the exact conditional P-value (4.4) m ust be approxim ated. D irect sim ulation appears impossible, but a simple M arkov chain sim ulation is possible. First, the state space o f the chain is 3§ = {x = ( x i , . . . ,x n ) : Y l x j = s}> tha t is all perm utations of y i , . . . , y n . For any two vectors x and x in the state-space, the ratio o f null conditional jo in t probabilities

;'= i

p{x | s, 0 ] p(x | s, 01

We take the carrier M arkov chain to have one-step transitions which are ran dom perm utations: this guarantees fast movement over the state space. A step

which moves from x to x is then accepted with probability min ^1, f l j l i a]‘ •By symmetry the reverse chain is defined in exactly the same way.

The test statistic m ust be chosen to m atch the particular alternative hypothesis thought relevant. Here we suppose that the alternative is a m onotone ratio o f hazards, for which T = YljLi Yj log(Zj) seems to be a reasonable choice. The M arkov chain sim ulation is applied with N = 100 steps back to give the initial state x and 100 steps forward to state y ' , the latter repeated R = 99 times. O f the resulting £* values, 48 are less than or equal to the observed value t = 17.75, so the P-value is (1 + 4 8 )/( l + 99) = 0.49. Thus there appears to be no evidence against the proportional hazards model.

Average acceptance probability in the M etropolis algorithm is approximately 0.7, and results for N = 10 and N = 1000 appear indistinguishable from those for N = 100. This indicates unusually fast convergence for applications of the M arkov chain method. ■

148 4 ■ Tests

The use o f R conditionally independent realizations o f the M arkov chain is sometimes referred to as the parallel method. In contrast is the series method, where only one realization is used. Since the successive states o f the chain are dependent, a random ization device is needed to induce exchangeability. For details see Problem 4.2.

4.2.3 Parametric bootstrap testsIn m any problems o f course the distribution o f T under Hq will depend upon nuisance param eters which cannot be conditioned away, so that the M onte C arlo test m ethod does not apply exactly. Then the natural approach is to fit

A A

the null model Fo and use (4.5) to com pute the P-value, i.e. p = P r(T > t \ Fo). For example, for the param etric model where we are testing Ho : ip = ipo with X a nuisance param eter, Fo would be the C D F o f f ( y \ ipo,Xo) with Xo the maximum likelihood estim ator (M LE) o f the nuisance param eter when ip is fixed equal to ipo. C alculation o f the P-value by (4.5) is referred to as a bootstrap test.

If (4.5) cannot be com puted exactly, o r if there is no satisfactory approxim ation (norm al or otherwise), then we proceed by simulation. T hat is, R independent replicate samples yj,...,_y* are draw n from Fo, and for the rth such sample the test statistic value t'r is calculated. Then the significance probability (4.5) will be approxim ated by

Pboot ~ J ( 4 .1 3 )

O rdinarily one would use a simple proportion here, but we have chosen to make the definition m atch tha t for the M onte Carlo test in (4.11).

Example 4.5 (Separate families test) Suppose that we wish to choose between the alternative model forms fo(y \ r\) and f i ( y \ £) for the P D F o f the random sample y \ , . . . , yn. In some circumstances it may m ake sense to take one model, say fo, as a null hypothesis, and to test this against the o ther model as alternative hypothesis. In the notation o f Section 4.1, the nuisance param eter is X = (t],C) and ip is the binary indicator o f model, with null value ipo = 0 and alternative value ip a = 1. The likelihood ratio statistic (4.7) is equivalent to the more convenient form

r = » - N ° g ^ = n- ' X > g £ M ^ , (4.14)L o(rj) fo(yj I ri)

where f\ and ( are the M LEs and Lo and L\ the likelihoods under fo and / 1 respectively. If the two families are strictly separate, then the chi-squared approxim ation (4.8) does not apply. There is a norm al approxim ation for the


logy and s og) are the average and sample variance for the log yj.

null distribution o f T, but this is often quite unreliable except for very large n. The param etric bootstrap provides a more reliable and simple option.

The param etric bootstrap works as follows. We generate R samples o f size n by random sampling from the fitted null model /o(y | fj). For each sample we calculate estimates fj* and ( ’ by maximizing the simulated log likelihoods

m) = E lo&w i 4>fa) = E lo&w 11)*and com pute the simulated log likelihood ratio statistic

Then we calculate p using (4.13).As a particular illustration, consider the failure-time data in Table 1.2. Two

plausible models for this type of da ta are gam m a and lognormal, that is

, , , , Kiicy)*-1 e x p ( -K y /n ) , { l o g y - ot\fo ( y \ r i ) = ----------^ r ( K ) ----------’ = ^ — p— ) ’ y > 0 -

For these da ta the M LEs o f the gam m a mean and index are fi = y = 108.083 and k = 0.707, the latter being the solution to

log(/c) - h(k) = log(y) - logy

with h(x) = d \ogr(K) /dK, the digam m a function. The M LEs o f the mean and variance o f the norm al distribution for log Y are a = logy = 3.829 and P2 = (n — 1 )s?ogy/ n = 2.339. The test statistic (4.14) is

t = —k log(fc/y) — ka + k + log r(/c) — | \og(2n[i2) —

whose value for the data is t = —0.465. The left panel o f Figure 4.2 shows a histogram o f R = 999 values of t* under sampling from the fitted gamma model: o f these, 619 are greater than t and so p = 0.62.

N ote tha t the histogram has a fairly non-norm al shape in this case, suggesting that a norm al approxim ation will not be very accurate. This is true also for the (rather complicated) studentized version Z o f T : the right panel o f Figure 4.2 shows the norm al plot o f bootstrap values z \ The observed value o f z is 0.4954, for which the bootstrap P-value is 0.34, som ewhat smaller than that com puted for t, but not changing the conclusion tha t there is no evidence to change from a gam m a to a lognorm al model for these data. There are good general reasons to studentize test statistics; see Section 4.4.1.

It should perhaps be mentioned that significance tests o f this kind are not always helpful in distinguishing between models, in the sense that we could find evidence against either both or neither o f them. This is especially true with small samples such as we have here. In this case the reverse test shows no evidence against the lognorm al model. ■

150 4 ■ Tests

t* Quantiles of standard normal

4.2.4 Graphical testsG raphical m ethods are popular in model checking: examples include norm al and half-norm al plots o f residuals in regression, plots o f C ook distance in regression, plots o f nonparam etric hazard function estimates, and plots of intensity functions in spatial analysis (Section 8.3). In m any cases the nom inal shape o f the plot is a straight line, which aids the detection o f deviation from a null model. W hatever the situation, informed in terpretation o f the plot requires some notion o f its probable variation under the model being checked, unless the sample size is so large tha t deviation is obvious (c.f. the plot o f resampling results in Figure 4.2). The simplest and m ost com m on approach is to superimpose a “probable envelope”, to which the original da ta plot is compared. This probable envelope is obtained by M onte Carlo or param etric resampling methods.

G raphical tests are not usually appropriate when a single specific alternative model is o f interest. R ather they are used to suggest alternative models, depending upon the m anner in which such a plot deviates from its null expected behaviour, or to find suspect data. (Indeed graphical tests are not tests in the usual sense, because there is usually no simple notion o f “rejectable” behaviour: we com m ent more fully on this below.)

Suppose tha t the graph plots T(a) versus a for a e s / , a bounded set. The observed plot is {t(a) : a € j /} . For example, in a norm al plot j / is a set o f norm al quantiles and the values o f t(a) are the ordered values o f a sample, possibly studentized. The idea o f the plot is to com pare t(a) with the probable behaviour o f T(a) for all a e when Hq is true.

Figure 4.2 Null hypothesis resampling for failure data. Left panel shows histogram of under gamma sampling. Right panel shows normal plot of z ' ; R — 999 and gamma parameters p. =108.0833, k = 0.7065; dotted line is theoretical N(0,1) approximation.

Example 4.6 (Normal plot) Consider the data in Table 3.1, and suppose in

Figure 4.3 Normal plot of n = 13 studentized values for final sample in Table 3.1. N o

CO

cd>■OCD

c<1)-oD35

inooo

yt)O /

oO / '

Q--'

0/o

.■•6 o o

0/0o

-1 0 1


particular tha t we w ant to assess w hether or not the last sample o f n = 13 measurem ents can be assumed normal. A norm al plot o f the da ta is shown in Figure 4.3, which plots the ordered studentized values Z(,-) = ( y ^ — y ) / s against the quantiles a, = o f the N{0,1) distribution. In the general notation,s4 is the set o f norm al quantiles, and t(at) = Z(,-). The dotted line is the expected pattern, approximately, and the question is whether or not the points deviate sufficiently from this to suggest that the sample is non-norm al. ■

Assume for the m om ent tha t the null hypothesis jo in t distribution o f {T(a) : a € involves no unknow n nuisance param eters. This is true for a norm al plot if we use studentized sample values z, as in the previous example. Then for any fixed a we can subject t(a) to a M onte C arlo test. For each o f R independent sets o f da ta y j , . . . ,y * , which are obtained by sampling from the null model, we com pute the simulated plot

t*(a), a G s i .

Under the null hypothesis, T(a), Tj*(a),. . . , TR(a) are independent and identically distributed for any fixed a, so tha t (4.9) applies with T = T(a). T hat is,

P r ( T ( a ) < T (})( f l ) |Ho) = ^ I . (4.15)

This leads to (4.11) as the one-sided P-value at the given value o f a, if large values o f t(a) are evidence against the null model. There are obvious

152 4 ■ Tests

modifications if we w ant to test for small values o f t(a), or if we w ant a two-sided test.

The test as described applies for any single value o f a. However, the graphical test does not look just a t one fixed a, but rather at all a € j / simultaneously. In principle the M onte C arlo test could be applied at all values o f a e srf, but this would be time-consuming and difficult to interpret. To simplify matters, at each value o f a we com pute lower and upper critical values corresponding to fixed one-sided levels p, and plot these critical values against a to provide critical curves against which to com pare the whole data plot {t(a),a e i } .

So the m ethod is to choose integers R and k so tha t ^ = p, the desired one-sided test level, and then com pute the critical values

f (fc)(a )> f (R + l-lc )(a )

from the R simulated plots. If t(a) exceeds the upper value, or falls below the lower value, then the corresponding one-sided P-value is at m ost p; the two- sided test which rejects Ho if t(a) falls outside the interval [ ^ ( a ) , tJ'J?+1_fc)(a)] has level equal to 2p. The set o f all upper and lower critical values defines the test envelope

S'1 2p = {[t(fc)(a), t(R+!_(;)(«)] : a e s / j . (4.16)

Excursions o f t(a) outside S l~2p are regarded as evidence against Ho, and this sim ultaneous com parison across all values o f a is w hat is usually m eant by the graphical test.

Example 4.7 (Normal plot, continued) For the norm al plot o f the previous example, suppose we set p = 0.05. The smallest sim ulation size that works is R = 19, and then we take k = 1 in (4.16). The test envelope will therefore be the lines connecting the m axim a and the minima. Because we are plotting studentized sample values, which eliminates mean and variance param eters, the sim ulation can be done with the N ( 0,1) distribution. Each simulated sample y \ , . . . , y ‘u is studentized to give z* = (y ‘ — y*)/s*, i = 1 , . . . , 13, whose ordered values are then plotted against the same norm al quantiles a, = <P-1 ( ^ ) . The left panel o f Figure 4.4 shows a set o f R = 19 norm al plots (plotted as connecting dashed lines) and their envelope (solid curves) for studentized values o f simulated samples o f n = 13 N{0,1) data. The right panel shows the envelope o f these plots together with the original da ta plot. N ote tha t one o f the inner points falls just outside the envelope: this might be taken as mild evidence against norm ality o f the data, bu t such an in terpretation may be prem ature, in light o f the discussion below. ■

The discussion so far assumes either tha t the null model involves no unknown param eters, or tha t it is possible to eliminate unknow n param eters by standardization, as in the previous example. In the latter case simulated

Figure 4.4 Graphical test of normality. Left panel: normal plots (dashed lines) of studentized values for R = 19 samples of n = 13 simulated from the N(0, 1) distribution, together with their envelope (solid line). Right panel: envelope of the simulated plots superimposed on the original data plot.



samples can be generated from any null model Fo. W hen unknown model param eters cannot be eliminated, we would simulate from Fo: then (4.15) will be approxim ately true provided n is not too small.

There are two aspects o f the graphical test which need careful thought, namely the choice o f R and the in terpretation o f the resulting plot. It seems clear from earlier discussion that for p = 0.05, say, R = 19 is too small: the test envelope is too random . R = 99 would seem to be a m ore sensible choice, provided this is not com putationally difficult. But we should consider how formal is to be the interpretation o f the graph. As it stands the notional one-sided significance levels p hold pointwise, and certainly the chance that the envelope captures an entire plot will be far less than 1 — 2p. So it would not make sense to infer evidence against the null model if one arbitrarily placed point falls outside the envelope, as happened in Example 4.7. In fact in that example the chance is about 0.5 that some point will fall outside the sim ulation envelope, in contrast to the pointwise chance 0.1.

For some purposes it will be useful to know the overall error ra te , i.e. the chance o f a point falling outside the envelope, or even to control this rate. While this is difficult to do exactly, there is a simple empirical approach which works satisfactorily. Given the R simulated plots which were used to calculate the test envelope, we can simulate the graphical test by com paring {t '(a),a G j / } to the envelope S‘lSr2p that is obtained from the o ther R — 1 simulated plots. If we repeat this simulated test for r = 1 , . . . , R , then we obtain a resample estimate o f the overall two-sided error rate

# { r : {t’(a),a G j / } exits <0’Lr2pj R

(4.17)

154 4 ■ Tests


This is easy to calculate, since {t‘(a),a e j t f j exits S lSr2p if and only if

rank{t*(a)} < k or rank{f’ (a)} > R + 1 — k

for at least one value o f a, where as before k = p(R + 1 ) . Thus if the R plots are represented by a R x N array, we first com pute columnwise ranks. Then we calculate the proportion o f rows in which either the m inim um rank is less than or equal to k, or the m axim um rank is greater than or equal to R + 1 — k, o r both. The corresponding one-sided error rates are estim ated in the obvious way.

Example 4.8 (Normal plot, continued) For the norm al plot o f Example 4.6, an overall two-sided error rate o f approxim ately 0.1 requires R = 199. Figure 4.5 shows a graphical test plot for R = 199 with outer envelope corresponding to overall two-sided error rate 0.1 and inner envelope corresponding to pointwise two-sided error rate 0.1; the empirical error rate (4.17) for the outer envelope is 0.10. ■

In practice one m ight ra ther be looking for trends, manifested by sequences o f points going outside the test envelope. Alternatively one might be focusing attention on particular regions o f the plot, such as the tails o f a probability plot. Because such plots may be used to detect several possible deviations from a hypothetical model, and hence be interpreted in several possible ways, it is not possible to m ake a single recom m endation that will induce a controlled error rate. In the absence o f a single criterion by which the plot is to be judged, it seems wise to plot envelopes corresponding to both pointwise one-sided error rate p and sim ultaneous one-sided error rate p, say with p = 0.05. This is relatively easy to do using (4.17). For a further illustration see Example 8.9.

Figure 4.5 Normal plot of n = 13 studentized values for final sample in Table 3.1, together with simultaneous (solid lines) and pointwise (dashed lines) two-sided 0.10 test envelopes.K = 199

4.2.5 Choice o f RIn any sim ulation-based test, relatively few samples could be used if it quickly became clear tha t p was so large as to not be regarded as evidence against Ho- For example, if the event t* > t occurred 50 times in the first 100 samples, then it is reasonably certain that p will exceed 0.25, say, for much larger R, so there is little point in sim ulating further. O n the other hand, if we observed t* > t only five times, then it would be worth sampling further to more accurately determine the level o f significance.

One effect o f not com puting p exactly is to weaken the power o f the test, essentially because the critical region o f a fixed-level test has been random ly displaced. The effect can be quantified approxim ately as follows. Consider testing at level a, which is to say reject Ho if p < a. If the integer k is chosen equal to (R + l)a , then the test rejects Ho when t'{R+l_k) < t. For the alternative hypothesis H a , the power o f the test is

nR(a,HA) = Pr(reject H 0 \ HA) = Pr(T(*R+1_k) < T \ HA).

To evaluate this probability, suppose for simplicity tha t T has a continuous distribution, with P D F go(t) and C D F Go(t) under Ho, and density gA(t) under HA. Then from the standard result for P D F o f an order statistic we have

nR(a,HA) = J J R ( ^ _ Q c o M ^ g o M U - Goix) } ^ 1 gA(t)dxdt.

A fter change o f variable and some rearrangem ent o f the integral, this becomes

nR(cc,Ha) = [ ^ao(u, H A)hR(u;tx)du, (4.18)Jo

where nx (u,HA) is the power o f the test using the exact P-value, and hR{u;a) is the beta density on [0,1] with indices (R + l)a and (R + 1)(1 — a).

The next p art o f the calculation relies on nR{ot, H A) being a concave function o f a, as is usually the case. Then a lower bound for n ^ u , H a ) is nm[ u,H a )

which equals U7taj(a ,Ha)/a for u < a and 7tx (a ,H 4) for u > a. It follows by applying (4.18) to nR(y., HA), and some m anipulation, that

n00(oL ,HA)-nR(a,HA) < nco ^ A') J \u - a \ h R(u;cc)du

7too(a, Hy4)a*R+1*<x(l - + 1)(R + l ) a r ((R + l)a) T ((R + 1)(1 - a)) '

We apply Stirling’s approxim ation T(x) = (2n)l/2x x~ l / 1 exp(—x) for large x to the right-hand side and obtain the approxim ate bound

156 4 ■ Tests

The following table gives some numerical values o f this approxim ate bound.

sim ulation size R 19 39 99 199 499 999 9999power ratio for a = 0.05 0.61 0.73 0.83 0.88 0.92 0.95 0.98power ratio for a. = 0.01 — — 0.60 0.72 0.82 0.87 0.96

These values suggest that the loss o f power with R = 99 is not serious for a > 0.05, and tha t R = 999 should generally be safe. In fact the values can be quite conservative. For example, for testing a norm al mean the power ratios for a = 0.05 are usually above 0.85 and 0.97 for R = 19 and R = 99 respectively.

4.3 Nonparametric Permutation TestsIn m any practical situations it is useful to have available statistical methods which do not depend upon specific param etric models, if only in order to provide backup to results o f param etric methods. So, with significance testing, it is useful to have nonparam etric tests such as the sign test and the signed-rank test for analysing paired data, either to confirm the results o f applying the param etric paired t test, or to deal with evident non-norm ality o f the paired differences.

N onparam etric tests in general com pute significance w ithout assuming forms for the da ta distributions. The choice o f test statistic will usually be based firmly on the physical context o f the problem, possibly reinforced by w hat we know would be a good choice if a plausible param etric model were applicable. So, in a com parison o f two treatm ents where we believe tha t treatm ent effects are additive, it would be reasonable to choose as test statistic the difference of means, especially if we thought tha t the da ta distributions were not far from norm al; for long-tailed data distributions the difference of m edians would be m ore reasonable from a statistical point o f view. If we are concerned about the nonrobustness o f means, then we might first convert da ta values to relative ranks and then use an appropriate rank test.

There is a vast literature on various kinds o f nonparam etric tests, such as rank tests, U-statistic tests, and distance tests which com pare ED Fs in various ways. We shall not a ttem pt to review these here. R ather our concern in this chapter is with resampling tests, and the simplest form o f nonparam etric resampling test is the perm utation test.

Essentially a perm utation test is a com parative test, where the test statistic involves some sort o f com parison between EDFs. The special feature of the perm utation test is that the null hypothesis implies a reduction o f the nonparam etric M LE o f the data distributions to E D Fs which play the role o f sufficient statistic S in equation (4.4). The conditional probability distribution

4.3 ■ Nonparametric Permutation Tests 157

Figure 4.6 Scatter plot of n — 37 pairs of measurements in a study of handedness (provided by Dr Gordon Claridge, University of Oxford).

dnan

used in (4.4) is then a uniform distribution over a set o f perm utations o f the d a ta structure. The following example illustrates this.

Example 4.9 (Correlation test) Suppose that Y = ( U , X ) is a random pair and that n such pairs are observed. The objective is to see if U and X are independent, this being the null hypothesis Hq. A n illustrative dataset is plotted in Figure 4.6, where u = dnan is a genetic measure and x = hand is an integer measure o f left-handedness. The alternative hypothesis is tha t x tends to be larger when u is larger. These data are clearly non-norm al.

One simple test statistic is the sample correlation, T = p{F) say. N ote that here the E D F F puts probabilities n~‘ on each o f the n data pairs (u;,x,). The correlation is zero for any distribution that satisfies Ho. The correlation coefficient for the data in Figure 4.6 is 0.509.

W hen the form o f F is unspecified, F is minimal sufficient for F. Under the null hypothesis, however, the m inimal sufficient statistic is com prised of the ordered us and ordered xs, s = (M(i),...,U(n),X(i),...,X(„)), equivalent to the two m arginal EDFs. So here a conditional test can be applied, with (4.4) defining the P-value, which will therefore be independent o f the underlying m arginal distributions o f U and X . Now when S is constrained to equal s, the random sample ( U \ , X \ ) , . . . ,(U„,X„) is equivalent to (u(i),X j), . . . , (u (n),X*) with ( X j , . . .,X"n) a random perm utation o f X ( i ) , . . . ,X ( „ ) . Further, when Ho is true all such perm utations are equally likely, and there are n! o f them. Therefore the one-sided P-value is

# of perm utations such that T* > t

In evaluating p, we can use the fact that all m arginal sample m om ents

158 4 ■ Tests

-0.5 0.0 0.5

Correlation t*

are constant across perm utations. This implies tha t T > t is equivalent to

T ,X i U i > Y , X i U i - ■

As a practical m atter, it is rarely possible or necessary to com pute the perm utation P-value exactly. Typically a very large num ber o f perm utations is involved, for example m ore than 3 million in Example 4.9 when n = 10. In special cases involving linear statistics there will be theoretical approximations, such as norm al approxim ations or improved versions o f these: see Section 9.5. But for general use the m ost reliable approach is to make use of the M onte C arlo m ethod o f Section 4.2.1. T hat is, we take a large num ber R o f random perm utations, calculate the corresponding values t \ , . . . , t ’R o f T, and approxim ate p by

1 + # { tr* > r} me R + 1 '

A t least 99 and at m ost 999 random perm utations should suffice.

Example 4.10 (Correlation test, ctd) For the dataset shown in Figure 4.6, the test o f Example 4.9 was implemented by simulation, tha t is generating random perm utations o f the x-values, with R = 999. Figure 4.7 is a histogram of the correlation values. The unshaded part corresponds to the 4 t* values which are greater than the observed correlation t = 0.509: the P-value is p = ( l + 4 ) / ( l + 9 9 9 ) = 0.005. ■

Figure 4.7 Histogram o f correlation t* values for R — 999 random perm utations o f data in Figure 4.6.

One feature o f perm utation tests is tha t any test statistic is as easy to use as any other, a t least in principle. So in the previous example it is just as easy to use the rank correlation (in which the us and xs are replaced by their relative

4.3 • Nonparametric Permutation Tests 159

ranks), a robust measure o f correlation, or a com plicated measure o f distance between the bivariate E D F F and its null hypothesis version Fo which is the product o f the E D Fs o f u and x. All that is required is tha t we be able to com pute the test statistic for all perm utations o f the xs.

In the previous example the null hypothesis of independence led unam biguously to a sufficient statistic s and a perm utation distribution. M ore generally the explicit null hypothesis may not be strong enough to do this, unless it can be taken to imply a stronger hypothesis. This depends upon the practical context, as we see in the following example.

Example 4.11 (Comparison of two means) Suppose tha t we w ant to com pare the means o f two populations, given random samples from each which are denoted by C y n ,... ,j 'ini) and (y2i , - - - , y 2n2)- The explicit null hypothesis is Hq : n\ = fi2, where ji\ and jij are the means for the respective populations. Now Ho alone does not reduce the sufficient statistics from the two sets o f ordered sample values. However, suppose we believe tha t the C D Fs Fi and Fj have either o f the special forms

Fi(y) = G ( y - n \ ) , F2(y) = G(y - n2)

or

F\(y) = G(y /n i), F2(y) = G{y/(i2),

for some unknow n G. Then the null hypothesis implies a com m on C D F F for the two populations. In this case, the null hypothesis sufficient statistic s is the set o f order statistics for the pooled sample

= yiii • • • > * 1 = yittp Wfii+i = 3 21 s • • • •> ^«i+«2 = y2n2,

tha t is s = (u(i) ,...,H (ni+n2)).Situations where the special forms for Fj and F2 apply would include

com parisons o f two treatm ents which were both applied to a random selection o f units from a com m on pool. The special forms would not necessarily apply to sets o f physical m easurem ents taken under different experimental conditions or using different apparatus, since then the samples could have unequal variablity even though Ho were true.

Suppose tha t we test Ho by com paring the sample means using test statistic t = y2 — yi, and suppose tha t the one-sided alternative Ha : fi2 > ji\ is appropriate. If we assume tha t Ho implies a com m on distribution for the Yu and Yzj, then the exact significance probability is given by (4.4), i.e.

p = P r(T > t | S = s,Ho).

Now when S is constrained to equal s, the concatenation of the two random samples (Y u ,. . . , Yini, Y2i , . . . , Y2„2) m ust form a perm utation o f s. The first

160 4 ■ Tests

m com ponents o f a perm utation will give the first sample and the last «2 com ponents will give the second sample. Further, when Ho is true all such perm utations are equally likely, and there are o f them. Therefore

# o f perm utations such that T* > t P ~ ^ i+n2 ' (4-21)

As in the previous example, this exact probability would usually be approxim ated by taking R random perm utations o f the type described, and applying(4.11). ■

A som ewhat more com plicated two-sample test problem is provided by the following example.

Example 4.12 (AML data) Figure 3.3 shows the product-lim it estimates of the survivor function for times to remission o f two groups o f patients with acute myelogeneous leukaem ia (AM L), with one o f the groups receiving m aintenance chemotherapy. Does this treatm ent make a difference to survival?

A com m on test for com parison o f estim ated survivor functions is based on the log-rank statistic, which com pares the actual num ber o f failures in group 1 with its expected value at each time a failure is observed, under the null hypothesis tha t the survival distributions o f the two groups are equal. To be m ore explicit, suppose tha t we pool the two groups and obtain ordered failure times y\ < ■ ■ ■ < ym, with m < n if there is censoring. Let / \j and r\j be the num ber o f failures and the num ber at risk o f failure in group 1 at time yj, and similarly for group 2. Then the log-rank statistic is

T = E j= i( /U - mij)

where

( f i j + f 2 j ) r i j ( / l j + f 2 j ) r i j r 2j ( r i j + r2j - f i; - f 2J)1] r i j + r y ’ lJ (ri; + r2j ) 2{r\ j + r2j — 1)

are the conditional mean and variance o f the num ber in group 1 to fail a t time tj, given the values o f f i j + f 2j, r\j and r2j. For the A M L data t = 1.84. Is this evidence that chem otherapy lengthens survival times?

For a suitable null distribution we simply treat the observations in the rows o f Table 3.4 as a single group and perm ute them, effectively random ly allocating group labels to the observations. For each o f R perm utations, we recalculate t, obtaining t\, .. . , t*R. Figure 4.8 shows the t'r plotted against order statistics from the JV(0,1) distribution, which is the asym ptotic null distribution o f T. The asym ptotic P-value is 0.033, in reasonable agreem ent with the P-value 26/(999 + 1) = 0.026 from the perm utation test. ■

4.4 • Nonparametric Bootstrap Tests 161

Figure 4.8 Results of a Monte Carlo permutation test for differences between the survivor functions for the two groups of AML data, R = 499. The dashed horizontal line shows the observed value of the statistic, and values of t* that exceed it are hollow.The dotted line is the line x = y.


4.4 Nonparametric Bootstrap TestsThe perm utation tests described in the previous section are special nonparametric resam pling tests, in which resampling is done w ithout replacement. In this section we discuss the direct application o f nonparam etric resampling m ethods, as introduced in Chapters 2 and 3. For tightly structured problems such as those in the previous section, this means resampling with replacement rather than w ithout, which makes little difference. But bootstrap tests apply to a much wider class o f testing problems.

The special nature o f significance tests requires tha t probability calculations be done under a null hypothesis model. In this way the bootstrap calculations m ust differ from those in earlier chapters. For example, where in C hapter 2 we introduced the idea o f resampling from the E D F F, now we m ust resample from a distribution Fo, say, which satisfies the relevant null hypothesis Hq. This has been illustrated already for param etric bootstrap tests in Section 4.2.

A

Once the null resampling distribution Fo is decided, the basic bootstrap test will be to com pute the P-value as

Pboot = Pr*(7” > r I Fo),

or to approxim ate this by

p i + # K > t }P R + l

using the results t\ , . . . ,t*R from R bootstrap samples.

1 6 2 4 • Tests

inCM

in

o

oo

Figure 4.9 Histogram of test statistic values t' = y2 — y\ from R = 999 resamples of the two samples in Example 4.13. The data value of the test statistic is t — 2.84.

■6 -4

Example 4.13 (Comparison of two means, continued) Consider the last two series o f m easurem ents in Example 3.1, which are reproduced here labelled samples 1 and 2 :

sample 1 82 79 81 79 77 79 79 78 79 82 76 73 64sample 2 84 86 85 82 77 76 77 80 83 81 78 78 78

Suppose tha t we w ant to com pare the corresponding population means, p\ and /i2, say with test statistic t = y i — y\. If, as seems plausible, the shapes o f the underlying distributions are identical, then under Ho : P2 = Pi the two distributions are the same. It would then be sensible to choose for Fo the pooled E D F o f the two samples. The resampling test will be the same as the perm utation test o f Example 4.11, except that random perm utations will be replaced by random samples o f size n\ 4-112 = 26 draw n with replacement from the pooled data.

Figure 4.9 shows the results from applying this procedure to our two samples with R = 999. The unshaded area o f the histogram corresponds to the 48 values o f t* larger than the observed value t = 80.38 — 77.54 = 2.84. The one-sided P- value for alternative H A : H2 > Hi is p = (48 + l ) / (9 9 9 + l) = 0.049. Application o f the perm utation test gave the same result.

I t is w orth stressing again tha t because the resampling m ethod is wholly com putational, any sensible test statistic is as easy to use as any other. So here, if outliers were present, it would be just as easy, and perhaps m ore sensible, to choose t to be the difference o f trim m ed means.


yi and sj are the average and sample variance for the ith series.

The question is: do we gain or lose anything by assuming tha t the two distributions have the same shape? ■

The particular null fitted model used in the previous example was suggested in part by the perm utation test, and is clearly not the only possibility. Indeed, a m ore reasonable null model in the context would be one which allowed different variances for the two populations sam pled: an analogous model is used in Example 4.14 below. So in general there can be m any candidates for null model in the nonparam etric case, each corresponding to different restrictions imposed in addition to Hq. One m ust judge which is m ost appropriate on the basis o f w hat makes sense in the practical context.

Semiparametric null modelsIf da ta are described by a sem iparam etric model, so that some features of underlying distributions are described by param eters, then it may be relatively easy to specify a null model. The following example illustrates this.

Example 4.14 (Comparison of several means) For the gravity da ta in Exam ple 3.2, one point that we m ight check before proceeding with an aggregate estim ation is tha t the underlying means for all eight series are in fact the same. One plausible model for the data, as m entioned in Section 3.2, is

)fij ~ j — L • • • ? I ~ I? • ■ •)where the ei; come from a single distribution G. The null hypothesis to be tested is Ho : p\ = ■ ■ ■ = p.%, with general alternative. For this an appropriate test statistic is given by

8

t = E Wi(yi - £o)2, Wi = Hi/sf, i= 1

with fo = Y wi}'i/ Y wi null estimate of the com m on mean. The null distribution o f T would be approxim ately yfi were it not for the effect o f small sample sizes. So a bootstrap approach is sensible.

The null model fit includes /to and the estim ated variances

°K> = (« i “ l ) s f / « i + (Pi ~ M 2-

The null m odel studentized residuals

ytj - fo eij { ^ - ( E w , ) - 1}172’

when plotted against norm al quantiles, suggest mild non-norm ality. So, to be safe, we apply a nonparam etric bootstrap. D atasets are simulated under the null model

y'j = fo +

164 4 ■ Tests

i ^ s? w,'

1 66.4 370.6 474.4 0.0222 89.9 233.9 339.9 0.0473 77.3 248.3 222.3 0.0364 81.4 68.8 67.8 0.1165 75.3 13.4 23.1 0.5996 78.9 34.1 31.1 0.3237 77.5 22.4 21.9 0.5798 80.4 11.3 13.5 1.155

0 10 20 30 40 50 60

t* Chi-squared quantiles

with e'jS random ly sampled from the pooled residuals {e^, i = 1.......8, j =l , . . . ,n ,} . For each such simulated dataset we calculate sample averages and variances, then weights, the pooled mean, and finally t*.

Table 4.3 contains a sum m ary o f the null model fit, from which we calculate fo = 78.6 and t = 21.275.

A set o f R = 999 bootstrap samples gave the histogram o f t ‘ values in the left panel o f Figure 4.10. Only 29 values exceed t = 21.275, so p = 0.030. The right panel o f the figure plots ordered t* values against quantiles o f the Xi approxim ation, which is off by a factor o f about 1.24 and gives the distorted P-value 0.0034. A norm al-error param etric bootstrap gives results very similar to the nonparam etric bootstrap. ■

Table 4.3 Summary statistics for eight samples in gravity data, plus ingredients for significance test. The weighted mean is po = 78.6.

Figure 4.10Resampling results for comparison of the means of the eight series of gravity data. Left panel: histogram of R = 999 values of t* under nonparametric resampling from the null model with pooled studentized residuals; the unshaded area to right of observed value t = 21.275 gives p = 0.029. Right panel: ordered t‘ values versus Xi quantiles; the dotted line is the theoretical approximation.

4.4 ■ Nonparametric Bootstrap Tests 165

Example 4.15 (Ratio test) Suppose that, as in Example 1.2, each observation y is a pair (u,x), and that we are interested in the ratio o f means 8 = E(X) /E(U) . In particular suppose that we wish to test the null hypothesis Hq : 6 = 0O- This problem could arise in a variety o f contexts, and the context would help to determ ine the relevant null model. For example, we might have a paired- com parison experim ent where the multiplicative effect 0 is to be tested. Here do would be 1, and the m arginal distributions o f U and X should be the same under Hq- One natural null model Fo would then be the symmetrized EDF, i.e. the E D F o f the expanded data ( u i ,x i ) , . . . , (u„,x„),(xi ,ui) , . . . , ( x n,u„). ■

Fully nonparametric null models

In those few situations where the context o f the problem does not help identify a suitable sem iparam etric null model, it is in principle possible to form a wholly nonparam etric null model Fo. Here we look at one general way to do

Suppose the test involves k distributions F i , . . . ,F ^ for which the null hypothesis imposes a constraint, Ho : r (F i,. . . , F*) = 0. Then we can obtain a null model by nonparam etric maximum likelihood, or a similar m ethod, by adding the constraint to the usual derivation o f the E D Fs as M LEs. To be specific, suppose that we force the estimates o f F \ , . . . , Fk to be supported on the corresponding sample values, as the ED Fs are. Then the estimate for F, will attach probabilities p, = (p,i, ■ ■ • ,P;n,) to sample values y , i , t h e unconstrained E D F Ft corresponds to pi = n ^ 'f l , . . . , 1). Now measure the discrepancy between a possible F, and the E D F F,- by rf(p„p,), say, such that the E D F probabilities p, minimize this when no constraints o ther than Y.%] Pij = 1 are imposed. Then a nonparam etric null model is given by the probabilities which minimize the aggregate discrepancy subject to t (Fi , . . . ,Fk) = 0. T hat is, the null model minimizes the Lagrange expression

where t{p i , . . . , pt) is a re-expression o f the original constraint function t (F \ , . . . , Fk). We denote the solutions o f this constrained m inimization problem by p®, i = l , . . . ,k .

The choice o f discrepancy function d(-, ■) that corresponds to maximum likelihood estim ation is the aggregate inform ation distance

this.

(4.22)

k ttj

(4.23)

166 4 ■ Tests

and a useful alternative is the reverse inform ation distancek nk

Y Y Pli log(P<7/Py')- (4-24)r=l j= 1

Both are minimized by the set o f E D Fs when no constraints are imposed. The second measure has the advantage o f autom atically providing non-negative solutions. The following example illustrates the m ethod and some o f its im plications.

Example 4.16 (Comparison of two means, continued) For the two-sample problem considered in Examples 4.11 and 4.13, we apply (4.22) with the discrepancy measure (4.24). The null hypothesis constraint is tha t the two means are equal, tha t is J^y i jPi j = Hi = H2 = ^ y i j P i j , so tha t (4.22) becomes

2 n, 2 / n, N

Y Y piJ (Y yiwj - Y yypv) ~ Y a> [ Y - 11=1 y=i >=i \ j=i

Setting derivatives with respect to pi; equal to zero gives the equations

1 + lo gpij — ai — Xyij = 0, 1 + log p2j - a2 + Xy2j = 0,

which together with the initial constraints gives the solutions

exp(A)>ij) exp i - X y y ) .

17,0 EkLiexp(Ayik)’ 2 j ' ° E"Li ^ p i - X y i k Y

The specific value o f X is uniquely determ ined by the null hypothesis constraint, which becomes

Eyijexp(/l};iv-) = E y2jexp(-Ay2j)E * e x p ( ^ l t ) s x p ( - X y 2k) ’

whose solution m ust be determ ined numerically. D istributions o f the form (4.25) are usually called exponential tilts o f the EDFs.

For our da ta X = 0.130. The resulting null model probabilities are shown in the left panel o f Figure 4.11. The right panel will be discussed later.

H aving determ ined these null probabilities, the bootstrap test algorithm is as follows:

Algorithm 4.1 (Tilted bootstrap two-sample comparison) For r = 1 , . . . , JR,

1 G enerate ( y ^ , . . . , y *lni) by random ly sampling «i times from (y n ,. . . , y ini) with weights (p n ,o ,...,P ini,o).

2 G enerate (y'2 x y ‘2ni) by random ly sam plingn2 times from (y2i , . . . , y 2ni) with weights (p2i,o, • ■ ■, P2n2,o)-

3 Calculate the test statistic t ' = y \ — y\.

4.4 * Nonparametric Bootstrap Tests 167

Figure 4.11 Null distributions for comparison of two means. Left panel: null probability distributions pio (1) and p2o (2) with equal means (X = 0.130); observations are marked +. Right panel: smooth densities corresponding to null probability distributions for population 1 (dotted curve) and population 2 (dashed curve), and smooth density corresponding to pooled EDF (solid curve).

coc0TJ

Table 4.4 ResamplingP-values for one-sided N ull m odel Statistic P-valuecomparison of two ______________________________________means. The entries areexplained in pooled E D F t and z 0.045Examples 4.11, 4.13, null variances t 0.0534.16, 4.19 and 4.20. exponential tilt t 0.006

z 0.025M LE t 0.019

z 0.017(pivot) z 0.015

Calculate1 + # { f* ^ t}

v = -------- --------- -■V R + 1•

Num erical results for R = 999 are given in Table 4.4 in the line labelled “exponential tilt, t". Results for other resampling tests are also given for com parison: z refers to a studentized version o f t, “M L E ” refers to use of constrained m axim um likelihood (see Problem 4.8), “null variances” refers to the sem iparam etric m ethod o f Example 4.14. Clearly the choice o f null model can have a strong effect on the P-value, as one m ight expect. The studentized test statistics z are discussed in Section 4.4.1. ■

The m ethod as illustrated here has strong similarity to use o f empirical likelihood m ethods, as described in C hapter 10. In practice it seems wise to

168 4 ■ Tests

check the null model produced by the method, since resulting P-values are generally sensitive to model. Thus, in the previous example, we should look at Figure 4.11 to see if it makes practical sense. The sm oothed versions o f the null distributions in the right panel, which are obtained by kernel smoothing, are perhaps easier to interpret. One might well judge in this case tha t the two null distributions are more different than seems plausible. Despite this reservation about this example, the general m ethod is a valuable tool to have in case of need.

There are, o f course, situations where even this quite general approach will not work. Nevertheless the basic idea behind the approach can still be applied, as the following examples show.

Example 4.17 (Test for unimodality) One o f the difficulties with nonparametric curve estim ation is knowing w hether particular features are “real”. For example, suppose that we com pute a density estimate f ( y ) and find that it has two modes. How do we tell if the m inor mode is real? Bootstrap m ethods can be helpful in such problems. Suppose that a kernel density estimate is used, so that

/<>■;» = (4.26)j = 1

where (j> is the standard norm al density. It is possible to show that the num ber o f modes o f f decreases as h increases. So one way to test unim odality is to see if an unusually large h is needed to make / unimodal. This suggests that we take as test statistic

t = min{h : f { y , h ) is unimodal}.

A natural candidate for the null sampling distribution is f { y , t ) , since this is the least sm oothed version o f the E D F which satisfies the null hypothesis o f unimodality. By the convolution property o f / , random sample values from f ( y ; t ) are given by

y j = yij + hep (4.27)

where the Ej are independent N (0, 1) variates and the l j are random integers from {1,2 ,... ,n } . On general grounds it seems wise to modify / so as to have first two m om ents agree with the da ta (Problem 3.8), but this modification would have no effect here.

For any such sample y \ , . . . , y 'n generated from the null distribution, we can check w hether or not t’ > t by checking w hether or not the particular density estimate f ' ( y ' , t ) is unimodal. ■

The next example applies a variation o f this test.


Table 4.5 Perpendicular distances (miles) from 0.19 0.28 0.29 0.45 0.64 0.65 0.78 0.85an aerial line transect to 1.00 1.16 1.17 1.29 1.31 1.34 1.55 1.60schools of Southern 1.83 1.91 1.97 2.05 2.10 2.17 2.28 2.41Bluefin Tuna in the Great Australian Bight 2.46 2.51 2.89 2.89 2.90 2.92 3.03 3.19(Chen, 1996). 3.48 3.79 3.83 3.94 3.95 4.11 4.14 4.19

4.36 4.53 4.97 5.02 5.13 5.75 6.03 6.196.19 6.45 7.13 7.35 7.77 7.80 8.81 9.229.29 9.78 10.15 11.32 13.21 13.27 14.39 16.26

Example 4.18 (Tuna density estimate) One m ethod for estim ating the abundance o f a species in a region is to traverse a straight line o f length L through the region, and to record the perpendicular distances from the line to positions where there are sightings. If there are n independent sightings and their (unsigned) distances y \ , . . . , y n are presum ed to have P D F f (y ) , y > 0, the abundance density can be estim ated by n /(0) /(2L), where / ( 0 ) is an estimate o f the density a t distance y = 0. The P D F f ( y ) is proportional to a detection function tha t is assumed to decline m onotonically with increasing distance, with non-m onotonic decline suggesting tha t the assum ptions tha t underlie line transect sampling m ust be questioned.

Table 4.5 gives data from an aerial survey o f schools of Southern Bluefin Tuna in the G reat A ustralian Bight. Figure 4.12 shows a histogram o f the data. The figure also shows kernel density estimates

y * 0- ( 4 -2 8 >

with h = 0.75, 1.5125, and 3. This seemingly unusual density estim ate is used because the probability o f detection, and hence the distribution o f signed distances, should be symmetric about the transect. The estimate is obtained by first calculating the E D F o f the reflected distances + y i , - . . , + y n, then applying the kernel sm oother, and finally folding the result about the origin.

A lthough the estim ated density falls m onotonically for h greater than 1.5125, the estimate for smaller values suggests non-m onotonic decline. Since we consider f ( y ; h ) for positive values o f y only, we are interested in whether the underlying density falls m onotonically or not. We take the smallest h such that f ( y ; h ) is unim odal to be the value o f our test statistic t. This corresponds to m ono tonic decline o f f ( y ; h ) for y > 0, giving no modes for y > 0. The observed value o f the test statistic is t = 1.5125, and we are interested in the significance probability

P r(T > 1 1 Fo),

for da ta arising from Fo, an estimate o f F that satisfies the null hypothesis of

170 4 ■ Tests

Figure 4.12 Histogram of the tuna data, and kernel density estimates (4.28) with bandwidths h = 1.5125 (solid), 0.75 (dashes), and 3 (dots).

Distance (miles)

m onotone decline but is otherwise as close to the data as possible. T hat is, the null model is f ( y , t ) .

To generate replicate datasets from the null model we use the convolution property o f (4.28), which implies

y] = I ± y i j + tej\, j = l , . . . , n ,

where the signs + are assigned random ly, the l j are random integers from { 1 ,2 , . . . ,n}, and the r.j are independent N ( 0,1) variates; cf. (4.27). The kernel density estimate based on the y ’ is f*(y;h) . We now calculate the test statistic as outlined in the previous example, and repeat the process R = 999 times to obtain an approxim ate significance probability. We restrict the hunt for modes to 0 < y < 10, because it does not seem sensible to use so small a sm oothing param eter in the density tails.

W hen the sim ulations were perform ed for these data, the frequencies o f the num ber o f modes o f f ' ( y ; t ) for 0 < y < 10 were as follows.

M odes 0 1 2 3Frequency 536 411 50 2

Like the fitted null distribution, a replicate where the full f*{y;t ) is unim odal will have no modes for y > 0. I f we assume that the event t* = t is impossible, bootstrap datasets with no modes have t* < t, so the significance probability is (411 + 5 0 + 2 + l)/(999 + 1) = 0.464. There is no evidence against m onotonic decline, giving no cause to doubt the assum ptions underlying line transect methods. ■

4.4.1 Studentized bootstrap methodFor testing problems which involve param eter values, it is possible to obtain m ore stable significance tests by studentizing comparisons. One version o f this is analogous to calculating a 1 — p confidence set by the studentized bootstrap method (Section 5.2.1), and concluding that the P-value is less than p if the null hypothesis param eter value is outside the confidence set. Section 4.1 outlined the application o f this idea. Here we describe two possible resampling implementations.

For simplicity suppose first that 9 is a scalar with estim ator T, and that we want to test Ho : 9 = 9o versus Ha '■ 9 > 9q. The m ethod suggested in Section 4.1 applies when

is approxim ately a pivot, meaning that its distribution is approxim ately independent o f unknow n parameters. Then, with zo = (t — ()o)/v,/2 denoting the observed studentized test statistic, the resampling analogue o f (4.6) is

p = Pr*(Z* > z0 | F), (4.29)

which we can approxim ate by sim ulation w ithout having to decide on a null model Fo- The usual choice for v would be the nonparam etric delta m ethod estimate vL o f Section 2.7.2. The theoretical support for the use o f Z is given in Section 5.4; in certain cases it will be advantageous to studentize a transform ed estimate (Sections 5.2.2 and 5.7). In practice it would be appropriate to check on w hether or not Z is approxim ately pivotal, using techniques described in Section 3.10.

Applications o f this m ethod are described in Section 6.2.5 and Section 6.3.2. The modifications for the other one-sided alternative and for the two-sided alternative are simply p = Pr*(Z* < zo | F) and p = Pr*(Z*2 > z \ \ F).

Example 4.19 (Comparison of two means, continued) For the application considered in Examples 4.11, 4.13 and 4.16, where we com pared two means using t = — y u it would be reasonable to suppose tha t the usual two-samplet statistic

z Y 2 - Y 1 - ( H 2 - H i )

(,S i / n 2 + S f / n i ) l/2

is approxim ately pivotal. Here F in (4.29) represents the E D Fs o f the two samples, given that no assum ptions are m ade connecting the two distributions.

We calculate the observed value o f the test statistic,

2o = h - h(s \ /n 2 + S ]/« i) 1/2

172 4 ■ Tests

whose value for these da ta is 2.846/1.610 = 1.768. Then R values of

z . = f 2 ~ f i ~ ( h - h )(s22/ n 2 + s \ 2/ n i ) l/2

are generated, with each sim ulated dataset containing n\ values sampled with replacem ent from sample 1 and n2 values sampled with replacem ent from sample 2.

In R = 999 sim ulations we found 14 values in excess o f 1.768, so the P-value is 0.015. This is entered in Table 4.4 in the row labelled “ (pivot)”. ■

If 9 is a vector with estim ator T, and the null hypothesis is simple, Ho : 6 = Go, with general alternative H A \ 9 =/= Go, then the analogous pivot is

Q = {T - 6 ) t V ~ \ T - G ) ,

with observed test statistic value

qo = ( t - O o ) Tv~1( t - 9 o ) .

A gain v i is a standard choice for v, and again it may be beneficial first to transform T (Section 5.8). Test statistics for more com plicated alternatives can be defined similarly; see Problem 4.10.

Studentized test statistics can also be used when Z or Q is not a pivot. The definitions will be slightly different,

Z = (4.30)

for the scalar case and

Q = ( T - 9 o)t Vo 1( T - 9 o)

for the vector case, where Vo is an estim ated variance under the null model. If Zo is used the bootstrap P-value will simply be

p = Pr*(Z0* > z0 | Fo), (4.31)

with the obvious changes for a test based on Qo. Even though the statistic is no t pivotal, its use is likely to reduce the effects o f nuisance param eters, and to give a P-value tha t is m ore nearly uniformly distributed under the null hypothesis than tha t calculated from T alone.

Example 4.20 (Comparison of two means, continued) In Table 4.4 all theentries for z, except for the row labelled “(pivot)”, were obtained using (4.30) with t = y2 — yi and vo depending on the null model. For example, for the null models discussed in Example 4.16,

2 n,

vo = Y l nr1 Y l ( yij ~ foo)2Pij,o, i=1 j=1

where £,o = Yl'j=i yijPijfi■ For the two samples in question, under the exponential tilt null model both means equal 79.17 and vo = 1.195, the latter differing considerably from the variance estimate 2.59 used in the pivot method (Example 4.19).

The associated P-values com puted from (4.31) are shown in Table 4.4 for all null models. These P-values are less dependent upon the particular model than those obtained with t unstudentized. ■

4.4.2 Conditional bootstrap testsIn param etric testing, conditioning plays an im portant role both in eliminating nuisance param eters and in fixing the inform ation content o f the data. In nonparam etric testing the situation is less clear, because o f the absence o f a full model. Some aspects o f conditioning are illustrated in Examples 5.16 and 5.17.

One simple example which does illustrate the possibility and effect o f conditioning is the nonparam etric bootstrap test for independence. In Example 4.9 we described an exact perm utation test for this problem. The analagous boo tstrap test would set the null model Fq to be the product o f the marginal ED Fs. Sim ulation under this model is equivalent to creating x*s by random sampling with replacement from the xs, and independently creating z*s by random sampling with replacem ent from the zs. However, we could view the m arginal C D Fs G and H as nuisance param eters and attem pt to remove them from the analysis by conditioning on G* = G and H * = H. This turns out to be exactly equivalent to using the perm utation test, which does indeed completely eliminate G and H.

Adaptive tests

C onditioning occurs in a som ewhat different way in the adaptive choice of test statistic. Suppose tha t we have possible test statistics T \ , . . . , T k for which efficiency measures can be defined and estim ated by e i , . . . , ^ : for example, if the T, are alternative estim ators for scalar param eter 9 and Ho concerns 9, then e, might be the reciprocal o f the estim ated variance o f T,. The idea o f the adaptive test is to use that T* which is estim ated to be m ost efficient for the observed data, and to condition on this fact.

We first partition the set 9 o f all possible null model resamples _y‘into < W i k such tha t

Vi = {Cxi. •■•,3';) = % = m a x e’}.1< J < k J

Then if y i , . . . , y n is in so that t, is preferred, the adaptive test computes the P-value as

p = Pr*(T;* > ti | (y j ,.. . ,y * ) G ^,)-

174 4 ■ Tests

For an example o f this, see Problem 4.13. In the case o f exact tests, such as perm utation tests, the adaptive test is also exact.

4.4.3 Multiple testingIn some applications m ultiple tests o f a hypothesis are based on a single set o f data. This happens, for example, when pairwise com parisons o f means are carried out for a several-sample analysis where the null hypothesis is equality o f all means. In such situations the smallest o f all test P-values is used, and it is clearly incorrect to interpret this smallest value in the usual way. Bootstrapping can be used to find the true significance level o f the smallest P-value, as follows. D eparting from our general notation, suppose tha t the test statistics are Si,...,Sfc, with observed values s i,. . . ,s / t , and that the null distribution o f 5, is known to be G ,(). Then the observed significance levels are 1 — G,(s,). The incorrect procedure would be treat the smallest P- value m in{l — G i(s i) ,.. . ,1 — G^Sk)} as uniform on the interval [0,1]. I f the tests were exact and independent, the corresponding random variable would have distribution 1 — (1 — p)k on [0, 1], bu t in general we should take into account their (unknown) dependence. We can allow for the multiple testing by taking t = m in{l — G i( s i ) , . . . , l — Gi(sfc)} to be the test statistic, and then the procedure is as follows. We generate da ta from the null hypothesis distribution, calculate the bootstrap statistics and then take t* =m in{l — G i(sJ) ,...,1 — G^(s^)}. We repeat this R times to get t\, . . . , t*R, and then obtain the P-value in the usual way. Notice tha t if all the G;( ) equal G (), say, the test is tan tam ount to bootstrapping t = m ax(si,...,sf;), and then G () need not be known. I f the G,( ) are unequal, the procedure requires them to be known, in order to put the test statistics on a scale where they can be compared. I f the G,(-) are unknown, they can be estimated, but then a nested bootstrap (Section 3.9) is needed to obtain the P-value. The algorithm is the following.

Algorithm 4.2 (Multiple testing) For r = 1, . . . , R,

1 G enerate y[ , . . . , y 'n independently from the fitted null distribution Fo, and from them calculate s\ , . . . ,s 'k.

2 Fit the null distribution Fq to y*,. . . , y*.3 For m = 1 , . . . ,M , generate y i n d e p e n d e n t l y from the fitted

null distribution Fq, and from them calculate4 Calculate

t* = min [ !+ # {■ \ M

i + # { s rM + 1 }

Calculate p = (1 + #{t^ > t})/(/? + 1).

4.5 • Adjusted P-values 175

The procedure is analogous to tha t used in Section 4.5, but in this case adjustm ent would require three levels o f nested bootstrapping.

4.5 Adjusted P-valuesSo far we have described tests which com pute P-values as p = Pr*(T* > t \ Fo) with Fo the working null sampling model. Ideally P should be uniformly distributed on [0,1] if the usual error rate in terpretation is to be valid. This will be exactly or approxim ately correct for perm utation and perm utationlike bootstrap tests, bu t for other tests it can be far from correct. Preventive measures we can take are to transform t or studentize it, or both. However, these are not guaranteed to work. Here we describe a general m ethod of adjustm ent, simple in principle but potentially very computer-intensive.

The idea behind adjusting P-values is simply to treat p as the observed test statistic: it is after all just a transform ation o f t. We estimate the distribution of the corresponding random variable P by resampling — under the null model, o f course. Since small values o f p are o f interest, the adjusted P-value is defined by

Padj = Pr*(P* < P I Fo), (4.32)

where p is the observed P-value defined above. This requires bootstrapping the algorithm for com puting P-values, ano ther instance o f increasing the accuracy o f a bootstrap m ethod by bootstrapping it, an idea introduced in Section 3.9.

The problem can be explained theoretically in either o f two ways, perturbing the critical value o f t for a fixed nom inal error rate a, or adjusting for the bias in the P-value. We take the second approach, and since we are dealing with statistical error rather than sim ulation error (Section 2.5), we ignore the latter. The P-value com puted for the data is written po{F), where the function po(') depends on the m ethod used to obtain Fo from F. W hen the null hypothesis is true, suppose tha t the particular null distribution Fo obtains. Then the null distribution function for the P-value is

G ( u , F o) = P t { P o ( F ) < u \ F o } , (4.33)

which with u = a is the true error rate corresponding to nom inal error rate a. Now (4.33) implies that

Pr{G(p0(F), F0) < a I F0} = a,

and so G{po(F),Fo) would be the ideal adjusted P-value, having actual error rate equal to the nom inal error rate. N ext notice tha t by substituting Fo for Fo in (4.33) we can estimate G{u,Fo) by

Pr*{po(F*) < u | Fo}.

176 4 ■ Tests

Finally, setting u = po(F) we obtain

G(po(F), Fo) = Pr*{p0(F ') < Po(F) \ F0}.

This we define to be the adjusted P-value, so when po{F) = p,

Padj = Pr*{p0( f *) < P I F0}, (4.34)

which is a m ore precise version o f (4.32).One m ust be careful to interpret P * = p o ( F " ) properly in (4.34). Since the

outer probability relates to sampling from Fo, F* in (4.34) denotes the E D F of a sample draw n from Fo.

The adjusted P-value can be applied to advantage in both param etric and nonparam etric testing, the key point being tha t it is more nearly uniformly distributed than the unadjusted P-value. Before discussing sim ulation implem entation o f the adjustm ent, we look at a simple example which illustrates the basic method.

Example 4.21 (Comparison of exponential means) Suppose that x i , . . . , x m and y i , . . . , y„ are respectively random samples from exponential distributions with means m and n 2, and that we wish to test H o : Hi = Hi- For this problem there is an exact test based on U = X / Y , but we consider instead the test statistic T = X — Y , for which we show tha t the adjusted P-value autom atically produces the P-value for the exact test.

For the param etric bootstrap test the null model sets the two sampling distributions equal to a com m on fitted exponential distribution with pooled mean

mx + nyv = — ;— •m + n

If X * and 7* denote averages o f random samples o f sizes m and n respectively from this exponential distribution, then the bootstrap P-value is p = P r’(X ’ — 7* > x — y). This can be rewritten as

P = P r L - - G. - „ - G . > (m + * - ‘ ) } , (4.35)(_ mu + n j

where u = x / y , and Gm and Gn are independent gam m a random variables with indices m and n respectively and unit scale parameters.

The bootstrap P-value (4.35) does not have a uniform distribution under the null hypothesis, so P = p does not correspond to error rate p. This is fully corrected using the adjustm ent (4.34). To see this, write (4.35) as p = h(u), so tha t po(F') equals

Pr**(T" > T* | F*o) = h(U*),

where U' = X ' / Y ' . Since h( ) is decreasing, it follows that

Padj = Pr*{/i(l/*) < h(u) | x ,y } = Pr*(t/* > u | x ,y ) = Pr(F2m,2„ > u),

4.5 ■ Adjusted P-values 177

which is the P-value o f the exact test. Therefore pa<jj is exactly uniform and the adjustm ent is perfectly successful. ■

In the previous example, the same result for pa would be achieved if the bootstrap distribution o f T were replaced by a norm al approxim ation. This m ight suggest tha t bootstrap calculation o f p could be replaced by a rough theoretical approxim ation, thus removing one level o f bootstrap sampling from calculation o f padj- U nfortunately this is not always true, as is clear from the fact that if an approxim ate null distribution o f T is used which does not depend upon F at all, then pa<jj is just the ordinary bootstrap P-value.

In m ost applications it will be necessary to use sim ulation to approxim ate the adjusted P-value (4.34). Suppose that we have draw n R resamples from the nullmodel Fo, with corresponding test statistic values r j ....... t'R. The rth resamplehas E D F F* (possibly a vector o f ED Fs), to which we fit the null model Ko- Resampling M times from F *0 gives samples from which we calculate f " , m = 1 ,. . . , M. Then the M onte Carlo approxim ation for the adjusted P-value is

1 + #{pr* < p } dj — R + 1 ’ (4.36)

where for each r

• = 1 + # K m ^ fr ) (4 3 7 )Pr M + l

If p is calculated from the same R resamples, then a total o f R M samples is generated. We can summarize the algorithm as follows:

Algorithm 4.3 (Double bootstrap test) For r = 1

1 G enerate y\ , . . . ,y*n independently from the fitted null distribution Fo and calculate the test statistic t* from them.

2 Fit the null distribution to y[ , . . . , y* , thereby obtaining K r3 For m = 1 , . . . ,M ,

(a) generate y p , . . . , y ” independently from the fitted null distribution F*0 ; and

(b) calculate from them the test statistic t” .

4 Calculate p* as in (4.37).

Finally, calculate padj as in (4.36). •

We discuss the choice o f M after the following example.

Example 4.22 (Two-way table) Table 4.6 contains a set o f observed m ultinom ial counts, for which we wish to test the null hypothesis o f row -colum n independence, or additive loglinear model.

178 4 ■ Tests

1 2 2 1 1 0 12 0 0 2 3 0 00 1 1 1 2 7 31 1 2 0 0 0 10 1 1 1 1 0 0

If the count in row i and colum n j is y , j , then the null fitted values are P-ijfi = yi+y+j/y++> where y l+ = E /J ty and so forth. The log likelihood ratio test statistic is

t = y 'i Xo y ' i / N f i ) -

According to standard theory, T is approxim ately distributed as Xd under the null hypothesis with d = (7 —1) x (5 — 1) = 24. Since t = 38.52, the approxim ate P-value is Pr(^24 ^ 38.52) = 0.031. However, the chi-squared approxim ation is known to be quite poor for such a sparse table, so we apply the param etric bootstrap.

The model Fq is the fitted m ultinom ial model, sample size n = y ++ and (i , j ) th cell probability p-ijfi/n. We generate R tables from this model and calculate the corresponding log likelihood ratio statistics t \ , . . . , t 'R. W ith R = 999 we obtain 47 statistics larger than the observed value t = 38.52, so the bootstrap P-value is (1 + 4 7 )/( l + 999) = 0.048. The inaccuracy o f the chi-squared approxim ation is illustrated by Figure 4.13, which is a plot o f ordered values o f Pr(x24 > O versus expected uniform order statistics: the straight line corresponds to the theoretical chi-squared approxim ation for T.

The bootstrap P-value turns out to be quite non-uniform. A double bootstrap calculation with R = M = 999 gives pa<jj = 0.076.

N ote tha t the test applied here conditions only on the total y ++, whereas in principle one would prefer to condition on all row and colum n sums, which are sufficient statistics under the null hypothesis: this would require m ore complex sim ulation m ethods, such as those o f Section 4.2.1; see Problem 4.3. ■

Choice o f M

The general application o f the double bootstrap algorithm involves sim ulation at two levels, with a total o f R M samples. If we follow the suggestion to use as m any as 1000 samples for calculation o f probabilities, then here we would need as m any as 106 samples, which seems im practical for o ther than simple problems. As in Section 3.9, we can determ ine approxim ately w hat a sensible choice for M would be. The calculation below o f sim ulation m ean squared error suggests that M = 99 would generally be satisfactory, and M = 249 would be safe. There are also ways o f reducing considerably the total size of the simulation, as we shall show in C hapter 9.

Table 4.6 Two-way table of counts (Newton and Geyer, 1994).

4.5 ■ Adjusted P-values 179

Figure 4.13 Ordered values of ^ t*) versus expected uniform order statistics from R = 999 bootstrap simulations under the null fitted model for two-way table. Dotted line is theoretical approximation.

I {A} is the indicator function of the event A.

Expected uniform order statistic

To calculate the sim ulation m ean squared error, we begin with equation (4.37), which we rewrite in the form

1 +Em=lJ{C ^ K}Pr M + 1

In order to simplify the calculations, we suppose that, as M —>oo, p ' —>ur such tha t the ur s are a random sample from the uniform distribution on [0,1]. In this case there is no need to adjust the bootstrap P-value, so padj = P■ Under this assum ption (M + l)p* is alm ost a B inom (M ,ur) random variable, so that equation (4.36) can be approxim ated by

■ l + £ r = l * r

Padj = — r + t ~ '

where X r = /{B inom (M , ur) < (M + \)p}. We can now calculate the sim ulation m ean and variance o f p adj by using the fact that

E(X^ | ur) = Pr{Binom(M , ur) < (M + 1 )p}

for k = 1,2. First we have tha t for all r

r i m + mE (*?) = y T . ( " ; )uJ( l - u ) M^ d u =

y=0

where [z] is the integer part o f z. Since pa is proportional to the average o f independent X rs, it follows that

R[(M + l)p]U W (n + i)(Af + i)>

180 4 ■ Tests

which tends to the correct answer p as R, M —>00, and

, . . R[(M + 1 )p](M + l - [ ( M + l)p])var(padj) =

A simple aggregate measure o f sim ulation error is the m ean squared error relative to p,

Num erical evaluations o f this result suggest that M = 249 would be a safe choice. If 0.01 < p < 0.10 then M = 99 would be satisfactory, while M = 49 would be adequate for larger p. N ote that two assum ptions were m ade in the calculation, both o f which are harmless. First, we assumed tha t p was independent o f the t ', whereas in fact it would likely be calculated from the same values. Secondly, our main interest is in cases where P-values are not exactly uniformly distributed. Problem 4.12 suggests a more flexible calculation, from which very similar conclusions emerge.

4.6 Estimating Properties of TestsA statistical test involves two steps, collection o f data and application o f a particular test statistic to those data. Both steps involve choice, and resampling m ethods can have a role to play in such choices by providing estimates o f test power.

Estimation o f power

As regards collection o f data, in simple problems o f the kind under discussion in this chapter, the statistical contribution lies in recom m endation o f sample sizes via considerations o f test power. I f it is proposed to use test statistic T, and if the particular alternative Ha to the null hypothesis Ho is o f prim ary interest, then the power o f the test is

where tp is defined by P r(T > tp \ Ho) = p. In the simplified language o f testing theory, if we fix p and decide to reject Ho when t > tp, then n(p,HA) is the chance o f rejection when HA is true. A n alternative specification is in terms o f E(P | Ha ), the expected P-value. In m any problems hypotheses are expressed in term s o f param eters, and then power can be evaluated for arbitrary param eter values to give a power function. W hat is o f interest to us here is the use of resampling to assess the power o f a test, either as an aid to determ ination o f appropriate sample sizes for a particular test, or as a way to choose from a set o f possible tests.

M S E ( p 3di) =[(M + l)p]{M + l - [ ( M + l)p]}

R ( M + l )2

7i(p,HA) = P r(T > tp I Ha ),

4.6 ■ Estimating Properties o f Tests 181

Suppose, then, that a pilot set o f da ta y i , . . . , y n is in hand, and that the model description is sem iparam etric (Section 3.3). The pilot da ta can be used to estim ate the nonparam etric com ponent o f the model, and to this can be added arbitrary values o f the param etric com ponent. This provides a family o f alternative hypothesis models from which to simulate da ta and test statistic values. From these sim ulations we obtain approxim ations o f test power, provided we have critical values tp for the test statistic. This last condition will not always be met, but in m any problems there will at least be a simple approxim ation, for example N( 0,1) if we are using a studentized statistic. For m any nonparam etric tests, such as those based on ranks, critical values are distribution-free, and so are available. The following example illustrates this idea.

Example 4.23 (Maize height data) The ED Fs plotted in the left panel o f Figure 4.14 are for heights o f maize plants growing in two adjacent rows, and differing only in a pollen sterility factor. The two samples can be modelled approxim ately by a sem iparam etric model with an unspecified baseline distribution F and one m edian-shift param eter 8. For analysis o f such data it is proposed to test Ho : 8 = 0 using the Wilcoxon test. W hether or not there are enough data can be assessed by estim ating the power o f this test, which does depend upon F.

Denote the observations in sample i by y i j , j = l , . . . ,n ; . The underlying distributions are assumed to have the forms F(y ) and F(y — 8), where 8 is estim ated by the difference in sample m edians 0. To estimate F we subtract 0 from the second sample to give y2j = y ij — 8- Then F is the pooled E D F of the yijS and y 2js. For these data n\ = n2 = 12 and 8 = —4.5. The right panel o f Figure 4.14 plots E D Fs o f the y );s and y2js.

The next step is to simulate data for selected values o f 0 and selected sample sizes N i and N 2 as follows. For group 1, sample da ta from F(y),i.e. random ly with replacem ent from

and for group 2, sample da ta y 2\ , - - - , y 2Nl from F(y — 8), i.e. random ly with replacement from

yn + 8, . . . , yi„, + 8, y2\ + 0, . . . , y2„2 + 0-

Then calculate test statistic t*. W ith R repetitions o f this, the power o f the test at level p is the proportion o f times that t* > tp, where tp is the critical value o f the Wilcoxon test for specified N\ and N 2.

In this particular case, the sim ulations show tha t the Wilcoxon test a t level p = 0.01 has power 0.26 for 8 = 8 and the observed sample sizes. A dditional

182 4 • Tests

Figure 4.14 Power comparison for maize height data (Hand et al., 1994, p. 130). Left panel: EDFs of plant height for two groups. Right panel: EDFs for group 1 (unadjusted) and group 2 (adjusted by estimated median-shift 6 ~ —4.5).

Data values Data values

calculations show tha t both sample sizes need to be increased from 12 to at least 33 to have power 0.8 for 9 = 9. ■

If the proposed test uses the pivot m ethod o f Section 4.4.1, then calculations o f sample size can be done more simply. For example, for a scalar 9 consider a two-sided test o f Ho : 9 = 9o with level 2a based on the pivot Z . The power function can be written

n(2a, 9) = 1 - Pr I zx>N + < Z N < z X- ^ N + - i , (4.39)I VN VN J

where the subscript N indicates sample size. A rough approxim ation to this power function can be obtained as follows. First simulate R samples o f size N from F , and use these to approxim ate the quantiles za>sr and zi_a>jv. N ext set v lJ 2 = n^^vh^2/ N 1/2, where v„ is the variance estimate calculated from the pilot data. Finally, approxim ate the probability (4.39) using the same R bootstrap samples.

Sequential tests

Similar sorts o f calculations can be done for sequential tests, where one im portant criterion is term inal sample size. In this context sim ulation can also be used to assess the likely eventual sample size, given data y i , . . . , y „ at an interim stage o f a test, with a specified protocol for term ination. This can be done by sim ulating data continuation y^+i,y^,+2, - ■ ■ up to term ination, by sampling from fitted models or ED Fs, as appropriate. From repetitions o f this sim ulation one obtains an approxim ate distribution for term inal sample size N.


4.7 Bibliographic Notes

The standard theory o f significance tests is described in Chapters 3-5 and 9 o f Cox and Hinkley (1974). For detailed treatm ent o f the m athem atical theory see Lehm ann (1986). In recent years m uch work has been done on obtaining improved distributional approxim ations for likelihood-based statistics, and m ost o f this is covered by Barndorff-Nielsen and Cox (1994).

Random ization and perm utation tests have long histories. R. A. Fisher (1935) introduced random ization tests as a device for explaining and justifying significance tests, both in simple cases and for com plicated experimental designs: the random ization used in selecting a design can be used as the basis for inference, w ithout appeal to specific error models. For a recent account see M anly (1991). A general discussion o f how to apply random ization in complex problems is given by Welch (1990).

Perm utation tests, which are superficially similar to random ization tests, are specifically nonparam etric tests designed to condition out the unknown sampling distribution. The theory was developed by Pitm an (1937a,b,c), and is summarized by Lehm ann (1986). M ore recently R om ano (1989, 1990) has examined properties o f perm utation tests and their relation to bootstrap tests for a variety o f problems.

M onte C arlo tests were first suggested by B arnard (1963) and are particularly popular in spatial statistics, as described by Ripley (1977,1981,1987) and Besag and Diggle (1977). G raphical tests for regression diagnostics are described by A tkinson (1985), and Ripley (1981) applies them to model-checking in spatial statistics. M arkov chain M onte Carlo m ethods for conditional tests were introduced by Besag and Clifford (1989); applications to contingency table analysis are given by Forster, M cD onald and Smith (1996) and Smith, Forster and M cD onald (1996), who give additional references. Gilks et al. (1996) is a good general reference on M arkov chain M onte Carlo methods, including design of simulation.

The effect o f sim ulation size R on power for M onte Carlo tests (with independent simulations) has been considered by M arrio tt (1979), Jockel (1986) and by H all and Titterington (1989); the discussion in Section 4.2.5 follows Jockel. Sequential calculation o f P-values is described by Besag and Clifford (1991) and Jennison (1992).

The use o f tilted E D Fs was introduced by Efron (1981b), and has subsequently had a strong im pact on confidence interval m ethods; see Chapters 5 and 10.

D ouble bootstrap adjustm ent o f P-values is discussed by Beran (1988), Loh (1987), Hinkley and Shi (1989), and Hall and M artin (1988). A pplications are described by Newton and Geyer (1994). G eyer (1995) discusses tests for inequality-constrained hypotheses, which sheds light on possible inconsistency

184 4 ■ Tests

of bootstrap tests and suggests remedies. For references to discussions of improved sim ulation m ethods, see C hapter 9.

A variety o f m ethods and applications for resampling in multiple testing are covered in the books by N oreen (1989) and Westfall and Young (1993).

Various aspects o f resam pling in the choice o f test are covered in papers by Collings and H am ilton (1988), H am ilton and Collings (1991), and Samawi (1994). A general theoretical treatm ent o f power estim ation is given by Beran (1986). The brief discussion o f adaptive tests in Section 4.4.2 is based on Donegani (1991), who refers to previous work on the topic.

4.8 Problems1 For the dispersion test of Example 4.2, y \ , . . . , yn are hypothetically sampled from

a Poisson distribution. In the Monte Carlo test we simulate samples from the conditional distribution of Y i,..., Y„ given Y Yj — s< with s = Yl yj- If the exact multinomial simulation were not available, a Markov chain method could be used. Construct a Markov chain Monte Carlo algorithm based on one-step transitionsfrom (mi,...,u„) to (t>i,_,u„) which involve only adding and subtracting 1 fromtwo randomly selected us. (Note that zero counts must not be reduced.)Such an algorithm might be slow. Suggest a faster alternative.(Section 4.2)

2 Suppose that X i , . . . , X n are continuous and have the same marginal CDF F, although they are not independent. Let / be a random integer between 1 and n. Show that rank(X/) has a uniform distribution on {1,2,...,n}.Explain how to apply this result to obtain an exact Monte Carlo test using one realization of a suitable Markov chain.(Section 4.2.2; Besag and Clifford, 1989)

3 Suppose that we have a m x m contingency table with entries ytj which are counts.(a) Consider the null hypothesis of row-column independence. Show that the sufficient statistic So under this hypothesis is the set of row and column marginal totals. To assess the significance of the likelihood ratio test statistic conditional on these totals, a Markov chain Monte Carlo simulation is used. Develop a Metropolis-type algorithm using one-step transitions which modify the contents of a randomly selected tetrad yik,yu>yjk>yji> where i ^ j , k ^ I.(b) Now consider the the null hypothesis of quasi-symmetry, which implies thatin the loglinear model for mean cell counts, log E(Yy) = /i + a, + + ytj, theinteraction parameters satisfy yy = y;i- for all /, j. Show that the sufficient statistic So under this hypothesis is the set of totals yy+yji, i =£ j, together with the row and column totals and the diagonal entries. Again a conditional test is to be applied. Develop a Metropolis-type algorithm for Markov chain Monte Carlo simulation using one-step transitions which involve pairs of symmetrically placed tetrads. (Section 4.2.2; Smith et al, 1996)

4 Suppose that a one-sided bootstrap test at level a is to be applied with R simulated samples. Then the null hypothesis will be rejected if and only if the number of t’s exceeding t is less than k = (R + l)a — 1. If kr is the number of t*s exceeding t in the first r simulations, for what values of kr would it be unnecessary to continue simulation?(Section 4.2.5; Jennison, 1992)

5 (a) Consider the following rule for choosing the number of simulations in a Monte Carlo test. Choose k, and generate simulations t\ , t’2, . . . , t] until the first I for which k of the t’ exceed the observed value t; then declare P-value p = (k + I)/(I + 1). Let the random variables corresponding to I and p be L and P. Show that

Pr{P < (k + 1)/(/ + 1)} = Pr(L > 1 - 1 ) = k / l , l = k , k + 1,..

and deduce that L has infinite mean. Show that P has the distribution of a t/(0, 1) random variable rounded to the nearest achievable significance level l , k / ( k + l ) , k / ( k + 2), . . . , and deduce that the test is exact.(b) Consider instead stopping immediately if k of the f* exceed t at any I < R, and anyway stopping when I = R, at which point m values exceed t. Show that this rule gives achievable significance levels

/ ( * + ! ) /( / + !), m = k,P ~ \ (m + l) /(K + l) , m < k .

Show that under this rule the null expected value of L is

R R

E(L) = ^ 2 Pr(L > l ) = k + k ^ 2 l ~ \1=1 Mc+l

and evaluate this with k = 49 and 9 for R = 999.(Section 4.2.5; Besag and Clifford, 1991)

6 Suppose that n subjects are allocated randomly to each of two treatments, A andB. In fact each subject falls in one of two relevant groups, such as gender, and the treatment allocation frequencies differ between groups. The response y t] for the j l h subject in the ith group is modelled as y,j = y,- + + e,;, where xA and rb aretreatment effects and k(i, j ) is A or B according to which treatment was allocated to the subject. Our interest is in testing Ho : rA = xB with alternative that xA < tb, and the test statistic chosen is

T = Y . ri> - Y r‘>’i,j±(i,j)=B i,jM<,j)=A

where is the residual from regression of the >>s on the group indicators.(a) Describe how to calculate a permutation P-value for the observed value t using the method described above Example 4.12.(b) A different calculation of the P-value is possible which conditions on the observed covariates, i.e. on the treatment allocation frequencies in the two groups. The idea is to first eliminate the group effects by reducing the data to differences djj = yij — yij+i, and then to note that the joint probability of these differences under Ho is constant under permutations of data within groups. That is, the minimal sufficient statistic So under H0 is the set of differences — Yl(J+l), where Yni) < % ) < ■ • • are the ordered values within the ith group. Show carefully how to calculate the P-value for t conditional on sole) Apply the unconditional and conditional permutation tests to the following data:

Group 1 Group 2

A 3 5 4 4 1B O 2 1

(Sections 4.3, 6.3.2; Welch and Fahey, 1994)

186 4 ■ Tests

1 A randomized matched-pair experiment to compare two treatments produces paired responses from which the paired differences dj = yij — >’i7 arecalculated for j = 1 The null hypothesis Ho o f no treatment differenceimplies that the djs are sampled from a distribution that is symmetric with mean zero, whereas the alternative hypothesis implies a positive mean difference. For any test statistic t, such as d, the exact randomization P-value Pr(T* > t | H0) is calculated under the null resampling model

d) = Sjdj, j =

where the Sj are independent and equally likely to be +1 and —1. What would be the corresponding nonparametric bootstrap sampling model Fo? Would the resulting bootstrap P-value differ much from the randomization P-value?See Practical 4.4 to apply the randomization and bootstrap tests to the following data, which are differences o f measurements in eighths o f an inch on cross- and self-fertilized plants grown in the same pot (taken from R. A. Fisher’s famous discussion o f Darwin’s experiment).

49 -6 7 8 16 6 23 28 41 14 29 56 24 75 60 -48

(Sections 4.3, 4.4; Fisher, 1935, Table 3)

8 For the two-sample problem o f Example 4.16, consider fitting the null model by maximum likelihood. Show that the solution probabilities are given by

- 1 . 1Pij,° ni (a + X yi j) ’ P2]'° n2(P - Xy2j) ’

where a, fi and / are the solutions to the equations Y P i j f l = 1> Y PVfl ~ U andY yijPij.o = Y y 2jP2j,o- Under what conditions does this solution not exist, or givenegative probabilities? Compare this null model with the one used in Example 4.16.

9 For the ratio-testing problem o f Example 4.15, obtain the nonparametric MLE o f the joint distribution o f (U,X ) . That is, if pj is the probability attached to the data pair (Uj,Xj), maximize YlPj subject to Y P)(x i ~ ^ a uj) = 0- Verify that the resulting distribution is the E D F o f (U,X) when 0o = x/u. Hence develop a numerical algorithm for calculating the pjS for general $o-Now choose probabilities p i , . . . , p n to minimize the distance

d(p, q) = Y ^ V j log Pj - Y 2 Pi lo§

with q = (^ , . . . , i) , subject to Y ( x j ~ &oUj)Pj = 0. Show that the solution is the exponential tilted EDF

Pj cc exp{r\(xj - Bouj)}.

Verify that for small values o f do — x / u these PjS are approximately the same as those obtained by the MLE method.(Section 4.4; Efron, 1981b)

10 Suppose that we wish to test the reduced-rank model H0 : g(0) — 0, where g(-) is a Pi-dimensional reduction o f p-dimensional 6. For the studentized pivot method we take Q = {g(T ) - g(6)}T V~ l {g(T) - g(0)}, with data test value q0 = g(t)r i;g-1g(t), where vg estimates var[g(T)}. Use the nonparametric delta method to show that var{g(T)} = g(t)VLg ( t y , where g(0) = 8g(6) /ddT.Show how the method can be applied to test equality o f p means given p independent samples, assuming equal population variances.(Section 4.4.1)

11 In a parametric situation, suppose that an exact test is available with test statistic U, that S is sufficient under the null hypothesis, but that a parametric bootstrap test is carried out using T rather than U. Will the adjusted P-value padj always produce the exact test?(Section 4.5)

12 In calculating the mean squared error for the simulation approximation to the adjusted P-value, it might be more reasonable to assume that P-values u, follow a Beta distribution with parameters a and b which are close to, but not equal to, one. Show that in this case

E(Xk) = " V ,Pl T(M + l)r(a + j)V(b + M - j)T(a + b)j^o T ( j + I W ( M - j + l )T(a + b + M ) r ( a ) r ( b ) ’

where X r = /{B inom (M , ur) < (M + l)p}. Use this result to investigate numerically the choice o f M.(Section 4.5)

13 For the matched-pair experiment o f Problem 4.7, suppose that we choose between the two test statistics ty = d and t2 = (n — 2m)~l J2"Z2+i c/)> f° r some m in the range 2, . . . , [^n], on the basis o f their estimated variances Vi and v2, where

„ = E ( d j - h ) 21 n2

Ej=m+l( U) ~ f2)2 + m( (rn+1) ~ h )2 + m(rf(„_m) — t2)2v-> = --- ----------------------------------------------------------------------- .

n(n — 2m)

Give a detailed description o f the adaptive test as outlined in Section 4.4.2. To apply it to the data o f Problem 4.7 with m = 2, see Practical 4.4.(Section 4.4.2; Donegani, 1991)

14 Suppose that we want critical values for a size a one-sided test o f Ho : 9 = 9o versus HA : 9 > 0n. The ideal value is the 1 — a quantile to,i-a o f the distribution o f T under Ho, and this is estimated by the solution f o , i -a to Pr’(T ' > t0 | Fo) = a- Typically t o i - c is biased. Consider an adjusted critical value ( o , i - o - y . Obtain the double bootstrap algorithm for choosing y, and compare the resulting test to use o f the adjusted P-value (4.34).(Sections 4.5, 3.9.1; Beran, 1988)

4.9 Practicals1 The data in dataframe dogs are from a pharmacological experiment. The two

variables are cardiac oxygen consumption (MVO) and left ventricular pressure(LVP). D ata for n = 7 dogs are

MVO 78 92 116 90 106 78 99LVP 32 33 45 30 38 24 44

Apply a bootstrap test for the hypothesis o f zero correlation between MVO and LVP. Use R = 499 simulations.(Sections 4.3, 4.4)

2 For the permutation test outlined in Example 4.12,

188 4 ■ Tests

ami.fun <- function(data, i){ d <- data[i, ]

temp <- survdiff(Surv(d$time, d$cens) " data$group) s <- sign(temp$obs[2]-temp$exp[2]) s*sqrt(temp$chisq) }

ami.perm <- boot(ami, ami.fun, R=499, sim="permutation")(1+sum(ami.perm$tO<aml.perm$t) ) / (1+aml.perm$R)o <- rank(aml.perm$t)less <- (l:aml.perm$R)[ami.perm$t<aml.perm$tO]o <- o/(1+aml.perm$R)qqnorm(ami,perm$t,ylab="Log-rank statistic",type="n")points(qnorm(o[less]),aml.perm$t[less])points(qnorm(o [-less]),aml.perm$t[-less],pch=l)

abline(0,l,lty=2);abline(h=aml.perm$tO,lty=3)

Compare this with the corresponding bootstrap test.(Section 4.3)

3 For a graphical test o f suitability o f the exponential model for the data in Table 1.2, we generate data from the exponential distribution, and plot an envelope.

expqq.fun <- function(data, q) sort(data)/mean(data) exp.gen <- function(data, mle) rexp(length(data), mle) n <- nrow(aircondit) qq <- qexp((1:n)/(n+1))exp.boot <- boot(aircondit$hours,expqq.fun,R=999,sim="parametric",

ran.gen=exp.gen,mle=l/mean(aircondit$hours),q=qq) env <- envelope(exp.boot$t)

plot(qq,exp.boot$tO,xlab="Exponential quantiles",ylab="Scaled order statistics",xlim=c(0,max(qq)), ylim=c(0,max(c(exp.boot$t0,env$overall[2,]))),pch=l)

lines(qq,env$overall[1,]); lines(qq,env$overall[2,] ) lines(qq,env$point[l,],lty=2); lines(qq,env$point[2,],lty=2)

Discuss the adequacy o f the model. Check whether the gamma model is a better fit.(Section 4.2.4)

4 To apply the permutation test outlined in Problem 4.7,

darwin.gen <- function(data, mle){ sign <- sample(c(-1,1),mle,replace=T)

data*sign }darwin.rand <- boot(darwin$y, mean, R=999, sim="parametric",

ran.gen=darwin.gen, mle=nrow(darwin))(1+sum(darwin.rand$t>darwin.rand$tO))/(1+darwin.rand$R)

Can you see how to modify darw in .gen to produce the bootstrap test?To implement the adaptive test described in Problem 4.13, with m = 2:

darwin.f <- function(d){ n <- length(d); m <- 2

tl <- mean(d) vl <- sum((d-tl)"2)/n“2 d <- sort(d)[(m+1):(n-m)] t2 <- mean(d)

v2 <- ((sum((d-t2)~2)+m*(min(d)-t2)“2+m*(max(d)-t2)"2))/(n*(n-2*m)) c(tl, vl, t2, v2) }

darwln.ad <- boot(darwin$y, darwin.f, R=999, sim="parametric", ran. gen=darwin. gen, mle=nrow (darwin))

darwin.ad$tOi <- c(1:999)[darwin.ad$t[,2]>darwin.ad$t[,4]](1+sum(darwin. ad$t [i, 3] >darwin. ad$tO [3] )) / (1+length ( i))

Is a different result obtained with the adaptive version o f the bootstrap test? (Sections 4.3, 4.4)

5 Dataframe p a u lsen contains data collected as part o f an investigation into the quantal nature o f neurotransmission in the brain, by Dr O. Paulsen o f the Department o f Pharmacology, University o f Oxford, in collaboration with Professor P. Heggelund o f the Department o f Neurophysiology, University o f Oslo. Two models have been proposed to explain such data. The first model suggests that the data are drawn from an underlying skewed unimodal distribution. The alternative model suggests that the data are drawn from a series o f distributions with modes equal to integer multiples o f a unit size. To distinguish between the two models, a bootstrap test o f multimodality may be carried out, with the null hypothesis that the underlying distribution is unimodal.To plot the data and a kernel density estimate with a Gaussian kernel and bandwidth h = 1.5, and to count its local maxima:

h <- 1.5hist(paulsen$y,probability=T,breaks=c(0:30)) lines(density(paulsen$y,width=4*h,from=0,to=30)) peak.test <- function(y, h){dens <- density(y,width=4*h,n=100)

sum(peaks(dens$y[(dens$x>=0)k (dens$x<=20)])) } peak.test(paulsen$y, h)

Check that h = 1.87 is the smallest value giving just one peak.For bootstrap analysis,

peak.gen <- function( d, mle){ n <- mle[l] ; h <- mle [2]

i <- sample(n,n,replace=T) d[i]+h*rnorm(n) >

paulsen.boot <- boot(paulsen$y, peak.test, R=999, sim="parametric", ran.gen=peak.gen, mle=c(nrow(paulsen),1.87), h=1.87)

What is the significance level?To repeat with a shrunk smoothed density estimate:

shrunk.gen <- function(d, mle){ n <- mle[l] ; h <- mle [2]

v <- var(d)(d [sample (n,n,replace=T)] +h*rnorm(n)) /sqrt (l+h~2/v) }•

paulsen.boot <- boot(paulsenSy, peak.test, R=999, sim="parametric",ran.gen=shrunk.gen, mle=c(nrow(paulsen),1.87), h=1.87)

Bootstrap to obtain the P-value. Discuss your results.(Section 4.4; Paulsen and Heggelund, 1994, Silverman, 1981).

190 4 ■ Tests

6 For the cd4 data o f Practicals 2.3 and 3.6, test the hypothesis that the distribution o f C D4 counts after one year is the same as the baseline distribution. Test also whether the treatment affects the counts for each individual. Discuss your conclusions.

5

Confidence Intervals

5.1 Introduction

The assessment o f uncertainty about param eter values is made using confidence intervals or regions. Section 2.4 gave a brief introduction to the ways in which resampling can be applied to the calculation o f confidence limits. In this chapter we undertake a more thorough discussion o f such methods, including more sophisticated ideas that are potentially more accurate than those mentioned previously.

Confidence region m ethods all focus on the same target properties. The first is that a confidence region with specified coverage probability y should be a set Cy(y) o f param eter values which depends only upon the data y and which satisfies

Pr{0 e Cy( F )} = y. (5.1)

Implicit in this definition is that the probability does not depend upon any nuisance param eters that m ight be in the model. The confidence coefficient, or coverage probability, y, is the relative frequency with which the confidence region would include, or cover, the true param eter value 9 in repetitions o f the process that produced the data y. In principle the coverage probability should be conditional on the inform ation content o f y as measured by ancillary statistics, bu t this may be difficult in practice w ithout a param etric model; see Section 5.9.

The second im portant property o f a confidence region is its shape. The general principle is that any value in Cy should be more likely than all values outside Cy, where “likely” is m easured by a likelihood or similar function. This is difficult to apply in nonparam etric problems, where strictly a likelihood function is not available; see, however, C hapter 10. In practice the difficulty is

191

192 5 • Confidence Intervals

not serious for scalar 9, which is the m ajor focus in this chapter, because in m ost applications the confidence region will be a single interval.

A confidence interval will be defined by limits 0ai and 9 i_a2, such tha t for any a

Pr(0 < 0„) = a.

The coverage o f the interval [0a,,0 i_ a2] is y = 1 — (x\ + a2), and ai and a2 are respectively the left- and right-tail error probabilities. For some applications only one limit is required, either a lower confidence limit 6a or an upper confidence limit 9 i_a, these both having coverage 1 — a. If a closed interval is required, then in principle we can choose oti and a2, so long as they sum to the overall error probability 2a. The simplest way to do this, which we adopt for general discussion, is to set a.\ = a2 = a. Then the interval is equi-tailed with coverage probability 1 — 2a. In particular applications, however, one might well w ant to choose ai and a2 to give approxim ately the shortest interval: this would be analogous to having the likelihood property m entioned earlier.

A single confidence region cannot give an adequate sum m ary o f the uncertainty about 9, so in practice one should give regions for three or four confidence levels between 0.50 and 0.99, say, together with the point estimate for 9. One benefit from this is tha t any asymmetry in the uncertainty about 6 will be fairly clear.

So far we have assumed that a confidence region can be found to satisfy (5.1) exactly, but this is not possible except in a few special param etric models. The m ethods developed in this chapter are based on approxim ate probability calculations, and therefore involve a discrepancy between the nom inal or target coverage, and the actual coverage probability.

In Section 5.2 we review briefly the standard approxim ate m ethods for param etric and nonparam etric models, including the basic bootstrap m ethods already described in Section 2.4. M ore sophisticated methods, based on what is known as the percentile m ethod, are the subject o f Section 5.3. Section 5.4 com pares the various m ethods from a theoretical viewpoint, using asym ptotic expansions, and introduces the A B C m ethod as an alternative to simulation methods. The use o f significance tests to obtain confidence limits is outlined in Section 5.5. A nested bootstrap algorithm is introduced in Section 5.6. Empirical com parisons between m ethods are made in Section 5.7.

Confidence regions for vector param eters are described in Section 5.8. The possibility o f conditional confidence regions is explored in Section 5.9 through discussion o f two examples. Prediction intervals are discussed briefly in Section 5.10.

The discussion in this chapter is about how to use the results o f bootstrap sim ulation algorithm s to obtain confidence regions, irrespective o f w hat the resampling algorithm is. The presentation supposes for the m ost part that we

5.2 • Basic Confidence Limit Methods 193

are in the simple situation o f C hapter 2, where we have a single, complete hom ogeneous sample. M ost o f the m ethods described can be applied to more complex data structures, provided that appropriate resampling algorithm s are used, but for m ost sorts o f highly dependent data the theoretical properties o f the m ethods are largely unknown.

5.2 Basic Confidence Limit Methods

5.2.1 Parametric modelsOne general approach to calculating a confidence interval is to make it surround a good point estimate o f the param eter, which for param etric models will often be taken to be the m axim um likelihood estim ator. We begin by discussing various simple ways in which this approach can be applied.

Suppose tha t T estimates a scalar 9 and that we want an interval with left- and right-tail errors bo th equal to a. For simplicity we assume that T is continuous. I f the quantiles o f T — 9 are denoted by ap, then

Pr(T — 9 < aa) = a = P r(T — 9 > a i - x). (5.2)

Rewriting the events T — 9 < aa and T — 9 > a i_a as 9 > T — a 3 and 9 < T — a i _3 respectively, we see tha t the 1 — 2a equi-tailed interval has limits

9a = t fli_ a, ^1—a = t aa. (5.3)

This ideal solution rarely applies, because the distribution o f T — 9 is usually unknown. This leads us to consider various approxim ate m ethods, m ost o f which are based on approxim ating the quantiles o f T — 9.

Normal approximationThe simplest approach is to apply a N{0,v) approxim ation for T — 9. This gives the approxim ate confidence limits

= t + v l/1z l- 0l, (5.4)

where as usual z i_a = '(1 —a). If T is a m axim um likelihood estim ator, then the approxim ate variance v can be com puted directly from the log likelihood function tf(9). I f there are no nuisance param eters, then we can use the recip-

. . A

((B) = d({e)id8, and rocal o f either the observed Fisher inform ation, v = —l/tf(9) o r the estimated?(e) = d2t(e)/B0deT. expected Fisher inform ation v = 1/7(0), where i(9) = E{— if(9)} — var{/(0)}.

The form er is usually preferable. W hen there are nuisance param eters, we use the relevant element o f the inverse o f either — ?(0) or i(9). M ore generally, if T is given by an estim ating equation, then v can be calculated by the delta m ethod; see Section 2.7.2. Equation (5.4) is the standard form for norm al approxim ation confidence limits, although it is sometimes augm ented by a bias correction which is based on the third derivative o f the log likelihood function.


In problems where the variance approxim ation v is hard to obtain theoretically, or is thought to be unreliable, the param etric bootstrap o f Section 2.2 can be used. This requires sim ulation from the fitted model with param eter value 9. If the resampling estimates o f bias and variance are denoted by bR and vR, then (5.4) is replaced by

9a,9i -x = t - b R + v ^ 2z i - x. (5.5)

W hether or not a norm al approxim ation m ethod will work can be assessed by m aking a norm al Q-Q plot o f the simulated estimates t \ , . . ., t'R, as illustrated in Section 2.2.2. If such a plot suggests that norm al approxim ation is poor, then we can either try to improve the norm al approxim ation in some way, or replace it completely. The basic resampling confidence interval m ethods o f Section 2.4 do the latter, and we review them first.

Basic and studentized bootstrap methods

If we start again at the general confidence limit form ula (5.3), we can estimate the quantiles ax and a i_ a by the corresponding quantiles o f T ’ — t. Assuming that these are approxim ated by simulation, as in Section 2.2.2, the argum ent given in Section 2.4 leads to the confidence limits

Sot = 2t — f((R+i)(i_a)), 9 i_a = 2t — t((R+l)a)- (5.6)

These we refer to as the basic bootstrap confidence limits for 9.A modification o f this is to use the form o f the norm al approxim ation

confidence limit, but to replace the N ( 0,1) approxim ation for Z = ( T — 9 ) / V ,/2 by a bootstrap approxim ation. Each simulated sample is used to calculate t*, the variance estimate v*, and hence the bootstrap version z* = (t’ — t ) / v ’ll2 o f Z . The R simulated values o f z* are ordered and the p quantile o f Z is estim ated by the (R + l)p th o f these. Then the confidence limits (5.4) are replaced by

9a = t — V 1/ 2Z({ R + i)(i_a)), ^l-oc = t — tf1/2Z((R+l)a)- (5.7)

These we refer to as studentized bootstrap confidence limits. They are also known as bootstrap-t limits, by analogy with the Student-f confidence limits for the mean o f a norm al distribution, to which they are equal under infinite sim ulation in tha t problem. In principle this m ethod is superior to the previous basic m ethod, for reasons outlined in Section 2.6.1 and discussed further in Section 5.4.

An empirical bias adjustm ent could be incorporated into the num erator of Z , but this is often difficult to calculate and is usually not worthwhile, because the effect is implicitly adjusted for in the bootstrap distribution.

For both (5.6) and (5.7) to apply exactly it is necessary tha t (R + l)a be an integer. This can usually be arranged: with R = 999 we can handle most

5.2 ■ Basic Confidence Limit Methods 195

h(0) is dh(6)/d6.

conventional values o f a. But if for some reason ( R + l)a is not an integer, then interpolation can be used. A simple m ethod that works well for approximately norm al estim ators is linear interpolation on the norm al quantile scale. For example, if we are trying to apply (5.6) and the integer part o f (R + l)a is k, then we define

O -ifa ) —0 _1( - ^ r )= f o + ‘ = [ (* + ! )« ] • (5-8)

w 'R + l ' v ^R+l-’

The same interpolation can be applied to the z* s. Clearly such interpolations fail if k = 0, R or R + 1.

Parameter transformation

The norm al approxim ation m ethod may fail to work well because it is being applied on the wrong scale, in which case it should help to apply the approxim ation on an appropriately transform ed scale. Skewness in the distribution of T is often associated with var(T ) varying with 9. For this reason the accuracy o f norm al approxim ation is often improved by transform ing the param eter scale to stabilize the variance o f the estim ator, especially if the transform ed scale is the whole real line. The accuracy o f the basic bootstrap confidence limits (5.6) will also tend to be improved by use o f such a transform ation.

Suppose tha t we make a m onotone increasing transform ation o f the param eter scale from 9 to tj = h(9), and then transform t correspondingly to u = h(t). Any confidence limit m ethod can be applied for tj, and untransform ing the results will give confidence limits for 9. For example, consider applying the norm al approxim ation limits (5.4) for r). By the delta m ethod (Section 2.7.1) the variance approxim ation v for T transform s to

var(l/) = var{/i(T)} = {h(t)}2v = vv ,

say. Then the confidence limits for r\ are h(t) + t; |/2zi_a, which transform back to the limits

9 j i - a = h - l {h(t) + vli2z l^ } . (5.9)

Similarly the basic bootstrap confidence limits (5.6) become

= h~1{2h(t) - /i(?,*(R+1)(1_a)))}, 0 .—* = h - l {2h(t) - h(t*{{R+l)a])}. (5.10)

W hether or not the norm al approxim ation is improved by transform ation can be judged from a norm al Q-Q plot o f simulated h(t") values.

How do we determine an appropriate transform ation h(-)2 I f var(T ) is exactly or approxim ately equal to the known function v(0), then the variance- stabilizing transform ation is defined by (2.14); see Problem 5.2 for an example. If no theoretical approxim ation exists for var(T ), then we can apply the empirical m ethod outlined in Section 3.9.2. A simpler empirical approach

196 5 ■ Confidence Intervals

which sometimes works is to make norm al Q-Q plots o f h(t’) for candidate transform ations.

It is im portan t to stress that the use o f transform ation can improve the basic bootstrap m ethod considerably. Nevertheless it may still be beneficial to use the studentized m ethod, after transform ation. Indeed there is strong empirical evidence tha t the studentized m ethod is improved by working on a scale with stable approxim ate variance. The studentized transform ed estim ator is

H T ) - m

\h (T) \VV2 '

Given R values o f the bootstrap quantity z* = {/i(f’) — h(t)} / {\h(t*)\v*1/2}, the analogue o f (5.10) is given by

6 a = h ~ l {h( t ) - IM0|t>1/2z(*(R+i)(i-«))}> 0i-« = h - ' i h i t ) - \h( t ) \v1/2z ‘i { R + m }.(5.11)

N ote tha t if h(-) is given by (2.14) with no constant multiplier and V = v(T), then the denom inator o f z* and the m ultiplier \h(t)\v1/2 in (5.11) are both unity.

Likelihood ratio methods

W hen likelihood estim ation is used, in principle the norm al approxim ation confidence limits (5.4) are inferior to likelihood ratio limits. Suppose that the scalar 6 is the only unknow n param eter in the model, and define the log likelihood ratio statistic

W(d) = 2 { m - m } -

Quite generally the distribution o f W(6) is approxim ately chi-squared, withone degree o f freedom since 8 is a scalar. So a 1 — 2a approxim ate confidenceregion is

C l - 2cc = { 0 : w ( 0 ) < CU _2a} , ( 5 .1 2 )

where ciiP is the p quantile o f the y2 distribution. This confidence region need not be a single interval, although usually it will be, and the left- and right- tail errors need not be even approxim ately equal. Separate lower and upper confidence limits can be defined using

z(0) = sgn(d - d)y/Md) ,

which is approxim ately N ( 0,1). The resulting confidence limits are defined implicitly by

z ( h ) = za, z(01—a) = Zl-«. (5.13)

W hen the model includes o ther unknow n param eters X, also estim ated by m axim um likelihood, w(6) is calculated by replacing / ( 0) with the profile log likelihood / prof(0) = sup; <f(0, 1).

These m ethods are invariant with respect to use o f param eter transform ation.

«f(0) is the log likelihood function

sgn(u) = u/\u\ is the sign function.


( ' is the log likelihood for a set of data simulated using 6, for which the MLE is 0 \

In m ost applications the accuracy will be very good, provided the model is correct, but it may nevertheless be sensible to consider replacing the theoretical quantiles by bootstrap approxim ations. W hether or not this is worthwhile can be judged from a chi-squared Q-Q plot o f simulated values of

w -(6) = 2 { r ( 6 ' ) - f ( G ) } ,

or from a norm al Q-Q plot o f the corresponding values o f z*(0).

Example 5.1 (Air-conditioning data) The data o f Example 1.1 were used to illustrate various features o f param etric resampling in C hapter 2. Here we look at confidence limit calculations for the underlying m ean failure time n under the exponential m odel for these data. The example is convenient in tha t there is an exact solution against which to com pare the various approxim ations.

For the norm al approxim ation m ethod we use an estimate o f the exact variance o f the estim ator T = Y , v = n~ly 2. The observed value o f y is108.083 and n = 12, so v = (31.20)2. Then the 95% confidence interval limits given by (5.4) with a = 0.025 are

108.083 ± 31.20 x 1.96 = 46.9 and 169.2.

These contrast sharply with the exact limits 65.9 and 209.2.T ransform ation to the variance-stabilizing logarithm ic scale does improve

the norm al approxim ation. Application o f (2.14) with v(/i) = n“ V 2 gives h(t) = log(t), if we drop the multiplier n1/2, and the approxim ate variance transform s to n~l . The 95% confidence interval limits given by (5.9) are

ex p { lo g (1 0 8 .0 8 3 )± (1 2 r1/2 x 1.96} = 61.4 and 190.3.

While a considerable improvement, the results are still not very close to the exact solution. A partial explanation for this is tha t there is a bias in log(T) and the variance approxim ation is no longer equal to the exact variance. Use o f bootstrap estimates for the bias and variance o f log(T), with R = 999, gives limits 58.1 and 228.8.

For the basic bootstrap confidence limits we use R = 999 sim ulations under the fitted exponential model, samples o f size n = 12 being generated from the exponential distribution with mean 108.083; see Example 2.6. The relevant ordered values o f y ' are the (999+ l)0.025th and (999+ l)0.975th, i.e. the 25th and 975th, which in our sim ulation were 53.3 and 176.4. The 95% confidence limits obtained from (5.6) are therefore

2 x 108.083 - 176.4 = 39.8, 2 x 108.083 - 53.3 = 162.9.

These are no better than the norm al approxim ation limits. However, application o f the same m ethod on the logarithm ic scale gives much better results:


using the same ordered values o f >■' in (5.10) we obtain the limits

exp{21og(108.083)-log(176.4)} = 66.2, exp{21og(108.083)-log(53.4)} = 218.8.

In fact these are sim ulation approxim ations to the exact limits, which are based on the exact gam m a distribution o f Y / p.- The same results are obtained using the studentized bootstrap limits (5.7) in this case, because z = n l/2(y — n ) / y is a m onotone function o f log(y) — log(/i) = log(y/p). Equation (5.11) also gives these results.

N ote tha t if we had used R = 99, then the bootstrap confidence limits would have required interpolation, because (9 9 + 1)0.025 = 2.5 which is not an integer. The application o f (5.8) would be

This involves quite extreme ordered values and so is somewhat unstable.The likelihood ratio m ethod gives good results here, even using the chi-

squared approxim ation.Broadly similar com parisons am ong the m ethods apply under the more

com plicated gam m a model for these data. As the com parisons m ade in Example 2.9 would predict, results for the gam m a model are similar to those for nonparam etric resampling, which are discussed in the next example. ■

5.2.2 Nonparametric modelsW hen no model is assumed for the data distribution, we are then in the situation o f Section 2.3, if the data form a single hom ogeneous sample. Initially we assume that this is the case. M ost o f the m ethods just discussed for param etric models extend to this nonparam etric situation with little difficulty. The m ajor exception is the likelihood ratio m ethod, which we postpone to C hapter 10.

Normal approximationThe simplest m ethod is again to use a norm al approxim ation, now with a nonparam etric estimate o f variance such as that provided by the nonparam etric delta m ethod described in Section 2.7.2. If lj represents the empirical influence value for the ;'th case yj, then the approxim ate variance is vL = n~2 J2 lj, so the nonparam etric analogue o f (5.4) for the limits o f a 1 — 2a confidence interval for 6 is

Section 2.7 outlines various ways o f calculating or approxim ating the influence values.

If a small nonparam etric bootstrap has been run to produce bias and

*(2.5) - f (2) +(D- 1 (0.025)- 0 ) - 1 (0.020)(D- 1 (0.030) - 0*-1 (0.020) ( (3) {2])'

* - r 1/2 t + V £ Z 1—a. (5.14)


variance estim ates bR and vR, as described in Section 2.3, then the corresponding approxim ate 1 — 2a confidence interval is

t - bR + 4 /2zi-a- (5.15)

In general we should expect this to be m ore accurate, provided R was large enough.

Basic and studentized bootstrap methods

For the basic bootstrap m ethod, the only change from the param etric case is tha t the sim ulation model is the E D F P. Otherwise equation (5.6) still applies. W hether or no t the bootstrap m ethod is likely to give im provement over the norm al approxim ation m ethod can again be judged from a norm al Q-Q plot o f the t ’ values. Sim ulated resample values do give estim ates of bias and variance which provide the more accurate norm al approxim ation limits (5.15).

The studentized bootstrap m ethod with confidence limits (5.7) likewise applies here. If the nonparam etric delta m ethod variance estim ate vl is used for v, then those confidence limits become

_ . 1/2 * /J _ , 1/2 * / r 1vx — t — VL t 'l—a ~ ~ VL z((R+l)a)> (D.10J

where now z* = (t* — t ) / v ' ^ 2. N ote that the influence values m ust be recom puted for each bootstrap sample, because in expanded notation lj = l(yj;P) depends upon the E D F o f the sample. Therefore

» i = « - 2 £ / v ; ; n7=1

where P ' is the E D F o f the bootstrap sample. A simple approxim ation to v] can be m ade by substituting the approxim ation

n

Ky' j ; F " ) = K y ) ; f ) - n ~ l £ i ( y 'k ; P) ,k=l

bu t this is unreliable unless t is approxim ately linear; see Section 2.7.5 and Problem 2.20.

As in the param etric case, one m ight consider m aking a bias adjustm ent in the num erator o f z, for example based on the empirical second derivatives o f t. However, this rarely seems effective, and in any event an approxim ate adjustm ent is implicitly m ade in the bootstrap distribution o f Z*.

Example 5.2 (Air-conditioning data, continued) For the data o f Example 1.1, confidence limits for the m ean were calculated under an exponential m odel in Example 5.1. Here we apply nonparam etric methods, sim ulated datasets being obtained by sampling with replacem ent from the data.

For the norm al approxim ation, we use the nonparam etric delta m ethod


estimate vL = n 2 E (> '; — whose da ta value is 1417.715 = (37.65)2. So the approxim ate 95% confidence interval is

108.083 ± 37.65 x 1.96 = 34.3 and 181.9.

This, as with m ost o f the numerical results here, is very similar to w hat is obtained under param etric analysis with the best-fitting gam m a m odel; see Example 2.9.

For the basic bootstrap m ethod with R = 999 simulated datasets, the 25th and 975th ordered values o f y * are 43.92 and 192.08, so the limits o f the 95% confidence interval are

2(108.083) - 192.08 = 24.1 and 2(108.083) - 43.92 = 172.3.

This is not obviously a poor result, unless com pared with results for the gam m a model (likelihood ratio limits 57 and 243), but the corresponding 99% interval has lower limit —27.3, which is clearly very bad! The studentized bootstrap fares better: the 25th and 975th ordered values o f z* are —5.21 and 1.66, so that application o f (5.7) gives 95% interval limits

108.083 - 37.65 x 1.66 = 45.7 and 108.083 - 37.65 x (-5 .2 1 ) = 304.2.

But are these last results adequate, and how can we tell? The first part o f this question we can answer both by com parison with the gam m a model results, and by applying m ethods on the logarithm ic scale, which we know is appropriate here. The basic bootstrap m ethod gives 95% limits 66.2 and 218.8 when the log scale is used. So it would appear that the studentized bootstrap m ethod limits are too wide here, but otherwise are adequate. If the studentized bootstrap m ethod is applied in conjunction with the logarithm ic transform ation, the limits become 50.5 and 346.9.

How would we know in practice tha t the logarithm ic transform ation o f T is appropriate, o ther than from experience with similar data? One way to answer this is to plot v’L versus t*, as a surrogate for a “variance-param eter” plot, as suggested in Section 3.9.2. For this particular dataset, the equivalent plot o f standard errors vL is shown in the left panel o f Figure 5.1 and strongly suggests that variance is approxim ately proportional to squared param eter, as it is under the param etric model. From this we would deduce, using (2.14), that the logarithm ic transform ation should approxim ately stabilize the variance. The right panel o f the figure, which gives the corresponding plot for log- transform ed estimates, shows that the transform ation is quite successful. ■

Parameter transformation

For suitably sm ooth statistics, the consistency o f the studentized bootstrap m ethod is essentially guaranteed by the consistency o f the variance estimate V. In principle the m ethod is m ore accurate than the basic bootstrap method,

5.2 ■ Basic Confidence Limit Methods 201

Figure 5.1Air-conditioning data: nonparametric delta method standard errors for t = y (left panel) and for log(t) (right panel) in R = 999 nonparametric bootstrap samples.

csj<"*_i>

oCO

oin

o

oco

oC\J

50 100 150 200 250

t*

CNJ

*>

log t*

as we shall see in Section 5.4. However, variance approxim ations such as vL can be somewhat unstable for small n, as in the previous example with n = 12. Experience suggests tha t the m ethod is m ost effective when 6 is essentially a location param eter, which is approxim ately induced by variance-stabilizing transform ation (2.14). However, this requires knowing the variance function v(9) = var(T | F), which is never available in the nonparam etric case.

A suitable transform ation may sometimes be suggested by analogy with a param etric problem, as in the previous example. Then equations (5.10) and(5.11) will apply w ithout change. Otherwise, a transform ation can be obtained empirically using the technique described in Section 3.9.2, using either nested bootstrap estimates v* or delta m ethod estimates v*L with which to estimate values o f the variance function v(6). Equation (5.10) will then apply with estim ated transform ation h( ) in place o f h( ). For the studentized bootstrap interval (5.11), if the transform ation is determ ined empirically by (3.40), then studentized values o f the transform ed estimates h(t'r) are

K = v{K) l/2{ k O - M O }/”! 1/ 2-

On the original scale the (1 — 2a) studentized interval has endpoints

h~l { h(t) - v1/ ' 2tS(0 _ 1/2Z('(i-o<)(«+i))} ’ {MO - ®l : ( 0 “ 1/2Z(*a(K+i))} -(5.17)

In general it is wise to use the studentized interval even after transform ation.

Example 5.3 (City population data) For the data o f Example 2.8, with ratio6 estim ated by t = x /u, we discussed empirical choice o f transform ation in Example 3.23. A pplication o f the empirical transform ation illustrated in


Figure 3.11 with the studentized bootstrap limits (5.17) leads to the 95% interval [1.23, 2.25], This is similar to the 95% interval based on the h(t*) — h(t), [1.27, 2.21], while the studentized bootstrap interval on the original scale is [1.12, 1.88]. The effect o f the transform ation is to make the interval more like those from the percentile m ethods described in the following section.

To com pare the studentized methods, we took 500 samples o f size 10 without replacem ent from the full city population data in Table 1.3. Then for each sample we calculated 90% studentized bootstrap intervals on the original scale, and on the transform ed scale with and w ithout using the transform ed standard error; this last interval is the basic bootstrap interval on the transform ed scale. The coverages were respectively 90.4, 88.2, and 86.4%, to be com pared to the ideal 90% . The first two are not significantly different, but the last is rather smaller, suggesting tha t it can be worthwhile to studentize on the transform ed scale, when this is possible. The draw back is that studentized intervals tha t use the transform ed scale tend to be longer than on the original scale, and their lengths are m ore variable. ■

5.2.3 Choice of RW hat has been said about sim ulation size in earlier chapters, especially in Section 4.2.5, applies here. In particular, if confidence levels 0.95 and 0.99 are to be used, then it is advisable to have R = 999 or more, if practically feasible. Problem 5.5 outlines some relevant theoretical calculations.

5.3 Percentile MethodsWe have seen in Section 5.2 that simple confidence limit m ethods can be made m ore accurate by working on a transform ed scale. In m any cases it is possible to use sim ulation results to get a reasonable idea o f w hat a sensible transform ation m ight be. A quite different approach is to find a m ethod which implicitly uses the existence o f a good transform ation, but does no t require that the transform ation be found. This is w hat the percentile m ethod and its m odifications try to do.

5.3.1 Basic percentile methodSuppose tha t there is some unknow n transform ation o f T, say U = h(T), which has a symmetric distribution. Im agine tha t we knew h and calculated a 1 — 2a confidence interval for cf) = h(6) by applying the basic bootstrap m ethod (5.6), except that we first use the symmetry to write ax = —fli_a in the basic equation (5.3) as it applies to U = h(T). This would m ean tha t in applying (5.3) we would take u - u{iR+m_a)) instead o f u{{R+m - u, and u - u{{R+l)a)

5.3 ■ Percentile Methods 203

instead o f — u, to estimate the a and 1 — a quantiles o f U. Thisswap would change the confidence interval limits (5.6) to

* *

U((K+l)a)> U((R +l)(l-a))>

whose transform ation back to the 9 scale is

f ((R+l)<x)’ f ((R+1)(1—ot))- ( 5 .1 8 )

R em arkably this 1 — 2a interval for 9 does not involve h at all, and so can be com puted w ithout knowing h. The interval (5.18) is known as the bootstrap percentile interval, and was initially recom m ended in place o f (5.6).

As with m ost bootstrap methods, the percentile m ethod applies for both param etric and nonparam etric bootstrap sampling. Perhaps surprisingly, the m ethod turns out not to work very well with the nonparam etric bootstrap even when a suitable transform ation h does exist. However, adjustm ents to the percentile m ethod described below are successful for m any statistics.

Example 5.4 (Air-conditioning data, continued) For the air-conditioning data discussed in Examples 5.1 and 5.2, the percentile m ethod gives 95% intervals [70.8, 148.4] under the exponential model and [43.9, 192.1] under the nonparametric model. N either is satisfactory, com pared to accurate intervals such as the basic bootstrap interval using logarithm ic transform ation. ■

5.3.2 Adjusted percentile methodFor the percentile m ethod to work well, it would be necessary that T be unbiased on the transform ed scale, so tha t the swap o f quantile estimates be correct. This does not usually happen. Also the m ethod carries the defect o f the basic bootstrap m ethod, that the shape o f the distribution of T changes as the sampling distribution changes from F to F, even after transform ation. In particular, the implied symmetrizing transform ation often will not be quite the same as the variance-stabilizing transform ation — this is the cause o f the poor perform ance of the percentile m ethod in Example 5.4. These difficulties need to be overcome if the percentile m ethod is to be made accurate.

Parametric case with no nuisance parametersWe assume to begin with tha t the data are described by a param etric model with just the single unknow n param eter 9, which is estim ated by the maximum likelihood estim ate t = 9. In order to develop the adjusted percentile m ethod we make the simplifying assum ption tha t for some unknow n transform ation h( ), unknow n bias correction factor w and unknow n skewness correction factor a, the transform ed estim ator U = h(T) for (j> = h(9) is normally distributed,

U ~ N (</> — wer(</>), <t2(0)) with a(<j)) = 1 + a<>. (5.19)

In fact this is an improved norm al approxim ation, after applying the (unknown) norm alizing transform ation which eliminates the leading term in a skewness approxim ation. The usual factor n-1 has been taken out o f the variance by scaling h(-) appropriately, so that both a and w will typically be o f order n~x/2. The use o f a and w is analogous to the use o f Bartlett correction factors in likelihood inference for param etric models.

The essence o f the m ethod is to calculate confidence limits for <j) and then transform these back to the 6 scale using the bootstrap distribution o f T. To begin with, suppose that a and w are known, and write

U = (j) + (I + acj))(Z - w ) ,

where Z has the N ( 0,1) distribution with a quantile z«. It follows that

log(l + aJJ) = log(l + a<j>) + log{ 1 + a(Z — w)},

which is m onotone increasing in e/>. Therefore substitution o f za for Z and u for U in this equation identifies the a confidence limit for cj), which is

1 i r \ w + z«</>a = U + f f ( u h ---------;-------------- r .1 — a(w + za)

Now the a confidence limit for d is dx = but h( ) is unknown. However,if we denote the distribution function o f T ' by G, then

G(0a) = Pr*(T* < e a \ t ) = ?x '{U' <4>a \u) = <6\ ° w

I w + z“= <D I w + ■1 - a(w + za)

which is known. Therefore the a confidence limit for 0 is

< 5 - 2 0 >

which expressed in terms o f sim ulation values is

0* = ?m + l w , a = 0 ) ( w + 1 _ ^ + ^ ^ ) ) . (5.21)

These limits are usually referred to as B C a confidence limits. N ote that they share the transform ation invariance property o f percentile confidence limits.

The use o f G overcomes lack o f knowledge o f the transform ation h. The values o f a and w are unknown, o f course, but they can be easily estimated. For w we can use the initial norm al approxim ation (5.19) for U to write

Pr*(T* < t | t) = Pr*([/* \ (j)) = (w),

so that

w = 0>-1{G(0}. (5.22)

# denotes the number of times the event occurs.

In terms o f sim ulation values

'# { t ; < t}R + 1

The value o f a can be determ ined informally using (5.19). Thus if /(</>) denotes the log likelihood defined by (5.19), with derivative ?(<f>), then it is easy to show that

e { m 3}var{<f(</>)}3/2

= 6a,

ignoring term s o f order n~l . But the ratio on the left o f this equation is invariant under param eter transform ation. So we transform back from (j) to 6 and deduce that, still ignoring terms o f order n-1 ,

Em > }

var{ /(0)}3/2

To calculate a we approxim ate the m om ents o f f (6) by those o f / (d) under the fitted model with param eter value 9, so tha t the skewness correction factor is

1 E*{/*(0)3} a = T -------- :— *—1— > (5.23)

6 v a r * { n 0)3/2}

where ( ' is the log likelihood of a set o f da ta sim ulated from the fitted model. M ore generally a is one-sixth the standardized skewness o f the linear approxim ation to T.

One potential problem with the BCa m ethod is that if a in (5.21) is much closer to 0 or 1 than a, then (R + l)a could be less than 1 or greater than R, so that even with interpolation the relevant quantile cannot be calculated. If this happens, and if R cannot be increased, then it would be appropriate to quote the extreme value o f t ' and the implied value of a. For example, if ( R + l)a > JR, then the upper confidence limit t'Rj would be given with implied right-tail error a2 equal to one minus the solution to a = R / ( R + 1).

Example 5.5 (Air-conditioning data, continued) Returning to the problem o f Example 5.4 and the exponential bootstrap results for R = 999, we find that the num ber of y * values below y = 108.083 is 535, so by (5.22) w = <P_1 (0.535) = 0.0878. The log likelihood function is tC(fi) = —nlogf i — f i^1 Y,yj> whose derivative is

i w = ^ - n- fiz fi

The second and third m om ents of if(fi) are nfi~2 and 2n/i~3, so by (5.23)

a = In ” 1/2 = 0.0962.

a z„ = w + za $ = ®(w + -p rjj;) r = (/?-(- 1)5 ‘w

0.025 -1 .8 7 2 0.067 67.00 65.260.975 2.048 0.996 995.83 199.410.050 -1 .5 5 7 0.103 102.71 71.190.950 1.733 0.985 984.89 182.42

The calculation o f the adjusted percentile limits (5.21) is illustrated in Table 5.1. The values o f r = (K + l)a are not integers, so we have applied the interpolation form ula (5.8).

H ad we tried to calculate a 99% interval, we should have had to calculate the 999.88th ordered value o f t ‘, which does not exist. The implied right-tail error for t'gg9j is the value a2 which solves

9" = d> (0 .0 8 7 8 + 0-0878 + ^1000 V 1 -0 .0 9 6 2 (0 .0 8 7 8+ Z !_ a2)

namely a2 = 0.0125. ■

Parametric case with nuisance parametersW hen 9 is one o f several unknow n param eters, the previous development applies to a derived distribution called the least-favourable family. As usual we denote the nuisance param eters by X and write ip = (9, a ). I f the log likelihood function for ip based on all the data is <f(ip), then the expected Fisher inform ation m atrix is i(ip) = E{—£(ip)}. Now define (5 = i- 1( $ ) ( l ,0 , . . . ,0 ) r . Then the least-favourable family o f distributions is the one-param eter family obtained from the original model by restricting ip to the curve ip + £S. With this restriction, the log likelihood is

fL F (0 = f ( < P + t h

akin to the profile log likelihood for 9. The M LE o f £ is £ = 0.The bias-corrected percentile m ethod is now applied to the least-favourable

family. Equations (5.21) and (5.22) still apply. The only change in the calculations is to the skewness correction factor a, which becomes

1 E - f c ( 0 ) 3 }6var'K lf (0))W' ‘ 1

In this expression the param eter estim ates ip are regarded as fixed, and the m om ents are calculated under the fitted model.

A som ewhat simpler expression for a can be obtained by noting that i?LF{0) is proportional to the influence function for t. The result in Problem 2.12 shows that

Lt(yj' ,Fv ) = m l(ipy{ip,yj),

Table 5.1 Calculation of adjusted percentile bootstrap confidence limits for fi with the data of Example 1.1, under the parametric exponential model with R = 999;a = 0.0962, w = 0.0878.

j{y>) is d2f(\p)/dy>d\pT.

5 J • Percentile Methods 207

Table 5.2 Calculation of adjusted percentile bootstrap confidence

a za = w + z a 5 = <I,(w + i i ; ) r = ( R + l)a £(r>

limits for ^ with thedata of Example 1.1, 0.025 -1 .8 2 3 0.085 85.20 62.97under the gamma 0.975 2.097 0.998 998.11 226.00parametric model with 0.050 -1 .5 0 8 0.125 125.36 67.25p. = 108.0833,/c = 0.7065 and 0.950 1.782 0.991 991.25 208.00

a = 0.1145, w = 0.1372.

where i-1(tp) is the first row o f the inverse o f i(ip) and /( \p,yj) is the contribution to /(tp) from the _/th case. We can then rewrite (5.24) as

where

L* = nil (xpy(xp,Y’ )

and Y ' follows the fitted distribution with param eter value ip. As before, to first order a is one-sixth the estim ated standardized skewness o f the linear approxim ation to t. In the form given, (5.25) will apply also to nonhom ogeneous data.

The B C a m ethod can be extended to any sm ooth function o f the original model param eters \p; see Problem 5.7.

Example 5.6 (Air-conditioning data, continued) We now replace the exponential model used in the previous example, for the data o f Example 2.3, with the tw o-param eter gam m a model. The param eters are 6 = fi and A = k , the first still being the param eter o f interest. The log likelihood function is

/(fi, k ) = n K \og(K/fi) + ( k - 1) £ log y j - k £ y j / f i - n log T(k).

The inform ation m atrix is diagonal, so tha t the least-favourable family is the original gam m a family with k fixed at k = 0.7065. It follows quite easily that

? l f (°) - y ,

and so a is one-sixth o f the skewness o f the sample average under the fitted gam m a model, that is a = The same result is obtained somewhatm ore easily via (5.25), since we know tha t the influence function for the mean is L t(y \F) = y - fi.

The num erical values o f a and w for these data are 0.1145 and 0.1372 respectively, the latter from R = 999 simulated samples. Using these we com pute the adjusted percentile bootstrap confidence limits as in Table 5.2.

2 0 8 5 ■ Confidence Intervals

Just how flexible is the BCa m ethod? The following example presents a difficult challenge for all bootstrap m ethods, and illustrates how well the studentized bootstrap and B C a m ethods can com pensate for weaknesses in the more primitive methods.

Example 5.7 (Normal variance estimation) Suppose that we have independent samples ( y n , . . . , y i m), i = l , . . . , /c , from norm al distributions with different m eans A, but com m on variance 8, the latter being the param eter o f interest. The m axim um likelihood estim ator o f the variance is t = n~l Y%=i Y^j=t(yij ~ yi)2, where n = mk. In practice the m ore usual estim ate would be the pooled mean square, with denom inator d = k(m — 1) ra ther than n, but here we leave the bias o f T in tact to see how well the bootstrap m ethods can cope.

The distribution o f T is n~l6xj. This exact result allows us both to avoid the use o f simulation, and to calculate exact coverages for all the confidence limit methods. D enote the a quantile o f the Xd distribution by cd . Using the fact tha t T* = n~l txd we see that the upper a confidence limits for 8 under the basic bootstrap and percentile m ethods are respectively

21 — rT1 tcj'i-a, n~l tcd,a.

The coverages o f these limits are calculated using the exact distribution o f T. For example, for the basic bootstrap confidence limit

Pr (0 < 2T — T c4 ,_.) - Pr > J T T T T ^ ) '

For the B C a m ethod, (5.22) gives w = ®_ 1{ P r(^ < n)\ and (5.24) gives a = \ 2 l?2n - V 2. The upper a confidence limit, calculated by (5.20) with the exact distribution for T ‘, is n~l tcd,«. The exact coverage o f this limit is P r ( ^ > n2/cd,$).

Finally, for the studentized bootstrap upper a confidence limit (5.7), we first calculate the variance approxim ation v = 2n~l t2 from the expected Fisher inform ation m atrix and then the confidence limit is nt/cd,i-a. The coverage of this limit is exactly a.

Table 5.3 shows num erical values o f coverages for the four m ethods in the case k = 10 and m = 2, where d = |n = 10. The results show quite dram atically first how bad the basic and percentile m ethods can be if used w ithout careful thought, and secondly how well studentized and adjusted percentile m ethods can do in a m oderately difficult situation. O f course use o f a logarithm ic transform ation would improve the basic bootstrap m ethod, which would then give correct answers. ■

yi is the average ofyn>---,ymr


Table 5.3 Exact coverages (%) of confidence limits for normal variance based on maximum likelihood estimator for 10 samples each of size two.

fj($) is the first derivative drj(6)/dd.

N om inal Basic S tudentized Percentile BCa

1.0 0.8 1.0 0.0 1.02.5 2.5 2.5 0.0 2.55.0 4.8 5.0 0.0 5.0

95.0 35.0 95.0 1.6 91.597.5 36.7 97.5 4.4 100.099.0 38.3 99.0 6.9 100.0

Nonparametric case: single sampleThe adjusted percentile m ethod for the nonparam etric case is developed by applying the m ethod for the param etric case with no nuisance param eters to a specially constructed nonparam etric exponential family with support on the data values, the least-favourable family derived from the multinom ial distribution for frequencies o f the da ta values under nonparam etric resampling.

Specifically, if lj denotes the empirical influence value for t a t yj, then the resampling model for an individual Y * is the exponential tilted distribution

Pr(7* = y j ) = pj = (5-26)

The param eter o f interest 6 is a m onotone function o f r\ with inverse rj(6), say. The M LE o f rj is fj = rj(t) = 0, which corresponds to the E D F F being the nonparam etric M LE o f the sampling distribution F.

The bias correction factor w is calculated as before from (5.22), but using nonparam etric bootstrap sim ulation to obtain values of t*. The skewness correction a is given by the empirical analogue of (5.23), where now

W hen the m om ents needed in (5.23) are evaluated at 6, or equivalently at fj = 0, two simplifications occur. First we have E*(L*) = 0, and secondly the m ultiplier ij(t) cancels when (5.23) is applied. The result is that

a 6 / \ 3/2’6 (e i f )

which is the direct analogue o f (5.25).

Example 5.8 (Air-conditioning data, continued) The nonparam etric version o f the calculations in the preceding example involves the same form ula (5.21), but now with a = 0.0938 and w = 0.0728. The form er constant is calculated from (5.27) with lj = y7 — y. The confidence limit calculations are shown in Table 5.4 for 90% and 95% intervals. ■


a za = w + z a 5 = ®(w + i ^ i ; ) r = (R + 1)5 C(r)

0.025 -1.8872 0.0629 62.93 55.330.975 2.0327 0.9951 995.12 243.500.050 -1.5721 0.0973 97.26 61.500.950 1.7176 0.9830 983.01 202.08

function o f sample moments, say t == t ( s ) where In = n - ' E ”=i Sitij)for i = then (5.26) is a one-dimensional reduction o f a /c-dimensionalexponential family for si(Y * ),... ,s*( Y *). By equation (2.38) the influence values lj for t are given simply by lj = tT {s(yj) — s} with t = dt/ds.

The m ethod as described will apply as given to any single-sample problem, and to m ost regression problems (Chapters 6 and 7), but not exactly to problems where statistics are based on several independent samples, including stratified samples.

Nonparametric case: several samples

In the param etric case the BCa m ethod as described applies quite generally through the unifying likelihood function. In the nonparam etric case, however, there are predictable changes in the B C a method. The background approxim ation m ethods are described in Section 3.2.1, which defines an estim ator in terms o f the E D Fs o f k samples, t = t (F\ , . . . , Fk). The empirical influence values lij for j = 1 and i = 1 , . . . , k and the variance approxim ation vL are defined in (3.2) and (3.3).

I f we return to the origin and developm ent o f the BCa method, we see that the definition o f bias correction w in (5.22) will rem ain the same. The skewness correction a will again be one-sixth the estim ated standardized skewness of the linear approxim ation to t, which here is

, i s i , » r J

This can be verified as an application o f the param etric m ethod by constructing the least-favourable jo in t family o f k distributions from the k multinom ial distributions on the da ta values in the k samples.

N ote that (5.28) can be expressed in the same form as (5.27) by defining hj = nlij/ni, where n = Y so that

ij (e u?5)!vL = n 2 Y ^ i - ° = (5.29)

Table 5.4 Calculation of adjusted percentile bootstrap confidence limits for p. in Example 1.1 using nonparametric bootstrap with R ~ 999; a = 0.0938, w = 0.0728.

5.4 ■ Theoretical Comparison o f Methods 211

see Problem 3.7. This can be helpful in writing an all-purpose algorithm for the BCa m ethod; see also the discussion o f the A B C m ethod in the next section.

A n example is given at the end o f the next section.

5.4 Theoretical Comparison of MethodsThe studentized bootstrap and adjusted percentile m ethods for calculating confidence limits are inherently more accurate than the basic bootstrap and percentile m ethods. This is quite clear from empirical evidence. Here we look briefly at the theoretical side o f the story for statistics which are approxim ately normal. Some aspects o f the theory were discussed in Section 2.6.1. For simplicity we shall restrict m ost of the detailed discussion to the single-sample case, but the results generalize w ithout much difficulty.

5.4.1 Second-order accuracyTo assess the accuracies of the various bootstrap confidence limits we calculate coverage probabilities up to the n_1/2 term s in series approxim ations, these based on corresponding approxim ations for the C D Fs o f U = ( T — 0 ) /v 1/2 and Z = (T — 0 ) / F 1/2. Here v is var(T ) or any approxim ation which agrees to first order with vL, the variance of the linear approxim ation to T. Similarly V is assumed to agree to first order with VL. For example, in the scalar param etric case where T is the maximum likelihood estim ator, v is the inverse o f the expected Fisher inform ation matrix. In all o f the equations in this section equality is correct to order n~1 2, i.e. ignoring errors o f order n ~ \

The relevant approxim ations for C D Fs are the one-term C ornish-Fisher approxim ations

P r([/ < u) = G(6 + v 1/ 2m) = <5 (u — n~l/2(m\ — |m 3 + |m 3w2) j , (5.30)

where G is the C D F o f T, and

K(z) = Pr(Z < z) = <D (u - n~l/2{mi - gm3 - ( |m n - gm3)u2}) , (5.31)

with the constants defined by

E(U) = n~1/2m u E ( l /3) = n“ 1/2(m3 - 3wi), E {(F - v)(T - 0)} = rT 1/2m „ ;(5.32)

note tha t the skewness o f U is «_ 1/2m3. The corresponding approxim ations for quantiles o f T (rather than U) and Z are

G- 1(oe) = 6 + v1/2za + n_ 1/2v1/2(mi — + gm3z2) (5.33)

and

K _ 1(a) = za + n~l/2 {mi - ±m3 - { {m n - gw3)z2} . (5.34)

The analogous approxim ations apply to bootstrap distributions and quantiles, ignoring sim ulation error, by substituting appropriate estimates for the various constants. In fact, provided the estimates for mi, m3 and mu have errors o f order n~1 2 o r less, the n~1/2 terms in the approxim ations will not change. This greatly simplifies the calculations that we are about to do.

Studentized bootstrap method

Consider the studentized bootstrap confidence limit 0a = t — v[/2k ~ l (l — a). Since the right-hand side o f (5.34) holds also for K ~1 we see tha t to order

ea = t — v l/2z i - x - n~1/2v l/2{mi - ±m3 - (±mn - i/n 3)z^_o;}. (5.35)

It then follows by applying (5.31) and expanding by Taylor series that

Pr(0 < 0a) = a + 0 (n _1).

This property is referred to as second-order accuracy, as distinct from the first- order accuracy o f the norm al approxim ation confidence limit, whose coverage is a + 0 ( n ~ l/2).

Adjusted percentile methodFor the adjusted percentile limit (5.20) we use (5.33) with estim ated constants as the approxim ation for G_1( ). For the norm al integral inside (5.20) we can use the approxim ation ®(za + 2w + ai\) , because a and w are o f order n~{/2. Then the a confidence limit (5.20) is approxim ately

= t + v1/2za + n_ 1/V /2 {2n1/2w + mi — ±m3 + (nI/2a + |m 3)z2} . (5.36)

This will also be second-order accurate if it agrees with (5.35), which requires that to order n~1/2,

a = n~ll2{ \ m n - |m 3), w = - n ~ 1/2(mi - gm3). (5.37)

To verify these approxim ations we use expressions for mu m3 and m n derived from the quadratic approxim ation for T described in Section 2.7.2.

In slightly simplified notation, the quadratic approxim ation (2.41) for T is

T = 6 + n~l Y Lt(Yi) + \ n ~ 2 £ Q,(Yj, Yk). (5.38)» ‘J

It will be helpful here and later to define the constants

a = I „ - 3v-V 2 ^ E {L3(y,)}, b = i n ” 2 Y E iQt(Yi, i i

c = \n-4v-3/2J 2 E{Lt(Yj)L,(Yk)Qt{Yj, 7,)}. i+k


(5.39)

Then calculations of the first and third m om ents o f T — 0 from the quadratic approxim ation show that

m, = n1/2v~l/2b, m3 = n1/2(6<z + 6c). (5.40)

For m u, it is enough to take V to be the delta m ethod estim ator, which in the full notation is Vl = n~2 ^ L 2(Y;;F). Then using the approxim ation

n n

L t ( Y r , F ) = L , ( Y ,) - n - 1 £ L t ( Y j ) + n ” 1 £ Q ,(Y „ Y j )

7=1 7=1

given in Problem 2.20, detailed calculation leads to

mu = « ^ 2(6a + 4c). (5.41)

The results in (5.40) and (5.41) imply the identity for a in (5.37), after noting tha t the definitions o f a in (5.23), (5.25) and (5.27) used in the adjusted percentile m ethod are obtained by substituting estimates for m om ents o f the influence function. The identity for w in (5.37) is confirmed by noting that the original definition w = 4>~’{G(t)} approxim ates <1>_ 1{G(0)}, which by applying (5.30) with u = 0 agrees with (5.37).

Basic and percentile methods

Similar calculations show that the basic bootstrap and percentile confidence limits are only first-order accurate. However, they are both superior to the norm al approxim ation limits, in the sense that equi-tailed confidence intervals are second-order accurate. For example, consider the 1 — 2a basic bootstrap confidence interval with limits

0 x, 8 i - x = 2t — G_1(l — a),2 t — G_ 1(a).

It follows from the estim ated version o f (5.33) that

21 — G_1(l — a) = t — t>1/2zi_c( — n- 1/V /2(mi — ^m3 +

and by (5.31) the error o f this lower limit is

Pr{0 < 2 1 — G ~ '(l — a)} = 1 — <J>(zi_a + n~1/2m n z j _ J .

Correspondingly the error o f the upper limit is

Pr{0 > 2 1 — G- 1(a)} = <P(za + n_ 1/2m n z2).

Therefore the com bined coverage error o f the confidence interval is

1 - <J>(zi_a + n~l/2m n z 2- J + ®(za + n_ 1/2m iiz2)

which, after expanding in Taylor series and dropping n~l terms, and then noting tha t z2 = z \_a and <j)(za) = <f>{zi_«), turns out to equal

2a + n~1/2mn {z2_a0 (zi-a) - z«0 (za)} = 2a.


These results are suggestive o f the behaviour that we observe in specific examples, tha t bootstrap m ethods in general are superior to norm al approxim ation, but that only the adjusted percentile and studentized bootstrap m ethods correctly adjust for the effects o f bias, nonconstant variance, and skewness. It would take an analysis including n-1 term s to distinguish between the preferred methods, and to see the effect o f transform ation prior to use o f the studentized bootstrap method.

5.4.2 The ABC methodIt is fairly clear that, to the order n~l/2 considered above, there are many equivalent confidence limit methods. One o f these, the A B C m ethod, is o f particular interest. The m ethod rests on the approxim ation (5.35), which by using (5.40) and (5.41) can be re-expressed as

6x = t + v 1/2{zx + a + c — v~l/2b + (2 a + c)za}; (5.42)

here v has been approxim ated by v in the definition o f mi, and we have used Z\-a — — z«.

The constants a, b and c in (5.42) are defined by (5.39), in which the expectations will be estimated. Special forms o f the A B C m ethod correspond to special-case estimates o f these expectations. In all cases we take v to be vl-

Parametric case

If the estimate t is a sm ooth function o f sample moments, as is the case for an exponential family, then the constants in (5.39) are easy to estimate. W ith a tem porary change o f notation, suppose that t = t(s) where s = n~l ^ s(yj) has p com ponents, and define fi = E(S), so that 6 = t(n). Then

L t(Y,) = t(n)T{s(Yj) - fi}, Qt(Yj, Yk) = {s(Yj) - fi}Tt(fi){s(Yk) - /i}. (5.43)

Estimates for a, b and c can therefore be calculated using estimates for the first three m om ents o f s( Y ).

For the particular case where the distribution o f S has the exponential family PD F

f S ) = exp{//r s -£ ( f / )} ,

the calculations can be simplified. First, define L(^) = var(5) = l(rj). Then

vl = t(s)T'L(s)i(s).

Substitution from (5.43) in (5.39), and estim ation o f the expectations, gives estim ated constants which can be expressed simply as

, b = ^tr{t(s)£(s)}, £=0 L

t = dt(s)/ds, and V = d2t(s)/dsdsT.

Ul) = 81Ul)/d'ldqT-

tr(A) is the trace of the square matrix A.


1 d2t(s + ke)c =

2vm del(5.44)

£=0

where k = £ (s)i ( s ) /v ^2.The confidence limit (5.42) can also be approxim ated by an evaluation of

the statistic t, analogous to the BCa confidence limit (5.20). This follows by equating (5.42) with the right-hand side o f the approxim ation t(s + v 1 2e) = t(s) + v ^ 2eT 't(s), with appropriate choice o f e. The result is

where= t i i + F k ? k | * ( 5 -4 5 )

za = w + za = a + c - bvL i/2 + z«.

In this form the ABC confidence limit is an explicit approxim ation to the BCa confidence limit.

If the several derivatives in (5.44) are calculated by numerical differencing, then only 4p + 4 evaluations o f t are necessary, plus one for every confidence limit calculated in the final step (5.45). Algorithm s also exist for exact numerical calculation o f derivatives.

Nonparametric case: single sampleIf the estimate t is again a sm ooth function o f sample moments, t = t(s), then (5.43) still applies, and substitution o f empirical m om ents leads to

b = ! t (•;£) 1 (E 5jO)rf'(£ sjQ) k== 1 E s)h6(E/,2)3/2 2 { U 2n (E '72)3/2 ’ n (E^)1/2'

(5.46)An alternative, m ore general form ulation is possible in which s is replaced by

the m ultinom ial proportions n~l{ f \ , . . . , f n) attaching to the da ta values. C orrespondingly fi is replaced by the probability vector p, and with distributions F restricted to the data values, we re-express t(F) as t(p); cf. Section 4.4. Now F is equivalent to p = (£,■■■,£) and t = t(p). In this notation the empirical influence values and second derivatives are defined by

andd2

qjj =

(5.47)

(5.48)£=0

where 1; is the vector with 1 in the ; th position and 0 elsewhere. Let us set tj(p) = dt(p)/8pj, and tjk(p) = d2t(p)/dpjdpk ; see Section 2.7.2 and Problem 2.16. Then alternative forms for the vector I and the full m atrix q are

/ = (/ - iT* J)t(p), q = ( I ~ n - ' J i i m i ~ n~lJ),


where J = 11T. For each derivative the first form is convenient for approxim ation by numerical differencing, while the second form is often easier for theoretical calculation.

Estimates for a and b can be calculated directly as empirical versions of their definitions in (5.39), while for c it is simplest to use the analogue o f the representation in (5.44). The resulting estimates are

i E ' j , i v - 1 / ,6 ( £ I ]? '1' 2n2 ^ qjj ~ 2n1 ^

(5.49)1 d2t(p + ek)

a =

c —2 V 1/ 2 d £ l 2 * v l »

t (I — n J)t(I — n J)t

where k = n 2vL i/2lT and t , i are evaluated at p.The approxim ation (5.45) can also be used here, but now in the form

d« = t [ P + n ' . J ■ (5.50)V (1 - a z a y )

If the several derivatives are calculated by numerical differencing, then the num ber o f evaluations o f t(p) needed is only 2n+2, plus one for each confidence limit and the original value t. Note tha t the probability vector argum ent in (5.50) is not constrained to be proper, or even positive, so that it is possible for A B C confidence limits to be undefined.

Example 5.9 (Air-conditioning data, continued) The adjusted percentile m ethod was applied to the air-conditioning da ta in Example 5.6 under the gamma model and in Example 5.8 under the nonparam etric model. Here we examine how well the A B C m ethod approxim ates the adjusted percentile confidence limits. For the m ean param eter, calculations are simple under all models. For example, in the gam m a case the exponential family is two-dimensional with s = (y .lo g y )7’,

rj i = - hk/ h, rj 2 = me, xp(ri) = - f /2 log( -t] i /n) + n log r(rj2/n),

and t(s) = si. The last implies that t = ( l ,0 ) r and t = 0. It then follows straightforw ardly that the constant a is given by \ (nk)~ l/2 as in Example 5.6, that b = c = 0, and tha t k = v ^ 2( l , 0 )T. Similar calculations apply for the nonparam etric model, except that a is given by the corresponding value in Example 5.8. So under both models

0i_a = 108.083 + v[/2-------- + - - - -{1 - a ( a + z i-c)}2 '

Num erical com parisons between the adjusted percentile confidence limits and

1 is a vector of ones.

5.4 • Theoretical Comparison o f Methods 217

Table 5.5 Adjusted percentile (BCa) and ABC confidence intervals for mean failure time fi for the air-conditioning data. R = 999 simulated samples for BCa methods.

N om inal confidence 1 — 2a

0.99 0.95 0.90

G am m a m odel BCaA B C

51.5, 241.652.5, 316.6

63.0, 226.0 61.4, 240.5

67.2, 208.0 66.9, 210.5

N o nparam etric m odel BC aA B C

44.6, 268.846.6, 287.0

55.3, 243.5 57.2, 226.7

61.5, 202.163.6, 201.5

A B C limits are shown in Table 5.5. The A B C m ethod appears to give reasonable approxim ations, except for the 99% interval under the gam m a model.

Nonparametric case: several samplesThe estim ated constants (5.49) for the single-sample case can be applied to several samples by using a single artificial probability vector n o f length n = J 2 ni as follows. The estim ator will originally be defined by a function t (pi , . . . ,pk) , where p, = ( p u , . . . ,p ini) is the vector o f probabilities on the ith sample values y n , . . . , yim■ The artificial representation o f the estim ator in terms o f the single probability vector

71 = (7T11, . . . , 7T21,. . . , )

o f length n is u(n) = t (pi , . . . ,pk) where p, has elements

py = (5.51)E ;= i nn

The set o f E D Fs is equivalent to ft = (£,•■•,£) and the observed value o f the estimate is t = u(n). This artificial representation leads to expressions such as (5.29), in which the definition o f 7i; is obtained by applying (5.47) to u(p). (Note tha t the real influence values /y and second derivatives q(j j derived from t (pi , . . . ,pk) should not be used.) T hat this m ethod produces correct results is quite easy to verify using the several sample extension o f the quadratic approxim ation (5.38); see Section 3.2.1 and Problem 3.7.

Example 5.10 (Air-conditioning data failure ratio) The da ta o f Example 1.1 form one o f several samples corresponding to different aircraft. The previous sample («i = 12) and a second sample (n2 = 24) are given in Table 5.6. Suppose that we w ant to estimate the ratio o f failure rates for the two aircraft, and give confidence intervals for this ratio.

To set notation, let the mean failure times be fii and fi2 for the first and second aircraft, with 6 = n t / n \ the param eter o f interest. The corresponding


F irst a ircraft

3 5 7 18 43 85 91 98 100 130 230 487

Second aircraft

3 5 5 13 14 15 22 22 23 30 36 3944 46 50 72 79 88 97 102 139 188 197 210

sample means are y\ = 108.083 and y2 = 64.125, so the estim ate for 6 is t = y i / y i = 0-593.

The empirical influence values are (Problem 3.5)

h j =yi yi

We use (5.29) to calculate vL = 0.05614 and a = —0.0576. In R = 999 nonparam etric sim ulations there are 473 values o f t* below t, so by (5.22) w = —0.0954. W ith these values we can calculate B Ca confidence limit (5.21). For example, for a = 0.025 and 0.975 the values o f a are 0.0076 and 0.944 respectively, so tha t the limits o f the 95% interval are r(*76) = 0.227 and £(944) = 1-306; the first value is interpolated using (5.8).

The studentized bootstrap m ethod gives 95% confidence interval [0.131,1.255] using the original scale. The distribution o f t* values is highly skew here, and the logarithm ic scale is strongly indicated by diagnostic plots. Figure 5.2 shows the norm al Q-Q plot o f the t* values, the variance-param eter plots for original and logarithm ic scales, and the norm al Q-Q plot o f z* values after logarithm ic transform ation. A pplication o f the studentized bootstrap m ethod on the logarithm ic scale leads to 95% confidence interval [0.183,1.318] for 6, m uch closer to the B C a limits.

For the A B C m ethod, the original definition o f the estim ator is t = t(pi ,p2) = Y ^ y i j P i j / ^ y i j P i j - The artificial definition in term s o f a single probability vector n is

u(n\ = E£i yunn/ E£i n2i

( E"li yijnij /TjLinij '

Table 5.6 Failure times for air-conditioning equipment in two aircraft (Proschan,1963).

A pplication o f (5.47) shows tha t the artificial empirical influence values are

5.4 • Theoretical Comparison o f Methods 219

Figure 5.2 Diagnostic plots forair-conditioning data confidence intervals, based on R = 999 nonparametric simulations. Top left panel: normal Q-Q plot of t*, dotted line is N(t,VL) approximation. Top right:variance-parameter plot, v*L versus r \ Bottom left: variance-parameter plot after logarithmic transformation. Bottom right: normal Q-Q plot of z* after logarithmic transformation.

Quantiles of Standard Normal t*

/

CM /

■ . • i s - v y

N O

/

• ‘ •

CVJ

/

-1.5 -0.5 0.5 1.0 - 2 0 2

iog(t*) Quantiles of Standard Normal

and

' n \ ( y 2 j - y i» j - .......«■

This leads to formulae in agreement with (5.29), which gives the values o f a and vL already calculated. It remains to calculate b and c.

For b, application o f (5.48) gives

. r,. \2= 2t f n2(yij

h I »i- , w , , - y \ ) . n n 2( y i j - y i )

iUjj — t-. 1 „2r, „2yi n\

and*,nrn(y2J - y2)

^ l.jj 2- 'n \y x

so by (5.49) we have

b = n f M T 3 ' Y i y i j - y i f ,

whose value is b = 0.0720. (The bootstrap estimates b and v are respectively 0.104 and 0.1125.) Finally, for c we apply the second form in (5.49) to u(n), that is

c = ^n~4v ^ 3/2l Tii(7t)I,

and calculate c = 0.3032. The implied value o f w is —0.0583, quite different from the bootstrap value —0.0954. The A B C form ula (5.50) is now applied to u(jt) with k = n~2v~[1/20 n , . ■ ■ The resulting 95% confidence interval is [0.250,1.283], which is fairly close to the B Ca interval.

It seems possible that the approxim ation theory does not work well here, which would explain the larger-than-usual differences between B C a, A B C and studentized bootstrap confidence limits; see Section 5.7.

One practical point is that the theoretical calculation o f derivatives is quite time-consuming, com pared to application o f numerical differencing in (5.47)-(5.49). ■

5.5 Inversion of Significance TestsThere is a duality between significance tests for param eters and confidence sets for those param eters, in the sense tha t — for a prescribed level — a confidence region includes param eter values which are not rejected by an appropriate significance test. This can provide another option for calculating confidence limits.

Suppose that 8 is an unknow n scalar param eter, and tha t the model includes no other unknow n param eters. If Ra(0o) is a size a critical region for testing the null hypothesis H0 : 8 = 80, which means that

Pr{(Yu . . . , Y n) e R a( d o ) \e 0} = «,

then the set

C W Y ,,. . . , Y„) = {6 : (Y l t . . . , Y„) £ J^(0)}

is a 1 — a confidence region for 6. The shape o f the region will be determ ined by the form o f the test, including the alternative hypothesis for which the test is designed. In particular, an interval would usually be obtained if the alternative is two-sided, HA : 6 0O; an upper limit if HA : 8 < 0O; and a lower limit ifHa : 8 > 80.

5.5 ■ Inversion o f Significance Tests 221

For definiteness, suppose tha t we w ant to calculate a lower 1 — a confidence limit, which we denote by 9X. The associated test o f Ho : 9 = do versus Ha : 8 > do will be based on a test statistic t(9o) for which large values are evidence in favour o f H A : for example, t(0o) m ight be an estimate o f 6 minus Oo- We will have an algorithm for approxim ating the P-value, which we can write as p(9o) = Pr{T(0o) > ?(^o) I Fo}, where Fo is the null hypothesis distribution with param eter value 9o. The 1 — a confidence set is all values of9 such tha t p(8) > a, so the lower confidence limit 0a is the smallest solution o f p(9) = a. A simple way to solve this is to evaluate p(0) over a grid of, say, 20 values, and to interpolate via a simple curve fit. The grid can sometimes be determ ined from the norm al approxim ation confidence limits (5.4). For the curve fit, a simple general m ethod is to fit a logistic function to p(9) using either a simple polynom ial in 9 or a spline. Once the curve is fitted, solutions to p(9) = a can be com puted: usually there will be one solution, which isK

For an upper 1 — a confidence limit 9 \ - a, note tha t this is identical to a lower a confidence limit, so the same procedure as above with the same t(9o) can be used, except that we solve p(0) = 1 — a. The com bination o f lower and upper 1 — a confidence limits defines an equi-tailed 1 — 2a confidence interval.

The following example illustrates this procedure.

Example 5.11 (Hazard ratio) For the A M L da ta in Example 3.9, also analysed in Example 4.4, assume that the ratio o f hazard functions h2(z) /hi(z) for the two groups is a constant 9. As before, let rtJ be the num ber in group i who were at risk just prior to the y'th failure time zj, and let y} be 0 or 1 according as the failure ^t Zj is in group 1 or 2. Then a suitable statistic for testing Ho : 9 = 9o is

this is the score test statistic in the Cox proportional hazards model. Large values o f t(6o) are evidence that 9 > Oo-

There are several possible resampling schemes tha t could be used here, including those described in Section 3.5 but modified to fix the constant hazard ratio 9o- Here we use the simpler conditional model o f Example 4.4, which holds fixed the survival and censoring times. Then for any fixed 9o the simulated values y \ , . . . , y 'n are generated by

log(theta)

where the num bers a t risk just prior to zj are given by

f J - i ) (r\j = max I 0, m - ^ ( 1 - y ’k) - c1;-

I *=ir2j = max 0, r 2i

1Y . y ' kk= 1

C2j

with Cij the num ber of censoring times in group i before zj.

For the A M L data we simulated R = 199 samples in this way, and calculated the corresponding values t*(90) for a grid o f 21 values o f 90 in the range 0.5 < 0o ^ 10. For each Go we com puted the one-sided P-value

#{t*(0o) > t(0o)}Pieo) =

200

then on the logit scale we fitted a spline curve (in log 6), and interpolated the solutions to p(9o) = a, 1—a to determ ine the endpoints o f the (1—2a) confidence interval for 9. Figure 5.3 illustrates this procedure for a = 0.05, which gives the 90% confidence interval [1.07,6.16]; the 95% interval is [0.86,7.71] and the point estimate is 2.52. Thus there is mild evidence tha t 6 > 1.

A m ore efficient approach would be to use R = 99 for the initial grid to determ ine rough values o f the confidence limits, near which further sim ulation with R = 999 would provide accurate interpolation o f the confidence limits. Yet more efficient algorithm s are possible. ■

In a m ore systematic developm ent o f the m ethod, we m ust allow for a nuisance param eter X, say, which also governs the data distribution bu t is not constrained by Ho. Then both Ra(0) and C \ - a{Y \ , . . . , Y„) m ust depend upon X to make the inversion m ethod work exactly. U nder the bootstrap approach X is replaced by an estimate.

Figure 5.3 Bootstrap P-values p(0o) for testing constant hazard ratio 0o, with R = 199 at each point. Solid curve is spline fit on logistic scale. Dotted lines interpolate solutions to p(l?o) = 0.05,0.95, which are endpoints of 90% confidence interval.

5.6 • Double Bootstrap Methods 223

Suppose, for example, tha t we w ant a lower 1 — a confidence limit, which is obtained via the critical region for testing Ho : 9 = 9q versus the alternative hypothesis H a : 9 > 9o■ Define ip = (9, A). I f the test statistic is T(9o), then the size a critical region has the form

R«(8o) = { ( y u - - - ,y n) ■ Pr{T(0o) > t(90) | ip = (0o,A)} < a},

and the exact lower confidence limit is the value uy = ua(y, X), such that

Pr{ T(ua) > t(ua) | xp = (ua,/1)} = a.

We replace X by an estimate s, say, to obtain the lower 1 — a bootstrap confidence limit u i_a = ua(y,s). The solution is found by applying for u the equation

Pr* {T*(u) > t(u) | xp = (u,s)} = a,

where T*(w) follows the distribution under xp = (u,s). This requires application o f an interpolation m ethod such as the one illustrated in the previous example.

The simplest test statistic is the point estimate T o f 9, and then T(9o) = T. The m ethod will tend to be more accurate if the test statistic is the studentized estimate. T hat is, if var(T ) = o 2(9,A), then we take Z = (T — 9o)/v(9o,S)\ for further details see Problem 5.11. The same rem ark would apply to score statistics, such as that in the previous example, where studentization would involve the observed or expected Fisher inform ation.

N ote that for the particular alternative hypothesis used to derive an upper limit, it would be standard practice to define the P-value as Pr{T(0o) < t(9o) \ Fo}, for example if T(0 q) were an estim ator for 9 or its studentized form. Equivalently one can retain the general definition and solve p(9o) = 1 — a for an upper limit.

In principle these m ethods can be applied to both param etric and sem iparametric problems, but not to completely nonparam etric problems.

5.6 Double Bootstrap MethodsW hether the basic or percentile bootstrap m ethod is used to calculate confidence intervals, there is a possibly non-negligible difference between the nom inal 1 — a coverage and the actual probability coverage o f the interval in repeated sampling, even if R is very large. The difference represents a bias in the m ethod, and as indicated in Section 3.9 the bootstrap can be used to estimate and correct for such a bias. T hat is, by bootstrapping a bootstrap confidence interval m ethod it can be m ade m ore accurate. This is analogous to the bootstrap adjustm ent for bootstrap P-values described in Section 4.5. One straightforw ard application o f this idea is to the norm al-approxim ation confidence interval (5.4), which produces the studentized bootstrap interval;

see Problem 5.12. A m ore am bitious application is bootstrap adjustm ent o f the basic bootstrap confidence limit, which we develop here.

First we recall the full notations for the quantities involved in the basic bootstrap confidence interval method. The “ideal” upper 1 — a confidence limit is t(F) — ax(F), where

Pr { T - 6 < ax(F) | Fj = Pr{f(F) - t (F) < aa(F) \ F} = a.

W hat is calculated, ignoring sim ulation error, is the confidence limit t(F)—ax(F). The bias in the m ethod arises from the fact tha t aa(F) ^ aa(F) in general, so that

Pr{f(F) < t(F) - aa(F) | F} ± 1 - a. (5.52)A

We could try to eliminate the bias by adding a correction to ax(F), bu t a more successful approach is to adjust the subscript a. T hat is, we replace ax(F) by Oq(a)(F) and estimate w hat the adjusted value q(a) should be. This is in the same spirit as the BCa method.

Ideally we w ant q(a) to satisfy

P r{t(F) < t(F) - fl, (a)(F) | F} = 1 - a. (5.53)

The solution q(a) will depend upon F, i.e. q(oc) = q(a, F). Because F is unknown, we estimate q(a) by q(a) = q(a, F). This means tha t we obtain q(a) by solving the bootstrap version o f (5.53), namely

Pr*{t(F) < t (F') - ai{a)( h I F} = 1 - a. (5.54)

This looks intim idating, but from the definition o f aa(F) we see tha t (5.54) can be rewritten as

Pr*{Pr**(T** < 2 T ' - t \ F*) > q{oc) | F} = 1 - a. (5.55)

The same m ethod o f adjustm ent can be applied to any bootstrap confidence limit m ethod, including the percentile m ethod (Problem 5.13) and the studentized bootstrap m ethod (Problem 5.14).

To verify tha t the nested bootstrap reduces the order o f coverage error made by the original bootstrap confidence limit, we can apply the general discussion o f Section 3.9.1. In general we find that coverage 1 — a + 0(n~“) is corrected to1—a + 0 (n ~ fl~1/2) for one-sided confidence limits, w hether a = | or 1. However, for equi-tailed confidence intervals coverage 1 — 2a + 0 (n-1 ) is corrected to1 — 2a -I- 0(n~2); see Problem 5.15.

Before discussing how to solve equation (5.55) using simulated samples, we look at a simple illustrative example where the solution can be found theoretically.

Example 5.12 (Exponential mean) Consider the param etric problem o f exponential da ta with unknow n mean /i. The data estim ate for fi is t = y, F is

5.6 ■ Double Bootstrap Methods 225

I {A} is the zero-one indicator function of the event A.

the fitted exponential C D F with mean y, and F * is the fitted exponential C D F with mean y * — the m ean o f a param etric bootstrap sample y \ , . . . , y 'n drawn from F. A result tha t we use repeatedly is tha t if X \ , . . . , X n are independent exponential with mean y, then 2n X / y has the x ln distribution.

The basic bootstrap upper 1 — a confidence limit for n is

2y - yc2n,u/(2n),

where Pt(xI„ < cjn,%) = oc. To evaluate the left-hand side o f (5.55), for the inner probability we have

P r**(F" < 2 ? - y | F*) = Pr{*2„ < 2n(2 - J) / ? ) } ,

which exceeds q if and only if 2n(2 — y / y ’) > C2n,q■ Therefore the outer probability on the left-hand side of (5.55) is

Pr" {2„(2 - « ? • ) > I = Pr { & > 2 _ ^ / (2„ , } . (5-56)

with q = q(a). Setting the probability on the right-hand side o f (5.56) equal to1 — a, we deduce that

2 n2 - cl n m l{2n) C2n’a'

Using q(a) in place o f a in the basic bootstrap confidence limit gives the adjusted upper 1 — a confidence limit 2 n y /c 2n,a, which has exact coverage 1 — oc.So in this case the double bootstrap adjustm ent is perfect.

Figure 5.4 shows the actual coverages o f nom inal 1 — a bootstrap upper confidence limits when n = 10. There are quite large discrepancies for both basic and percentile m ethods, which are completely removed using the double bootstrap adjustm ent; see Problem 5.13. ■

In general, and especially for nonparam etric problems, the calculations in (5.55) cannot be done exactly and sim ulation or approxim ation m ethods must be used. A basic sim ulation algorithm is as follows. Suppose that we draw R samples from F, and denote the model fitted to the rth sample by F ’ — the E D F for one-sample nonparam etric problems. Define

ur = Pr(T** < 21* - 1 1 F*).

This will be approxim ated by drawing M samples from F", calculating the estim ator values r” for m = 1 , . . . , M and com puting the estimate

M

«M ,r = ^ K ~ '} •m= 1

2 2 6 5 • Confidence Intervals

Figure 5.4 Actual coverages of percentile (dotted line) and basic bootstrap (dashed line) upper confidence limits for exponential mean when n = 10. Solid line is attained by nested bootstrap confidence limits.

0.0 0.2 0.4 0.6 0.8 1.0

Nominal coverage

Then the M onte Carlo version of (5.55) is

R

^ «(«)} = 1 -r= l

which is to say tha t q(a) is the a quantile o f the uMr. The simplest way to obtain <j(ot) is to order the values uMr into uM{l) < ■ ■ ■ < and thenset q{a) = W hat this am ounts to is tha t the (R + l)a th orderedvalue is read off from a Q-Q plot o f the uMr against quantiles o f the U(0 ,1) distribution, and that ordered value is then used to give the required quantile o f the t* — t. We illustrate this in the next example.

The total num ber o f samples involved in this calculation is R M . Since we always think o f sim ulating as m any as 1000 samples to approxim ate probabilites, here this would suggest as m any as 106 samples overall. The calculations o f Section 4.5 would suggest som ething a bit smaller, say M = 249 to be safe, bu t this is still ra ther impractical. However, there are ways o f greatly reducing the overall num ber o f simulations, two o f which are described in C hapter 9.

Example 5.13 (Kernel density estimate) B ootstrap confidence intervals for the value o f a density raise some awkward issues, which we now discuss, before outlining the use of the nested bootstrap in this context.

The standard kernel estimate o f the P D F f ( y ) given a random sample y u - - - , y n is

where w( ) is a symmetric density with mean zero and unit variance, and h is the bandw idth. One source o f difficulty is that if we consider the estim atorto be t(F), as we usually do, then t(F) = h~l f w{h~l (y — x) } f (x )dx is beingestim ated, no t f (y) . The mean and variance of f ( y ; h ) are approxim ately

f ( y ) + j h 2f ' ( y ) , (nh)~lf ( y ) J w2(u)du, (5.57)

for small h and large n. In general one assumes that as n—► o o so h—>0 in such a way that nh—*-oo, and this makes both bias and variance tend to zero as n increases. The density estimate then has the form t„(F), such that t„ (F ) - t ( F ) = f (y) .

Because the variance in (5.57) is approxim ately proportional to the mean, it makes sense to work with the square root o f the estimate. T hat is we take T = {f ( y ; h )}1/2 as estim ator o f 9 = {f ( y )}1/2. By the delta m ethod of Section 2.7.1 we have from (5.57) that the approxim ate mean and variance of T are

{f(y)Y/2 + Uf(yT1/2{h2f"(y)-i2(nhr iK}, (5.58)

where K = f w2(u) du.There remains the problem of choosing h. For point estim ation o f f ( y ) it is

usually suggested, on the grounds o f minimizing m ean squared error, that one take h o c n-1/5. This makes both bias and standard error o f order n~2 5. But there is no reason to do the same for setting confidence intervals, and in fact h o c n-1/5 turns out to be a poor choice, particularly for standard bootstrap m ethods, as we now show.

Suppose that we resample y i , . . . , y ‘ from the E D F F. Then the bootstrap version o f the density estimate, that is

has mean exactly equal to f ( y ’,h); the approxim ate variance is the same as in (5.57) except that f ( y \ h ) replaces f (y ) . It follows that T* = { f ' ( y \ h ) } 1 2 has approxim ate m ean and variance

{ f ( y , h ) } 1/2 - K/ONfc)}-172^ ) -1^ , { (nh) - lK. (5.59)

Now consider the studentized estimates

7 = { f ( y M l l 2 - { f ( y ) Y 12 z< = {r(^}1/2-{/(>>; ft)}1/2i(n /j) - ‘/ 2K i /2 ’ \ ( n h ) ~ ^ K ^

From (5.58) and (5.59) we see that if h o c n“ 1/5, then as n increases

2 = e + { f ( y ) } - l/2K - ' /2{f" (y) - \ K } , Z* = e \

20 50 100 200 5001000 20 50 100 200 5001000 20 50 100 200 5001000 20 50 100 200 5001000

where both e and s' are N ( 0,1). This means that quantiles o f Z cannot be well approxim ated by quantiles o f Z*, no m atter how large is n. The same thing happens for the untransform ed density estimate.

There are several ways in which we can try to overcome this problem. One o f the simplest is to change h to be o f order «-1/3, when calculations similar to those above show that Z = e and Z* = e*. Figure 5.5 illustrates the effect. Here we estim ate the density a t y = 0 for samples from the N (0,1) distribution, with w(-) the standard norm al density. The first two panels show box plots o f 500 values o f z and z* when h = n~1/s, which is near-optim al for estim ation in this case, for several values o f n; the values o f z* are obtained by resampling from one dataset. The last two panels correspond to h = n~1/3. The figure confirms the key points o f the theory sketched above: that Z is biased away from zero when h = n-1^5, but not when h = n_1/3; and tha t the distributions o f Z and Z ’ are quite stable and similar when h = n-1/3.

Under resampling from F, the studentized bootstrap applied to {/(>’; ^)}1/2 should be consistent if h oc n~1/3. From a practical point o f view this means considerable undersm oothing in the density estimate, relative to standard practice for estimation. A bias in Z o f order n~1/3 or worse will remain, and this suggests a possibly useful role for the double bootstrap.

For a num erical example o f nested bootstrapping in this context we revisit Example 4.18, where we discussed the use o f a kernel density estimate in estim ating species abundance. The estim ated P D F is

f (y .h) = zz

where </>(•) is the standard norm al density, and the value o f interest is / ( 0 ;/i), which is used to estimate /(0 ). In light o f the previous discussion, we base

Figure 5.5 Studentized quantities for density estimation. The left panels show values of Z when h = n~1 5 for 500 standard normal samples of sizes n and 500 bootstrap values for one sample at each n. The right panels show the corresponding values when h = n-1 3.

Figure 5.6 Adjusted bootstrap procedure for variance-stabilized density estimate f = {/(0;0.5)}1/2 for the tuna data. The left panel shows the EDF of 1000 values of I* — t.The right panel shows a plot of the ordered u'Mr against quantiles r/(R + 1) of the 1/(0,1) distribution. The dashed line shows how the quantiles of the u are used to obtain improved confidence limits, by using the right panel to read off the estimated coverage q{a) corresponding to the required nominal coverage a, and then using the left panel to read off the q(a) quantile of t* — t.

OLU

<DO)5O>oo■O0)foE

LU

t * - t Nominal coverage

confidence intervals on the variance-stabilized estimate t = { /(0 ;h )}1/2. We also use a value o f h considerably smaller than the value (roughly 1.5) used to estimate / in Example 4.18.

The right panel o f Figure 5.6 shows the quantiles o f the uMr obtained when the double bootstrap bias adjustm ent is applied with R = 1000 and M = 250, for the estimate with bandw idth h = 0.5. If T* — t were an exact pivot, the distribution o f the u would lie along the dotted line, and nom inal and estim ated coverage would be equal. The distribution is close to uniform, confirming our decision to use a variance-stabilized statistic.

The dashed line shows how the distribution o f the u* is used to remove the bias in coverage levels. For an upper confidence limit with nom inal level 1 — a = 0.9, so that a = 0.1, the estim ated level is 4(0-1) = 0.088. The 0.088 quantile o f the values o f tj. — t is t(*gg) — t = —0.091, while the 0.10 quantile is t(*100) — t = —0.085. The corresponding upper 10% confidence limits for f ( 0 ) V 2 are t - (t(*88) - t) = 0.356 - (-0 .091) = 0.447 and t - (t(*100) - t) = 0.356 — (—0.085) = 0.441. For this value o f a the adjustm ent has only a small effect.

Table 5.7 com pares the 95% limits for /(0 ) for different methods, using bandw idth h = 0.5, for which /(0 ;0 .5 ) = 0.127. The longer upper tail for the double bootstrap interval is a result o f adjusting the nom inal a = 0.025 to §(0.025) = 0.004; a t the upper tail we obtain §(0.975) = 0.980. The lower tail o f the interval agrees well with the o ther second-order correct methods.

For larger values o f h the density estimates are higher and the confidence intervals narrower.


Basic Basic1- Student Student Percentile BCa Double

Upper 0.204 0.240 0.273 0.266 0.218 0.240 0.301Lower 0.036 0.060 0.055 0.058 0.048 0.058 0.058

In Example 9.14 we describe how saddlepoint m ethods can greatly reduce the time taken to perform the double bootstrap in this problem. It might be possible to avoid the difficulties caused by the bias o f the kernel estimate by using a clever resam pling scheme, but it would be more com plicated than the direct approach described above. ■

5.7 Empirical Comparison of Bootstrap MethodsThe several bootstrap confidence limit m ethods can be com pared theoretically on the basis o f first- and second-order accuracy, as in Section 5.4, but this really gives only suggestions as to which m ethods we would expect to be good. The theory needs to be bolstered by num erical comparisons. One rather extreme com parison was described in Example 5.7. In this section we consider one moderately com plicated application, estim ation o f a ratio o f means, and assess through sim ulation the perform ances o f the main bootstrap confidence limit methods. The conclusions appear to agree qualitatively with the results o f other sim ulation studies involving applications o f similar complexity: references to some o f these are given in the bibliographic notes a t the end o f the chapter.

The application here is similar to tha t in Example 5.10, and concerns the ratio o f means for data from two different gam m a distributions. The first sample o f size ni is draw n from a gam m a distribution with mean fi\ = 100 and index 0.7, while the second independent sample o f size n2 is draw n from the gamma distribution with mean n2 = 50 and index 1. The param eter 9 = n i / ( i 2, whose value is 2, is estim ated by the ratio o f sample means t = y \ / y 2. For particular choices o f sample sizes we simulated 10000 datasets and to each applied several o f the nonparam etric bootstrap confidence limit m ethods discussed earlier, always with R = 999. We did not include the double bootstrap method. As a control we added the exact param etric m ethod when the gam m a indexes are known: this turns out not to be a strong control, but it does provide a check on sim ulation validity.

The results quoted here are for two cases, n\ = n2 = 10 and n\ = n2 = 25. In each case we assess the left- and right-tail error rates o f confidence intervals, and their lengths.

Table 5.8 shows the empirical error rates for both cases, as percentages, for nom inal rates between 1% and 10% : sim ulation standard errors are rates

Table 5.7 Upper and lower endpoints of 95% confidence limits for / ( 0) for the tuna data, with bandwidth h = 0.5; t indicates use of square-root transformation.

5.8 • Multiparameter Methods 231

Table 5.8 Empirical error rates (%) for nonparametric bootstrap confidence limits in ratio estimation: rates for sample sizes wi = n2 = 10 are given above those for sample sizes «| = «2 = 25.R = 999 for all bootstrap methods. 10000 datasets generated from gamma distributions.

M ethod N om inal erro r rate

Lower limit U p p er lim it

1 2.5 5 10 10 5 2.5 1

Exact 1.0 2.8 5.5 10.5 9.8 4.8 2.6 1.01.0 2.3 4.8 9.9 10.2 4.9 2.5 1.1

N orm al approxim ation 0.1 0.5 1.7 6.3 20.6 15.7 12.5 9.60.1 0.5 2.1 6.4 16.3 11.5 8.2 5.5

Basic 0.0 0.0 0.2 1.8 24.4 21.0 18.6 16.40.0 0.1 0.4 3.0 19.2 15.0 12.5 10.3

Basic, log scale 2.6 4.9 8.1 12.9 13.1 7.5 4.8 2.51.6 3.2 6.0 11.4 11.5 6.3 3.3 1.7

S tudentized 0.6 2.1 4.6 9.9 11.9 6.7 4.0 2.00.8 2.3 4.6 9.9 10.9 5.9 3.0 1.4

Studentized, log scale 1.1 2.8 5.6 10.7 11.6 6.3 3.5 1.71.1 2.5 5.0 10.1 10.8 5.7 2.9 1.3

B ootstrap percentile 1.8 3.6 6.5 11.6 14.6 8.9 5.9 3.31.2 2.6 5.1 10.1 12.6 7.1 4.2 2.1

BCa 1.9 4.0 6.9 12.3 14.0 8.3 5.3 3.01.4 3.0 5.6 10.9 11.8 6.8 3.8 1.9

AB C 1.9 4.2 7.4 12.7 14.6 8.7 5.5 3.11.3 3.0 5.7 11.0 12.1 6.8 3.7 1.9

divided by 100. The norm al approxim ation m ethod uses the delta m ethod variance approxim ation. The results suggest tha t the studentized m ethod gives the best results, provided the log scale is used. Otherwise, the studentized m ethod and the percentile, B C a and A B C m ethods are com parable but only really satisfactory at the larger sample sizes.

Figure 5.7 shows box plots o f the lengths o f 1000 confidence intervals for both sample sizes. The m ost pronounced feature for ni = n2 = 10 is the long— sometimes very long — lengths for the two studentized methods, which helps to account for their good error rates. This feature is far less prom inent a t the larger sample sizes. I t is noticeable tha t the normal, percentile, B C a and A B C intervals are short com pared to the exact ones, and that taking logs improves the basic intervals. Similar comments apply when ni = n2 = 25, but with less force.

5.8 Multiparameter MethodsW hen we w ant a confidence region for a vector param eter, the question o f shape arises. Typically a rectangular region formed from intervals for each com ponent param eter will not have high enough coverage probability, although a Bonferroni argum ent can be used to give a conservative confidence coefficient,

n1= n2= 101000

100

10...... ^ ...................B " " S .......E3........ Et3.......S "

Figure 5.7 Box plots of confidence interval lengths for the first 1000 simulated samples in the numerical experiment with gamma data.

n1=n2=2510

5

2

1

■0....0 .... 0 .....0 .... 6 .... B .... [j.....0 ....0 -

as follows. Suppose tha t 9 has d components, and tha t the confidence region Ca is rectangular, with interval Cxj = (9Lyi, 9Vj) for the ith com ponent 9t. Then

Pr(0 * Ca) = P r ( \ J { 9 t $ ^ Pr(0, ^ Q ,) = ^

say. If we take a, = a /d then the region Ca has coverage at least equal to1 — a. For certain applications this could be useful, in part because o f its simplicity. But there are two potential disadvantages. First, the region could be very conservative — the true coverage could be considerably more than the nom inal 1 — a. Secondly, the rectangular shape could be quite at odds with plausible likelihood contours. This is especially true if the estimates for param eter com ponents are quite highly correlated, when also the Bonferroni m ethod is m ore conservative.

One simple possibility for a jo in t bootstrap confidence region when T is approxim ately norm al is to base it on the quadratic form

Q = ( T - 9 ) t V ~ 1( T - 9 ) , (5.60)

where V is the estim ated variance m atrix o f T. N ote tha t Q is the m ultivariate extension o f the square o f the studentized statistic o f Section 5.2. If Q had exact p quantiles ap, say, then a 1 — a confidence set for 9 would be

{9 : ( T - 9 ) t V ~ 1( T - 9 ) < a ^ } . (5.61)

5.8 ■ Multiparameter Methods 233

y and logy are the averages o f the data and the log data.

The elliptical shape o f this set is correct if the distribution o f T has elliptical contours, as the m ultivariate norm al distribution does. So if T is approxim ately m ultivariate normal, then the shape will be approxim ately correct. Moreover, Q will be approxim ately distributed as a y2d variable. But as in the scalar case such distributional approxim ations will often be unreliable, so it makes sense to approxim ate the distribution o f Q, and in particular the required quantiles a i_a, by resampling. The m ethod then becomes completely analogous to the studentized bootstrap m ethod for scalar param eters. The bootstrap analogue o f Q will be

Q’ = ( T , - t ) r F * -1( T * - t ) ,

which will be calculated for each o f R simulated samples. If we denote the ordered bootstrap values by q[ <■■■ < q'R, then the 1 — a bootstrap confidence region is the set

{0 : (t - 9)Tv~l (t - 0) < 5(*R+i)(i-a)}- (5-62)

As in the scalar case, a com m on and useful choice for v is the delta m ethodvariance estim ate v^.

The same m ethod can be applied on any scales which are m onotone transform ations o f the original param eter scales. For example, if h(6) has ith com ponent /i,(0;), say, and if d is the diagonal m atrix with elements dhi/d6j evaluated at 0 = t, then we can apply (5.62) with the revised definition

q = {h(t) - h(0)}T(dTvd)~l {h(t) - fe(0)}.

If corresponding ordered bootstrap values are again denoted by q*, then the bootstrap confidence region will be

{0 : {h(t) - h(6)}T(dTvd )- l {h(t) - h(6)} < 9(*r+1Mi_«)}- (5.63)

A particular choice for h(-) would often be based on diagnostic plots ofcom ponents o f t* and v", the objectives being to attain approxim ate norm ality and approxim ately stable variance for each component.

This m ethod will be subject to the same potential defects as the studentized bootstrap m ethod o f Section 5.2. There is no vector analogue o f the adjusted percentile m ethods, but the nested bootstrap m ethod can be applied.

Example 5.14 (Air-conditioning data) For the air-conditioning data o f Example 1.1, consider setting a confidence region for the two param eters 0 = (ji, k) in a gam m a model. The log likelihood function is

/(,u, k ) = n{K\og{K/ii) ~ logr(jc) + (k - l)logy - Ky/n},

from which we calculate the maxim um likelihood estim ators T = (p,,k). The

numerical values are p. = 108.083 and k = 0.7065. A straightforw ard calculation shows that the delta m ethod variance approxim ation, equal to the inverse o f the expected inform ation m atrix as in Section 5.2, is

vL = n_1d ia g |/c _1/i2, ~ lo g r(£ ) - k_1j . (5.64)

The standard likelihood ratio 1 — a confidence region is the set o f values of (f i , k ) for which

2{/(fi, k) - Z(f i , jc)} < c2,i—«,

where c2,i_« is the 1 — a quantile o f the x l distribution. The top left panel o f Figure 5.8 shows the 0.50, 0.95 and 0.99 confidence regions obtained in this way. The top right panel is the same, except tha t C2,i_a is replaced by a bootstrap estim ate obtained from R = 999 samples simulated from the fitted gam m a model. This second region is somewhat larger than, but o f course has the same shape as, the first.

From the bootstrap sim ulation we have estim ators t" = (£*,£*) from each sample, from which we calculate the corresponding variance approxim ations using (5.64), and hence the quadratic forms q* = ( f — f)r i>2-1 (f* — t). We then apply (5.62) to obtain the studentized bootstrap confidence regions shown in the bottom left panel o f Figure 5.8. This is clearly nothing like the likelihood- based confidence regions above, partly because it fails completely to take account o f the mild skewness in the distribution o f fi and the heavy skewness in the distribution o f k. These features are clear in the histogram plots o f Figure 5.9.

Logarithm ic transform ation o f both fi and k improves m atters considerably: the bottom right panel o f Figure 5.8 comes from applying the studentized bootstrap m ethod after dual logarithm ic transform ation. Nevertheless, the solution is not completely satisfactory, in tha t the region is too wide on the k

axis and slightly narrow on the f i axis. This could be predicted to some extent by plotting v'L versus f*, which shows that the log transform ation o f k is not quite strong enough. Perhaps m ore im portant is tha t there is a substantial bias in k: the bootstrap bias estimate is 0.18.

One lesson from this example is tha t where a likelihood is available and usable, it should be used — with param etric sim ulation to check on, and if necessary replace, standard approxim ations for quantiles o f the log likelihood ratio statistic. ■

Example 5.15 (Laterite data) The data in Table 5.9 are axial da ta consisting o f 50 pole positions, in degrees o f latitude and longitude, from a palaeo- magnetic study o f New Caledonian laterites. The data take values only in the lower unit half-sphere, because an axis is determined by a single pole.

5.8 ■ Multiparameter Methods 235

Figure 5.8 Bootstrap confidence regions for the parameters /*, k of a gamma model for the air-conditioning data, with levels 0.50, 0.95 and 0.99. Top left: likelihood ratio region with x\ quantiles; top right: likelihood ratio region with bootstrap quantiles; bottom left: studentized bootstrap on original scales; bottom right: studentized bootstrap on logarithmic scales.R = 999 bootstrap samples from fitted gamma model with ft = 108.083 and k = 0.7065. + denotes MLE.

(0Q.Q.(0

Q.Q.<oJ*

<0Q.Q.CO

mu mu

m u

Let Y denote a unit vector on the lower half-sphere with cartesian coordinates (c o sX c o sZ ,c o sX s in Z ,s in X )T, where X and Z are degrees o f latitude and longitude. The population quantity o f interest is the mean polar axis, a(6,0) = (cos 8 cos 0 , cos 9 sin <j), sin 6)T, defined as the axis given by the eigenvector corresponding to the largest eigenvalue o f E (7 Y T ). The sample value of this is given by the corresponding eigenvector o f the m atrix n-1 y j y f , where y/ is the vector o f cartesian coordinates o f the j th pole position. The sample

A A

m ean polar axis has latitude 9 = —76.3 and longitude (f> = 83.8. Figure 5.10 shows the original da ta in an equal-area projection onto a plane tangential to the South Pole, at 9 = —90°; the hollow circle represents the sample mean polar axis.

C\J

oo

cooo

soo

oo

50 100 150 200 250 300

mu

ino

oo I I I i i i i—. i—i i—

0.5 1.0 1.5 2.0 2.5 3.0

kappa

Lat Long Lat Long Lat Long Lat Long

-26.4 324.0 -52.1 83.2 -80.5 108.4 -74.3 90.2-32.2 163.7 -77.3 182.1 -77.7 266.0 -81.0 170.9-73.1 51.9 -68.8 110.4 -6.9 19.1 -12.7 199.4-80.2 140.5 -68.4 142.2 -59.4 281.7 -75.4 118.6-71.1 267.2 -29.2 246.3 -5 .6 107.4 -85.9 63.7-58.7 32.0 -78.5 222.6 -62.6 105.3 -84.8 74.9-40.8 28.1 -65.4 247.7 -74.7 120.2 -7 .4 93.8-14.9 266.3 -49.0 65.6 -65.3 286.6 -29.8 72.8-66.1 144.3 -67.0 282.6 -71.6 106.4 -85.2 113.2

-1.8 256.2 -56.7 56.2 -23.3 96.5 -53.1 51.5-38.3 146.8 -72.7 103.1 -60.2 33.2 -63.4 154.8-17.2 89.9 -81.6 295.6 -40.4 41.0-56.2 35.6 -75.1 70.7 -53.6 59.1

In order to set a confidence region for the mean polar axis, or equivalently (6, <f>), we let

b(6, <£) = (sin 6 cos (j), sin 9 sin 0 , — cos d)T, c(0 ,0 ) = (— sin <j>t — cos <j>, 0)T

denote the unit vectors orthogonal to a(0, </>). The sample values o f these vectors are 2, b and c, and the sample eigenvalues are 1\ < %2 < ^3- Let A denote the 2 x 3 m atrix (S,c)r and B the 2 x 2 m atrix with {j ,k) th element

------— n~ l y ^ ( b Tyj)(cTyj)(aTyj)2.

Figure 5.9 Histograms of ft and k* from R = 999 bootstrap samples from gamma model with p. = 108.083 and ic = 0.7065, fitted to air-conditioning data.

Table 5.9 Latitude (°) and longitude (°) of pole positions determined from the paleomagnetic study of New Caledonian laterites (Fisher et a/., 1987, p. 278).

5.8 • Multiparameter Methods 237

Figure 5.10 Equal-area projection of the laterite data onto the plane tangential to the South Pole (+). The sample mean polar axis is the hollow circle, and the square region is for comparison with Figures 5.11 and 10.3.

90

Then the analogue o f (5.60) is

Q = na(9,(j>)T A T J3_1/la(0, <>), (5.65)

which is approxim ately distributed as a y\ variable in large samples. In the bootstrap analogue o f Q, a is replaced by a, and A and B are replaced by the corresponding quantities calculated from the bootstrap sample.

Figure 5.11 shows results from setting confidence regions for the mean polar axis based on Q. The panels show the 0.5, 0.95 and 0.99 contours, using x\ quantiles and those based on R = 999 nonparam etric bootstrap replicates q". The contours are elliptical in this projection. For this sample size it would not be misleading to use the asym ptotic 0.5 and 0.95 quantiles, though the 0.99 quantiles differ by more. However, simulations with a random subset of size n — 20 gave dram atically different quantiles, and it seems to be essential to use the bootstrap quantiles for smaller sample sizes.

A different approach is to set T = (6, (j>)T, and then to base a confidence region for (d,4>) on (5.60), with V taken to be nonparam etric delta m ethod estimate o f the covariance matrix. This approach does not take into account the geometry o f spherical da ta and works very poorly in this example, partly because the estimate t is close to the South Pole, which limits the range o f ().

238 5 * Confidence Intervals

Figure 5.U The 0.5, 0.95, and 0.99 confidence regions for the mean polar axis of the laterite data based on (5.65), using x\ quantiles (left) and bootstrap quantiles (right). The boundary of each panel is the square region in Figure 5.10; also shown are the South Pole (+) and the sample mean polar axis <°).

5.9 Conditional Confidence Regions

In param etric inference the probability calculations for confidence regions should in principle be m ade conditional on the ancillary statistics for the model, when these exist, the basic reason being to ensure tha t the inference accounts for the actual inform ation content in the observed data. In param etric models what is ancillary is often specific to the m athem atical form o f F, and there is no nonparam etric analogue. However, there are situations where there is a model-free ancillary indicator o f the experiment, as with the design o f a regression experim ent (C hapter 6). In fact there is such an indicator in one o f our earlier examples, and we now use this to illustrate some o f the points which arise with conditional bootstrap confidence intervals.

Example 5.16 (City population data) For the ra tio estim ation problem of Example 1.2, the statistic d = u would often be regarded as ancillary. The reason rests in part on the notion o f a model for linear regression o f x on u with variation proportional to u. The left panel o f Figure 5.12 shows the scatter plot o f t* versus d" for the R = 999 nonparam etric bootstrap samples used earlier. The observed value o f d is 103.1. The middle and right panels o f the figure show trends in the conditional mean and variance, E*(T* | d') and var*(T ’ | d"), these being approxim ated by crude local averaging in the scatter plot on the left.

The calculation o f confidence limits for the ratio 6 = E(AT)/E(l/) is to be m ade conditional on d* = d, the observed mean o f u. Suppose, for example, that we want to apply the basic bootstrap m ethod. Then we need to approxim ate the conditional quantiles ap(d) o f T — 6 given D = d for p = a and 1 — a, and

5.9 ■ Conditional Confidence Regions 239

Figure 5.12 City population data, n = 49. Scatter plot of bootstrap ratio estimates t* versus d*, and conditional means and variances of t* given d*. R = 999 nonparametric samples.

Table 5.10 City population data, n = 49. Comparison of unconditional and conditional cumulative probabilities for bootstrap ratio T*.R = 9999 nonparametric samples, Rj = 499 used for conditional probabilities.

. ... 5 \

91001

\

■ l i f t ; , .

5

1 5

<3 V vaitH

0.00

08

0.00

10

0.00

12

0.00

14

(

V80 100 120 140 160 80 90 100 110 120 130 80 90 100 110 120 130

6’ d ' <S'

U nconditional 0.010 0.025 0.050 0.100 0.900 0.950 0.975 0.990C onditional 0.006 0.020 0.044 0.078 0.940 0.974 0.988 1.000

use these in (5.3). The bootstrap estimate o f ap(d) is the value ap(d) defined by

Pr{T* — t < ap(d) \ D* = d} = p,

and the simplest way to use our simulated samples to approxim ate this is to use only those samples for which d* is “near” d. For example, we could take the Ri = 99 samples whose d* values are closest to d and approxim ate ap(d) by the lOOpth ordered value o f t* in those samples.

Certainly stratification o f the sim ulation results by intervals o f d* values shows quite strong conditional effects, as evidenced in Figure 5.12. The difficulty is tha t Rj = 99 samples is not enough to obtain good estimates o f conditional quantiles, and certainly not to distinguish between unconditional quantiles and the conditional quantiles given d' = d, which is near the mean. Only with an increase o f R to 9999, and using strata of Rd = 499 samples, does a clear picture emerge. Figure 5.13 shows plots o f conditional quantile estimates from this larger simulation.

How different are the conditional and unconditional distributions? Table 5.10 shows bootstrap estimates o f the cumulative conditional probabilities Pr( T < ap | D = d), where ap is the unconditional p quantile, for several values o f p. Each estim ate is the proportion o f times in Rd = 499 samples tha t t" is less than or equal to the unconditional quantile estimate £(’10ooop)- The com parison suggests tha t conditioning does not have a large effect in this case.

A more efficient use o f bootstrap samples, which takes advantage o f the smoothness o f quantiles as a function o f d, is to estim ate quantiles for interval stra ta o f Rd samples and then for each level p to fit a sm ooth curve. For example, if the k th such stratum gives quantile estimates ap# and average

240 5 * Confidence Intervals

Figure 5.13 City population data, n = 49. Conditional 0.025 and 0.975 quantiles of bootstrap ratio t* from R = 9999 samples, with strata of size Rj = 499. The horizonal dotted lines are unconditional quantiles, and the vertical dotted line is at d' =d.

Ancillary d*

Figure 5.14 City population data, n = 49. Smooth spline fits to 0.025 and 0.975 conditional quantiles of bootstrap ratio t* from R = 9999 samples, using overlapping strata of size Rj = 199.

Ancillary d*

value dk for d', then we can fit a sm oothing spline to the points (dk, (ip^) for each p and interpolate the required value ap(d) at the observed d. Figure 5.14 illustrates this for R = 9999 and non-overlapping s tra ta o f size R^ = 199, with p = 0.025 and 0.975. N ote that interpolation is only needed at the centre o f the curve. Use o f non-overlapping intervals seems to give the best results. ■

An alternative sm oothing m ethod is described in Problem 5.16. In C hapter 9 we shall see tha t in some cases, including the preceding example, it is possible to get accurate approxim ations to conditional quantiles using theoretical methods.

5.9 ■ Conditional Confidence Regions 241

Figure 5.15 Annual discharge of River Nile at Aswan, 1871-1970 (Cobb, 1978).

oo

ooCM

<D

E_2o>

ooo

oo00

ooCO

•

' l\ ./ h m . i r

•%

'■ : j* .♦ • • •• • T • : • ;* j! : k * !■ :•;* •

’ • M M ii m #i I * ?! iii • . . * M i 1/1

* ii \Mi ; i\i * * \i. ** r ; '»■ U i \ . • *• !:* * . *•

i

1880 1900 1920 1940 1960

Year

Just as with unconditional analysis, so with conditional analysis there is a choice o f bootstrap confidence interval methods. From our earlier discussion the studentized bootstrap and adjusted percentile m ethods are likely to work best for statistics that are approxim ately norm al, as in the previous example. The adjusted percentile m ethod requires constants a, v i and w, all o f which m ust now be conditional; see Problem 5.17. The studentized bootstrap m ethod can be applied as before with Z = (T — 0 ) / F 1/2, except that now conditional quantiles will be needed. Some simplification may occur if it is possible to standardize with a conditional standard error.

The next example illustrates another way o f overcoming the paucity of bootstrap samples which satisfy the conditioning constraint.

Example 5.17 (Nile data) The data plotted in Figure 5.15 are annual discharges y o f the River Nile at Aswan from 1871 to 1970. Interest lies in the year 1870+0 in which the m ean discharge drops from n\ — H 00 to H2 = 870; these mean values are estimated, bu t it is reasonable to ignore this fact and we shall do so.

The least squares estim ate o f the integer 0 maximizes

eS(0) = ^ { > 7 “ 3 ^ i+ w )} -

j= i

S tandard norm al-theory likelihood analysis suggests tha t differences in S(6) for 0 near 0 are ancillary statistics. We shall reduce these differences to two particular statistics which measure skewness and curvature o f S( ) near 0,

c* b'

-0.62 -0.37 -0.17 0 0.17 0.37 0.62 0.87 2.45

1.64 59 52 53 71 68 62 50 53 _2.44 62 88 81 83 79 82 68 81 50

4.62 .. 92 84 93 93 95 97 87 93 764.87 91 91 91 95 89 92 92 95 765.12 .. 92 96 100 95 86 97 100 97 815.49 97 96 89 98 96 95 97 96 856.06 94 100 100 100 97 96 95 95 866.94 93 100 100 100 100 100 100 100 100

namely

B = S(d + 5) - S(6 - 5 ) , C = S(0 + 5) - 2S(0) + S(0 - 5);

for numerical convenience we rescale B and C by 0.0032. It is expected that B and C respectively influence the bias and variablity o f 0. We are interested in the conditional confidence that should be attached to the set 0 + 1, that is

Pr(|0 — 0| < 1 | b,c).

The data analysis gives 0 = 28 (year 1898), b = 0.75 and c = 5.5.W ith no assum ption on the shape o f the distribution of Y , except tha t it is

constant, the obvious bootstrap sampling scheme is as follows. First calculate the residuals ej = Xj — f i u j = 1 ,. . . ,2 8 and e; = x j — fi2, j = 2 9 ,. . . , 100. Then simulate da ta series by x ' = m + e ’, j = 1 ,. . . ,2 8 and x* = n2 + s ) , j =29.......100, where e’ is random ly sampled from eioo- Each such sampleseries then gives 0*,fr* and c*.

From R = 10 000 bootstrap samples we find tha t the proportion o f samplesA A

with 16 — 9\ < 1 is 0.862, which is the unconditional bootstrap confidence. But when these samples are partitioned according to b* and c”, strong effects show up. Table 5.11 shows part o f the table o f proportions for outcom e 10* — 01 < 1 for a 16 x 15 partition, 201 o f these partitions being non-em pty and m ost o f them having at least 50 bootstrap samples. The proportions are consistently higher than 0.95 for (b ' ,c ' ) near (b,c), which strongly suggests that the conditional confidence Pr(|0 — 0| < 1 | b = 0.75, c = 5.5) exceeds 0.95.

The conditional probability Pr(|0 — 0| < 1 | b,c) will be sm ooth in b and c, so it makes sense to assume that the estimate

Table 5.11 Nile data. Part of the table of proportions (%) of bootstrap samples for which 10" —§ | ^ 1, for interval values of b' and c*. R = 10000 samples.

p(b’,c*) = Pr*(|0* — 0| < 1 | 6*,c’)

5.10 • Prediction 243

is sm ooth in b ' ,c ' . We fitted a logistic regression to the proportions in the 201 non-em pty cells o f the complete version o f Table 5.11, the result being

logit p(b* , c ) = —0.51 — 0.20b’2 + 0.68c*.

The residual deviance is 223 on 198 degrees o f freedom, which indicates an adequate fit for this simple model. The conditional bootstrap confidence is the fitted value o f p a t b' = b, c* = c, which is 0.972 with standard error 0.009. So the conditional confidence attached to 6 = 28 + 1 is much higher than the unconditional value.

The value o f the standard error for the fitted value corresponds to a binomial standard error for a sample o f size 3500, or 35% o f the whole bootstrap simulation, which indicates high efficiency for this m ethod o f estim ating conditional probability. ■

5.10 PredictionClosely related to confidence regions for param eters are confidence regions for future outcom es o f the response Y, more usually called prediction regions. A pplications are typically in more com plicated contexts involving regression models (Chapters 6 and 7) and time series models (C hapter 8), so here we give only a brief discussion o f the main ideas.

In the simplest situation we are concerned with prediction o f one future response Yn+l given observations y \ , . . . , y n from a distribution F. The ideal upper y prediction limit is the y quantile o f F, which we denote by ay(F). The simplest approach to calculating a prediction limit is the plug-in approach, that is substituting the estimate F for F to give ay = ay(F). But this is clearly biased in the optimistic direction, because it does not allow for the uncertainty in F. Resam pling is used to correct for, or remove, this bias.

Parametric caseSuppose first tha t we have a fully param etric model, F = Fg, say. Then the prediction limit ay(F) can be expressed m ore directly as ay(9). The true coverage o f this limit over repetitions o f both data and predictand will not generally be y, but rather

Pr{7n+i < ay(6) \ 6} = h(y), (5.66)

say, where h(-) is unknow n except that it must be increasing. (The coverage also depends on 6 in general, but we suppress this from the notation for simplicity.) The idea is to estimate h(-) by resampling. So, for data Y J , . . . , Y * and predictand Yn*+1 all sampled from F = Fg, we estimate (5.66) by

Mv) = Pr*{y„*+1 < a y(d')}, (5.67)

where as usual O' is the estim ator calculated for da ta Y Y ‘. In practice itwould usually be necessary to use R simulated repetitions o f the sampling and approxim ate (5.67) by

Example 5.18 (Normal prediction limit) Suppose tha t Y^,..., Y„+i are independently sampled from the N(/i, cr2) distribution, where fi and a are unknown, and that we wish to predict Yn+\ having observed yi , . . . ,y„- The plug-in m ethod gives the basic y prediction limit

where y„ = n 1 ^ yj and s2 = n 1 — -V)2- ^ we write Yj = n + ere,-, so thatthe Ej are independent JV(0,1), then (5.66) becomes

where Z„_i has the Student-f distribution with n — 1 degrees o f freedom. This leads directly to the Student-f prediction limit

where /c„_i,y is the y quantile o f the Student-t distribution with n — 1 degrees o f freedom.

In this particular case, then, h( ) does not need to be estimated. But if we had not recognized the occurrence o f the Student-f distribution, then the first probability in (5.69) would have been estim ated by applying (5.68) with samples generated from the N( yn, s2) distribution. Such an estim ate (corresponding to infinite R) is plotted in Figure 5.16 for sample size n = 10. The plot has logit scales to emphasize the discrepancy between h(y) and y. Given values o f the estimate h{y), a sm ooth curve can be obtained by quadratic regression o f their logits on logits o f y; this is illustrated in the figure, where the solid line is the regression fit. The required value g(y) can be read off from the curve. ■

The preceding example suggests a more direct m ethod for special cases involving means, which makes use o f a point prediction y n+\ and the distribution o f prediction error Yn+l — y„+1: resampling can be used to estimate this distribution directly. This m ethod will be applied to linear regression models in Section 6.3.3.

(5.68)

Once h(y) has been calculated, the adjusted y prediction limit is taken to be

at<7) = ag(y)(h where

Hgiv)} = 7-

aY = y„ + s„<l> ‘(y),

e„ is the average of

5.10 ■ Prediction 245

Figure 5.16Adjustment function /i(y) for prediction with sample size n = 10 from N(n,cr2), with quadratic logistic fit (solid), and line giving /i(y) = y (dots).

Logit of gamma

Nonparametric case

Now consider the nonparam etric context, where F is the E D F o f a single sample. The calculations outlined for the param etric case apply here also. First, if r /n < y < (r + 1 )/n then the plug-in prediction limit is ay(F) = y(r)\ equivalently, ay(F) = y([ny\), where [■] means integer part. Straightforw ard calculation shows that

Pr(Y„+1 < yw ) = r / ( n + l ) ,

which means tha t (5.66) becomes h(y) = [ny]/(n+1). Therefore [ng(y)]/(n+ l) = y, so that the adjusted prediction limit is y ( [ (n+ i ) v ] ) : this is exact if (n + l ) y is an integer.

It seems intuitively clear that the efficiency o f this nonparam etric prediction limit relative to a param etric prediction limit would be considerably lower than would be the case for confidence limits on a param eter. For example, a com parison between the norm al-theory and nonparam etric m ethods for samples from a norm al distribution shows the efficiency to be about j for a = 0.05.

For sem iparam etric problems similar calculations apply. One general approach which makes sense in certain applications, as m entioned earlier, bases prediction limits on point predictions, and uses resampling to estimate the distribution o f prediction error. For further details see Sections 6.3.3 and7.2.4.

246 J • Confidence Intervals

5.11 Bibliographic NotesStandard m ethods for obtaining confidence intervals are described in C hapters 7 and 9 o f Cox and Hinkley (1974), while more recent developments in likelihood-based m ethods are outlined by Barndorff-Nielsen and Cox (1994). Corresponding m ethods based on resample likelihoods are described in C hapter 10.

B ootstrap confidence intervals were introduced in the original bootstrap paper by Efron (1979); bias adjustm ent and studentizing were discussed by Efron (1981b). The adjusted percentile m ethod was developed by Efron (1987), who gives detailed discussion o f the bias and skewness adjustm ent factors b and a. In part this development responded to issues raised by Schenker (1985). The A B C m ethod and its theoretical justification were laid out by DiCiccio and Efron (1992). Hall (1988a, 1992a) contain rigorous developments o f the second-order com parisons between com peting methods, including the studentized bootstrap m ethods, and give references to earlier work dating back to Singh (1981). DiCiccio and Efron (1996) give an excellent review o f the B C a and A B C m ethods, together with their asym ptotic properties and com parisons to likelihood-based methods. A n earlier review, with discussion, was given by DiCiccio and R om ano (1988).

O ther empirical com parisons o f the accuracy o f bootstrap confidence interval m ethods are described in Section 4.4.4 o f Shao and Tu (1995), while Lee and Young (1995) make com parisons with iterated bootstrap methods. Their conclusions and those o f Canty, D avison and Hinkley (1996) broadly agree with those reached here.

Tibshirani (1988) discussed empirical choice o f a variance-stabilizing transform ation for use with the studentized bootstrap method.

Choice o f sim ulation size R is investigated in detail by Hall (1986). See also the related references for C hapter 4 concerning choice o f R to m aintain high test power.

The significance test m ethod has been studied by K abaila (1993a) and discussed in detail by C arpenter (1996). Buckland and G arthw aite (1990) and G arth waite and Buckland (1992) describe an efficient algorithm to find confidence limits in this context. The particular application discussed in Exam ple 5.11 is a modified version o f Jennison (1992). One intriguing application, to phylogenetic trees, is described by Efron, H alloran and Holmes (1996).

The double bootstrap m ethod o f adjustm ent in Section 5.6 is similar to that developed by Beran (1987) and Hinkley and Shi (1989); see also Loh (1987). The m ethod is sometimes called bootstrap calibration. Hall and M artin (1988) give a detailed analysis o f the reduction in coverage error. Lee and Young (1995) provide an efficient algorithm for approxim ating the m ethod w ithout sim ulation when the param eter is a sm ooth function o f means. Booth and Hall

5.12 • Problems 247

s2 is the usual sample variance of .........y„.

(1994) discuss the num bers o f samples required when the nested bootstrap is used to calibrate a confidence interval.

Conditional m ethods have received little attention in the literature. Exam ple 5.17 is taken from Hinkley and Schechtm an (1987). Booth, H all and W ood (1992) describe kernel m ethods for estim ating the conditional distribution of a bootstrap statistic.

Confidence regions for vector param eters are alm ost untouched in the literature. There are no general analogues o f adjusted percentile methods. Hall (1987) discusses likelihood-based shapes for confidence regions.

Geisser (1993) surveys several approaches to calculating prediction intervals, including resam pling m ethods such as cross-validation.

References to confidence interval and prediction interval m ethods for regression models are given in the notes for C hapters 6 and 7; see also C hapter 8 for time series.

5.12 Problems1 Suppose that we have a random sample from a distribution F whose

mean is unknown but whose variance is known and equal to a 1. Discuss possible nonparametric resampling methods for obtaining confidence intervals for ^, including the following: (i) use z = Jn ( y — n ) /a and resample from the EDF; (ii) use z = J n ( y — fi)/s and resample from the EDF; (iii) as in (ii) but replace the E D F o f the data by the E D F o f values y + a(yi — y) / s; (iv) as in (ii) but replace the EDF by a distribution on the data values whose mean and variance are y and a 2.

2 Suppose that 9 is the correlation coefficient for a bivariate distribution. If this distribution is bivariate normal, show that the MLE 9 is approximately N(9, ( 1 — 92)2/n). Use the delta method to show that the transformed correlation parameter f for which fj is approximately N(0,n ') is ( = | log{(l + 9)/ ( 1 — 0)}.Compare the use o f normal approximations for 9 and f with use o f a parametric bootstrap analysis to obtain confidence intervals for 9: see Practical 5.1.(Section 5.2)

3 Independent measurements y i , . . . , y n come from a distribution with range [0,0], Suppose that we resample by taking samples o f size m from the data, and base confidence intervals on Q = m{t — T' )/ t , where T ’ = m a x { . . . , Ym’ }. Show that this works provided that m /n—>0 as n—»oo, and use simulation to check its performance when n = 100 and Y has the (7(0,0) distribution.(Sections 2.6.1, 5.2)

4 The gamma model (1.1) with mean /i and index k can be applied to the data o f Example 1.1. For this model, show that the profile log likelihood for pt is

prof(M) = nk„ log (kft/fi) + (k„ - 1) Y 2 lo8 JO ~ ^ Y I Vi/t1 ~ nlog r ^ ’

where kh is the solution to the estimating equation

n log(K/n) + n + log yj - y j / f i - mp ( K) = 0,

with tp(fc) the derivative o f logr(K).Describe an algorithm for simulating the distribution o f the log likelihood ratio statistic W(p) = 2{<fprof(/i) — <fprof(^)}, where p. is the overall maximum likelihood estimate.(Section 5.2)

5 Consider simulation to estimate the distribution o f Z = (T — 6 ) / V </2, using R independent replicates with ordered values z[ < ■■■ < z ’R, where z" = (t’ —t ) / v ' 1/2 is based on nonparametric bootstrapping o f a sample yi , . . . , y„ . Let a = (r+ 1 ) / ( R + 1), so that a one-sided confidence interval for 6 with nominal coverage a is I r = [ t - v l/2z'r+l,co).(a) Show that

Pr'(0 € I r | F) = Pr*(z < Z r'+1) = £ ( R ) f ( l - p f ~ s,5 = 0 W

where p = p(F) = Pr'(Z ‘ < z \ F). Let P be the random variable corresponding to p(F), with C D F G( ). Hence show that the unconditional probability is

Pr(0 6 Ir) = J 2 ( * ) f o “S(1 - u)R~s dG(u).

Note that Pr(P < a) = Pr{0 6 [T — 7 1/2Z a',oo)}, where Z a* is the a quantile o f the distribution o f Z ', conditional on Y i , . . . , Y n.(b) Suppose that it is reasonable to approximate the distribution o f P by the beta distribution with density wa l (1 — u)b~l / B(a,b), 0 \ as n—► oo. For some representative values o f R, a, a and b, compare the coverage error o f I , with that o f the interval [T — V 1/2Z ’,oo).(Section 5.2.3; Hall, 1986)

6 Capability or precision indices are used to indicate whether a process satisfies a specification o f form (L, U ), where L and U are the lower and upper specification limits. If the process is “in control”, observations y i , . . . , y„ on it are taken to have mean p and standard deviation a. Two basic capability indices are then 9 = (U — L)/a and t] = 2m in {([/ — p)/a,(p — L)/a], with precision regarded as low if 9 < 6, medium if 6 < 6 < 8, and high if 6 > 8, and similarly for r\, which is intended to be sensitive to the possibility that p ^ j ( L + U ) . Estimates o f 9 and r\ are obtained by replacing p and a with sample estimates, such as

(i) the usual estimates p = y = n~l Y , y j and a = {(« — I)-1 Y ( y j ~ y)2}1/2;(ii) p — y and a — rk/dk, where rk = b~' Y l rKi and r/y is the range max yj — min yj o f the ith block o f k observations, namely yk(i-i)+i, ■ ■ ■, yki, where n = kb. Here du is a scaling factor chosen so that rk estimates a.

(a) When estimates (i) are used, and the are independent N ( p , a 2) variables, show that an exact (1 — 2a) confidence interval for 8 has endpoints

s{ ^ f • where c„(a) is the a quantile o f the x, distribution.(b) With the set-up in (a), suppose that parametric simulation from the fitted normal distribution is used to generate replicate values 9’1, . . . , 8 ‘R o f 6. Show that for R = oo , the true coverage o f the percentile confidence interval with nominal

coverage (1 — 2a) is

( n - l ) 2 _ 2 „ ( n - l ) 2Pr i < t n_x <

i. ^n—1,1— a Cn— l,a

where C has the x l - 1 distribution. Give also the coverages o f the basic bootstrap confidence intervals based on 9 and log 6.Calculate these coverages for n = 25, 50, 75 and a = 0.05, 0.025, and 0.005. Which o f these intervals is preferable?(c) See Practical 5.4, in which we take d5 = 2.236.(Section 5.3.1)

7 Suppose that we have a parametric model with parameter vector tp, and that9 = h(xp) is the parameter o f interest. The adjusted percentile (BCa) method is found by applying the scalar parameter method to the least-favourable family, for which the log likelihood <f(ip) is replaced by / l f ( 0 = £($>+£&), with S = i~l {rp)h(y)) and h( ) is the vector o f partial derivatives. Equations (5.21), (5.22) and (5.24) still apply.Show in detail how to apply this extension o f the B C a method to the problem o f calculating confidence intervals for the ratio 9 = Hi/\i\ o f the means o f two exponential distributions, given independent samples from those distributions. Use a numerical example (such as Example 5.10) to compare the BCa method to theexact method, which is based on the fact that 9 /9 has an F distribution.(Sections 5.3.2, 5.4.2; Efron, 1987)

8 For the ratio o f independent means in Example 5.10, show that the matrix o f second derivatives ii{n) has elements

n2t 12(yu - y i X y i j ~ y \ )uu,ij — ~ ^ r \ --------- =--------------- h (yu — y 0 + (y\j — y\)

njyi I y i

n2uu.2j = r j {(yi,- - yi )(yn - h)}, n\n2y {

and

f t 2 _

“2i,2j = — j~i(y2‘ ~ fo) + (yy ~ ^)}- n2y i

Use these results to check the value o f the constant c used in the A B C method in that example.

For the data o f Example 1.2 we are interested in the ratio o f means 9 = E (X) /E(U) . Define /j. = (E((7), E(X))T and write 9 = t(n), which is estimated by t = t(s) with s = («, 5c)t . Show that

' - h 2/ h \ \ -i = ( l -Hi /Hl — 1/Mi. ___ ( I*2// l \” V l /in ) ’Vm r \ - i / f i i o

From Problem 2.16 we have lj = e j / u with = x; — tUj. Derive expressions for the constants a, b and c in the nonparametric A B C method, and note that b = cv1/ 2-

Hence show that the A B C confidence limit is given by

~ = x + d„ Y X j e j / ( n 2v l / 2u)

u + d x Y l u j e j / ( n 2v l / 2u) ’

where da = (a + z« )/{ l - a(a + za)}2.Apply this result to the full dataset with n = 49, for which u = 103.14, x = 127.80, t = 1.239, vL = 0.0119, and a = 0.0205.(Section 5.4.2)

10 Suppose that the parameter 9 is estimated by solving the monotone estimating equation SY(0) = 0, with unique solution T. I f the random variable c (Y ,9 ) has (approximately or exactly) the known, continuous distribution function G, and if U ~ G, then define t v to be the solution to c(y, t v ) = U for a fixed observation vector y. Show that for suitable A, t — tu = —A ~ lc (Y ,9 ) has roughly the same distribution as — A~l U = —A ' ' c ( y , t u ) = T — 9, and deduce that the distributions o f t — tv and T — 9 are roughly the same.The distribution o f t — tu can be approximated by simulation, and this provides a way to approximate the distribution o f T — 6. Comment critically on this resampling confidence limit method.(Parzen, Wei and Ying, 1994)

11 Consider deriving an upper confidence limit for 9 by test inversion. If T is an estimator for 8, and S is an estimator for nuisance parameter X, and if var(T | 9, X) = a2(9,X), then define Z = (T — 90)/a(90,S). Show that an exact upper 1 — a confidence limit is U\ = Ui_a(t,s, X) which satisfies

The bootstrap confidence limit is « i_ „ = ui_a(r, s, s). Show that if S is a consistent s is consistent for k if estimator for X then the method is consistent in the sense that Pr(0 < tii_a) = s = A + op(i) as n->oo. 1 — a + o(l). Further show that under certain conditions the coverage differs from 1 — a by 0 (n _1).(Section 5.5; Kabaila, 1993a; Carpenter, 1996)

12 The normal approximation method for an upper 1 — a confidence limit gives= 9 + z i_ at)1/2. Show that bootstrap adjustment o f the nominal level 1 — a in

z i- a leads to the studentized bootstrap method.(Section 5.6; Beran, 1987)

13 The bootstrap method o f adjustment can be applied to the percentile method.Show that the analogue o f (5.55) is

The adjusted 1 — a upper confidence limit is then the 1 — q(a) quantile o f T*.In the parametric bootstrap analysis for a single exponential mean, show that the percentile method gives upper 1 — a limit > ’C 2 „ , i - a / ( 2 n ) . Verify that the bootstrap cv is the a quantile of adjustment o f this limit gives the exact upper 1 — a limit 2 ny / c 2n,tt- the distribution.(Section 5.6; Beran, 1987; Hinkley and Shi, 1989)

Pr'{Pr**(T” < t | F") < 1 - q(a) | F} = 1 - a.

14 Show how to make a bootstrap adjustment o f the studentized bootstrap confidence limit method for a scalar parameter.(Section 5.6)

15 For an equi-tailed (1 — 2a) confidence interval, the ideal endpoints are t + p with values o f P solving (3.31) with

h(F, F ; P ) = I {t(F) - t(F) < 0} - a, h(F, F; P) = I {t (F) - t(F) < p} - (1 - a).

Suppose that the bootstrap solutions are denoted by [i? and Pt-a., and that in the language o f Section 3.9.1 the adjustments b(F,y) are /Ja+?1 and /?i_a+w. Show how to estimate yi and y2, and verify that these adjustments modify coverage 1 — 2a + 0 (n _1) to 1 — 2a + 0(n~2).(Sections 3.9.1, 5.6; Hall and Martin, 1988)

16 Suppose that D is an approximate ancillary statistic and that we want to estimate the conditional probability G(u | d) = Pr(T — 9 < u \ D = d) using R simulated values (t’,d"r). One smooth estimate is the kernel estimate

G(„ I d ) ,£ f = i W { h - ' ( d ; - d ) }

where w( ) is a density symmetric about zero and h is an adjustable bandwidth. Investigate the bias and variance o f this estimate in the case where ( T , D ) is approximately bivariate normal and w( ) = <p(-). Show that h = R ~ i/2 is a reasonable choice.(Section 5.9; Booth, Hall and Wood, 1992)

17 Suppose that ( T , D) are approximately bivariate normal, with D an ancillary statistic upon whose observed value d we wish to condition when calculating confidence intervals. If the adjusted percentile method is to be used, then we need conditional evaluations o f the constants a, vL and w. One approach to this is based on selecting the subset o f the R bootstrap samples for which d' = d. Then w can be calculated in the usual way, but restricted to this subset. For a and vL we need empirical influence values, and these can be approximated by the regression method o f Section 2.7.4, but using only the selected subset o f samples.Investigate whether or not this approach makes sense.(Section 5.9)

18 Suppose that y \ , . . . , y„ are sampled from an unknown distribution, which is known to be symmetric about its median. Then to calculate a 1 — a upper prediction limit for a further observation Y„+1 , the plug-in approach would use the 1 — a quantile o f the symmetrized EDF (Example 3.4). Develop a resampling algorithm for obtaining a bias-corrected prediction limit.(Section 5.10)

19 For estimating the mean /i o f a population with unknown variance, we want to find a (1 — 2a) confidence interval with specified length i. Given data y{,...,y„, consider the following approach. Create bootstrap samples o f sizes N = n , n + 1,. . . and calculate confidence intervals (e.g. by the studentized bootstrap method) for each N. Then choose as total sample size that N for which the interval length is if or less. An additional N — n data values are then obtained, and a bootstrap confidence interval applied. Discuss this approach, and investigate it numerically for the case where the data are sampled from a N(n,cr2) distribution.

5.13 Practicals1 Suppose that we wish to calculate a 90% confidence interval for the correlation

9 between the two counts in the columns o f cd4; see Practical 2.3. To obtain

confidence intervals for 9 under nonparametric resampling, using the empirical influence values to calculate vl ■

cd4.boot <- boot(cd4, corr.fun, stype="w", R=999) boot.ci(cd4.boot,conf=0.9)

To obtain intervals on the variance-stabilized scale, i.e. based on

t = i l o g { ( l + 0 ) / ( l - 0 ) } :

fisher <- function(r) 0.5*log((l+r)/(l-r))fisher.dot <- function(r) l/(l-r~2)fisher.inv <- function(z) (exp(2*z)-l)/(exp(2*z)+l)boot.ci(cd4.boot,h=f isher,hdot=f isher.dot,hinv=f isher.inv,conf =0.9)

How well do the intervals compare? Is the normal approximation reliable here? To compare intervals under parametric simulation from a fitted bivariate normal distribution:

cd4.rg <- function(data, mle){ d <- matrix(rnorm(2*nrow(data)), nrow(data), 2)

d[,2] <- mle [5] *d[, 1]+sqrt (1-mle [5] "2)*d[,2] d[,l] <- mle[1]+mle[3]*d[, 1] d[,2] <- mle [2]+mle [4] *d[,2] d >

n <- nrow(cd4)cd4.mle <- c (apply(cd4,2,mean),sqrt(apply(cd4,2,var)*(n-l)/n),

corr(cd4))cd4.para <- boot(cd4, corr.fun, R=999, sim="parametric",

ran.gen = cd4.rg, mle=cd4.mle) boot.ci(cd4.para,type=c("norm","basic","stud","perc"),conf=0.9) boot.ci(cd4.para,h=fisher,hdot=fisher.dot,hinv=fisher.inv,

type=c("norm","basic","stud","perc"),conf=0.9)

To obtain the corresponding interval using the nonparametric ABC method:

abc.ci(cd4, corr, conf=0.9)

D o the differences among the various intervals reflect what you would expect? (Sections 5.2, 5.3, 5.4.2; DiCiccio and Efron, 1996).

Suppose that we wish to calculate a 90% confidence interval for the largest eigenvalue 9 o f the covariance matrix o f the two counts in the columns o f cd4; see Practicals 2.3 and 5.1. To obtain confidence intervals for 9 under nonparametric resampling, using the empirical influence values to calculate vL :

eigen.fun <- function(d, w = rep(l, nrow(d))/nrow(d)){ w <- w/sum(w)

n <- nrow(d) m <- crossprod(w, d) m2 <- sweep(d,2,m)v <- crossprod(diag(sqrt(w)) ■/.*■/, m2) eig <- eigen(v,symmetric=T) stat <- eig$values[l] e <- eig$vectors[,l] i <- rep(l:n,round(n*w)) ds <- sweep(d[i,],2,m)

L <- (ds/C*7,e)~2 - stat c(stat, sum(L~2)/n~2) }

cd4.boot <- boot(cd4,eigen.fun,R=999,stype="w") boot.ci(cd4.boot, conf=0.90) abc.ci(cd4, eigen.fun, conf=0.9)

Discuss the differences among the various intervals.(Sections 5.2, 5.3, 5.4.2; DiCiccio and Efron, 1996)

3 Dataframe amis contains data made available by G. Amis o f Cambridgeshire County Council on the speeds in miles per hour o f cars at pairs o f sites on roads in Cambridgeshire. Speeds were measured at each site before and then again after the erection o f a warning sign at one site o f each pair. The quantity o f interest is the mean relative change in the 0.85 quantile, o f the speeds for each pair, i.e. the mean o f the quantities (rjai — r]bl) — (rjao—Vbo)', here »/m and r\ai are the 0.85 quantiles o f the speed distribution at the site where the sign was placed, before and after its erection. This quantity is chosen because the warning is particularly intended to slow faster drivers. About 100 speeds are available for each combination of14 pairs o f sites and three periods, one before and two after the warnings were erected, but some o f the pairs overlap. We work with a slightly smaller dataset, for which the rjs are:

amisl <- amis[(amis$pair!=4)&(amis$pair!=6)&(amis$period!=3),] tapply(amisl$speed, list(amisl$period,amisl$warning,amisl$pair),

quantile, 0.85)

To attempt to set confidence intervals for 6, by stratified resampling from the speeds at each combination o f site and period:

amis.fun <- function(data, i){ d <- data[i, ]

d <- tapply(d$speed,list(d$period,d$warning,d$pair).quantile,0.85) mean((d[2,1, ] - d[l,l, ]) - (d[2,2, ] - d[l,2, ])) >

str <- 4*(amisl$pair-l)+2*(amisl$warning-l)+amisl$period amisl.boot <- boot(amisl,amis.fun,R=99,strata=str) amisl,boot$t0 qqnonn(amisl.boot$t)abline(mean(amisl,boot$t),sqrt(var(amisl,boot$t)),lty=2) boot.ci(amisl.boot,type=c("basic","perc","norm"),conf=0.9)

(There are 4800 cases in am isl so this is demanding on memory: it may be necessary to increase the o b j e c t . s iz e .) D o the resampled averages look normal? Can you account for the differences among the intervals?How big is the average effect o f the warnings?(Section 5.2)

4 Dataframe c a p a b i l i t y gives “data” from Bissell (1990) comprising 75 successive observations with specification limits U = 5.79 and L — 5.49; see Problem 5.6. To check that the process is “in control” and that the data are close to independent normal random variables:

par(mfrow=c(2,2))tsplot(capabilityly,ylim=c(5,6))abline(h=5.79,lty=2); abline(h=5.49,lty=2)qqnorm(capability$y)acf(capabilitySy)

acf(capability$y,type="partial")

To find nonparametric confidence limits for rj using the estimates given by (ii) in Problem 5.6:

capability.fun <- function(data, i, U=5.79, L=5.49, dk=2.236){ y <- data$y[i]

m <- mean(y)r5 <- apply(matrix(y,15,5), 1, function(y) diff(range(y))) s <- mean(r5)/dk 2*min((U-m)/s, (m-L)/s) >

capability.boot <- boot(capability, capability.fun, R=999) boot.ci(capability.boot,type=c("norm","basic","perc"))

D o the values o f t* look normal? Why is there such a difference between the percentile and basic bootstrap limits? Which do you think are more reliable here? (Sections 5.2, 5.3)

Following on from Practical 2.3, we use a double bootstrap with M = 249 to adjust the studentized bootstrap interval for a correlation coefficient applied to the cd4 data.

nested.corr <- function(data, w, tO, M){ n <- nrow(data)

i <- rep(l:n,round(n*w)) t <- corr.fun(data, w ) z <- (t[l]-t0)/sqrt(t[2])nested.boot <- boot(data[i,], corr.fun, R=M, stype="w") z.nested <- (nested.boot$t[,1]—t [1])/sqrt(nested.boot$t[,2]) c(z, sum(z.nested<z)/(M+l)) }

cd4.boot <- boot(cd4, nested.corr, R=9, stype="w",tO=corr(cd4), M=249)

To get some idea how long you will have to wait if you set R = 999 you can time the call to boot using u n ix .t im e or dos . t ime: beware o f time and memory problems. It may be best to run a batch job, with contents

cd4.boot <- boot(cd4.nested.corr,R=99,stype="w",tO=corr(cd4),M=249) junk <- boot(cd4,nested.corr,R=100,stype="w",tO=corr(cd4),M=249) cd4.boot$t <- rbind(cd4.boot$t,junk$t) cd4.boot$R <- cd4.boot$R+junk$R

but with the last three lines repeated eight further times.cd4.nested contains a nested simulation we did earlier. To compare the actual and nominal coverage levels:

par(pty="s")qqplot((1:cd4.nested$R)/ (l+cd4.nested$R),cd4.nested$t[,2],

xlab="nominal coverage",ylab="estimated coverage",pch=".") lines(c(0,l),c(0,l))

How close to nominal is the estimated coverage? To read off the original and corrected 95% confidence intervals:

q <- c(0.975,0.025)q.adj <- quantile(cd4.nested$t[,2],q) tO <- corr.fun(cd4) z <- sort(cd4.nested$t[,1])


tO [1]-sqrt(tO[2])*z[floor((l+cd4.nested$R)*q)] tO [1]-sqrt(tO[2])*z[floor((l+cd4.nested$R)*q.adj)]

Does the correction have much effect? Compare this interval with the corresponding ABC interval.(Section 5.6)

6

Linear Regression

6.1 IntroductionOne o f the m ost im portan t and frequent types o f statistical analysis is regression analysis, in which we study the effects o f explanatory variables or covariates on a response variable. In this chapter we are concerned with Unear regression, in which the m ean o f the random response Y observed at value x = (x i , . . . , x p)T o f the explanatory variable vector is

E (y | x) = n(x) = x Tp.

The model is completed by specifying the nature o f random variation, which for independent responses am ounts to specifying the form o f the variance v a r(7 | x). For a full param etric analysis we would also have to specify the distribution o f Y , be it norm al, Poisson or whatever. W ithout this, the model is semiparametric.

For linear regression with norm al random errors having constant variance, the least squares theory o f regression estim ation and inference provides clean, exact m ethods for analysis. But for generalizations to non-norm al errors and non-constant variance, exact m ethods rarely exist, and we are faced with approxim ate m ethods based on linear approxim ations to estim ators and central limit theorems. So, ju st as in the simpler context o f C hapters 2-5, resampling m ethods have the potential to provide m ore accurate analysis.

We begin our discussion in Section 6.2 with simple least squares linear regression, where in ideal conditions resampling essentially reproduces the exact theoretical analysis, but also offers the potential to deal with non-ideal circumstances such as non-constant variance. Section 6.3 covers the extension to m ultiple explanatory variables. The related topics o f aggregate prediction error and o f variable selection based on predictive ability are discussed in Section 6.4. Robust m ethods o f regression are examined briefly in Section 6.5.

256

6.2 ■ Least Squares Linear Regression 257

Figure 6.1 Average body weight (kg) and brain weight (g) for 62 species of mammals, plotted on original scales and logarithmic scales (Weisberg, 1985, p. 144).

Body weight Body weight

The further topics o f generalized linear models, survival analysis, other nonlinear regression, classification error, and nonparam etric regression models are deferred to C hapter 7.

6.2 Least Squares Linear Regression

6.2.1 Regression fit and residualsThe left panel o f Figure 6.1 shows the scatter plot of response “brain weight” versus explanatory variable “body weight” for n = 62 mammals. As the right panel o f the figure shows, the data are well described by a simple linear regression after the two variables are transform ed logarithmically, so that

y = log(brain weight), x = log(body weight).

The simple linear regression m odel is

Yj = + Pixj + ej, j = l , . . . , n , (6.1)

where the EjS are uncorrelated with zero means and equal variances a2. This constancy o f variance, or homoscedasticity, seems roughly right for the example data. We refer to the data (x j , y j ) as the y'th case.

In general the values Xj m ight be controlled (by design), random ly sampled, or merely observed as in the example. But we analyse the da ta as if the x,s were fixed, because the am ount o f inform ation about ft = (/fo, lh )T depends upon their observed values.

The simplest analysis o f data under (6.1) is by the ordinary least squares

258 6 • Linear Regression

m ethod, on which we concentrate here. The least squares estimates for (i are

where x = n 1Y XJ and = ^ = i ( x; — x )2- The conventional estimate of the error variance er2 is the residual m ean square

the fitted values, or estim ated mean values, for the response at the observed x values.

The estimates are norm ally distributed and optim al if the errors e;- are norm ally distributed, they are often approxim ately norm al for other error distributions, bu t they are not robust to gross non-norm ality o f errors or to outlying response values.

The raw residuals e} are im portant for various aspects o f model checking, and potentially for resampling m ethods since they estimate the random errors Ej, so it is useful to summarize their properties also. U nder (6.1),

with djk equal to 1 if j = k and zero otherwise. The quantities hjj are known as leverages, and for convenience we denote them by hj. It follows from (6.7) that

, h = y - Pi*, (6.2)

where

ei = yj - A> (6.3)

are raw residuals with

A/ = Po + Plxj (6.4)

The basic properties o f the param eter estimates Po, Pi, which are easily obtained under model (6.1), are

(6.5)

and

E(j?i) = Pu var(j?i) = (6.6)

n(6.7)

k= 1

where

E(e; ) = 0, var(e; ) = tx2( l -hj). (6.8)

6.2 • Least Squares Linear Regression 259

Standardized residuals are called studentized residuals by some authors.

One consequence o f this last result is tha t the estim ator S 2 that corresponds to s2 has expected value a 2, because £)(1 — hj) = n — 2. N ote tha t with the intercept o in the model, YI ej = 0 automatically.

The raw residuals e} can be modified in various ways to makes them suitable for diagnostic m ethods, bu t the m ost useful m odification for our purposes is to change them to have constant variance, that is

• (6.9)1 ( i - h j W

We shall refer to these as modified residuals, to distinguish them from standardized residuals which are in addition divided by the sample standard deviation. A norm al Q-Q plot o f the r;- will reveal obvious outliers, or clear non-norm ality o f the random errors, although the latter may be obscured som ewhat because o f the averaging property o f (6.7).

A simpler m odification o f residuals is to use 1 — h = 1 — 2n-1 instead o f individual leverages 1 — hj, where h is the average leverage; this will have a very similar effect only if the leverages hj are fairly homogeneous. This simpler modification implies m ultiplication o f all raw residuals by (1 — 2n~1)~]/'2: the average will equal zero autom atically because ^ ej = 0.

I f (6.1) holds with hom oscedastic random errors e; and if those random errors are norm ally distributed, or if the dataset is large, then standard distributional results will be adequate for drawing inferences with the least squares estimates. But if the errors are very non-norm al or heteroscedastic, meaning tha t their variances are unequal, then those standard results may not be reliable and a resam pling m ethod may offer genuine improvement. In Sections 6.2.3 and 6.2.4 we describe two quite different resam pling m ethods, the second of which is robust to failure o f the model assumptions.

If strong non-norm ality or heteroscedasticity (which can be difficult to distinguish) appear to be present, then robust regression estimates may be considered in place o f least squares estimates. These will be discussed in Section 6.5.

6.2.2 Alternative modelsThe linear regression model (6.1) can arise in two ways, and for our purposes it can be useful to distinguish them.

First formulation

The first possibility is tha t the pairs are random ly sampled from a bivariate distribution F for (X, 7 ) . Then linear regression refers to linearity o f the conditional m ean o f Y given X = x, tha t is

E(Y I X = x) = f l y + y{x — H x ) , y = 0 x y / 0 2x , (6-10)

260 6 ■ Linear Regression

with nx = E(X), fly = E(Y), a 2 = \a.r(X) and axy = cov(X, Y). This conditional m ean corresponds to the m ean in (6.1), with

Po = H y - y f i x, P i = y - (6.11)

The param eters ft = (Po,Pi)T are here seen to be statistical functions o f the kind met in earlier chapters, in this case based on the first and second m om ents o f F. The random errors t.j in (6.1) will be hom oscedastic with respect to x if F is bivariate norm al, for example, bu t not in general.

The least squares estim ators (6.2) correspond to the use o f sample moments in (6.10). For future reference we note (Problem 6.1) tha t the influence function for the least squares estim ators t = (/?o, Pt )T is the vector

L^ F> = C - S ? <612>The empirical influence values as defined in Section 2.7.2 are therefore

( 1 - n x ( x j - x ) / S S x \' < = { n(Xj — x ) / S S x ) “ ■ (6' 13)

The nonparam etric delta m ethod variance approxim ation (2.36) applied to [1] gives

Y,{x j — x)2e2j vl = — -S S 2 1■ (6-14)

This makes no assum ption o f homoscedasticity. In practice we modify the variance approxim ation to account for leverage, replacing ej by r, as defined in (6.9).

Second formulation

The second possibility is tha t a t any value of x, responses Yx can be sampled from a distribution Fx(y) whose m ean and variance are n(x) and <r2(x), such tha t n{x) = Po + Pix. Evidently /?o = /40), and the slope param eter /ii is a linear contrast o f m ean values n(x i) ,^(*2), • • •, namely

_ E (xj - x)n(xj)SSX

In principle several responses could be obtained at each xj. Simple linear regression with homoscedastic errors, with which we are initially concerned, corresponds to cr(x) = a and

Fx(y) = G { y - r t x ) } . (6.15)

So G is the distribution o f random error, with mean zero and variance a 2. Any particular application is characterized by the design x i , . . . ,x „ and the corresponding distributions Fx, the means o f which are defined by linear regression.


The influence function for the least squares estim ator is again given by (6.12), bu t with fix and a\ respectively replaced by x and n~' J2(x j ~ *)2- Empirical influence values are still given by (6.13). The analogue o f linear approxim ations (2.35) and (3.1) is $ = fi + n~x Lt{(xj ,y j) ;F} , with variance n_2 ^ " =1 var [Lt{(xj ,Yj) ;F}]. If the assumed homoscedasticity o f errors is used to evaluate this, with the constant variance a 2 estim ated by n~l ep

then the delta m ethod variance approxim ation for /?i, for example, is

' Z i .nSSx ’

strictly speaking this is a sem iparam etric approxim ation. This differs by a factor o f (n — 2) /n from the standard estimate, which is given by (6.6) with residual m ean square s2 in place o f a 2.

The standard analysis for linear regression as outlined in Section 6.2.1 is the same for both situations, provided the random errors ej have equal variances, as would usually be judged from plots o f the residuals.

6.2.3 Resampling errorsTo extend the resampling algorithms o f Chapters 2-3 to regression, we have first to identify the underlying model F. Now if (6.1) is literally correct with homoscedastic errors, then those errors are effectively sampled from a single distribution. I f the x; s are treated as fixed, then the second form ulation of Section 6.2.2 applies, G being the com m on error distribution. The model F is the series o f distributions Fx for x = x i , . . . ,x „ , defined by (6.15). The resampling m odel is the corresponding series of estim ated distributions Fx in which each /i(xy) is replaced by the regression fit p.(xj) and G is estim ated from all residuals.

For param etric resampling we would estimate G according to the assumed form o f error distribution, for example the N(0,s2) distribution if norm ality were judged appropriate. (O f course resampling is not necessary for the norm al linear model, because exact theoretical results are available.) For nonparam etric resampling, on which we concentrate in this chapter, we need a generalization o f the E D F used in C hapter 2. I f the random errors Ej were known, then their E D F would be appropriate. As it is we have the raw residuals ej which estimate the e; , and their E D F will usually be consistent for G. But for practical use it is better to use the residuals r,- defined in (6.9), because their variances agree with those o f the e; . N oting tha t G is assumed to have m ean zero in the model, we then estimate G by the E D F o f rj — f, where r is the average o f the rj. These centred residuals have mean zero, and we refer to their E D F as G.

The full resampling model is taken to have the same “design” as the data, tha t is x* = X j ; it then specifies the conditional distribution o f YJ given x*


Y j = p . j + ep j = (6.16)

with p.j = + [Six’ and ej random ly sampled from G. So the algorithmto generate simulated datasets and corresponding param eter estim ates is as follows.

Algorithm 6.1 (Model-based resampling in linear regression)

For r = 1

1 For j = 1 , . . . ,n ,

(a) set x j = Xj\(b) random ly sample ej from r i . . , r „ — r; then

(c) set yj =Po + j?ix j + ej.

2 Fit least squares regression to (x j ,y j) , . ..,(x* ,y*), giving estimates

Po,r’ P \ j ’ Sr2-•

The resampling means and variances o f Pq and p\ will agree very closely with standard least squares theory. To see this, consider for example the slope estimate, whose bootstrap sample value can be w ritten

a . a ,

E ( * ) - * > 2 “ f t + SS, '

Because E*(e*) = n r 1 Y ( rj — r) = 0, it follows tha t E*(j?j) = Pi. Also, because var*(e*) = n_1 £ " =1(r; ~ F f for a11 J,

. y^(x; — x)2var*(£;) , -var (Pi) = -----------^ -------- J- = n ^ ( r , - - r f / S S x.

The latter will be approxim ately equal to the usual estimate s2/ S S x, because n_1 Y;(rj ~ r ) 2 = (n ~ 2)~' e] = s2- 1° fact if the individual hj are replaced by their average h, then the means and variances o f Pq and p\ are given exactly by (6.5) and (6.6) with the estim ates Pq, Pi and s2 substituted for param eter values. The advantage o f resam pling is improved quantile estim ation when norm al-theory distributions o f the estim ators Pq, Pi , S 2 are not accurate.

Example 6.1 (Mammals) For the da ta plotted in the right panel o f Figure 6.1, the simple linear regression model seems appropriate. S tandard analysis suggests tha t errors are approxim ately norm al, although there is a small suspicion o f heteroscedasticity: see Figure 6.2. The param eter estimates are Po = 2.135 and Pi = 0.752.

From R = 499 bootstrap sim ulations according to the algorithm above, the

through the estim ated version o f (6.1), which is

Figure 6.2 Normal Q-Q plot of modified residuals r;- and their plot against leverage values hj for linear regression fit to log-transformed mammal data.

co3TD

■o<D■oO

tO3■o

■o0>"DO

Quantiles of Standard Normal Leverage h

estim ated standard errors o f intercept and slope are respectively 0.0958 and 0.0273, com pared to the theoretical values 0.0960 and 0.0285. The empirical distributions o f bootstrap estimates are alm ost perfectly norm al, as they are for the studentized estimates. The estim ated 0.05 and 0.95 quantiles for the studentized slope estimate

s E { f a y

where SE(fS\) is the standard error for obtained from (6.6), are z*25) = —1.640 and z'475) = 1.5 89, com pared to the standard norm al quantiles +1.645. So, as expected for a m oderately large “clean” dataset, the resampling results agree closely with those obtained from standard m ethods. ■

Zero interceptIn some applications the intercept fo will no t be included in (6.1). This affects the estim ation o f Pi and a2 in obvious ways, but the resampling algorithm will also differ. First, the leverage values are different, namely

so the modified residual will be different. Secondly, because now e; 0, it is essential to m ean-correct the residuals before using them to simulate random errors.

Repeated design pointsI f there are repeat observations a t some or all values o f x, this offers an enhanced opportunity to detect heteroscedasticity: see Section 6.2.6. W ith


many such repeats it is in principle possible to estimate the C D Fs Fx separately (Section 6.2.2), bu t there is rarely enough data for this to be useful in practice.

The m ain advantage o f repeats is the opportunity it affords to test the adequacy o f the linear regression form ulation, by splitting the residual sum o f squares into a “pure error” com ponent and a “goodness-of-fit” com ponent. To the extent tha t the com parison o f these com ponents through the usual F ratio is quite sensitive to non-norm ality and heteroscedasticity, resampling m ethods m ay be useful in interpreting that F ratio (Practical 6.3).

6.2.4 Resampling casesA completely different approach would be to imagine the data as a sample from some bivariate distribution F o f (X, Y). This will sometimes, but not often, mimic w hat actually happened. In this approach, as outlined in Section 6.2.2, the regression coefficients are viewed as statistical functions o f F, and defined by (6.10). M odel (6.1) still applies, bu t with no assum ption on the random errors e7 other than independence. W hen (6.10) is evaluated at F we obtain the least squares estimates (6.2).

W ith F now the bivariate distribution o f (X, Y ), it is appropriate to take F to be the E D F o f the da ta pairs, and resampling will be from this EDF, just as in C hapter 2. The resampling sim ulation therefore involves sampling pairs with replacement from { x \ , y \ ) , . . . , (x„,y„). This is equivalent to taking (x,*,y*) = (x i , y i ), where I is uniformly distributed on {1,2 ,... ,n } . Simulated values Pq, fi\ o f the coefficient estim ates are com puted from (xj,_y*),...,(x*,y*) using the least squares algorithm which was applied to obtain the original estimates feo, fi\. So the resampling algorithm is as follows.

Algorithm 6.2 (Resampling cases in regression)

For r =

1 sample i\ , r a n d o m l y with replacem ent from {1,22 for j = 1 ,. . . , n, set x j = x,-, y j = y ;•; then3 fit least squares regression to ( x \ , y \ ) , . . . ,(x*n,y*n), giving estimates

K r K ’ sr2-•

There are two im portant differences between this second bootstrap m ethod and the previous one using a param etric model and simulated errors. First, with the second m ethod we make no assum ption about variance homogeneity— indeed we do not even assume that the conditional mean o f Y given X = x is linear. This offers the advantage o f potential robustness to heteroscedasticity, and the disadvantage o f inefficiency if the constant-variance model is correct. Secondly, the simulated samples have different designs, because the values

The model E(Y | X = x) = a + /?i(x — x), which some writers use in place of (6.1), is not useful here because a = /fo 4- fi\x is a function not only of F but also of the data, through x.


Table 6.1 Mammals data. Comparison of bootstrap biases and

T heoretical R esam pling cases R obust theoretical

standard errors ofintercept and slope with bias 0 0.0006 —theoretical results, stan d ard erro r 0.096 0.091 0.088standard and robust.

f>i bias 0 0.0002 _Resampling cases with R = 999. stan d ard erro r 0.0285 0.0223 0.0223

x j , . . . ,x * are random ly sampled. The design fixes the inform ation content o f a sample, and in principle our inference should be specific to the inform ation in our data. The variation in x j , . . . , x ’ will cause some variation in inform ation, but fortunately this is often un im portant in m oderately large datasets; see, however, Examples 6.4 and 6.6.

N ote that in general the resampling distribution o f a coefficient estimate will not have m ean equal to the da ta estimate, contrary to the unbiasedness property that the estimate in fact possesses. However, the difference is usually negligible.

Example 6.2 (Mammals) For the data o f Example 6.1, a bootstrap sim ulation was run by resampling cases with R = 999. Table 6.1 shows the bias and standard error results for both intercept and slope. The estim ated biases are very small. The striking feature o f the results is that the standard error for the slope is considerably smaller than in the previous bootstrap simulation, which agreed with standard theory. The last colum n o f the table gives robust versions o f the standard errors, which are calculated by estim ating the variance o f Ej to be rj. For example, the robust estimate o f the variance o f (it is

This corresponds to the delta m ethod variance approxim ation (6.14), except that rj is used in preference to e; . As we m ight have expected from previous discussion, the bootstrap gives an approxim ation to the robust standard error.

A A

Figure 6.3 shows norm al Q-Q plots o f the bootstrap estim ates Pq and fi'. For the slope param eter the right panel shows lines corresponding to norm al d istributions with the usual and the robust standard errors. The distribution o f Pi is close to normal, with variance much closer to the robust form (6.17) than to the usual form (6.6). ■

One disadvantage o f the robust standard error is its inefficiency relative to the usual standard error when the latter is correct. A fairly straightforw ard calculation (Problem 6.6) gives the efficiency, which is approxim ately 40% for the slope param eter in the previous example. Thus the effective degrees of freedom for the robust standard error is approxim ately 0.40 times 62, or 25.



The same loss o f efficiency would apply approxim ately to bootstrap results for resampling cases.

6.2.5 Significance tests for slopeSuppose tha t we w ant to test w hether or not the covariate x has an effect on the response y, assuming linear regression is appropriate. In terms o f model param eters, the null hypothesis is Ho : fi\ = 0. If we use the least squares estimate as the basis for such a test, then this is equivalent to testing the Pearson correlation coefficient. This connection immediately suggests one nonparam etric test, the perm utation test o f Example 4.9. However, this is not always valid, so we need also to consider o ther possible bootstrap tests.

Permutation test

The perm utation test o f correlation applies to the null hypothesis o f independence between X and Y when these are both random . Equivalently it applies when the null hypothesis implies tha t the conditional distribution o f Y given X = x does no t depend upon x. In the context o f linear regression this means not only zero slope, bu t also constant error variance. The justification then rests simply on the exchangeability o f the response values under the null hypothesis.

If we use AT(.) to denote the ordered values o f X \ , . . . , X n, and so forth, then the exact level o f significance for one-sided alternative Ha '■ Pi > 0 and test statistic T is

Figure 63 Normal plots for bootstrapped estimates of intercept (left) and slope (right) for linear regression fit to logarithms of mammal data, with R = 999 samples obtained by resampling cases. The dotted lines give approximate normal distributions based on the usual formulae (6.5) and (6.6), while the dashed line shows the normal distribution for the slope using the robust variance estimate (6.17).

p = Pr ( T > t | X (.) = x (.), y(.) = )>(.), H0)

- Pr [T > 1 1 X = x, Y = p e rm j^ .)} ] ,


where perm{ } denotes a perm utation. Because all perm utations are equally likely, we have

# o f perm utations such that T > tP = -------------------- i-------------------’n!

as in (4.20). In the present context we can take T = fii, for which p is the same as if we used the sample Pearson correlation coefficient, but the same m ethod applies for any appropriate slope estimator. In practice the test is performed by generating samples (x j ,y j) , . ..,(x* ,y*) such tha t x* = x j and (_y j,...,y ’ ) is a random perm utation o f ( y i , . . . , y n), and fitting the least squares slope estimate jSj. If this is done R times, then the one-sided P-value for alternative H A : fi i > 0 is

# { f r > M + iP R + 1

It is easy to show tha t studentizing the slope estimate would not affect this test; see Problem 6.4. The test is exact in the sense that the P-value has a uniform distribution under Ho, as explained in Section 4.1; note that this uniform distribution holds conditional on the x values, which is the relevant property here.

First bootstrap testA bootstrap test whose result will usually differ negligibly from that o f the perm utation test is obtained by taking the null model as the pair o f marginal E D Fs o f x and y , so that the x*s are random ly sampled with replacem ent from the X j S , and independently the y * s are random ly sampled from the y j s. Again

is the slope fitted to the simulated data, and the form ula for p is the same. As with the perm utation test, the null hypothesis being tested is stronger than just zero slope.

The perm utation m ethod and its bootstrap look-alike apply equally well to any slope estimate, not just the least squares estimate.

Second bootstrap testThe next bootstrap test is based explicitly on the linear model structure with homoscedastic errors, and applies the general approach o f Section 4.4. The null model is the null m ean fit and the E D F o f residuals from tha t fit. We calculate the P-value for the slope estim ate under sampling from this fitted model. T hat is, da ta are simulated by

x ) = x p y j = £;0 + 8}o>

where pjo = y and the £*0 are sampled with replacement from the null model residuals e o = yj ~ y , j = 1 , The least squares slope /Jj is calculated from the simulated data. A fter R repetitions o f the simulation, the P-value is calculated as before.

This second bootstrap test differs from the first bootstrap test only in that the values o f explanatory variables x are fixed at the data values for every case. N ote tha t if residuals were sampled w ithout replacement, this test would duplicate the exact perm utation test, which suggests that this bootstrap test will be nearly exact.

The test could be modified by standardizing the residuals before sampling from them, which here would m ean adjusting for the constant null model leverage n-1 . This would affect the P-value slightly for the test as described, but not if the test statistic were changed to the studentized slope estimate. It therefore seems wise to studentize regression test statistics in general, if m odel-based sim ulation is used; see the discussion o f bootstrap pivot tests below.

Testing non-zero slope valuesAll o f the preceding tests can be easily modified to test a non-zero value of Pi. If the null value is /?i,o, say, then we apply the test to modified responses yj — PiflXj, as in Example 6.3 below.

Bootstrap pivot tests

Further bootstrap tests can be based on the studentized bootstrap approach outlined in Section 4.4.1. For simplicity suppose that we can assume hom oscedastic errors. Then Z = ([S\ — Pi)/S\ is a pivot, where Si is the usual standard error for As a pivot, Z has a distribution not depending upon param eter values, and this can be verified under the linear model (6.1). The null hypothesis is Ho : Pi = 0, and as before we consider the one-sided alternative Ha : Pi > 0. Then the P-value is

because Z is a pivot. The probability on the right is approxim ated by the bootstrap probability

Algorithm 6.1, which uses the fit from the full model as in (6.16). So, applying the bootstrap as described in Section 6.2.3, we calculate the bootstrap P-value from the results o f R simulated samples as

The relation o f this m ethod to confidence limits is that if the lower 1 — a

p = Pr - Pi=0,Po,cr - Pi,Po,<r),

where Z* = (j?,* — Pi ) /S ' is com puted from a sample simulated according to

# {z* > Zo} P R + 1 ’

(6.19)

where zq = Pi/si .

Figure 6.4 Linear regression model fitted to monthly excess returns over riskless rate y for one company versus excess market returns x. The left panel shows the data and fitted line. The right panel plots the absolute values o f the standardized residuals against x (Simonoff and Tsai, 1994).

CMo

oo

COd

• • o •• CO •• Ip

• CM •X * * . O •

• • • CM •

* A » i* **•*.— •

• . •o • 1 .

«• i « • •••

so • * * • •• • - *w , •00 / t

- 0.2 - 0.1

x

0.0 - 0.2 - 0.1 0.0

x

confidence limit for fa is above zero, then p < oc. Similar interpretations apply with upper confidence limits and confidence intervals.

The same m ethod can be used with case resampling. If this were done as a precaution against error heteroscedasticity, then it would be appropriate to replace si with the robust standard error defined as the square root o f (6.17).

If we wish to test a non-zero value fa$ for the slope, then in (6.18) we simply replace f a / s \ by zo = (fa — fa,o)/si, or equivalently com pare the lower confidence limit to fay-

W ith all o f these tests there are simple modifications if a different alternative hypothesis is appropriate. For example, if the alternative is H A : fa < 0, then the inequalities “> ” used in defining p are replaced by and the two-sided P-value is twice the smaller o f the two one-sided P-values.

On balance there seems little to choose am ong the various tests described. The perm utation test and its bootstrap look-alike are equally suited to statistics other than least squares estimates. The bootstrap pivot test with case resampling is the only one designed to test slope w ithout assuming constant error variance under the null hypothesis. But one would usually expect similar results from all the tests.

The extensions to multiple linear regression are discussed in Section 6.3.2.

Example 6.3 (Returns data) The da ta plotted in Figure 6.4 are n = 60 consecutive cases o f m onthly excess returns y for a particular com pany and excess m arket returns x, where excess is relative to riskless rate. We shall ignore the possibility o f serial correlation. A linear relationship appears to fit the data, and the hypothesis o f interest is Ho : fa = 1 with alternative HA : fa > 1, the la tter corresponding to the com pany outperform ing the market.


Q CMa. o

-2

Figure 6.5 Returns data: histogram of R = 999 bootstrap values of studentized slopez* = (fil - M/Kob’obtained by resampling cases. Unshaded area corresponds to values in excess of data value20 = (ft - 1 )/sr0b = 0.669.

Figure 6.4 and plots o f regression diagnostics suggest that error variation increases with x and is non-norm al. It is therefore appropriate to apply the bootstrap pivot test with case resampling, using the robust standard error from (6.17), which we denote here by srob, to studentize the slope estimate.

Figure 6.5 shows a histogram o f R = 999 values o f z". The unshaded part corresponds to z ' greater than the da ta value

zo = (Pi - 1 ) / srob = (1.133 - 1)/0.198 = 0.669,

which happens 233 times. Therefore the bootstrap P-value is 0.234. In fact the use o f the robust standard error makes little difference here: using the ordinary standard error gives P-value 0.252. Com parison o f the ordinary t-statistic to the standard norm al table gives P-value 0.28. ■

6.2.6 Non-constant variance: weighted error resamplingIn some applications the linear model (6.1) will apply, but with heteroscedastic random errors. If the heteroscedasticity can be modelled, then bootstrap sim ulation by resampling errors is still possible. We assume to begin with that ordinary, i.e. unweighted, least squares estimates are fitted, as before.

Known variance functionSuppose tha t in (6.1) the random error ej a t x = Xj has variance uj, where either c ? = k V ( x j ) or a j = K V ( f i j ) , with V ( ) a known function. It is possible to estimate k , but we do not need to do this. We only require the modified residuals

r _ y j - h or y j - hJ { V ( X j ) ( l - h j ) y / 2 { F ( ^ . ) ( 1/ 2’

which will be approxim ately homoscedastic. The E D F o f these modified residuals, after subtracting their mean, will estimate the distribution function G of the scaled, hom oscedastic random errors dj in the model

Yj = p0 + fa Xj + V } % , (6.20)

where Vj = V ( x j ) or V( f i j ) . A lgorithm 6.1 for resampling errors is now modified as follows.

Algorithm 6.3 (Resampling errors with unequal variances)

For r = 1 , . . . , R,

1 For j = 1 , . . . , n,

(a) set x* = Xj\(b ) random ly sample <5* from r\ — r , . . . , r n — r; then

(c) set y'j = fio + fa Xj + Vj1/2Sj, where Vj is V(xj ) or V(frj) as appropriate.

2 F it linear regression by ordinary least squares to da ta (xj, y [ ) , (x*, >’*), giving estimates far, s*2.

Weighted least squaresO f course in this situation ordinary least squares is inferior to weighted least squares, in which ideally the j'th case is given weight Wj = V ~ l . I f Vj = V(x}) then weighted least squares can be done in one pass through the data, whereas if Vj — V(fij) we first estimate fij by ordinary least squares fitted values p°j, say, and then do a weighted least squares fit with the empirical weights Wj = l/V(p.°j). In the la tter case the standard theory assumes that the weights are fixed, which is adequate for first-order approxim ations to distributional properties. The practical effect o f using empirical weights can be incorporated into the resampling, and so potentially m ore accurate distributional properties can be obtained; cf. Example 3.2.

For weighted least squares, the estimates o f intercept and slope are

a _ T , wA x j - x » ) y j a _ 5P1 — P0 — PlXw,22 Wj{xj - x w)2

where x w = Y wj x j / Y ^ wj anc % ~ S wj y j / S wj- Fitted values and raw residuals are defined as for ordinary least squares, but leverage values and modified residuals differ. The leverage values are now

Wj(Xj - x w)2hj — ^ ------h

E wi ’ E wi(* i-X w )2’


and the modified residuals (standardized to equal variance) are

} , var(/?i)Y , W j ( X j - X w )2 ’

K

where k = s2 = (n — 2)_l J2 w j ( y j — faj )2 is the weighted residual m ean square.The algorithm for resampling errors is the same as for ordinary least squares,

summarized in A lgorithm 6.3, but with the full weighted least squares procedure implemented in the final step.

The situation where error variance depends on the mean is a special case of the generalized linear model, which is discussed m ore fully in Section 7.2.

Wild bootstrap

W hat if the variance function F(-) is unspecified? In some circumstances there may be enough da ta to model it from the pattern o f residual variation, for example using a plot o f modified residuals r; (or their absolute values or squares) versus fitted values fij. This approach can work if there is a clear m onotone relationship o f variance with x or fi, or if there are clearly identifiable stra ta o f constant variance (cf. Figure 7.14). But where the heteroscedasticity is unpatterned, either resampling o f cases should be done with least squares estimates, or something akin to local estim ation o f variance will be required.

The m ost local approach possible is the wild bootstrap, which estimates variances from individual residuals. This uses the model-based resampling Algorithm 6.1, but with the j t h resam pled error s* taken from the two-point distribution

where n = (5 + *J5)/10 and = yj — fij is the raw residual. The first three m om ents o f e ' are zero, ej and ej (Problem 6.8). This algorithm generates at m ost 2" different values o f param eter estimates, and typically gives results that are underdispersed relative to m odel-based resampling or resampling cases. N ote that if modified residuals rj were used in place of raw residuals ej, then the variance o f fi* under the wild bootstrap would equal the robust variance estimate (6.17).

Example 6.4 (Returns data) As m entioned in Example 6.3, the data in Figure 6.4 show an increase in error variance with m arket return, x. Table 6.3 com pares the bootstrap variances o f the param eter estimates from ordinary least squares for case resampling and the wild bootstrap, with R = 999. The estim ated variance o f fii from resam pling cases is larger than for the wild

(6.21)

6.3 ■ Multiple Linear Regression 273

Table 6.2 Bootstrap variances (xlO-3) of ordinary least squares estimates for returns data, with R = 999.

All cases W ithou t case 22

h h

Cases 0.32 44.3 0.42 73.2Cases, subset 0.28 38.4 0.39 59.1Wild, ej 0.31 37.9 0.37 62.5Wild, rj 0.33 37.0 0.41 67.2R obust theoretical 0.34 39.4 0.40 67.2

bootstrap, and for the full da ta it makes little difference when the modified residuals are used.

Case 22 has high leverage, and its exclusion increases the variances o f both estimates. The wild bootstrap is again less variable than bootstrapping cases, with the wild bootstrap o f modified residuals interm ediate between them.

We m entioned earlier that the design will vary when resampling cases. The left panel o f Figure 6.6 shows the simulated slope estimates plotted against the sums o f squares X X — x ”)2> f° r 200 bootstrap samples. The plotting character distinguishes the num ber o f times case 22 occurs in the resamples: we return to this below. The variability o f /}j decreases sharply as the sum of squares increases. Now usually we would treat the sum o f squares as fixed in the analysis, and this suggests that we should calculate the variance o f P\ from those bootstrap samples for which X ( x} — x*)2 is close to the original value XXx; ~ x)2, shown by the dotted vertical line. If we take the subset between the dashed lines, the estim ated variance is closer to that for the wild bootstrap, as shown the values in Table 6.2 and by the Q-Q plot in the right panel o f Figure 6.6. This is also true when case 22 is excluded.

The main reason for the large variability o f XXxy — x ’)2 is that case 22 has high leverage, as its position at the bottom left o f Figure 6.4 shows. Figure 6.6 shows that it has a substantial effect on the precision o f the slope estimate: the most variable estimates are those where case 22 does not occur, and the least variable those where it occurs two or more times. ■

6.3 Multiple Linear RegressionThe extension o f the simple linear regression model (6.1) to several explanatory variables is

(6.22)


V%: ;0 (*1n ol*

: d f er i p i . ..v Ti *

co 1ii

illiii

0 !i

iiiii0.001 0.003 0.005

Sum of squares Cases

where for models with an intercept Xjo = 1. In the more convenient vector form the model is

Yj = Xj (i +£j

with x j = (x jo ,Xj i , . . . ,XjP). The com bined m atrix representation for all responses Y t = (Y i , . . . , Y„) is

y = x p + s (6.23)

with X T = (xi , . . . ,x„) and eT = ( e i , . . . , e „ ) . A s before, the responses Y j are supposed independent. This general linear model will encom pass polynomial and interaction models, by judicious definition o f x in term s o f primitive variables; for example, we m ight have Xji = u j i and x,-2 = or Xj$ = uj\Uj2, and so forth. W hen the Xjk are dum m y variables representing levels o f factors, we om it Xjo if the intercept is a redundant param eter.

In m any respects the bootstrap analysis for multiple regression is an obvious extension o f the analysis for simple linear regression in Section 6.2. We again concentrate on least squares model fitting. Particular issues which arise a re : (i) testing for the effect o f a subset o f the explanatory variables, (ii) assessment of predictive accuracy o f a fitted model, (iii) the effect o f p large relative to n, and (iv) selection o f the “best” model by suitable deletion o f explanatory variables. In this section we focus on the first two o f these, briefly discuss the third, and address variable selection m ethods in Section 6.4. We begin by outlining the extensions o f Sections 6.2.1-6.2.4.

Figure 6.6 Comparison of wild bootstrap and bootstrapping cases for monthly returns data. The left panel shows 200 estimates of slope

plotted against sum of squares — x’)2 for case resampling. Resamples where case 22 occurred zero or one times are labelled accordingly. The right panel shows a Q-Q plot of the values of for the wild bootstrap and the subset of the cases lying within the dashed lines in the left panel.


6.3.1 Bootstrapping the least squares fitThe ordinary least squares estimates o f P for model (6.23) based on observed response vector y are

P = (X TX r lX Ty ,

and corresponding fitted values are fr = H y where H = X ( X TX ) ~ {X T is the “h a t” matrix, whose diagonal elements hjj — again denoted by hj for simplicity— are the leverage values. The raw residuals are e = (I — H)y.

Under homoscedasticity the standard form ula for the estim ated variance of P is

var (p) = s2( X TX ) ~ \ (6.24)

with s2 equal to the residual mean square (n — p — l ) ~ 1eTe. The empirical influence values for ordinary least squares estimates are

lj = n ( X T X ) ~ l Xjej, (6.25)

which give rise to the robust estimate o f var(/?),

vl = ( X t X ) - 1 (X TX ) ~ l ■ (6.26)

see Problem 6.1. These generalize equations (6.13) and (6.14). The variance approxim ation is improved by using the modified residuals

7 (1 - M 1/2

in place of the e; , and then v i generalizes (6.17).B ootstrap algorithm s generalize those in Sections 6.2.3-6.2.4. T hat is, model-

based resampling generates data according to

Y] = x J P + E p

where the s' are random ly sampled from the modified residuals n , . . . , rn, or their centred counterparts — r. Case resampling operates by random ly resampling cases from the data. Pros and cons o f the two m ethods are the same as before, provided p is small relative to n and the design is far from being singular. The situation where p is large requires special attention.

Large pDifficulty can arise with both model-based resampling and case resampling if p is very large relative to n. The following theoretical example illustrates an extreme version o f the problem.


Example 6.5 (One-way model) Consider the regression model that corresponds to m independent samples each o f size two. If the regression param eters Pi , . . . , pm are the means o f the populations sampled, then we om it the intercept term from the model, and the design m atrix has p = m columns and n = 2m rows with dum m y explanatory variables x 2,-i,( = x 2iyi = 1, = 0 otherwise,i = I , . . . , p . T hat is,

0 0 00X =

/ I100

0 \000

0 0 0 \ 0 0 0

0 1 0 1/

For this model

Pi = 3 (yn + yn- i ), i = l,...,p,and

ej = ( ~ i y (yn ~ yn-i), h j = \ , j = 2i - l , 2i, i = l , . . . , p .

The E D F o f the residuals, modified or not, could be very unlike the true error distribution: for example, the E D F will always be symmetric.

I f the random errors are homoscedastic then the model-based bootstrap will give consistent estimates o f bias and standard error for all regression coefficients. However, the bootstrap distributions m ust be symmetric, and so may be no better than norm al approxim ations if true random errors are skewed. There appears to be no remedy for this. The problem is not so serious for contrasts am ong the P,. For example, if 0 = P\ — P2 then it is easy to see that 9 has a symmetric distribution, as does O'. The kurtosis is, however,

A A

different for 9 and 6’ ; see Problem 6.10.Case resampling will not work because in those samples where both y2i+i

and y2i+2 are absent /?, is inestimable: the resample design is singular. The chance o f this is 0.48 for m = 5 increasing to 0.96 for m = 20. This can be fixed by om itting all bootstrap samples where + f 2i = 0 for any i. The resulting bootstrap variance for P’ consistently overestimates by a factor o f about 1.3. Further details are given in Problem 6.9. ■

The im plication for more general designs is that difficulties will arise with com binations cTp where c is in the subspace spanned by those eigenvectors of X TX corresponding to small eigenvalues. First, model-based resampling will give adequate results for standard error calculations, but bootstrap distributions may not improve on norm al approxim ations in calculating confidence limits for the /?,-s, or for prediction. Secondly, unconstrained case resampling


Table 63 Cement data (Woods, Steinour and Starke, 1932). The response y is the heat (calories per gram of cement) evolved while samples of cement set. The explanatory variables are percentages by weight of four constituents, tricaicium aluminate x\, tricalcium silicate X2, tetracalcium alumino ferrite *3 and dicalcium silicate X4.

xi *2 X) *4 y

1 7 26 6 60 78.52 1 29 15 52 74.33 11 56 8 20 104.34 11 31 8 47 87.65 7 52 6 33 95.96 11 55 9 22 109.27 3 71 17 6 102.78 1 31 22 44 72.59 2 54 18 22 93.110 21 47 4 26 115.911 1 40 23 34 83.812 11 66 9 12 113.313 10 68 8 12 109.4

may induce near-collinearity in the design m atrix X ' , or equivalently nearsingularity in X ' TX *, and hence produce grossly inflated bootstrap estimates o f some standard errors. One solution would be to reject simulated samples where the smallest eigenvalue o f X ’TX * is lower than a threshold just below the smallest eigenvalue ( \ o f X TX . A n alternative solution, more in line with the general thinking that analysis should be conditioned on X , is to use only those simulated samples corresponding to the middle half o f the values o f t \ . This probably represents the best strategy for getting good confidence limits which are also robust to error heteroscedasticity. The difficulty may be avoided by an appropriate use o f principal com ponent regression.

Example 6.6 (Cement data) The da ta in Table 6.3 are classic in the regression literature as an example o f near-collinearity. The four covariates are percentages o f constituents which sum to nearly 100: the smallest eigenvalue o f X TX is = 0.0012, corresponding to eigenvector (—1,0.01,0.01,0.01,0.01).

Theoretical and bootstrap standard errors for coefficients are given in Table6.4. For error resampling the results agree closely with theory, as expected. The bootstrap distributions o f /?* are very norm al-looking: the hat m atrix H is such tha t modified residuals r; would look norm al even for very skewed errors Ej.

Case resam pling gives much higher standard errors for coefficients, and the bootstrap distributions are visibly skewed with several outliers. Figure 6.7 shows scatter plots o f two bootstrap coefficients versus smallest eigenvalue o f X T' X ' ; plots for the other two coefficients are very similar. The variability o f /?,* increases substantially for small values o f /}, whose reciprocal ranges from j to 100 times the reciprocal o f £\. Taking only those bootstrap samples which give the middle 500 values of / j (which are between 0.0005 and 0.0012)


fio /?! P2 h P4 err0rS of linear____________________________________________________________________ regression coefficients

for cement data.N orm al-theory E rro r resam pling, R = 999 C ase resam pling, all R = 999 C ase resam pling, m iddle 500 C ase resam pling, largest 800

-------------------------------------------------------------------------------------------------------- only on those sampleswith the middle 500 and the largest 800 values ofr v

Table 6.4 Standard

70.1 0.74 0.72 0.75 0.7166.3 0.70 0.69 0.72 0.67108.5 1.13 1.12 1.18 1.1168.4 0.76 0.71 0.78 0.6967.3 0.77 0.69 0.78 0.68

Theoretical and error resampling assume homoscedasticity. Resampling results use R = 999 samples, but

1• V-

• . V :?-

1 5 10 50 500

Smallest eigenvalue

U ° (0 ©-O

1 5 10 50 500

Smallest eigenvalue

Figure 6.7 Bootstrap regression coefficients and fit, versus smallest eigenvalue ( x l0~5) o f X ' TX ' for R = 999 resamples of cases from the cement data. The vertical line is the smallest eigenvalue of X TX, and the horizontal lines show the original coefficients ± two standard errors.

gives more reasonable standard errors, as seen in the penultim ate row o f Table 6.4. The last row, corresponding to dropping the smallest 200 values of f \ , gives very similar results. ■

Weighted least squaresThe general discussion extends in a fairly obvious way to weighted least squares estim ation, just as in Section 6.2.6 for the case p = 1. Suppose that var(e) = k W ~ 1 where W is the diagonal m atrix o f known case weights w; .Then the weighted least squares estimates are

p = (X T W X ) ~ lX T Wy, (6.27)

Note that H is not symmetric in general. Some authors prefer to work with the

the fitted values are p. = Xfl , and the residual vector is e = (I — H)y, where now the hat m atrix H is defined by

H = X ( X T WX)~lX T W, (6.28)symmetric matrix X' (X'TX ' ) - ' X 'T, where X' = W l' 1X.


whose diagonal elements are the leverage values hj. The residual vector e has variance var(e) = k (I — H ) W ~ [, whose y'th diagonal element is /c(l — h j ) w j 1. So the modified residual is now

rj = _ J 2J -- ------ • (6.29)Wj (1 — hj)1/2

M odel-based resampling is defined by

y ; = x j p + w j ll2£j,

where e* is random ly sampled from the centred residuals r t — r , . . . , r n — r. It is not necessary to estimate k to apply this algorithm, but if an estimate were required it would be k = (n — p — 1 )~1y T W ( I — H)y.

An im portant modification o f case resampling is that each case m ust now include its weight w in addition to the response y and explanatory variables x.

6.3.2 Significance testsSignificance tests for the single covariate in simple linear regression were described in Section 6.2.5. Am ong those tests, which should all behave similarly, are the exact perm utation test and a related bootstrap test. Here we look at the more usual practical problem, testing for the effect o f one or a subset of several covariates. The tests are based on least squares estimates.

Suppose tha t the linear regression model is partitioned as

Y = X (3 + £ = X q oc + X\y + e,

where y is a vector and we wish to test Ho : y = 0. Initially we assume hom oscedastic errors. It would appear tha t the sufficiency argum ent which m otivates the single-variable perm utation test, and makes it exact, no longer applies. But there is a natural extension o f that perm utation test, and its m otivation is clear from the development o f bootstrap tests. The basic idea is to subtract out the linear effect o f X q from both y and X \ , and then to apply the test described in Section 6.2.5 for simple linear regression.

The first step is to fit the null model, tha t is

£o = Xo&o, fo = (X0r X0)_1X0T y.

We shall also need the residuals from this fit, which are eo = (/ — Ho)y with Ho = X q(X q Xo)~lX q . The test statistic T will be based on the least squares estimate y for y in the full model, which can be expressed as

y — (Xi-oXio) 1X[.0eo

with X i o = (I — H q) X i. The extension o f the earlier perm utation test is

2 8 0 6 • Linear Regression

equivalent to applying the perm utation test to “ responses” eo and explanatory variables Xio-

In the perm utation-type test and its bootstrap analogue, we simulate da ta from the null model, assuming hom oscedasticity; that is

y = Ao + £o,

where the com ponents o f the simulated error vector e0 are sampled w ithout (perm utation) or with (bootstrap) replacement from the n residuals in eo- N ote that this makes use o f the assumed homoscedasticity o f errors. Each case keeps its original covariate values, which is to say that X ’ = X. W ith the simulated data we regress y ’ on X to calculate y' and hence the simulated test statistic t \ as described below. W hen this is repeated R times, the bootstrap P-value is

# { t ; > t} + l R + l

The perm utation version o f the test is not exact when nuisance covariates Xj are present, but empirical evidence suggests that it is close to exact.

Scalar yW hat should t be? For testing a single com ponent, so tha t y is a scalar, suppose tha t the alternative hypothesis is one-sided, say HA : y > 0. Then we could

A 1/2take t to be y itself, or possibly a studentized form such as zo = y / v0 , where Do is an appropriate estim ate o f the variance o f y. If we com pute the standard error using the null model residual sum o f squares, then

v0 = ( n - q r ' e l e o i X l o X i o r 1,

where q is the rank o f X q. The same form ula is applied to every simulated sample to get i>q and hence z* = y*/vq1/2.

W hen there are no nuisance covariates Xo, Vq = vq in the perm utation test, and studentizing has no effect: the same is true if the non-null standard error is used. Empirical evidence suggests that this is approxim ately true when Xo is present; see the example below. Studentizing is necessary if modified residuals are used, with standardization based on the null model hat matrix.

An alternative bootstrap test can be developed in terms o f a pivot, as described for single-variable regression in Section 6.2.5. Here the idea is to treat Z = (y — y ) / V l/2 as a pivot, with V l/1 an appropriate standard error. Bootstrap sim ulation under the full fitted model then produces the R replicates o f z ’ which we use to calculate the P-value. To elaborate, we first fit the full model p = X f i by least squares and calculate the residuals e = y — p. Still assuming homoscedasticity, the standard error for y is calculated using the residual mean square — a simple form ula is

v = ( n - p - 1) leTe ( X l 0Xi .0)

Next, datasets are simulated using the model

/ = X p + e*, X ' = X,

where the n errors in e* are sampled independently with replacement from the residuals e or modified versions of these. The full regression o f y ‘ on X is then fitted, from which we obtain y* and its estim ated variance v", these being used to calculate z* = (y* — y ) / v ' ll2. From R repeats o f this sim ulation we then have the one-sided P-value

# { z r* > Z q } + 1

P R + 1

where zo = y /u 1/2. A lthough here we use p to denote a P-value as well as the num ber o f covariates, no confusion should arise.

This test procedure is the same as calculating a (1 — a) lower confidence limit for y by the studentized bootstrap m ethod, and inferring p < a if the lower limit is above zero. The corresponding two-sided P-value is less than 2a if the equi-tailed (1 — 2a) studentized bootstrap confidence interval does not include zero.

One can guard against the effects o f heteroscedastic errors by using case resampling to do the simulation, and by using a robust standard error for y as described in Section 6.2.5. Also the same basic procedure can be applied to estimates o ther than least squares.

Example 6.7 (Rock data) The data in Table 6.5 are m easurem ents on four cross-sections o f each o f 12 oil-bearing rocks, taken from two sites. The aim is to predict permeability from the other three measurements, which result from a complex image-analysis procedure. In all regression models we use logarithm of permeability as response y. The question we focus on here is whether the coefficient o f shape is significant in a multiple linear regression on all three variables.

The problem is nonstandard in that there are four replicates o f the explanatory variables for each response value. If we fit a linear regression to all 48 cases treating them as independent, strong correlation am ong the four residuals for each core sample is evident: see Figure 6.8, in which the residuals have unit variance.

Under a plausible model which accounts for this, which we discuss in Example 6.9, the appropriate linear regression for testing purposes uses core averages o f the explanatory variables. Thus if we represent the data as responses yj and replicate vectors o f the explanatory variables Xjk, k = 1 ,2 ,3 ,4 , then the model for our analysis is

yj = x J .P+ Ej,

where the Ej are independent. A summary o f the least squares regression

2 8 2 6 ■ Linear Regression

case a r e a p e r im e te r shap e p e r m e a b i l i t yTable 6.5 Rock data (Katz, 1995; Venables and Ripley, 1994,

1 4990 2792 0.09 6.3p. 251). These are measurements on

2 7002 3893 0.15 6.3 four cross-sections of3 7558 3931 0.18 6.3 12 core samples, with4 7352 3869 0.12 6.3 permeability5 7943 3949 0.12 17.1 (milli-Darcies), area6 7979 4010 0.17 17.1 (of pore space, in pixels7 9333 4346 0.19 17.1 out of 256 x 256),8 8209 4345 0.16 17.1 perimeter (pixels),9 8393 3682 0.20 119.0 and shape10 6425 3099 0.16 119.0 (perimeter/area)1 2.11 9364 4480 0.15 119.012 8624 3986 0.15 119.013 10651 4037 0.23 82.414 8868 3518 0.23 82.415 9417 3999 0.17 82.416 8874 3629 0.15 82.417 10962 4609 0.20 58.618 10743 4788 0.26 58.619 11878 4864 0.20 58.620 9867 4479 0.14 58.621 7838 3429 0.11 142.022 11876 4353 0.29 142.023 12212 4698 0.24 142.024 8233 3518 0.16 142.025 6360 1977 0.28 740.026 4193 1379 0.18 740.027 7416 1916 0.19 740.028 5246 1585 0.13 740.029 6509 1851 0.23 890.030 4895 1240 0.34 890.031 6775 1728 0.31 890.032 7894 1461 0.28 890.033 5980 1427 0.20 950.034 5318 991 0.33 950.035 7392 1351 0.15 950.036 7894 1461 0.28 950.037 3469 1377 0.18 100.038 1468 476 0.44 100.039 3524 1189 0.16 100.040 5267 1645 0.25 100.041 5048 942 0.33 1300.042 1016 309 0.23 1300.043 5605 1146 0.46 1300.044 8793 2280 0.42 1300.045 3475 1174 0.20 580.046 1651 598 0.26 580.047 5514 1456 0.18 580.048 9718 1486 0.20 580.0

6.3 • Multiple Linear Regression 283

Figure 6.8 Rock data: standardized residuals from linear regression of all 48 cases, showing strong intra-core correlations.

co3T3

■O0N

?(0•Oc03CO

4 6 8 10 12

Core number

Table 6.6 Least squaresresults for multiple V ariable Coefficient SE f-valuelinear regression of rockdata, all covariatesincluded and core intercept 3.465 1.391 2.49means used as response a r e a (x lO - 3 ) 0.864 0.211 4.09variable. p e r i (x lO - 3 ) -1 .9 9 0 0.400 -4 .9 8

sh a p e 3.518 4.838 0.73

is shown in Table 6.6. There is evidence of mild non-norm ality, but not heteroscedasticity o f errors.

Figure 6.9 shows results from both the null model resampling m ethod and the full model pivot resampling m ethod, in both cases using resampling o f errors. The observed value o f z is z0 = 0.73, for which the one-sided P-value is 0.234 under the first m ethod, and 0.239 under the second method. Thus shape should not be included in the linear regression, assuming that its effect would be linear. N ote tha t R = 99 sim ulations would have been sufficient here. ■

Vector y

For testing several com ponents simultaneously, we take the test statistic to be the quadratic form

T = F i X l o X v 0)y,

284 6 * Linear Regression

Figure 6.9 Resampling distributions of standardized test statistic for variable shape. Left: resampling 2 under null model,R = 999. Right: resampling pivot under full model, R = 999.

-6 -4 -2 0 2 4 6 8 -6 -4 -2 0 2 4 6 8

z* z0*

or equivalently the difference in residual sums o f squares for the null and full m odel least squares fits. This can be standardized to

n — q RSSo — R S S q X RSSo

where RSSo and R S S denote residual sums o f squares under the null model and full model respectively.

We can apply the pivot m ethod with full model sim ulation here also, using Z = (y — y)T( X l 0Xi.o)(y — y ) / S 2 with S 2 the residual m ean square. The test statistic value is zo = y T(X[.0Xi .0)y /s2, for which the P-value is given by

# {z* > Zp} + 1

R + 1This would be equivalent to rejecting Ho at level a if the 1 — a confidence set for y does not include the point y = 0. Again, case resampling would provide protection against heteroscedasticity: z would then require a robust standard error.

6.3.3 PredictionA fitted linear regression is often used for prediction o f a new individual response Y+ when the explanatory variable vector is equal to x +. Then we shall w ant to supplem ent our predicted value by a prediction interval. Confidence limits for the m ean response can be found using the same resampling as is used to get confidence limits for individual coefficients, but limits for the response Y+ itself — usually called prediction limits — require additional resampling to simulate the variation o f 7+ about x \ j i .

The quantity to be predicted is Y+ = x'+ji + £ +, say, and the point predictor is Y+ = The random error £+ is assumed to be independent o f the random errors £ i,...,£ „ in the observed responses, and for simplicity we assume that they all come from the same distribution: in particular the errors have equal variances.

To assess the accuracy o f the point predictor, we can estimate the distribution o f the prediction error

S = Y+ - Y + = x tJ - ( x l P + £+)

by the distribution of

<5* = x+/?* — (x+/? + e+), (6.30)

where £+ is sampled from G and /T is a simulated vector of estimates from the m odel-based resampling algorithm. This assumes homoscedasticity o f random error. Unconditional properties o f the prediction error correspond to averaging over the distributions o f bo th £+ and the estimates /?, which we do in the sim ulation by repeating (6.30) for each set o f values o f /T. H aving obtained the modified residuals from the data fit, the algorithm to generate R sets each with M predictions is as follows.

Algorithm 6.4 (Prediction in linear regression)

For r = 1 , . . . , R,

1 simulate responses y* according to (6.16);

2 obtain least squares estimates pr = ( X TX ) ~ 1X Ty *; then3 for m = 1 , . . . , M,

(a) sample £^m from r \ — f , . . . , r „ — r, and(b ) com pute prediction error S ’m = x+i?* — (x£/? + £+m).

It is acceptable to use M = 1 here: the key point is that R M be large enough to estimate the required properties o f <5*. N ote that if predictions at several values o f x+ are required, then only the third step o f the algorithm needs to be repeated for each x+.

The mean squared prediction error is estim ated by the sim ulation mean squared error (R M )-1 E rm(<5*m — <S*)2. M ore useful would be a (1 — 2a) prediction interval for Y+, for which we need the a and (1 — a) quantiles ax and

say, o f prediction error S. Then the prediction interval would have limits

y+ - fli-a, $+ - a*-

The exact, bu t unknown, quantiles are estim ated by empirical quantiles of

the pooled <5*s, whose ordered values we denote by <5( < • • • < Thebootstrap prediction limits are

y+ — ^((RM+l)(l-ct))’ y+ — ^((RM+lJa)’ (6.31)

where y+ = *+/?. This is analogous to the basic bootstrap m ethod for confidence intervals (Section 5.2).

A somewhat better approach which mimics the standard norm al-theory analysis is to work with studentized prediction error

where S is the square root o f residual m ean square for the linear regression. The corresponding simulated values are z*m = <5*m/s*, with s ' calculated in step2 o f A lgorithm 6.4. The a and (1 — a) quantiles o f Z are estim ated by z*(RM+1)0,) and respectively, where z'{V) < ■ ■ ■ < z ’RM) are the ordered valueso f all R M z* s. Then the studentized bootstrap prediction interval for 7+ is

y+ ~ SZ((RM+l)(l-ct))’ £+ ~ SZ((RM+1)«)- (6.32)

Example 6.8 (Nuclear power stations) Table 6.7 contains data on the cost o f 32 light w ater reactors. The cost (in dollars x l0 ~ 6 adjusted to a 1976 base) is the response o f interest, and the o ther quantities in the table are explanatory variables; they are described in detail in the data source.

We take lo g (co st) as the working response y, and fit a linear model with covariates PT, CT, NE, d a te , lo g (c a p a c ity ) and log(N). The dum m y variable PT indicates six plants for which there were partial turnkey guarantees, and it is possible tha t some subsidies may be hidden in their costs.

Suppose tha t we wish to obtain 95% prediction intervals for the cost o f a station like case 32 above, except tha t its value for d a te is 73.00. The predicted value o f lo g (co st) from the regression is x+fi = 6.72, and the m ean squared error from the regression is s = 0.159. W ith a = 0.025 and a sim ulation with R = 999 and M = 1, ( R M + l)a = 25 and ( R M + 1)(1 — a) = 975. The values o f 3(25) and <5*975) are -0.539 and 0.551, so the 95% limits (6.31) are 6.18 and 7.27, which are slightly wider than the norm al-theory limits o f 6.25 and 7.19. For the limits (6.32) we get z(*25) = —3.680 and z(*975) = 3.5 1 2, so the limits for lo g (co st) are 6.13 and 7.28. The corresponding prediction interval for c o s t is [exp(6.13), exp(7.28)] = [459.4,1451],

The usual caveats apply about extrapolating a trend outside the range of the data, and we should use these intervals with great caution. ■

It is unnecessary to standardize also by the square root of 1 + x l ( X TX)- 'x+, which would make the variance of Z close to 1. unless bootstrap results for different x+ are pooled.

The next example involves an unusual data structure, where there is hierarchical variation in the covariates.

Table 6.7 Data on light water reactors constructed in the USA (Cox and Snell, 1981, p. 81).

c o s t d a te Tl t 2 c a p a c i ty PR NE CT BW N PT

1 460.05 68.58 14 46 687 0 1 0 0 14 02 452.99 67.33 10 73 1065 0 0 1 0 1 03 443.22 67.33 10 85 1065 1 0 1 0 1 04 652.32 68.00 11 67 1065 0 1 1 0 12 05 642.23 68.00 11 78 1065 1 1 1 0 12 06 345.39 67.92 13 51 514 0 1 1 0 3 07 272.37 68.17 12 50 822 0 0 0 0 5 08 317.21 68.42 14 59 457 0 0 0 0 1 09 457.12 68.42 15 55 822 1 0 0 0 5 0

10 690.19 68.33 12 71 792 0 1 1 1 2 011 350.63 68.58 12 64 560 0 0 0 0 3 012 402.59 68.75 13 47 790 0 1 0 0 6 013 412.18 68.42 15 62 530 0 0 1 0 2 014 495.58 68.92 17 52 1050 0 0 0 0 7 015 394.36 68.92 13 65 850 0 0 0 1 16 016 423.32 68.42 11 67 778 0 0 0 0 3 017 712.27 69.50 18 60 845 0 1 0 0 17 018 289.66 68.42 15 76 530 1 0 1 0 2 019 881.24 69.17 15 67 1090 0 0 0 0 1 020 490.88 68.92 16 59 1050 1 0 0 0 8 021 567.79 68.75 11 70 913 0 0 1 1 15 022 665.99 70.92 22 57 828 1 1 0 0 20 023 621.45 69.67 16 59 786 0 0 1 0 18 024 608.80 70.08 19 58 821 1 0 0 0 3 025 473.64 70.42 19 44 538 0 0 1 0 19 026 697.14 71.08 20 57 1130 0 0 1 0 21 027 207.51 67.25 13 63 745 0 0 0 0 8 128 288.48 67.17 9 48 821 0 0 1 0 7 129 284.88 67.83 12 63 886 0 0 0 1 11 130 280.36 67.83 12 71 886 1 0 0 1 11 131 217.38 67.25 13 72 745 1 0 0 0 8 132 270.71 67.83 7 80 886 1 0 0 1 11 1

Example 6.9 (Rock data) For the da ta discussed in Example 6.7, one objective is to see how well one can predict permeability from a single replicate o f the three image-based measurements, as opposed to the four replicates obtained in the study. The previous analysis suggested that variable shape did not contribute usefully to a linear regression relationship for the logarithm of permeability, and this is confirmed by cross-validation analysis o f prediction errors (Section 6.4.1). So here we concentrate on predicting permeability from the linear regression o f y = lo g (p e rm e a b ility ) on a r e a and p e r i .

In Example 6.7 we com m ented on the strong intra-core correlation am ong the explanatory variables, and that m ust be taken into account here if we are to correctly analyse prediction of core perm eability from single measurements o f a r e a and p e r i . One way to do this is to think o f the four replicate values o f u = ( a r e a ,p e r i ) T as unbiased estim ates o f an underlying core variable £, on which y has a linear regression. Then the da ta are modelled by

yj = <x + £jy + fij, ujk = + sjk, (6.33)

2 8 8 6 ■ Linear Regression

M ethod Variable

In tercept a r e a (x lO - 4 ) p e r i ( x l O 4)

K = 1 D irect regression on x ^ s 5.746 5.144 -16.16N orm al-theory fit 5.694 5.300 -16.39

K = 4 Regression on Xj. s 4.295 9.257 -21.78N orm al-theory fit 4.295 9.257 -21.78

for j = 1 ,. . . ,1 2 and k = where rjj and < 5 are uncorrelated errorswith zero means, and for our da ta K = 4.

U nder norm ality assum ptions on the errors and the the linear regression o f yj on Uj\,.. . ,UjK depends only on the core average u; = K ~ l Y a =i ujk- The regression coefficients depend strongly on K . For prediction from a single m easurem ent u+ we need the model with K = 1, and for resampling analysis we shall need the model with K = 4. These two versions o f the observation regression model we write as

yj = x j p w + ef> = a(K) + uJy (K} + e f \ (6.34)

for K = 1 and 4; the param eters a and y in (6.33) correspond to a (x) and when K = oo. Fortunately it turns out tha t both observation models can be fitted easily: for K = 4 we regress the yjs on the core averages Uj; and for K = 1 we fit linear regression with all 48 individual cases as tabled, ignoring the intra-core correlation am ong the e;*s, i.e. pretending that y; occurs four times independently. Table 6.8 shows the coefficients for both fits, and com pares them to corresponding estimates based on exact norm al-theory analysis.

Suppose, then, that we want to predict the new response y + given a single set of measurements u+. If we define x \ = (1,m+), then the point prediction Y+ is x l P \ where /?(1) are the coefficients in the fit o f model (6.34) with K = 1, shown in the first row o f Table 6.8. The E D F o f the 48 modified residuals from this fit estimates the m arginal distribution o f the e*1* in (6.34), and hence o f the error e+ in

Y+ = x l ^ + s +.

O ur concern is with the prediction error

5 = Y+ - Y + = x l $ W - - £+, (6.35)

whose distribution is to be estim ated by resampling.The question is how to do the resampling, given the presence o f intra-core

correlation. A resampled dataset m ust consist o f 12 subsets each with 4 replicates u*k and a single response yj , from which we shall fit /?(1)*. The prediction

Table 6.8 Rock data: fits o f linear regression models with K replicate values o f explanatory variables a re a and p e r i . Normal-theory analysis is via model (6.33).

error (6.35) will then be simulated by

<5* = K - y'+ = *I(/*(1)* - j3(1)) - £+,

where e*+ is sampled from the E D F o f the 48 modified residuals as mentioned above. It rem ains to decide how to simulate the da ta from which we calculate Pw '.

Usually with error resampling we would fix the covariate values, so here we fix the 12 values o f Uj, which are surrogates for the £jS in model (6.33). Then we simulate responses from the fitted regression on these averages, and simulate the replicated measured covariates using an appropriate hierarchical-data algorithm. Specifically we take

Ujk = Uj + djk,

where djk = ujk — Uj and J is random ly sampled from { 1 ,2 ,..., 12}. O ur justification for this, in term s o f retaining intra-core correlation, is given by the discussion in Section 3.8. It is potentially im portant to build the variation o f u into the analysis. Since u* = Uj, the resampled responses are defined by

> ■ = * j r + t f .

where the £*4)* are random ly sampled from the 12 m ean-adjusted, modified

residuals r ^ — rw from the regression of the y; s on the iijS. The estimates are now obtained by fitting the regression to the 48 simulated cases ( u ^ y j ) , k = 1 , . . . ,4 and j = 1 ,. . . , 12.

Figure 6.10 shows typical norm al plots for prediction error y + — y+ , these for x + = (1,4000,1000) and x+ = (1,10000,4000) which are near the edge of the observed space, from R = 999 resamples and M = 1. The skewness of prediction error is quite noticeable. The resampling standard deviations for prediction errors are 0.91 and 0.93, somewhat larger than the theoretical standard deviations 0.88 and 0.87 obtained by treating the 48 cases as independent.

To calculate 95% intervals we set a = 0.025, so that (R M + l)a = 25 and ( R M + 1)(1 — a) = 975. The sim ulation values <5(*25) and <5('975) are —1.63 and 1.93 at x+ = (1,4000,1000), and -1 .5 7 and 2.19 at x+ = (1,10000,4000). The corresponding point predictions are 6.19 and 4.42, so 95% prediction intervals are (4.26,7.82) at x+ = (1,4000,1000) and (2.23,5.99) at x+ = (1,10000,4000). These intervals differ m arkedly from those based on norm al theory treating all 48 cases as independent, those being (4.44,7.94) and (2.68,6.17). M uch of the difference is due to the skewness o f the resampling distribution o f prediction error. ■


Figure 6.10 Rock data: normal plots of resampled prediction errors forx+ =(1,4000,1000) (left panel) and

= (1,10000,4000) (right panel), based on R = 999 and M = 1. Dotted lines correspond to theoretical means and standard deviations.


6.4 Aggregate Prediction Error and Variable SelectionIn Section 6.3.3 our discussion o f prediction focused on individual cases, and particularly on intervals o f uncertainty around point predictions. For some applications, however, we are interested in an aggregate measure o f prediction error — such as average squared error or misclassification error — which summarizes accuracy o f prediction across a range o f values o f the covariates, using a given regression model. Such a measure may be of interest in its own right, o r as the basis for com paring alternative regression models. In the first part o f this section we outline the m ain resampling m ethods for estim ating aggregate prediction error, and in the second part we discuss the closely related problem o f variable selection for linear regression models.

6.4.1 Aggregate prediction errorThe least squares fit o f the linear regression model (6.22) provides the least squares prediction rule y+ = x+fi for predicting w hat a single response y+ would be at value x+ o f the vector o f covariates. W hat we w ant to know is how accurate this prediction rule will be for predicting data similar to those already observed. Suppose first tha t we measure accuracy o f prediction by squared error (y+ — y+)2, and tha t our interest is in predictions for covariate values tha t exactly duplicate the da ta values x \ , . . . , x n. Then the aggregate prediction error is

D = n - x Y j E(Y+ j - x ] h \ j= i

6.4 ■ Aggregate Prediction Error and Variable Selection 291

X is the n x q matrix with rows x j , . . . , x j , where q = p + 1 if there are p covariate terms and an intercept in the model.

in which ft is fixed and the expectation is over y+J = x]p + e+j. We cannot calculate D exactly, because the model param eters are unknown, so we must settle for an estimate — which in reality is an estimate o f A = E(D), the average over all possible samples o f size n. O ur objective is to estimate D or A as accurately as possible.

As stated the problem is quite simple, at least under the ideal conditions where the linear model is correct and the error variance is constant, for then

D = n - l Y ™ r ( Y +j) + n - l Y , ( X j P - x J [ l ) 2

= a2 + n - l ( p - l } ) TX TX 0 - p ) , (6.36)

whose expectation is

A = <j2(1 + ^ - 1), (6.37)

where q = p + 1 is the num ber o f regression coefficients. Since the residual m ean square s2 is an unbiased estimate for a 2, we have the natural estimate

A = s2(l + qn~l ). (6.38)

However, this estimate is very specialized, in two ways. First, it assumes that the linear model is correct and that error variance is constant, both unlikely to be exactly true in practice. Secondly, the estimate applies only to least squares prediction and the squared error measure o f accuracy, whereas in practice we need to be able to deal with other measures o f accuracy and other predictionrules — such as robust linear regression (Section 6.5) and linear classification,where y is binary (Section 7.2). There are no simple analogues o f (6.38) to cover these situations, but resampling m ethods can be applied to all o f them.

In order tha t our discussion apply as broadly as possible, we shall use general notation in which prediction error is measured by c(y+, y +), typically an increasing function o f |y+ — y+|, and the prediction rule is y + = /i(x+, F), where the E D F F represents the observed data. Usually n(x +> F) is an estimate o f the mean response at x +, a function o f x+/? with /? an estimate o f /?, and the form o f this prediction rule is closely tied to the form o f c(y+,y+). We suppose that the data F are sampled from distribution F, from which the cases to be predicted are also sampled. This implies that we are considering x + values similar to da ta values x i , . . . ,x „ . Prediction accuracy is measured by the aggregate prediction error

D = D(F, F) = E+ [c{ Y+, tx(X+, F)} | F], (6.39)

where E+ emphasizes that we are averaging only over the distribution of (AT+, 7+), with da ta fixed. Because F is unknown, D cannot be calculated, and so we look for accurate m ethods o f estim ating it, or ra ther its expectation

A = A (F ) = E{ D ( F , F ) } , (6.40)

the average prediction accuracy over all possible datasets o f size n sampled from F.

The m ost direct approach to estim ation o f A is to apply the bootstrap substitution principle, tha t is substituting the E D F F for F in (6.40). However, there are o ther widely used resampling m ethods which also merit consideration, in part because they are easy to use, and in fact the best approach involves a com bination o f methods.

Apparent error

The simplest way to estim ate D or A is to take the average prediction error when the prediction rule is applied to the same data that was used to fit it. This gives the apparent error, sometimes called the resubstitution error,

n

K PP = D(F,F) = n~x ' Y ^ c{ y j ,ii{xj,F)}. (6.41)7=1

This is not the same as the bootstrap estimate A(F), which we discuss later.It is intuitively clear that Aapp will tend to underestim ate A, because the

la tter refers to prediction o f new responses. The underestim ation can be easilyA |

checked for least squares prediction with squared error, when Aapp = n~ RSS, the average squared residual. If the model is correct with homoscedastic random errors, then Aapp has expectation a 2(l —qn~l ), whereas from (6.37) we know tha t A = <x2(l + qn~l ).

The difference between the true error and apparent error is the excess error, D(F,F) — D(F,F), whose mean is the expected excess error,

e(F) = E {D(F, F) - D(F, F)} = A(F) - E{D(F, F)}, (6.42)

where the expectation is taken over possible datasets F. For squared error and least squares prediction the results in the previous paragraph show that e(F) = 2qri~lo 2. The quantity e(F) is akin to a bias and can be estim ated by resampling, so the apparent error can be modified to a reasonable estimate, as we see below.

Cross-validationThe apparent error is downwardly biased because it averages errors o f predictions for cases at zero distance from the data used to fit the prediction rule. Cross-validation estimates o f aggregate error avoid this bias by separating the da ta used to form the prediction rule and the data used to assess the rule. The general paradigm is to split the dataset into a training set {(x j , y j ) : j £ S,} and a separate assessment set {(X j ,y j ) : j e Sa}, represented by Ft and Fa, say. The linear regression predictor is fitted to Ft, used to predict responses yj for


j € Sa, and then A is estim ated by

D{Fa, Ft) = n ~ ' Y £)}> (6-43)j € S a

with na the size o f Sa. There are several variations on this estimate, depending on the size o f the training set, the m anner of splitting the dataset, and the num ber o f such splits.

The version o f cross-validation that seems to come closest to actual use of our predictor is leave-one-out cross-validation. Here training sets o f size n —1 are taken, and all such sets are used, so we measure how well the prediction rule does when the value o f each response is predicted from the rest o f the data. If F^j represents the n — 1 observations {(xk,yk),k ^ j}, and if /u(Xy,F_; ) denotes the value predicted for yj by the rule based on F_; , then the cross-validation estimate o f prediction error is

n

Ac v = n~l c{yj> F-j)}, (6.44)i= i

which is the average error when each observation is predicted from the rest o f the sample.

In general (6.44) requires n fits o f the model, but for least squares linear regression only one fit is required if we use the case-deletion result (Problem 6.2)

T A~ , Vi — x j BP - P- j = ( X TX ) ~ ' x j ^ _ £ ,

where as usual hj is the leverage for the 7th case. For squared error in particular we then have

= " E d - ^ • ' 6-45>

From the nature o f Ac v one would guess that this estimate has only a small bias, and this is so: assuming an expansion o f the form A(F) = oq + a\n~l + a2n~2 + ■ ■ ■, one can verify from (6.44) that E(Ac^) = «o + a i(n — I )-1 + • • ■, which differs from A by term s o f order n~2 — unlike the expectation of the apparent error which differs by terms o f order n_ l .

K -fold cross-validation

In general there is no reason that training sets should be o f size n — 1. For certain m ethods o f estim ation the num ber n o f fits required for A c v could itself be a difficulty — although not for least squares, as we have seen in (6.45). There is also the possibility that the small perturbations in fitted model when single observations are left out makes A cv too variable, if fitted values H(x,F) do no t depend sm oothly on F o r if c(y+,y+) is not continuous. These


potential problems can be avoided to a large extent by leaving out groups of observations, rather than single observations. There is more than one way to do this.

One obvious im plem entation o f group cross-validation is to repeat (6.43) for a series o f R different splits into training and assessment sets, keeping the size o f the assessment set fixed at na = m, say. Then in a fairly obvious notation the estimate o f aggregate prediction error would be

R

A cv = R ~ { X ! c{yJ’ ^v )} - (6-46^r= 1 jesv

In principle there are (") possible splits, possibly an extremely large number, bu t it should be adequate to take R in the range 100 to 1000. It would be in the spirit o f resampling to make the splits at random . However, consideration should be given to balancing the splits in some way — for example, it would seem desirable that each case should occur with equal frequency over the R assessment sets; see Section 9.2. D epending on the value of nt = n — m and the num ber p o f explanatory variables, one might also need some form o f balance to ensure tha t the model can always be fitted.

There is an efficient version o f group cross-validation that does involve just one prediction o f each response. We begin by splitting the da ta into K disjoint sets o f nearly equal size, with the corresponding sets o f case subscripts denoted by C i , . . . , C k , say. These K sets define R = K different splits into training and assessment sets, with S^k = Q the kth assessment set and the rem ainder o f theda ta Stf = |J,y* Ci the /cth training set. For each such split we apply (6.43), andthen average these estimates. The result is the K-fold cross-validation estimate o f prediction error

n

Acvjc = n~l y c{yj, n(xj, F - k{J))}, (6.47)j=i

where F-k{j) represents the data from which the group containing the j ih case has been deleted. N ote that ACvjc is equal to the leave-one-out estimate (6.44) when K = n. C alculation o f (6.47) requires just K model fits. Practical experience suggests tha t a good strategy is to take K = m in{n1!1, 10}, on the grounds tha t taking K > 10 may be too com putationally intensive when the prediction rule is complicated, while taking groups o f size at least n1/2 should perturb the da ta sufficiently to give small variance o f the estimate.

The use o f groups will have the desired effect o f reducing variance, but at the cost o f increasing bias. For example, it can be seen from the expansion used earlier for A tha t the bias o f A Cvjc is a\{n(K — l )}-1 + ■ • •, which could be substantial if K is small, unless n is very large. Fortunately the bias o f A qv,k can be reduced by a simple adjustm ent. In a harmless abuse o f notation, let


F-k denote the da ta with the /cth group om itted, for k = 1 and letif n / K = m is an pk denote the proportion o f the data falling in the /cth group. The adjustedinteger, then ail groups cross-validation estimate o f aggregate prediction error isare o f size m and 0 0 0 rPk = l / K .

&acvjk. = Ack,k + D(F,F) — ^2,PkD{F,F-k)- (6.48)k= 1

This has smaller bias than Acvjc and is alm ost as simple to calculate, because it requires no additional fits o f the model. For a com parison between ACvjc and Aacvjc in a simple situation, see Problem 6.12.

The following algorithm summarizes the calculation o f AAcvji when the split into groups is m ade at random .

Algorithm 6.5 (K -fold adjusted cross-validation)

1 Fit the regression model to all cases, calculate predictions from that model, and average the values o f c(yj ,yj) to get D.

2 Choose group sizes m i,. . . , such tha t mi H----- + m* = n.3 For k = 1

(a) choose Ck by sampling times w ithout replacem ent from { 1 ,2 ,...,« } minus elements chosen for previous C,s;

(b) fit the regression model to all da ta except cases j £ Ck',(c) calculate new predictions yj = n(xj, F-k) for j e Ck ;(d) calculate predictions %j = f i(xj ,F-k) for all j ; then(e) average the n values c{yj,%j) to give D(F,F-k).

4 Average the n values o f c(yj,yj) using yj from step 3(c) to give Acvj i-5 Calculate Aacvji as in (6.48) with pk = mk/n.

Bootstrap estimatesA direct application o f the bootstrap principle to A(F) gives the estimate

A = A(F) = E*{D(F,F*)},

where F* denotes a sim ulated sample (x j ,y j) , . . . , (x*, >’”) taken from the da ta by case resampling. Usually sim ulation is required to approxim ate this estimate, as follows. For r = 1 we random ly resample cases from the data toobtain the sample (x*j,y*j) , . . . , (x*n,y'„), which we represent by F*, and to this sample we fit the prediction rule and calculate its predictions n( x j , F ' ) o f the da ta responses yj for j = 1 The aggregate prediction error estim ate is then calculated as

R n

R - 1 Y 2 c { y j , f i { x j , F ' ) } . (6.49)r = l j = l


Intuitively this bootstrap estimate is less satisfactory than cross-validation, because the sim ulated dataset F* used to calculate the prediction rule is part o f the da ta F used for assessment o f prediction error. In this sense the estimate is a hybrid o f the apparent error estim ate and a cross-validation estimate, a point to which we return shortly.

As we have noted in previous chapters, care is often needed in choosing what to bootstrap. Here, an approach which works better is to use the bootstrap to estimate the expected excess error e(F) defined in (6.42), which is the bias o f the apparent error Aapp, and to add this estimate to Aapp. In theory the bootstrap estim ate o f e(F) is

e(F) = E ' { D ( F , F ' ) - D ( F ‘, F *)},

and its approxim ation from the sim ulations described in the previous paragraph defines the bootstrap estimate o f expected excess error

eB = R ‘Er= 1

n 1E c{yj> xp K.)} - n 1 E cWp MKp F")}i=i j=i

(6.50)

T hat is, for the rth bootstrap sample we construct the prediction rule n(x, F'), then calculate the average difference between the prediction errors when this rule is applied first to the original da ta and secondly to the bootstrap sample itself, and finally average across bootstrap samples. We refer to the resulting estimate o f aggregate prediction error, Ab = $b + A app, as the bootstrap estimate o f prediction error, given by

n R R

n~l E E F'r )} - R - 1 E D(F'r, K ) + D(F, F). (6.51)7=1 r= 1 r= l

N ote tha t the first term o f (6.51), which is also the simple bootstrap estimate (6.49), is expressed as the average o f the contributions jR-1 ^ f = i c{yy-, F’ )} that each original observation makes to the estim ate o f aggregate prediction error. These contributions are o f interest in their own right, m ost im portantly in assessing how the perform ance o f the prediction rule changes with values o f the explanatory variables. This is illustrated in Example 6.10 below.

Hybrid bootstrap estimatesIt is useful to observe tha t the naive estimate (6.49), which is also the first term o f (6.51), can be broken into two qualitatively different parts,


and

where R - j is the num ber o f the R bootstrap samples F' in which (xj ,yj ) does no t appear. In (6.52) yj is always predicted using data from which (Xj ,y j) is excluded, which is analogous to cross-validation, whereas (6.53) is similar to an apparent error calculation because yj is always predicted using data that contain (xj ,yj).

Now R - j / R is approxim ately equal to the constant e~l = 0.368, so (6.52) is approxim ately proportional to

A sc r = n - 1 E Y (6'54)j = 1 J r:j out

sometimes called the leave-one-out bootstrap estimate o f prediction error. The notation refers to the fact tha t Abcv can be viewed as a bootstrap smoothing o f the cross-validation estimate Acv- To see this, consider replacing the term c{y j ,n(x j ,F- j )} in (6.44) by the expectation E l j[c{yj,n(Xj,F*)}], where E l- refers to the expectation over bootstrap samples F * o f size n draw n from F-j. The estim ate (6.54) is a sim ulation approxim ation o f this expectation, because o f the result noted in Section 3.10.1 tha t the R - j bootstrap samples in which case j does not appear are equivalent to random samples draw n from F-j.

The sm oothing in (6.54) m ay effect a considerable reduction in variance, com pared to A cv , especially if c(y+, y +) is not continuous. But there will also be a tendency toward positive bias. This is because the typical bootstrap sample from which predictions are m ade in (6.54) includes only about (1 — e~l )n = 0.632n distinct da ta values, and the bias o f cross-validation estimates increases as the size o f the training set decreases.

W hat we have so far is tha t the bootstrap estim ate o f aggregate prediction error essentially involves a weighted com bination o f A bcv and an apparent error estimate. Such a com bination should have good variance properties, but may suffer from bias. However, if we change the weights in the com bination it may be possible to reduce or remove this bias. This suggests that we consider the hybrid estimate

A w = wAbcv + (1 - w)Aapp, (6.55)

and then select w to m ake the bias as small as possible, ideally E(AW) = A + 0(n~2).

N ot unexpectedly it is difficult to calculate E(AW) in general, but for quadratic error and least squares prediction it is relatively easy. We already know that the apparent error estimate has expectation a 2( 1 — qn~l ), and tha t the true


A pparen t K -fo ld (adjusted) cross-validation

erro r B ootstrap 0.632 32 16 10 6

2.0 3.2 3.5 3.6 3.7 (3.7) 3.8 (3.7) 4.4 (4.2)

aggregate error is A = er2( l + qn 1). It rem ains only to calculate E(ABCk), where here

ABCV = n ~ l Y 2 E - j ( y j - x ] P - j ) 2>

j =iA

with p ’_j the least squares estim ate o f /? from a bootstrap sample w ith the j t h case excluded. A rather lengthy calculation (Problem 6.13) shows that

E(Ajjck) = c 2( l + 2 qn~l ) + 0 (n~2),

from which it follows that

E{wABCk + (1 - w)A app} = er2( l + 3wqn~l ) + 0(n~2),

which agrees with A to term s o f order n~l if w = 2/3.It seems impossible to find an optim al choice o f w for general measures

o f prediction error and general prediction rules, but detailed calculations do suggest tha t w = 1 — e-1 = 0.632 is a good choice. Heuristically this value for w is equivalent to an adjustm ent for the below-average distance between cases and bootstrap samples w ithout them, com pared to w hat we expect in the real prediction problem. T hat the value 0.632 is close to the value 2 /3 derived above is reassuring. The hybrid estimate (6.55) with w = 0.632 is known as the 0.632 estimator o f prediction error and is denoted here by A0.632- There is substantial empirical evidence favouring this estimate, so long as the num ber o f covariates p is no t close to n.

Example 6.10 (Nuclear power stations) Consider predicting the cost o f a new power station based on the da ta o f Example 6.8. We base our prediction on the linear regression model described there, so we have n(x j ,F ) = x j f i , whereA •'

18 is the least squares estim ate for a model with six covariates. The estim ated error variance is s2 = 0.6337/25 = 0.0253 with 25 degrees o f freedom. The downwardly biased apparent error estim ate is Aapp = 0.6337/32 = 0.020, whereas the idealized estim ate (6.38) is 0.025 x (1 + ~ ) = 0.031. In this situation the prediction error for a particular station seems m ost useful, but before we turn to individual stations, we discuss the overall estimates, which are given in Table 6.9.

Those estimates show the pattern we would anticipate from the general

Table 6.9 Estimates of aggregate prediction error (xlO-2) for data on nuclear power plants. Results for adjusted cross-validation are shown in parentheses.

Figure 6.11Components of prediction error for nuclear power data based on 200 bootstrap simulations. The top panel shows the values of yj — n{xj,F*). The lower left panel shows the average error for each case, plotted against the residuals. The lower right panel shows the ratio of the model-based to the bootstrap prediction standard errors.


Case

Raw residual Case

discussion. The apparent error is considerably smaller than other estimates. The bootstrap estimate, with R = 200, is larger than the apparent error, but smaller than the cross-validation estimates, and the 0.632 estimate agrees well with the ordinary cross-validation estim ate (6.44), for which K — n = 32. A djustm ent slightly decreases the cross-validation estimates. N ote tha t the idealized estim ate appears to be quite accurate here, presum ably because the model fits well and errors are not far from hom oscedastic — except for the last six cases.

Now consider the individual predictions. Prediction error arises from two com ponents: the variability o f the predictor and that o f the associated error s+. Figure 6.11 gives some insight into these. Its top panel shows the values


o f yj — n(xj,F*) for r = 1 ,...,J? , plotted against case num ber j. The variability o f the average error corresponds to the variation o f individual observations about their predicted values, while the variance within each group reflects param eter estim ation uncertainty. A striking feature is the small prediction error for the last six power plants, whose variances and means are both small. The lower left panel shows the average values o f y j — fi(xj,F*) over the 200 simulations, plotted against the raw residuals. They agree closely, as we should expect with a well-fitting model. The lower right panel shows the ratio o f the model-based prediction standard error to the bootstrap prediction standard error. It confirms tha t the model-based calculation described in Example 6.8 overestimates the predictive standard error for the last six plants, which have the partial turnkey guarantee. The estim ated bootstrap prediction error for these plants is 0.003, while it is 0.032 for the rest. The last six cases fall into three groups determ ined by the values o f the explanatory variables: in effect they are replicated.

It might be preferable to plot y j — fi(xj, F ' ) only for those bootstrap samples which exclude the j t h case, and then m ean prediction error would better be com pared to jackknifed residuals yj — x j /L ; . For these da ta the plots are very similar to those we have shown. ■

Example 6.11 (Times on delivery suite) For a m ore systematic com parison o f prediction error estimates in linear regression, we use data provided by E. Burns on the times taken by 1187 women to give birth a t the John Radcliffe H ospital in Oxford. A n appropriate linear model has response the log time spent on delivery suite and dum m y explanatory variables indicating the type o f labour, the use o f electronic fetal m onitoring, the use o f an intravenous drip, the reported length o f labour before arriving at the hospital and whether or not the labour is the w om an’s first; seven param eters are estim ated in all. We took 200 samples o f size n = 50 at random from the full data. For each o f these samples we fitted the model described above, and then calculated cross-validation estimates o f prediction error A cv#. with K = 50, 10, 5 and2 groups, the corresponding adjusted cross-validation estimates A a c v j c , the bootstrap estimate AB, and the hybrid estimate Ao.632- We took R = 200 for the bootstrap calculations.

The results o f this experim ent are sum m arized in term s o f estim ates o f the expected excess error in Table 6.10. The average apparent error and excess error were 15.7 x 10-2 and 5.2 x 10-2 , the latter taken to be e(F) as defined in (6.42). The table shows averages and standard deviations o f the differences between estimates A and Aapp. The cross-validation estim ate with K = 50, the bootstrap and the 0.632 estim ate have similar properties, while other choices o f K give estim ates tha t are m ore variable; the half-sample estimate A C v ,2 is worst. Results for cross-validation with 10 and 5 groups are almost


Table 6.10 Summary results for estimates of prediction error for 200 samples of size n = 50 from a set of data on the times 1187 women spent on delivery suite at the John Radcliffe Hospital, Oxford. The table shows the average, standard deviation, and conditional mean squared error (x 10~2) for the 200 estimates of excess error. The ‘target’ average excess error is 5.2 x lO"2.

B ootstrap 0.632

X -fo ld (adjusted) cross-validation

50 10 5 2

M ean 4.6 5.3 5.3 6.0 (5.7) 6.2 (5.5) 9.2 (5.7)SD 1.3 1.6 1.6 2.3 (2.2) 2.6 (2.3) 5.4 (3.3)M SE 0.23 0.24 0.24 0.28 (0.26) 0.30 (0.27) 0.71 (0.33)

the same. A djustm ent significantly improves cross-validation when group size is not small. The bootstrap estimate is least variable, but is downwardly biased.

The final row o f the table gives the conditional m ean squared error, defined as (200)-1 {Aj — Dj(F,F)}2 for each error estim ate A. This measures the success o f A in estim ating the true aggregate prediction error D(F, F) for each o f the 200 samples. Again the ordinary cross-validation, bootstrap, and 0.632 estimates perform best.

In this example there is little to choose between K -fold cross-validation with10 and 5 groups, which bo th perform worse than the ordinary cross-validation, bootstrap, and 0.632 estim ators o f prediction error. K -fold cross-validation should be used with adjustm ent if ordinary cross-validation or the simulation- based estimates are not feasible. ■

6.4.2 Variable selectionIn m any applications o f multiple linear regression, one purpose o f the analysis is to decide which covariate terms to include in the final model. The supposition is that the full model y = x T fi + s with p covariates in (6.22) is correct, but that it may include some redundant terms. O ur aim is to eliminate those redundant terms, and so obtain the true model, which will form the basis for further inference. This is somewhat simplistic from a practical viewpoint, because it assumes tha t one subset o f the proposed linear model is “ true” : it may be more sensible to assume that a few subsets may be equally good approxim ations to a com plicated true relationship between mean response and covariates.

Given that there are p covariate terms in the model (6.22), there are 2P candidates for true model because we can include or exclude each covariate. In practice the num ber o f candidates will be reduced if prior inform ation necessitates inclusion o f particular covariates or com binations of them.

There are several approaches to variable selection, including various stepwise methods. But the approach we focus on here is the direct one o f minimizing aggregate prediction error, when each candidate model is used to predict independent, future responses at the da ta covariate values. For simplicity we assume tha t models are fitted by least squares, and tha t aggregate prediction


error is average squared error. It would be a simple m atter to use other prediction rules and other measures o f prediction accuracy.

First we define some notation. We denote an arbitrary candidate model by M , which is one o f the 2P possible linear models. W henever M is used as a subscript, it refers to elements o f th a t model. Thus the n x pm design m atrix X M contains those pM colum ns o f the full design m atrix X tha t correspond to covariates included in M ; the y'th row o f X m is x h p the least squares

estimates for regression coefficients in M are Pm , and H M is the hat m atrix X m (XI i X m )~1X11 that defines fitted values = H My under model M. The total num ber o f regression coefficients in M is qM = pM + 1, assum ing tha t an intercept term is always included.

Now consider prediction o f single responses y+ at each o f the original design points x i , . . . ,x „ . The average squared prediction error using model M is

n

n~l J 2 ( y +j ~ x TmM >7=1

and its expectation under m odel (6.22), conditional on the data, is the aggregate prediction error

n

D(M) = a 2 + n~x ^ ( ^ - - x ^ jPm )2, i= i

where p.T = (AMj■ is the vector o f m ean responses for the true multiple regression model. Taking expectation over the da ta distribution we obtain

A(M) = E{D(M)} = (1 + n~lqM) a 2 + fxT(I — H M)n, (6.56)

where /ir (/ — H M)p is zero only if model M is correct. The quantities D(M) and A(M) generalize D and A defined in (6.36) and (6.37).

In principle the best m odel would be the one tha t minimizes D{M), but since the model param eters are unknow n we m ust settle for minimizing a good estimate o f D(M) o r A(M). Several resam pling m ethods for estim ating A were discussed in the previous subsection, so the natural approach would be to choose a good m ethod and apply it to all possible models. However, accurate estim ation o f A(M) is no t itself im portant: w hat is im portant is to accurately estim ate the signs o f differences am ong the A(M), so tha t we can identify which o f the A(M)s is smallest.

O f the m ethods considered earlier, the apparent error estimate Aapp(M) = h^ R S S m was poor. Its use here is immediately ruled out when we observe that it always decreases when covariates are added to a model, so m inimization always leads to the full model.

Cross-validation

One good estimate, when used with squared error, is the leave-one-out cross- validation estimate. In the present notation this is

= (6.57)

where y ^ j is the fitted value for model M based on all the da ta and h ^ j is theleverage for case j in model M. The bias o f Ac v ( M ) is small, but that is notenough to m ake it a good basis for selecting M. To see why, note first that an expansion gives

mAc k (M ) = e t ( I - H M)e + 2pM + fiT(I - H M)fi. (6.58)

Then if m odel M is true, and M ' is a larger model, it follows tha t for large n

Pr{Ac v ( M ) < ACv(M')} = Pr(Z2 < 2d),

where d = p w ~ P m - This probability is substantially below 1 unless d is large. It is therefore quite likely tha t selecting M to minimize Ac v ( M ) will lead to overfitting, even for large n. So although the term p T(I — H M)n in (6.58) guarantees that, for large n, incorrect models will not be selected, m inimization o f A c v ( M ) does not provide consistent selection o f the true model.

One explanation for this is tha t to estimate A(M) with sufficient accuracy we need both large am ounts o f data to fit model M and a large num ber of independent predictions. This can be accomplished using the m ore general cross-validation measure (6.43), under conditions given below. In principle we need to average (6.43) over all possible splits, bu t for practical purposes we follow (6.46). T hat is, using R different splits into training and assessment sets o f sizes nt = n — m and na = m, we generalize (6.57) to

R

ACv(M) = jR_1 Y l m~ l X ~ yMj(St,r)}2, r=1 jesv

where p M j ( S t,r) = x h ^ M ^ t , ) and ^ M(^t,r) are the least squares estimates for coefficients in M fitted to the rth training set whose subscripts are in Sv . N ote tha t the same R splits into training and assessment sets are used for all models. It can be shown that, provided m is chosen so tha t n — m—> oo and m /n—>1 as n -» o o , m inim ization o f Ac v ( M ) will give consistent selection o f the true model as n—► o o and R —>o o .


Bootstrap methodsCorresponding results can be obtained for bootstrap resampling methods. The bootstrap estimate o f aggregate prediction error (6.51) becomes

Ab (M) = n~l R S S m + R ~ l £ n~l j - RSS'M, j , (6.59)

where the second term on the right-hand side is an estimate o f the expected excess error defined in (6.42). The resampling scheme can be either case resampling or error resampling, with x mMj r = x Mj for the latter.

It turns out tha t m inim ization o f A B(M) behaves much like m inim ization o f the leave-one-out cross-validation estimate, and does not lead to a consistent choice o f true model as n—*o o . However, there is a modification o f A B(M), analogous to tha t m ade for the cross-validation procedure, which does produce a consistent model selection procedure. The modification is to make simulated datasets be o f size n — m rather than n, such tha t m / n —>l and n — m—> oo as n—>co. Also, we replace the estimate (6.59) by the simpler bootstrap estimate

R n

Ab(M ) = R - 1 n- 1 Y ^ ( y j ~ x l j K r ) 2> (6.60)r=l j= 1

which is a generalization o f (6.49). (The previous doubts about this simple estimate are less relevant for small n — m.) I f case resampling is used, then n — m cases are random ly selected from the full set o f n. If model-based resampling is used, the model being M with assumed hom oscedasticity of errors, then is a random selection o f n — m rows from X m and the n — m errors £* are random ly sampled from the n m ean-corrected modified residuals i"Mj ~ for model M.

Bearing in m ind the general advice tha t the num ber o f sim ulated datasets should be at least R = 100 for estim ating second moments, we should use at least that m any here. The same R bootstrap resamples are used for each model M , as with the cross-validation procedure.

One m ajor practical difficulty tha t is shared by the consistent cross-validation and bootstrap procedures is that fitting all candidate models to small subsets o f da ta is not always possible. W hat empirical evidence there is concerning good choices for m /n suggests that this ratio should be about | . I f so, then in m any applications some o f the R subsets will have singular designs X'M for big models, unless subsets are balanced by appropriate stratification on covariates in the resampling procedure.

Example 6.12 (Nuclear power stations) In Examples 6.8 and 6.10 our analyses focused on a linear regression model tha t includes six o f the p = 10 covariates available. Three o f these covariates — d a te , lo g (c a p ) and NE — are highly


Figure 6.12 Aggregate prediction error estimates for sequence of models fitted to nuclear power stations data; see text. Leave-one-out cross-validation (solid line), bootstrap with R = 100 resamples of size 32 (dashed line) and 16 (dotted line).

0 2 4 6 8 10

Number of covariates

significant, all o thers having P -values o f 0.1 or m ore. H ere w e consider the selection o f variab les to inclu de in the m odel. T he to ta l num ber o f p ossib le m odels, 210 = 1024, is proh ib itively large, and for the purposes o f illustration w e consider on ly the particular sequence o f m od els in w hich variab les enter in the order d a te , l o g ( c a p ) , NE, CT, lo g ( N ) , PT, T l, T2, PR, BW: the first three

are the h igh ly sign ificant variables.

Figure 6.12 plots the leave-one-out cross-validation estimates and the boo tstrap estimates (6.60) with R = 100 o f aggregate prediction error for the models with 0 ,1 , . . . , 10 covariates. The two estimates are very close, and both are minimized when six covariates are included (the six used in Examples 6.8 and 6.10). Selection o f five or six covariates, ra ther than fewer, is quite clear- cut. These results bear out the rough rule-of-thum b tha t variables are selected by cross-validation if they are significant at roughly the 0.1 level.

As the previous discussion would suggest, use o f corresponding cross- validation and bootstrap estimates from training sets o f size 20 or less is precluded because for training sets o f such sizes the models with more than five covariates are frequently unidentifiable. T hat is, the unbalanced nature of the covariates, coupled with the binary nature o f some o f them, frequently leads to singular resample designs. Figure 6.12 includes bootstrap estimates for models with up to five covariates and training set o f size 16: these results were obtained by om itting m any singular resamples. These ra ther fragm entary results confirm tha t the model should include at least five covariates.

A useful lesson from this is that there is a practical obstacle to w hat in theory is a preferred variable selection procedure. One way to try to overcome

306

cv, resample 10

boot, resample 10

cv, resample 20

boot, resample 20

cv, resample 30

boot, resample 30

6 ■ Linear Regression

leave-one-out cv

boot, resample 50

Figure 6.13Cross-validation and bootstrap estimates of aggregate prediction error for sequence of six models fitted to ten datasets of size n = 50 with p = 5 covariates. The true model includes only two covariates.

this difficulty is to stratify on the binary covariates, bu t this is difficult to implement and does not work well here. ■

Example 6.13 (Simulation exercise) In order to assess the variable selection procedures w ithout the com plication o f singular resample designs, we consider a small sim ulation exercise in which procedures are applied to ten datasets simulated from a given model. There are p = 5 independent covariates, whose values are sampled from the uniform distribution on [0, 1], and responses y are generated by adding N ( 0,1) variates to the means p. = x Tp. The cases we examine have sample size n = 50, and yS3 = jS4 = = 0, so the truemodel includes an intercept and two covariate terms. To simplify calculations only six models are fitted, by successively adding x i , . . . , x 5 to an initial model with constant intercept. All resampling calculations are done with R = 100 samples. The num ber o f datasets is adm ittedly small, but sufficient to make rough com parisons o f performance.

The m ain results concern models with P\ = P2 = 2, which means tha t the two non-zero coefficients are about four standard errors away from zero. Each panel o f Figure 6.13 shows, for the ten datasets, one variable selection criterion plotted against the num ber o f covariates included in the model. Evidently the clearest indications o f the true model occur when training set size is 10 or 20. Larger training sets give flat profiles for the criterion, and m ore frequent selection o f overfitted models.

These indications m atch the evidence from more extensive simulations, which suggest tha t if training set size n —m is about n /3 then the probability o f correct model selection is 0.9 or higher, com pared to 0.7 or less for leave-one-out cross- validation.

Further results were obtained with P\ = 2 and P2 = 0.5, the latter equal to one standard error away from zero. In this situation underfitting — failure to

6.5 ■ Robust Regression 307

include x 2 in the selected model — occurred quite frequently even when using training sets o f size 20. This degradation o f variable selection procedures when coefficients are smaller than two standard errors is reputed to be typical. ■

The theory used to justify the consistent cross-validation and bootstrap procedures may depend heavily on the assum ptions that the dimension o f the true model is small com pared to the num ber o f cases, and tha t the non-zero regression coefficients are all large relative to their standard errors. I t is possible tha t leave-one-out cross-validation may work well in certain situations where model dimension is com parable to num ber o f cases. This would be im portant, in light o f the very clear difficulties o f using small training sets with typical applications, such as Example 6.12. Evidently further work, both theoretical and empirical, is necessary to find broadly applicable variable selection methods.

6.5 Robust RegressionThe use o f least squares regression estimates is preferred when errors are near-norm al in distribution and homoscedastic. However, the estim ates are very sensitive to outliers, tha t is cases which deviate strongly from the general relationship. Also, if errors have a long-tailed distribution (possibly due to heteroscedasticity), then least squares estim ation is not an efficient method. Any regression analysis should therefore include appropriate inspection of diagnostics based on residuals to detect outliers, and to determ ine if a norm al assum ption for errors is reasonable. If the occurrence o f outliers does not cause a change in the regression model, then they will likely be om itted from the fitting o f tha t model. Depending on the general pattern o f residuals for rem aining cases, we may feel confident in fitting by least squares, or we may choose to use a more robust m ethod to be safe. Essentially the resampling m ethods tha t we have discussed previously in this chapter can be adapted quite easily for use with m any robust regression methods. In this section we briefly review some o f the m ain points.

Perhaps the m ost im portant point is tha t gross outliers should be removed before final regression analysis, including resampling, is undertaken. There are two reasons for this. The first is that m ethods o f fitting that are resistant to outliers are usually not very efficient, and may behave badly under resampling. The second reason is tha t outliers can be disruptive to resampling analysis o f m ethods such as least squares that are no t resistant to outliers. For model-based resampling, the error distribution will be contam inated and in the resampling the outliers can then occur at any x values. For case resampling, outlying cases will occur with variable frequency and m ake the bootstrap estimates of coefficients too variable; see Example 6.4. The effects can be diagnosed from


D ose (rads) 117.5 235.0 470.0 705.0 940.0 1410

Survival % 44.000 16.000 4.000 0.500 0.110 0.70055.000 13.000 1.960 0.320 0.015 0.006

6.120 0.019

the jackknife-after-bootstrap plots o f Section 3.10.1 or similarly informative diagnostic plots, but such plots can fail to show the occurrence o f multiple outliers.

For datasets with possibly multiple outliers, diagnosis is aided by initial use o f a fitted m ethod that is highly resistant to the effects o f outliers. One preferred resistant m ethod is least trim m ed squares, which minimizes

m

5 > 0)(/*)- (6.61) j=i

the sum o f the m smallest squares o f deviations e; (/}) = yj — x j p. Usually m is taken to be [\n] + 1. Residuals from the least trim m ed squares fit should clearly identify outliers. The fit itself is not very efficient, and should best be thought o f as an initial step in a m ore efficient analysis. (It should be noted tha t in some im plem entations o f least trim m ed squares, local m inim a o f (6.61) m ay be found far away from the global minimum.)

Example 6.14 (Survival proportions) The da ta in Table 6.11 and the left panel o f Figure 6.14 are survival percentages for rats a t a succession o f doses of radiation, with two or three replicates at each dose. The theoretical relationship between survival rate and dose is exponential, so linear regression applies to

x = dose, y = log(survival percentage).

The right panel o f Figure 6.14 plots these variables. There is a clear outlier, case 13, at x = 1410. The least squares estim ate o f slope is —59 x 10-4 using all the data, changing to —78 x 10-4 with standard error 5.4 x 10-4 when case 13 is omitted. The least trim m ed squares estim ate o f slope is —69 x 10-4 .

From the scatter plot it appears tha t heteroscedasticity may be present, so we resample cases. The effect o f the outlier on the resample least squares estimates is illustrated in Figure 6.15, which plots R = 200 bootstrap least squares slopes PI against the corresponding values o f ]T (x” — x*)2, differentiated by the frequency with which case 13 appears in the resample. There are three distinct groups o f bootstrapped slopes, with the lowest corresponding to resamples in which case 13 does not occur and the highest to samples where it occurs twice or more. A jackknife-after-bootstrap plot would clearly reveal the effect o f case 13.

The resam pling standard error o f p\ is 15.3 x 10-4 , but only 7.6 x 10-4 for

Table 6.11 Survivaldata (Efron, 1988).

Here [•] denotes integer part.

6.5 • Robust Regression 309

Figure 6.14 Scatter plots of survival data.

iDCO

o

oco

oC\J

• S

t• CM •0s •

•15> o£ ••D(0O ) CM

• O ' ••

• t•• • • • •

200 600 1000

Dose

1400 200 600 1000 1400

Dose

Figure 6.15 Bootstrap estimates of slope and design sum-of-squares J 2 ( x } - x )2 ( x \ 0 5 ),

differentiated by frequency of case 13 (appears zero, one or more times), for case resampling with R = 200 from survival data.

Sum of squares

samples w ithout case 13. The corresponding resampling standard errors o f the least trim m ed squares slope are 20.5 x 10-4 and 18.0 x 10~4, showing both the resistance and inefficiency o f the least trim m ed squares method. ■

Example 6.15 (Salinity data) The da ta in Table 6.12 are n = 28 observations on the salinity o f w ater in Pamlico Sound, N orth Carolina. The response in the second colum n is the bi-weekly average o f salinity. The next three columns contain values o f the covariates, respectively a lagged value o f salinity, a trend


Salinitys a l

Lagged salinity l a g

T rend indicator t r e n d

R iver discharge d i s

1 7.6 8.2 4 23.012 7.7 7.6 5 22.873 4.3 4.6 0 26.424 5.9 4.3 1 24.875 5.0 5.9 2 29.906 6.5 5.0 3 24.207 8.3 6.5 4 23.228 8.2 8.3 5 22.869 13.2 10.1 0 22.27

10 12.6 13.2 1 23.8311 10.4 12.6 2 25.1412 10.8 10.4 3 22.4313 13.1 10.8 4 21.7914 12.3 13.1 5 22.3815 10.4 13.3 0 23.9316 10.5 10.4 1 33.4417 7.7 10.5 2 24.8618 9.5 7.7 3 22.6919 12.0 10.0 0 21.7920 12.6 12.0 1 22.0421 13.6 12.1 4 21.0322 14.1 13.6 5 21.0123 13.5 15.0 0 25.8724 11.5 13.5 1 26.2925 12.0 11.5 2 22.9326 13.0 12.0 3 21.3127 14.1 13.0 4 20.7728 15.1 14.1 5 21.39

indicator, and the river discharge. We consider a linear regression model with these three covariates.

The initial least squares analysis gives coefficients 0.78, —0.03 and —0.30, with intercept 9.70. The usual standard error for the trend coefficient is 0.16, so this coefficient would be judged not nearly significant. However, this fit is suspect, as can be seen not from the Q-Q plot o f modified residuals but from the plot o f cross-validation residuals versus leverages, where case 16 stands out as an outlier — due apparently to its unusual value o f d is . The outlier is much m ore easily detected using the least trim m ed squares fit, which has the quite different coefficient values 0.61, —0.15 and —0.86 with intercept 24.72: the residual o f case 16 from this fit has standardized value 6.9. Figure 6.16 shows norm al Q-Q plots o f standardized residuals from least squares (left panel) and least trim m ed squares fits (right panel); for the la tter the scale factor is taken to be the m edian absolute residual divided by 0.6745, the value appropriate for estim ating the standard deviation o f norm al errors.

Table 6.12 Salinity data (Ruppert and Carroll, 1980).

Application of standard algorithms for least trimmed squares with default settings can give very different, incorrect solutions.


Figure 6.16 Salinity data: standardized residuals from least squares (left) and least trimmed squares (right) fits using all cases.

co3■D'(/)T3©N

COx>cco55


There is some question as to w hether the outlier is really aberrant, or simply reflects the need for a quadratic term in d is . ■

Robust methods

We suppose now tha t outliers have been isolated by diagnostic plots and set aside from further analysis. The problem now is whether or not that analysis should use least squares estim ation: if there is evidence o f a long-tailed error distribution, then we should downweight large deviations yj — x j fi by using a robust m ethod. Two m ain options for this are now described.

One approach is to minimize not sums o f squared deviations bu t sums o f absolute values o f deviations, Y , Iy j ~ x J J® l> so liv ing less weight to those cases with the largest errors. This is the L i m ethod, which generalizes — and has efficiency com parable to — the sample m edian estim ate o f a population mean. There is no simple expression for approxim ate variance o f L\ estimators.

M ore efficient is M -estim ation, which is analogous to maxim um likelihood estimation. H ere the coefficient estim ates /? for a multiple linear regression solve the estim ating equation

0, (6.62)

where tp(z) is a bounded replacem ent for z, and s is either the solution to a sim ultaneous estim ating equation, or is fixed in advance. We choose the latter, taking s to be the m edian absolute deviation (divided by 0.6745) from the least trim m ed squares regression fit. The solution to (6.62) is obtained by iterative weighted least squares, for which least trim m ed squares estimates are good starting values.

W ith a careful choice o f ip(-), M -estimates should have smaller standard errors than least squares estim ates for long-tailed distributions o f random errors e, yet have com parable standard errors should those errors be hom oscedastic normal. One standard choice is tp(z) = z m in (l,c / |z |) , H uber’s win- sorizing function, for which the coefficient estimates have approxim ate efficiency 95% relative to least squares estim ates for hom oscedastic norm al errors when c = 1.345.

For large sample sizes M -estimates ft are approxim ately norm al in distribution, with approxim ate variance

var(£) = o'2 * {'p2{e/<T)\ -2( X TX ) - \ (6.63)[E { v (e /a )}]2

under homoscedasticity. A m ore robust, empirical variance estimate is provided by the nonparam etric delta method. First, the empirical influence values are, analogous to (6.25),

lj = k n ( X TX ) ~ 1Xj\p ,

where k = sn-1 ]T"=1 w(ej / s) and e; = yj — x j f i is the raw residual; see Problem 6.7. The variance approxim ation is then

vL = n~2 h lJ = k 2( X TX ) - lX TD X ( X TX ) - \ (6.64)7=1

where D = diag {y>2(e i /s ) , . . . ,xp2(e„/s)}; this generalizes (6.17).

ResamplingAs with least squares estim ation, so with robust estimates we have two simple choices for resampling: case resampling, or model-based resampling. D epending on which robust m ethod is used, the resampling algorithm m ay need to be modified from the simple form tha t it takes for least squares estimation.

The Lj estimates will behave like the sample m edian under either resampling scheme, so tha t the distribution o f can be very discrete, and close to that o f P ~ P only for very large samples. Use o f the sm ooth bootstrap (Section 3.4) will improve accuracy. N o simple studentization is possible for L\ estimates.

For M -estimates case resampling should be satisfactory except for small datasets, especially those with unreplicated design points. The advantage of case resampling is simplicity. For model-based resampling, some modifications are required to the algorithm used to resample least squares estim ation in Section 6.3. First, the leverage correction o f raw residuals is given by

ej j _2 J2(e) f sMej A) E W2(ej/s)1 ( l - d h j ) ' / 2 ’ Y W j / s ) ( E v f f j / s ) } 2 '

Simulated errors are random ly sampled from the uncentred ru . . . , r n. M ean

tp(u) is the derivative d\p(u)/du.

correction to the rj is replaced by a slightly m ore com plicated correction in the estim ation equation itself. The resample version o f (6.62) is

The scale estim ate s' is obtained by the same m ethod as s, but from the resample data.

Studentization o f j?* — ft is possible, using the resample analogue o f the delta m ethod variance (6.64) or m ore simply ju st using s'.


Example 6.16 (Salinity data) In our previous look at the salinity da ta in Example 6.15, we identified case 16 as a clear outlier. We now set that case aside and re-analyse the linear regression with all three covariates. One objective is to determ ine whether or no t the trend variable should be included in the m odel: the initial, incorrect least squares analysis suggested not.

A norm al Q-Q plot o f the modified residuals from the new least squares fit suggests som ewhat long tails for the error disribution, so tha t robust m ethods may be worthwhile. We fit the model by four m ethods: least squares, H uber M- estimate (with c = 1.345), Li and least trim m ed squares. Coefficient estimates are fairly similar under all m ethods, except for t r e n d whose coefficients are -0 .1 7 , -0 .2 2 , -0 .1 8 and -0 .08 .

For further analysis we apply case resampling with R = 99. Figure 6.17 illustrates the results for estimates o f the coefficient o f tr e n d . The dotted lines on the top two panels correspond to the theoretical norm al approxim ations: evidently the standard variance approxim ation — based on (6.63) — for the H uber estim ate is too low. N ote also the relatively large resam pling variance for the least trim m ed squares estimate, part o f which may be due to unconverged estimates: two resam pling outliers have been trim m ed from this plot.

To assess the significance o f t r e n d we apply the studentized pivot m ethod o f Section 6.3.2 with both least squares and M -estimates, studentizing by the theoretical standard error in each case. The corresponding values o f z are —1.25 and —1.80, with respectively 23 and 12 smaller values o f z* out o f 99. So there appears to be little evidence o f the need to include tre n d .

I f we checked diagnostic plots for any o f the four regression fits, a question might be raised about whether or no t case 5 should be included in the analysis. A n alternative view o f this is provided by jackknife-after-bootstrap plots (Section 3.10.1) o f the four fits: such plots correspond to case-deletion resampling. As an illustration, Figure 6.18 shows the jackknife-after-bootstrap plot for the coefficient o f t r e n d in the M -estim ation fit. This shows clearly that case 5 has an appreciable effect on the resampling distribution, and that its omission would give tighter confidence limits on the coefficient. It also raises




q u estions ab ou t tw o o th er cases. C learly som e further exp loration is needed before firm co n clu sio n s can b e reached. ■

T he p rev ious exam ple illustrates the p o in t th at it is o ften w orthw hile to

in corporate robust m eth od s in to a regression analysis, b o th to h elp iso la te

outliers and to assess the reliability o f con c lu sion s b ased on the least squares

fit to su p p osed ly “clean ” data. In som e areas o f app lications, for exam p le th ose in vo lv in g re lation sh ips b etw een financial series, lon g -ta iled d istributions m ay be quite com m on , and then robust m eth od s w ill b e esp ecia lly im portant. T o the exten t that theoretica l n orm al app rox im ation s are inaccurate for m any robust

estim ates, resam pling m eth od s are a natural com p an ion to robust analysis.

Figure 6.17 Salinity data: Normal Q-Q plots of resampled estimates of trend coefficient, based on case resampling (R = 99 for data excluding case 16. Clockwise from top left: least squares, Huber M-estimation, least trimmed squares, L\. Dotted lines correspond to theoretical normal approximations.

Figure 6.18 Jackknife-after-bootstrap plot for *

the coefficient of tren d o

in the M-estimation fit OCOoto the salinity data, CNJ

omitting case 16. ■xPO '"to

©

05 o

8o

3CMo

LO

CO ■<fro

O *

LO

9 22 2412 21 158 11 1 « 13 S

2 17 ISO 191 i V S 314 16 2*5 27

3 - 2 - 1 0 1 2

Standardized jackknife value

6.6 Bibliographic NotesThere are several comprehensive accounts o f linear regression analysis, including the books by D raper and Smith (1981), Seber (1977), and W eisberg (1985). D iagnostic m ethods are described by A tkinson (1985) and by Cook and W eisberg (1982). A good general reference on robust regression is the book by Rousseeuw and Leroy (1987). M any linear regression m ethods and their properties are summarized, with illustrations using S-Plus, in Venables and Ripley (1994).

The use o f bootstrap m ethods in regression was initiated by Efron (1979). Im portan t early work on the theory o f resam pling for linear regression was by Freedm an (1981) and Bickel and Freedm an (1983). See also Efron (1988). Freedm an (1984) and Freedm an and Peters (1984a,b) assessed the m ethods in practical applications. W u (1986) gives a quite comprehensive theoretical treatm ent, including com parisons between various resampling and jackknife m ethods; for further developments see Shao (1988) and Liu and Singh (1992b). Hall (1989b) shows tha t bootstrap m ethods can provide unusually accurate confidence intervals in regression problems.

Theoretical properties o f bootstrap significance tests, including the use of bo th studentized pivots and F statistics, were established by M am m en (1993). Recent interest in resam pling tests for econom etric models is reviewed by Jeong and M addala (1993).

Use o f the bootstrap for calculating prediction intervals was discussed by Stine (1985). The asym ptotic theory for the m ost elementary case was given by Bai and Olshen (1988). For further theoretical developm ent see Beran (1992).


Olshen et al. (1989) described an interesting application to a complicated prediction problem.

The wild bootstrap is based on an idea suggested by Wu (1986), and has been explored in detail by H ardle (1989, 1990) and M am m en (1992). The effectiveness o f the wild bootstrap, particularly for studentized coefficients, was dem onstrated by M am m en (1993).

Cross-validation m ethods for the assessment o f prediction error have a long history, but m odern developments originated with Stone (1974) and Geisser (1975). W hat we refer to as K -fold cross-validation was proposed by Breiman et al. (1984), and further studied by Burm an (1989). Im portan t theoretical results were developed by Bunke and Droge (1984), Li (1987), and Shao (1993). The theoretical foundation o f cross-validation and bootstrap estimates o f prediction error, with particular em phasis on classification problems, was developed in C hapter 7 o f Efron (1982) and by Efron (1983), the latter introducing the 0.632 estimate. Further developments, with applications and empirical studies, were given by Efron (1986) and Efron and Tibshirani (1997). The discussion o f hybrid estim ates in Section 6.4 is based on H all (1995). In a simple case Davison and Hall (1992) a ttem pt to explain the properties o f the bootstrap and cross-validation error estimates.

There is a large literature on variable selection in regression, much o f which overlaps with the cross-validation literature. Cross-validation is related to the the Cp m ethod o f linear model selection, proposed by M allows (1973), and to the A IC m ethod o f Akaike (1973), as was shown by Stone (1977). For a sum m ary discussion o f various m ethods o f model selection see C hapter 2 o f Ripley (1996), for example. The consistent bootstrap m ethods outlined in Section 6.4 were developed by Shao (1996).

A sym ptotic properties o f resam pled M -estimates were derived by Shorack (1982) who described the adjustm ent necessary for unbiasedness o f the resampled coefficients. M am m en (1989) provided additional asym ptotic support. Aspects o f residuals from robust regression were discussed by Cook, Hawkins and W eisberg (1992) and M cKean, Sheather and H ettsm ansperger (1993), the la tter showing how to standardize raw residuals in M -estimation. De Angelis, H all and Young (1993) gave a detailed theoretical analysis o f model-based resampling in Li estim ation, which confirmed tha t a sm ooth bootstrap is advisable; further numerical results were provided by Stangenhaus (1987).

6.7 Problems1 Show that for a multivariate distribution with mean vector pi and variance matrix

Q, the influence functions for the sample mean and variance are respectively

L(z) = z - f i , k(z) = (z - n)(z - n)T - si.


3

e and x are the averages of the ej and xj.

Hence show that for the linear regression model derived as the conditional expectation E (y | X = x) o f a multivariate C DF F, the empirical influence function values for linear regression parameters are

h(xj ,yj ) = n ( X TX ) ~ i x j eJ,

where X is the matrix o f explanatory variables.(Sections 2.7.2, 6.2.2)

For homogeneous data as in Chapter 2, the empirical influence values for an estimator can be approximated using case-deletion values. Use the matrix identity

t (X TX ) - l x iXJ ( X TX )->

(* * - ^ + l - xJlXTXT'x,

to show that in the linear regression model with least squares fitting,

' y j - x j P 'P - P - J = ( X ‘ X ) -

l - h j

Compare this to the corresponding empirical influence value in Problem 6.1, and obtain the jackknife estimates o f the bias and variance o f fa (Sections 2.7.3, 6.2.2, 6.4)

For the linear regression model y, = xjj i + ej, with no intercept, show that the least squares estimate o f /? is ft = Y x jy j / Y x j. Define residuals by ej = yj — xjfa

If the resampling model is yj = Xjfi + e", with e’ randomly sampled from the e;s, show that the resample estimate /T has mean and variance respectively

« + TSei ~Z * j ’ n E x j ■

Thus in particular the resampling mean is incorrect. Examine the improvements made by leverage adjustment and mean correction o f the residuals.(Section 6.2.3)

The usual estimated variance o f the least squares slope estimate fa in simple linear regression can be written

_ n y j - y ) 2 - M U x j - x ) 2(n ~ 2 ) £ ( * ; - x)2

If the x ’s and y ‘s are random permutations o f xs and ys, show that

. U y j - y ) 2- P ’2n x j - x ) 2 (n - 2) J2(xj ~ x)2

Hence show that in the permutation test for zero slope, the R values o f f}[ are in the same order as those o f f a / v ' 1/2, and that fa > fa is equivalent to f a /u*1/2 > f a / v lf2. This confirms that the P-value o f the permutation test is unaffected by studentizing. (Section 6.2.5)


For least squares regression, model-based resampling gives a bootstrap estimator fi' which satisfies

n

7=1

where the sj are randomly sampled modified residuals. An alternative proposal is to bypass the resampling model for data and to define directly

n

p = $ + { x Tx r i Y t »i’j=i

where the u’s are randomly sampled from the vectors

uj = xj ( y j - xJ h j = 1 ......... n.

Show that under this proposal fi" has mean fi and variance equal to the robustvariance estimate (6.26). Examine, theoretically or through numerical examples, towhat extent the skewness of fi’ matches the skewness of fi.(Section 6.3.1; Hu and Zidek, 1995)

For the linear regression model y = X p + e, the improved version of the robust estimate of variance for the least squares estimates fi is

Vrob = (X TX ) - lX Tdizg(r2i , . . . , r 2n)X(XTX ) - \

where rj is the j th modified residual. If the errors have equal variances, then the usual variance estimate

v = s2^ 7* ) - 1

would be appropriate and vroi, could be quite inefficient. To quantify this, examine the case where the random errors e; are independent N(0, a2). Show first that

E(rj) = „=,

Hence show that the efficiency of the ith diagonal element of vrob relative to the ith diagonal element of v, as measured by the ratio of their variances, is

bl(n-p)g{Qgt

where bu is the ith diagonal element of (Z TX)_1, gJ = (d^...... dfn) with D =(X TX)~lX T, and Q has elements (1 — h j k )2/ { ( 1 — /i; )(l — hk)} .Calculate this relative efficiency for a numerical example.(Sections 6.2.4, 6.2.6, 6.3.1; Hinkley and Wang, 1991)

The statistical function /?(F) for M-estimation is defined by the estimating equation

y - x Tm 'J x v { a(F)

dF(x,y) = 0,

where a(F) is typically a robust scale parameter. Assume that the model contains an intercept, so that the covariate vector x includes the dummy variable 1. Use the

hjk is the (J,k)th element of hat matrix H and hjj = hj.


V?(u) is dip(u)/du.

technique o f Problem 2.12 to show that the influence function for f l(F) is

M ^ ) = { / x x Tyj(e)dF(x,y) | oxy>(e),

where e — (y — x Tf i ) /o; it is assumed that sy)(e) has mean zero.If the distribution o f the covariate vector is taken to be the E D F o f x i , . . . , x„ , show that

Lp(x,y) = m k ~ 1(X TX)~1x\p(e),

where X is the usual covariate matrix and k = E{ip(e)}. Use the empirical version o f this to verify the variance approximation

.2/ y-r i T , V 2(ej/s)Vl = ns (X X ){ £ v(ej/s)}2’

where e; = yj — x j f t and s is the estimated scale parameter.(Section 6.5)

Given raw residuals e i , . . . , e n, define independent random variables ej by (6.21). Show that the first three moments o f ej are 0, ej, and ej.(a) Let be raw residuals from the fit o f a linear model y = Xf t + e , anddefine bootstrap data by y ' = x f t + e ’ , where the elements o f s’ are generated according to the wild bootstrap. Show that the bootstrap least squares estimates ft" take at most 2" values, and that

E’(ft') = ft, var'($*) = vwild = (X TX r lX TW X ( X TX )~ \

where W = d ia g (e f ,...,e 2).(b) Show that when all the errors have equal variances and the design is balanced, so that hj = p/n, vwiu is negatively biased as an estimate o f var(/3).(c) Show that for the simple linear regression model (6.1) the expected value o f var'($*) is

/r2

m2n 2(n — 1 — m^/m\),

where mr = n~l J2(x j — x)r. Hence show that if the xj are uniformly spaced and the errors have equal variances, the wild bootstrap variance estimate is too small by a factor o f about 1 — 14/(5n).(d) Show that if the e,- are replaced by r;, the difficulties in (b) and (c) do not arise. (Sections 6.2.4, 6.2.6, 6.3.2)

Suppose that responses y i , . . . , y„ with n = 2m correspond to m independent samples o f size two, where the ith sample comes from a population with mean n t and these means are o f primary interest; the m population variances may differ. Use appropriate dummy variables x t to express the responses in the linear model y = X f t + e, where /?, = n t. With parameters estimated by least squares, consider estimating the standard error o f ft, by case resampling.(a) Show that the probability o f getting a simulated sample in which all the parameters are estimable is

(b) Consider constrained case resampling in which each o f the m samples must be represented at least once. Show that the probability that there are r resample cases from the ith sample is

/ ^ \ / 1 \ r / 1 \ 2m—r in—1 <i / 2m \ ( 1 \ ( 4 1 \ / < m / m — 1P

for r = l , . . . , m + 1. Hence calculate the resampling mean o f [ij and give an expression for its variance.(Section 6.3; Feller, 1968, p. 102)

10 For the one-way model o f Problem 6.9 with two observations per group, suppose that 9 = fa ~ Pi- N ote that the least squares estimator o f 9 satisfies

0 = 9 + j (f i 3 + £4 — Si — 62).

Suppose that we use model-based resampling with the assumption o f error homoscedasticity. Show that the resample estimate can be expressed as

1=1

where the e ' are randomly sampled from the 2m modified residuals ± ^ ( « 2 i — S 2 1-1) ,

i = 1, . . . , m. Use this representation to calculate the first four resampling moments o f 8‘ — 9. Compare the results with the first four moments o f 9 — 6, and comment. (Section 6.3)

11 Suppose that a 2~r fraction o f a 28 factorial experiment is run, where 1 < r < 4. Under what circumstances would a bootstrap analysis based on case resampling be reliable?(Section 6.3)

12 The several cross-validation estimates o f prediction error can be calculated explicitly in the simple problem o f least squares prediction for homogeneous data with no covariates. Suppose that data yu - - - , yn and future responses y + are all sampled from a population with mean n and variance a2, and consider the prediction rule H(F) = y with accuracy measured by quadratic error.(a) Verify that the overall prediction error is A = cr2( l + n_1), that the expectation o f the apparent error estimate is <r2( l — n-1), and that the cross-validation estimate &cv with training sets o f size n, has expectation a2( 1 + n~').(b) Now consider the K -fold cross-validation estimate ACvjc and suppose that n = K m with m an integer. Re-label the data in the fcth group as yki,---,ykm, and define % = m-1 Yl?=i yu- Verify that

yk -1 (=1 k=1

and hence show that

E (A c ^ ) = ff2{ l + n - ‘ + n ~ \ K — I)-1 }.

Thus the bias o f ACvj< is a2n ‘(X — 1)

(c) Extend the calculations in (b) to show that the adjusted estimate can be written

KAacvjc = & c v x — K ~ l (K — I)-2 ^ (p * — y )2,

k=1

and use this to show that E(AACvjc) — A.(Section 6.4; Burman, 1989)

13 The leave-one-out bootstrap estimate o f aggregate prediction error for linear prediction and squared error is equal to

A bcv = E '_j(yj - x f f t l j ) 2,j = i

where /T j is the least squares estimate o f ji from a bootstrap sample with the )th case excluded and EV denotes expectation over such samples. To calculate the

mean o f ABcv, use the substitution

yj - x j p’_j = yj - x j P-j + x j (Plj - p_j),

and then show that

E ( Y j - X j p _ j ) 2 = ^ { l + q l n - l ) - 1},

E [E '_j { XJ (P l j ~ P - j ) ( t j ~ P - j )TX j } \ = ° 2q(n ~ 1)“ ‘ + 0 (n~2),

E H Y j - X j p ^ X j E l j C p l j - p - j ) } = 0 (n~2).

These results combine to show that E(ABCf ) = ff2( 1 + 2qn~]) 0 (n 2), which leadsto the choice w = | for the estimate Aw = wABCv + (1 — w)Aapp.(Section 6.4; Hall, 1995)

6.8 Practicals1 Dataset catsM contains a set o f data on the heart weights and body weights o f 97

male cats. We investigate the dependence o f heart weight (g) on body weight (kg). To see the data, fit a straight-line regression and do diagnostic plots:

catsMplot(ca tsM $B w t, catsM$Hwt, x lim = c(0,4),y lim = c(0, 24)) c a t s . l m <- glm (Hwt~Bwt,data=catsM ) sum m ary(cats. lm)cats.diag <- glm.diag.plots(cats.lm,ret=T)

The summary suggests that the line passes through the origin, but we cannot rely on normal-theory results here, because the residuals seem skewed, and their variance possibly increases with the mean. Let us assess the stability o f the fitted regression.For case resampling:

cats.fit <- function(data) coef(glm(data$Hwt~data$Bwt)) cats.case <- function(data, i) cats.fit(data[i,]) cats.bootl <- boot(catsM, cats.case, R=499) cats.bootl

plot(cats.boot1,j ack=T)plot(cats.boot1,index=2,j ack=T)

to see a summary and plots for the bootstrapped intercepts and slopes,. How normal do they seem? Is the model-based standard error from the original fit accurate? To what extent do the results depend on any single observation? We can calculate the estimated standard error by the nonparametric delta method by

cats.L <- empinf(cats.bootl,type="reg") sqrt(var.linear(cats.L))

Compare it with the quoted standard error from the regression output, and from the empirical variance o f the intercepts. Are the three standard errors in the order you would expect?For model-based resampling:

cats.res <- cats.diag$res*cats.diag$sd cats.res <- cats.res - mean(cats.res)cats.df <- data.frame(catsM,res=cats.res,fit=fitted(cats.lm)) cats.model <- function(data, i){ d <- data

d$Hwt <- d$fit + d$res[i]

cats.fit(d) } cats.boot2 <- boot(cats.df, cats.model, R=499) cats.boot2 plot(cats.boot2)

Compare the properties o f these bootstrapped coefficients with those from case resampling.How would you use a resampling method to test the hypothesis that the line passes through the origin?(Section 6.2; Fisher, 1947)

2 The data o f Example 6.14 are in dataframe s u r v iv a l. For a jackknife-after- bootstrap plot for the regression slope f a :

survival.fun <- function(data, i){ d <- data[i,]

d.reg <- glm(log(d$surv)”d$dose) c(coefficients(d.reg))}

survival.boot <- boot(survival, survival.fun, R=999) j ack.after.boot(survival.boot, index=2)

Compare this with Figure 6.15. What is happening?

3 p o iso n s contains the survival times o f animals in a 3 x 4 factorial experiment. Each combination o f three poisons and four treatments is used for four animals, the allocation to the animals being completely randomized. The data are standard in the literature as an example where transformation can be applied. Here we apply resampling to the data on the original scale, and use it to test whether an interaction between the two factors is needed. To calculate the test statistic, the standard F statistic, and to see its significance using the usual F test:

poison.fun <- function(data){ assignC'data. junk",data,frame=l)

data.anova <- anova(glm(time~poison*treat,data=data.junk)) dev <- as.numeric(imlist(data.anova[2]))

df <- as.numeric(unlist(data.anova[1])) res.dev <- as.numeric(unlist(data.anova[4])) res.df <- as.numeric(unlist(data.anova[3]))(dev [4] /df [4] ) / (res. dev [4] /res. df [4] ) >

poison.fun(poisons)anova(glm(time~poison*treat,data=poisons),test="F")

To apply resampling analysis, using as the null model that with main effects:

poison.lm <- glm(time~poison+treat,data=poisons) poison.diag <- glm.diag(poison.lm) poison.mle <- list(fit=fitted(poison.lm),

res=residuals(poison.lm)/sqrt(1-poison.diagSh)) poison.gen <- function(data,mle){ i <- sample(48,replace=T)

data$time <- mle$fit + mle$res[i] data >

poison.boot <- boot(poisons, poison.fun, R=199, sim="parametric", ran.gen=poison.gen, mle=poison.mle)

sum(poison.boot$t>poison.boot$tO)

At what level does this give significance? Is this in line with the theoretical value? One assumption o f the above analysis is homogeneity o f variances, but the data cast some doubt on this. To test the hypothesis without this assumption:

poison.genl <- function(data,mle){ i <- matrix(l:48,4,12,byrow=T)

i <- apply(i,1.sample,replace=T,size=4) data$time <- mle$fit + mle$res[i] data >

poison.boot <- boot(poisons, poison.fun, R=199, sim="parametric", ran.gen=poison.genl, mle=poison.mle)

sum (poison.boot$t>poison.boot$tO)

What do you conclude now?(Section 6.3; Box and Cox, 1964)

For an example o f prediction, we consider using the nuclear power station data to predict the cost o f new stations like cases 27-32, except that their value for d ate is 73. We choose to make the prediction using the model with all covariates. To fit that model, and to make the ‘new’ station:

nuclear.glm <- glm(log(cost)~date+log(tl)+log(t2)+log(cap)+pr+ne +ct+bw+log(cum.n)+pt,data=nuclear)

nuclear.diag <- glm.diag(nuclear.glm) nuke <- data.frame(nuclear,fit=fitted(nuclear.glm),

res=nuclear.diag$res*nuclear.diagSsd)nuke.p <- nuke[32,] nuke.p$date <- 73nuke.p$fit <- predict(nuclear.glm,nuke.p)

The bootstrap function and the call to boot are:

nuke.pred <- function(data,i,i.p,d.p){ d <- data

d$cost <- exp(d$fit+d$res[i])d.glm <- glm(log(cost)~date+log(tl)+log(t2)+log(cap)+pr+ne

+ct+bw+log(cum.n)+pt,data=d) predict(d.glm,d.p)-(d.p$fit+d$res[i.p]) }

nuclear. boot. pred <- boot (nuke, nuke.pred,R=199,m=l,d.p=nuke.p)

Finally the 95% prediction intervals are obtained by

as.vector(exp(nuke.p$f it-quantile(nuclear.boot.pred$t,

c(0.975,0.025))))

How do these compare to those in Example 6.8?Modify the above analysis to use a studentized pivot. What effect has this change on your interval?(Section 6.3.3; Cox and Snell, 1981, pp. 81-90)

5 Consider predicting the log brain weight o f a mammal from its log body weight, using squared error cost. The data are in dataframe mammals. For an initial model, apparent error and ordinary cross-validation estimates o f aggregate prediction error:

cost <- function(y, mu=0) mean((y-mu)“2)mammals.glm <- glm(log(brain)"log(body) ,data=maminals)muhat <- fitted(mammals.glm)app.err <- cost(mammals.glm$y, muhat)mammals.diag <- glm.diag(mammals.glm)cv.err <- mean((mammals.glm$y-muhat)“2/(1-mammals,diag$h)“2)

For 6-fold unadjusted and adjusted estimates o f aggregate prediction error:

c v . e r r . 6 <- cv.glm(mammals, mammals.glm, c o s t , K=6)

Experiment with other values o f K.For bootstrap and 0.632 estimates, and plot o f error components:

mammals.pred.fun <- function(data, i, formula){ d <- data[i,]

d.glm <- glm(formula,data=d)

D.F.hatF <- cost(log(data$brain), predict(d.glm,data))D.hatF.hatF <- cost(log(d$brain), fitted(d.glm)) c(log(data$brain)-predict(d.glm,data), D.F.hatF - D.hatF.hatF)}

mam.boot <- boot(mammals, mammals.pred.fun, R=200, formula=formula(mammals.glm))

n <- nrow(mammals)err.boot <- app.err + mean(mam.boot$t[,n+l]) err.632 <- 0mam.boot$f <- boot.array(mam.boot) for (i in l:n)

err.632 <- err.632 + cost(mam.boot$t[mam.boot$f[,i]==0,i])/n err.632 <- 0.368*app.err + 0.632*err.632 ord <- order(mammals.diag$res) mam.pred <- mam.boot$t[,ord]mam.fac <- factor(rep(l:n,rep(200,n)),labels=ord) plot(mam.fac, mam.pred,ylab="Prediction errors",

xlab="Case ordered by residual")

What are cases 34, 35, and 32? (Section 6.4.1)

6 The data o f Examples 6.15 and 6.16 are in dataframe s a l in i t y . For the linear regression model with all three covariates, consider the effect o f discharge d is and the influence o f case 16 on estimating this. Resample the least squares, Li and least trimmed squares estimates, and then look at the jackknife-after-bootstrap plots:

salinity.rob.fun <- function(data,i){ data.i <- data[i,]

Is.fit <- lm(sal~lag+trend+dis, data=data.i)11.fit <- llfit(data.i[,-l] ,data.i[,l])Its.fit <- ltsreg(data.i[,-l] ,data.i[,l]) c(ls.fit$coef,ll.fit$coef,Its.fit$coef) }

salinity.boot <- boot(salinity,salinity.rob.fun,R=1000) j ack.after.boot(salinity.boot,index=4) jack.after.boot(salinity.boot,index=8) j ack.after.boot(salinity.boot,index=12)

What conclusions do you draw from these plots about (a) the shapes o f the distributions o f the estimates, (b) comparisons between the estimation methods, and (c) the effects o f case 16?One possible explanation for case 16 being an outlier with respect to the multiple linear regression model used previously is that a quadratic effect in d isch a rg e should be added to the model. We can test for this using the pivot method with least squares estimates and case resampling:

salinity.quad.fun <- function(data, i){ data.i <- data[i,]

Is.fit <- lm(sal~lag+trend+poly(dis,2), data=data.i)Is.sum <- summary(ls.fit)ls.std <- sqrt(diag(Is.sum$cov))*ls.sum$sigma c(ls.fit$coef, ls.std) >

salinity.boot <- boot(salinity, salinity.quad.fun, R=99) quad.z <- salinity.boot$t0[5]/salinity.boot$tO[10] quad.z. stair <- (salinity ,boot$t [,5]-salinity .boot$t0[5] )/

salinity.boot$t[,10](1+sum(quad.z<quad.z .star))/(1+salinity.boot$R)

Out o f curiosity, look at the normal Q-Q plots o f raw and studentized coefficients:

qqnorm(salinity.boot$t[,5],ylab="discharge quadratic coefficient") qqnorm(quad.z.star, ylab="discharge quadratic z statistic")

Is it reasonable to use least squares estimates here? See whether or not the same conclusion would be reached using other methods o f estimation.(Section 6.5; Ruppert and Carroll, 1980; Atkinson, 1985, p. 48)

7

Further Topics in Regression

7.1 IntroductionIn C hapter 6 we showed how the basic bootstrap m ethods o f earlier chapters extend to linear regression. The broad aim o f this chapter is to extend the discussion further, to various forms o f nonlinear regression models — especially generalized linear models and survival models — and to nonparam etric regression, where the form o f the m ean response is not fully specified.

A particular feature o f linear regression is the possibility o f error-based resampling, when responses are expressible as m eans plus hom oscedastic errors. This is particularly useful when our objective is prediction. For generalized linear models, especially for discrete data, responses cannot be described in term s o f additive errors. Section 7.2 describes ways o f generalizing error-based resampling for such models. The corresponding developm ent for survival da ta is given in Section 7.3. Section 7.4 looks briefly at nonlinear regression with additive error, mainly to illustrate the useful contribution tha t resampling m ethods can m ake to analysis o f such models. There is often a need to estim ate the potential accuracy o f predictions based on regression models, and Section 6.4 contained a general discussion o f resampling m ethods for this. In Section 7.5 we focus on one type o f application, the estim ation o f misclassification rates when a binary response y corresponds to a classification.

N ot all relationships between a response y and covariates x can be readily modelled in term s o f a param etric m ean function o f known form. A t least for exploratory purposes it is useful to have flexible nonparam etric curve- fitting m ethods, and there is now a wide variety o f these. In Section 7.6 we examine briefly how resam pling can be used in conjunction with some o f these nonparam etric regression methods.

326

7.2 • Generalized Linear Models 327

7.2 Generalized Linear Models7.2.1 IntroductionThe generalized linear model extends the linear regression model o f Section 6.3 in two ways. First, the distribution o f the response Y has the property tha t the variance is an explicit function o f the m ean n,

var(Y ) =

where V(-) is the known variance function and 4> is the dispersion param eter, which may be unknown. This includes the im portant cases o f binom ial, Poisson, and gam m a distributions in addition to the norm al distribution. Secondly, the linear m ean structure is generalized to

gO*) = t], n = x T p,where g(-) is a specified m onotone link function which “links” the m ean to the linear predictor rj. As before, x is a {p + 1) x 1 vector o f explanatory variables associated with Y. The possible com binations o f different variance functions and link functions include such things as logistic and probit regression, and loglinear models for contingency tables, w ithout m aking ad-hoc transform ations o f responses.

The first extension was touched on briefly in Section 6.2.6 in connection w ith weighted least squares, which plays a key role in fitting generalized linear models. The second extension, to linear models for transform ed means, represents a very special type o f nonlinear model.

W hen independent responses y } are obtained with explanatory variables Xj,

the full model is usually taken to be

E (Yj) = Hj, g(nj) = x j p , \a i (Y j ) = KCjV(fij), (7.1)

where k m ay be unknow n and the c j are known weights. For example, for binom ial da ta with probability n(xj) and denom inator m7, we take c; = l /m ; ; see Exam ple 7.3. The constant k equals one for binomial, Poisson and exponential data. Notice tha t (7.1) strictly only specifies first and second m om ents o f the responses, and in tha t sense is a sem iparam etric model. So, for example, we can model overdispersed count da ta by using the Poisson variance function V(fi) = n but allowing k to be a free overdispersion param eter which is to be estimated.

One im portan t point about generalized linear models is the non-unique definitions o f residuals, and consequent non-uniqueness o f nonparam etric resampling algorithms.

After illustrating these ideas with an example we briefly review the main aspects o f generalized linear models. We then go on to discuss resampling methods.

328 7 ■ Further Topics in Regression

G ro u p 1 G roup 2

Case X y Case X y

1 3.36 65 18 3.64 562 2.88 156 19 3.48 653 3.63 100 20 3.60 174 3.41 134 21 3.18 75 3.78 16 22 3.95 166 4.02 108 23 3.72 227 4.00 121 24 4.00 38 4.23 4 25 4.28 49 3.73 39 26 4.43 210 3.85 143 27 4.45 311 3.97 56 28 4.49 812 4.51 26 29 4.41 413 4.54 22 30 4.32 314 5.00 1 31 4.90 3015 5.00 1 32 5.00 416 4.72 5 33 5.00 4317 5.00 65

Example 7.1 (Leukaemia data) Table 7.1 contains da ta on the survival times in weeks o f two groups o f acute leukaem ia victims, as a function o f their white blood cell counts.

A simple model is th a t within each group survival time Y is exponential with m ean /i = exp(/?o + Pix), where x = log10(white blood cell count). Thus the link function is logarithmic. The intercept is different for each group, but the slope is assumed common, so the full model for the- yth response in groupi is

E (Y y) = Hij, log (^ y ) = p Qi + piXj j , var(Y y) = K(/Zy) = /X2,

The fitted means p. and the da ta are shown in the left panel o f Figure 7.1. The m ean survival times for group 2 are shorter than those for group 1 at the same white blood cell count.

Under this model the ratios Y / n are exponentially distributed with unit mean, and hence the Q-Q plot o f yy //iy against exponential quantiles in the right panel o f Figure 7.1 would ideally be a straight line. Systematic curvature might indicate tha t we should use a gam m a density with index v,

yv_1vv / v v \f i y l ^ v) = J ? T w e x p \ j ) ’ y > 0 , ^ V > a

In this case var(Y ) = /i2/v , so the dispersion param eter is taken to be k = 1/v and Cj = 1. In fact the exponential model seems to fit adequately. ■

Table 7.1 Survival times y (weeks) for two groups of acute leukaemia patients, together with x = log10 white blood cell count (Feigl and Zelen, 1965).

7.2 ■ Generalized Linear Models 329

Figure 7.1 Summary plots for fits of an exponential model fitted to two groups of survival times for leukaemia patients. The left panel shows the times and fitted means as a function of their white blood cell count (group 1, fitted line solid; group 2, fitted line dots). The right panel shows an exponential Q-Q plot of the y/\i.

Log10 white blood cell count Quantile of standard exponential

7.2.2 Model fitting and residualsEstimationSuppose tha t independent da ta (x i , y i ) , . . . , ( x n, y n) are available, with response m ean and variance described by (7.1). I f the response distributions are assumed to be given by the corresponding exponential family model, then the m axim um likelihood estimates o f the regression param eters ji solve the (p + 1) x 1 system o f estim ating equations

t S = 0’ ™P ' C j V ( f i j ) g i n j )

where g(n) = dr\/dp. is the derivative o f the link function. Because the dispersion param eters are taken to have the form k c j , the estim ate fi does not depend on k . N ote tha t although the estimates are derived as m axim um likelihood estimates, their values depend only upon the regression relationship as expressed by the assumed variance function and the link function and choice o f covariates.

The usual m ethod for solving (7.2) is iterative weighted least squares, inwhich at each iteration the adjusted responses zj = t]j + (yj — /ij)g(nj) areregressed on the x; with weights wj given by

w j l = c j V(fij)g2(fiJ)- (7.3)

all these quantities are evaluated at the current values o f the estimates. The weighted least squares equation (6.27) applies at each iteration, with y replaced by the adjusted dependent variable z. The approxim ate variance m atrix for p

330 7 • Further Topics in Regression

var(j?) = k( X t W X ) ~ 1, (7.4)

with the diagonal weight m atrix W = d iag (w i,...,w „) evaluated at the final fitted values p.j.

The corresponding ‘hat’ m atrix is

H = X ( X T W X ) ~ l X T W , (7.5)

as in (6.28). The relationship o f H to fitted values is rj = Hz, where z is the vector o f adjusted responses. N ote tha t in general W , and hence H, depends upon the fitted values. The residual vector e = y — fi has approxim ate variance m atrix (I — H )var(Y ), this being exact only for linear regression with known W .

W hen the dispersion param eter k is unknown, it is estim ated by the analogue o f residual m ean square,

f t - 1 y ' t o - * # (76)n - p - l j j CjVfrj) ■

For a linear model with V{y) = 1 and dispersion param eter k = a 2, this gives k = s2, the residual m ean square.

Let tj(iij) denote the contribution tha t the jih observation makes to the overall log likelihood /(/i), param etrized in term s o f the means Hj. Then the fit o f a generalized linear model is m easured by the deviance

D = 2 k {t (y) - a m = 2k £ {t j (yj ) - 1 0 }) ) , (7.7)j

which is the scaled difference between the maximized log likelihoods for the saturated model — which has a param eter for each observation — and the fitted model. The deviance corresponds to the residual sum o f squares in the analysis o f a linear regression model. For example, there are large reductions in the deviance when im portant explanatory variables are added to a model, and com peting models may be com pared via their deviances. W hen the fitted model is correct, the scaled deviance k ~ 1D will sometimes have an approxim ate chi-squared distribution on n — p — 1 degrees o f freedom, analogous to the rescaled residual sum o f squares in a norm al linear model.

Significance tests

Individual coefficients /?; can be tested using studentized estimates, with standard errors estim ated using (7.4), with k replaced by the estimate k if necessary. The null distributions o f these studentized estimates will be approxim ately standard norm al, but the accuracy o f this approxim ation can be open to question. Allowance for estim ation o f k can be m ade by using the t distribution with

is given by the analogue o f (6.24), namely

Some authors prefer to work with X'(X'TX')-'X'V2, where X' = W ^ X .

n —p —1 degrees o f freedom, as is justifiable for norm al-theory linear regression, bu t in general the accuracy is questionable.

The analogue o f analysis o f variance is the analysis o f deviance, wherein differences o f deviances are used to m easure effects. To test whether or not a particular subset o f covariates has no effect on mean response, we use as test statistic the scaled difference o f deviances, D for the full model with p covariates and Do for the reduced model with po covariates. If k is known, then the test statistic is Q = (Do — D)/k. Approxim ate properties o f log likelihood ratios imply tha t the null distribution o f Q is approxim ately chi-squared, with degrees o f freedom equal to p — po, the num ber o f covariate terms being tested. I f k is estim ated for the full model by fc, as in (7.6), then the test statistic is

Q = (Do - D)/k. (7.8)

In the special case o f linear regression, (p — po)~lQ is the F statistic, and this m otivates the use o f the Fp- pa<n- p- i distribution as approxim ate null distribution for (p—po)_16 here, although this has little theoretical justification.

ResidualsResiduals and other regression diagnostics for linear models may be extended to generalized linear models. The general form o f residuals will be a suitably standardized version o f d(y,p) where d(Y,[i) matches some notion o f random error.

The simplest way to define residuals is to mimic the earlier definitions for linear models, and to take the set o f standardized differences, the Pearson residuals, (yj — p,j)/{cjkV(faj)}l/1. Leverage adjustm ent o f these to com pensate for estim ation o f /? involves hj, the yth diagonal element o f the hat m atrix H in (7.5), and yields standardized Pearson residuals

rpj = f u WU2' 7 = !.•••,»• (7.9)\ cjKV(fij)(l — hj)}1' 2

The standardized Pearson residuals are essentially scaled versions o f the m odified residuals defined in (6.29), except tha t the denom inators o f (7.9) may depend on the param eter estimates. In large samples one would expect the rpj to have m ean and variance approxim ately zero and one, as they do for linear regression models.

In general the Pearson residuals inherit the skewness o f the responses them selves, which can be considerable, and it may be better to standardize a transform ed response. One way to do this is to define standardized residuals on the linear predictor scale,

t w - m (710){cjkg2( t i j ) V ( j i j ) ( l - h j ) }

For discrete da ta this definition m ust be altered if g (yj) is infinite, as for

example when g(y) = logy and y = 0. For a non-identity link function one should not expect the m ean and variance o f rLj to be approxim ately zero and one, unless k is unusually small; see Example 7.2.

A n alternative approach to defining residuals is based on the fact tha t in a linear model the residual sum o f squares equals the sum o f squared residuals. This suggests tha t residuals for generalized linear models can be constructed from the contributions th a t individual observations make to the deviance. Suppose first tha t k is known. Then the scaled deviance can be w ritten as

where dj = d(y; , fij) is the signed square root o f the scaled deviance contribution due to the yth case, the sign being that o f y,- — frj. The deviance residual is dj. Definition (7.7) implies that

W hen / is the norm al log likelihood and k = o 2 is unknown, D is scaled by k = s2 rather than k before defining dj. Similarly for the gam m a log likelihood; see Example 7.2. In practice standardized deviance residuals

are more commonly used than the unadjusted dj.For the linear regression model o f Section 6.3, r Dj is proportional to the

modified residual (6.9). For o ther models the r Dj can be seriously biased, but once bias-corrected they are typically closer to standard norm al than are the r Pj or r LJ.

One general point to note about all o f these residuals is that they are scaled, implicitly or explicitly, unlike the modified residuals o f C hapter 6.

Quasilikelihood estimation

As we have noted before, only the link and variance functions m ust be specified in order to find estimates ft and approxim ate standard errors. So although (7.2) and (7.6) arise from a param etric model, they are m ore generally applicable— just as least squares results are applicable beyond the norm al-theory linear model. W hen no response distribution is assumed, the estimates ft are referred to as quasilikelihood estimates, and there is an associated theory for such estimates, although this is not o f concern here. The m ost com m on application is to da ta with a response in the form o f counts or proportions, which are often found to be overdispersed relative to the Poisson or binom ial distributions. One approach to modelling such data is to use the variance function appropriate to binom ial or Poisson data, but to allow the dispersion param eter k to be a free param eter, estim ated by (7.6). This estimate is then used in calculating standard errors for ft and residuals, as indicated above.

dj = sign(y, - £ ; )[2{ /,(yy) - <0 (£j)}]1/2-

TDi ( l - h j ) V 2’ j(7.11)


In these first two resampling schemes the scale factor k~l/2 can be omitted provided it is omitted from both the residual definition and from the definition of / •

7.2.3 Sampling plansParam etric sim ulation for a generalized linear model involves sim ulating new sets o f data from the fitted param etric model. It has the usual disadvantage o f the param etric bootstrap, that datasets generated from a poorly fitting m odel may no t have the statistical properties o f the original data. This applies particularly when count da ta are overdispersed relative to a Poisson or binom ial model, unless the overdispersion has been modelled successfully.

N onparam etric sim ulation requires generating artificial da ta w ithout assum ing tha t the original da ta have some particular param etric distribution. A completely nonparam etric approach is to resample cases, which applies exactly as described in Section 6.2.4. However, it is im portant to be clear w hat a case is in any particular application, because count and proportion data are often aggregated from larger datasets o f independent variables.

Provided that the model (7.1) is correct, as would be checked by appropriate diagnostic m ethods, it makes sense to use the fitted model and generalize the sem iparam etric approach o f resampling errors, as described in Section 6.2.3. We focus now on ways to do this.

Resampling errorsThe simplest approach mimics the linear model sampling scheme but allows for the different response variances, just as in Section 6.2.6. So we define simulated responses by

y ’j = fij + {cjkV{p.j)Yl2t), j = l , . . . , n , (7.12)

where £ j,...,e* is a random sample from the m ean-adjusted, standardized Pearson residuals r Pj — r P with rPj defined at (7.9). N ote tha t for count da ta we are not assuming k = 1. This resampling scheme duplicates the m ethod o f Section 6.2.6 for linear models, where the link function is the identity.

Because in general there is no explicit function connecting response yj to random error Sj, as there is for linear regression models, the resampling scheme (7.12) is not the only approach, and sometimes it is not suitable. One alternative is to use the same idea on the linear predictor scale. T hat is, we generate bootstrap da ta by setting

y ) = g ~ l x Tp + g{fij){cjkV{fij)}1/2£ ^ , j = l , . . . , n , (7.13)

where g_1(') is the inverse link function and £ j,...,e* is a bootstrap sample from the residuals r L U . . . , r Ln defined at (7.10). Here the residuals should not be m ean-adjusted unless g( ) is the identity link, in which case rLj = rPj and the two schemes (7.12) and (7.13) are the same.

A third approach is to use the deviance residuals as surrogate errors. If the deviance residual dj is written as d{yj,p.j), then imagine tha t corresponding random errors ej are defined by ej = d(yj,fij). The distribution o f these £_,■


is estim ated by the E D F o f the standardized deviance residuals (7.11). This suggests tha t we construct a bootstrap sample as follows. R andom ly sample

from ro i , . . . , rDn and let y | , . . . ,y * be the solutions to

ej = d(yj,fij), j = 1 , . . . , n. (7.14)

This also gives the m ethod o f Section 6.2.3 for linear models, except for the m ean adjustm ent o f residuals.

None o f these three m ethods is perfect. One obvious draw back is that they can all give negative or non-integer values o f y ' when the original data are non-negative integer counts. A simple fix for discrete responses is to round the value o f y j from (7.12), (7.13), or (7.14) to the nearest appropriate value. For count da ta this is a non-negative integer, and if the response is a proportion with denom inator m, it is a num ber in the set 0 ,1 /m ,2 /m ,. . . , 1. However, rounding can appreciably increase the proportion o f extreme values o f y ' for a case whose fitted value is near the end o f its range.

A similar difficulty can occur when responses are positive with V(fi) = Kfi2, as in Example 7.1. The Pearson residuals are K~l/2(yj — fij)/p.j, all necessarily greater than — k ~ 1 2. But the standardized versions rpj are not so constrained, so tha t the result yj = fij( 1 + /c1/2e*) from applying (7.12) can be negative. The obvious fix is to truncate y j a t zero, but this may distort the distribution of y ', and so is not generally recommended.

Example 7.2 (Leukaemia data) For the da ta introduced in Exam ple 7.1 the param etric model is gam m a with log likelihood contributions

tij(Hij) - —K^'OogOxy) + yij/Hij),

and the regression is additive on the logarithm ic scale, log(/zi;) = /?0i + /?ixy. The deviance for the fitted model is D = 40.32 with 30 degrees o f freedom, and equation (7.6) gives k = 1.09. The deviance residuals are calculated with k set equal to k ,

dtj = sign(ziy - l ) { 2 k ~ l ( z i j - 1 - logz,7)}1/2,

where zy = yy /£y . The corresponding standardized values rDi,; have sample m ean and variance respectively —0.37 and 1.15. The Pearson residuals are k - 1 /2 ( z ,7 - 1 ).

The Zjj would be approxim ately a sample from the standard exponential distribution if in fact k = 1, and the right-hand panel o f Figure 7.1 suggests tha t this is a reasonable assumption.

O ur basic param etric model for these data sets k = 1 and puts Y = fie, where £ has an exponential distribution with unit mean. Hence the param etric bootstrap involves sim ulating exponential da ta from the fitted model, tha t is setting y * = fie', where em is standard exponential. A slightly m ore cautious


Table 7.2 Lower and upper limits of 95% studentized bootstrap confidence intervals for Ai and 0 i for leukaemia data, based on 999 replicates of different simulation schemes.

Poi Pi

Lower Upper Lower Upper

Exponential 5.16 11.12 -1.42 -0.04Linear predictor, r i 3.61 10.58 -1.53 0.17Deviance, rp 5.00 11.10 -1.46 0.02Cases 0.31 8.78 -1.37 0.81

approach would be to generate gam m a da ta with m ean p. and index /c_1, but we shall not do this here.

For nonparam etric simulation, we consider all three schemes described earlier. First, with variance function V(fi) = k/x2, the Pearson residuals are k ~ 1/2(y — p)/p- Resam pling Pearson residuals via (7.12) would be equivalent to setting y * = p.e*, where e* is sampled at random from the zs (Problem 7.2). However, (7.12) cannot be used with the standardized Pearson residuals rp, because negative values o f y * will occur, possibly as low as —4. T runcation at zero is no t a sufficient remedy for this.

For the second resampling scheme (7.13), the logarithm ic link gives y ’ = j l cxp(k1/2e’ ), where e* is random ly sampled from the rLs which here are given by n — /c-1/2( 1 — h)~l/2 log(z). The sample m ean and variance o f rL are —0.61 and 1.63, in very close agreement with those for the logarithm o f a standard exponential variate. It is im portan t tha t no m ean correction be m ade to the r ^

To implement the bootstrap for deviance residuals, the scheme (7.14) can be simplified as follows. We solve the equations d(zj, 1) = rDj for j = toobtain z i , . . . ,z „ , and then set y* = /t,£* for j = 1 where is abootstrap sample from the zs (Problem 7.2).

Table 7.2 shows 95% studentized bootstrap confidence intervals for /foi (the intercept for G roup 1) and Pi using these schemes with R = 999. The variance estimates used are from (7.4) rather than the nonparam etric delta method. The intervals for the three model-based schemes are very similar, while those for resam pling cases are ra ther different, particularly for pi, for which the bootstrap distribution o f the studentized statistic is very non-normal.

Figure 7.2 com pares simulated deviances with quantiles o f the chi-squared distribution. Naive asymptotics would suggest tha t the scaled deviance kD has approxim ately a chi-squared distribution on 30 degrees o f freedom, but these asym ptotics — which apply as k—>0 — are clearly no t useful here, even when da ta are in fact generated from the exponential distribution. The fitted deviance o f 40.32 is no t extreme, and the variation o f the sim ulated estimates


Quantile of chi-squared distribution Quantile of chi-squared distribution

k’ is large enough tha t the observed value k = 1.09 could easily occur by chance if the data were indeed exponential. ■

Comparison o f resampling schemes

To com pare the perform ances o f the resampling schemes described above in setting confidence intervals, we conducted a series o f M onte C arlo experiments, each based on 1000 sets o f da ta o f size n = 15, with linear predictor r\ = Po + Pix. In the first experiment, the values o f x were generated from a distribution uniform on the interval (0, 1), we took po = Pi = 4, and responses were generated from the exponential distribution with m ean exp(^). Each sample was then bootstrapped 199 times using case resampling and by model- based resam pling from the fitted model, with variance function V(/j) = /i2, by applying (7.13) and (7.14). For each o f these resampling schemes, various confidence intervals were obtained for param eters Po, Pi, tpi = PoPi and V 2 = Po/Pi- The confidence intervals used were: the standard interval based on the large-sample norm al distribution o f the estimate, using the usual rather than a robust standard error; the interval based on a norm al approxim ation with bias and variance estim ated from the resamples; the percentile and BCa intervals; and the basic bootstrap and studentized bootstrap intervals, the la tter using nonparam etric delta m ethod variance estimates. The first part o f Table 7.3 shows the empirical coverages o f nom inal 90% confidence intervals for these com binations o f resampling scheme, m ethod o f interval construction, and param eter.

The second experim ent used the same design matrix, linear predictor, and model-fitting and resam pling schemes as the first, but the data were generated from a lognorm al model with m ean exp(t]) and unit variance on the log scale.

Figure 7.2 Leukaemia data. Chi-squared Q-Q plots of simulated deviances for parametric sampling from the fitted exponential model (left) and case resampling (right).


Table 73 Empirical coverages (%) for four parameters based on applying various resampling schemes with R = 199 to 1000 samples of size 15 generated from various models. Target coverage is 90%. The first two sets of results are for an exponential model fitted to exponential and lognormal data, and the second two are for a Poisson model fitted to Poisson and negative binomial data. See text for details.

Cases rL o r rp ro

Po Pi V>1 xp2 Po Pi Vl tp2 Po Pi Vl V>2

Standard 85 86 89 85 85 86 89 86 85 86 90 86N orm al 88 89 92 90 88 89 90 89 87 89 90 89Percentile 85 87 83 89 86 89 86 89 86 88 86 89BCa 84 86 82 86 86 88 83 88 86 88 83 88Basic 86 88 87 84 86 89 86 83 85 89 87 83Student 89 89 86 81 92 92 89 84 92 92 89 84

S tandard 79 79 82 81 79 78 82 82 79 78 82 82N orm al 81 81 84 85 81 80 84 84 82 80 84 84Percentile 80 84 73 85 80 82 77 83 80 81 76 82BCa 78 83 72 81 80 80 74 79 79 81 74 80Basic 78 78 82 78 81 80 83 80 80 81 84 80Student 84 85 82 74 90 88 84 79 90 88 84 79

S tandard 90 90 91 90 89 90 92 90 89 91 92 91N orm al 88 88 88 88 87 86 88 88 87 93 97 93Percentile 87 87 85 86 89 88 88 88 90 94 97 91BC a 86 86 82 86 88 87 85 87 88 94 96 91Basic 87 87 85 87 87 87 88 88 86 92 97 92Student 95 90 80 92 90 89 89 89 90 93 92 91

S tandard 69 64 59 70 69 63 59 69 67 64 60 71N orm al 87 84 86 90 88 84 84 89 87 89 92 94Percentile 85 86 84 86 90 86 82 88 90 91 93 91BC a 85 85 80 85 88 83 77 86 87 89 88 89Basic 86 84 83 85 88 84 83 87 87 89 91 91S tudent 93 87 82 87 89 89 85 85 89 93 90 85

The third experim ent used the same design m atrix as the first two, but linear predictor rj = Pq + P\x, with Po — Pi = 2 and Poisson responses with mean H = exp (rj). The fourth experim ent used the same means as the third, but had negative binom ial responses with variance function \x + /i2/1 0 . The bootstrap schemes for these two experiments were case resampling and model-based resampling using (7.12) and (7.14).

Table 7.3 shows that while all the m ethods tend to undercover, the standard m ethod can be disastrously bad when the random part o f the fitted model is incorrect, as in the second and fourth experiments. The studentized m ethod generally does better than the basic m ethod, bu t the B C a m ethod does not improve on the percentile intervals. Thus here a more sophisticated m ethod does not necessarily lead to better coverage, unlike in Section 5.7, and in particular there seems to be no reason to use the BCa m ethod. Use o f the studentized interval on another scale m ight improve its perform ance for the ratio \p2, for which the simpler m ethods seem best. As far as the resampling schemes are concerned, there seems to be little to choose between the model-

based schemes, which improve slightly on bootstrapping cases, even when the fitted variance function is incorrect.

We now consider an im portant caveat to these general comments.

Inhomogeneous residualsFor some types o f da ta the standardized Pearson residuals may be very inhomogeneous. If y is Poisson with m ean fi, for example, the distribution o f (y — f i ) /n l/1 is strongly positively skewed when n < I, but it becomes increasingly symmetric as fi increases. Thus when a set o f da ta contains both large and small counts, it is unwise to treat the rP as exchangeable. One possibility for such da ta is to apply (7.12) but with fitted values stratified by the estim ated skewness o f their residuals.

Example 7.3 (Sugar cane) Carvao da cana-de-aqucar — coal o f sugar cane— is a disease o f sugar cane tha t is com m on in some areas o f Brazil, and its effects on production o f the crop have led to a search for resistant varieties o f cane. We use da ta kindly provided by D r C. G. B. D em etrio o f Escola Superior de A gricultura, Universidade de Sao Paulo, from a random ized block experim ent in which the resistance to the disease o f 45 varieties o f cane was com pared in four blocks o f 45 plots. Fifty stems from a variety were put in a solution containing the disease agent, and then planted in a plot. A fter a fixed period, the total num ber o f shoots appearing, m, and the num ber o f diseased shoots, r, were recorded for each plot. Thus the data form a 4 x 45 layout of pairs (m, r). The purpose o f analysis was to identify the m ost resistant varieties, for further investigation.

A simple model is tha t the num ber o f diseased shoots ry for the ith block and / th variety is a binom ial random variable with denom inator my and probability nij. For the generalized linear model form ulation, the responses are taken to be y tj = rij/niij so tha t the m ean response fiij is equal to the probability 7iy that a shoot is diseased. Because the variance o f Y is 7t( l — n) /m, the variance function is V(n) = fi(\ — fi) and the dispersion param eters are (fi = 1/m , so that in the two-way version o f (7.1), cy = 1 /my and k = 1. The probability o f disease for the ith block and / th variety is related to the linear predictortjij = a, + Pj through the logit link function t] = log {n / ( l — 7i)}. So the fullmodel for all da ta is

E(Yij) - fiij, fiij = exp(a, + Pj)/ {1 + exp(a, + Pj)} ,

var(Ytj) = m-jl V(fiij), V(fitj) = /i,7(l - fi,j).

Interest focuses on the varieties with small values o f Pj, which are likely to be the m ost resistant to the disease.

For an adequate fit, the deviance would roughly be distributed according to a X m d istribution; in fact it is 1142.8. This indicates severe overdispersion relative to the model.

Figure 73 Model fit for the cane data. The left panel shows the estimated variety effects £i + for block 1: varieties 1 and 3 are least resistant, and 31 is most resistant. The lines show the levels on the logit scale corresponding to n = 0.5, 0.2, 0.05 and 0.01. The right panel shows standardized Pearson residuals rp plotted against etj +pj; the lines are at 0, ±3.

oo£<oCOQ.■&1■c<0>

10 20 30

Variety

40

Q.

eta

The left panel o f Figure 7.3 shows estim ated variety effects for block 1. Varieties 1 and 3 are least resistant to the disease, while variety 31 is most resistant. The right panel shows the residuals plotted against linear predictors. The skewness o f the rP drops as rj increases.

Param etric sim ulation involves generating binom ial observations from the fitted model. This greatly overstates the precision o f conclusions, because this model clearly does not reflect the variability o f the data. We could instead use the beta-binom ial distribution. Suppose that, conditional on n, a response is binom ial with denom inator m and probability n, but instead of being fixed, n is taken to have a beta distribution. The resulting response has unconditional m ean and variance

m il, m ll( l - n ) { l + ( m - 1)0}, (7.15)

where n = E(7t) and <j) > 0 controls the degree o f overdispersion. Param etric sim ulation from this model is discussed in Problem 7.5.

Two variance functions for overdispersed binom ial data are V\{n) = <f>n(l — n), with <j> > 1, and Viin) = 7i(l — 7t){l + (m — with (/> > 0.The first o f these gives com m on overdispersion for all the observations, while the second allows proportionately greater spread when m is larger. We use the first, for which 4> = 8.3, and perform nonparam etric sim ulation using (7.12). The sim ulated responses are rounded to the nearest integer in 0 ,1 , . . . , m.

The left panel o f Figure 7.4 shows box plots o f the ratio o f deviance to degrees o f freedom for 200 simulations from the binom ial model, the beta- binom ial model, for nonparam etric sim ulation by (7.12), and for (7.12) but with residuals stratified into groups for the fifteen varieties with the smallest values o f fij, the middle fifteen values o f fij, and the fifteen largest values o f


Variety

A

fij. The dotted line shows the observed ratio. The binom ial results are clearly quite inappropriate, those for the beta-binom ial and unstratified sim ulation are better, and those for the stratified sim ulation are best.

To explain this, we return to the right panel o f Figure 7.3. This shows tha t the residuals are not hom ogeneous: residuals for observations with small values o f rj are m ore positively skewed than those for larger values. This reflects the varying skewness of binom ial data, which m ust be taken into account in the resampling scheme.

The right panel o f Figure 7.4 shows the estim ated variety effects for the 200 sim ulations from the stratified simulation. Varieties 1 and 3 are much less resistant than the others, bu t variety 31 is not much more resistant than 11, 18, and 23; o ther varieties are close behind. As m ight be expected, results for the binom ial sim ulation are much less variable. The unstratified resampling scheme gives large negative estim ated variety effects, due to inappropriately large negative residuals being used. To explain this, consider the right panel o f Figure 7.3. In effect the unstratified scheme allows residuals from the right half o f the panel to be sampled and placed at its left-hand end, leading to negative simulated responses tha t are rounded up to zero: the varieties for which this happens seem spuriously resistant.

Finer stratification o f the residuals seems unnecessary for this application.■

7.2.4 PredictionIn Section 6.3.3 we showed how to use resampling m ethods to obtain prediction intervals based on a linear regression fit. The same idea can be applied here.

Figure 7.4 Resampling results for cane data. The left panel shows (left to right) simulated deviance/degrees of freedom ratios for fitted binomial and beta-binomial models, a nonparametric bootstrap, and a nonparametric bootstrap with residuals stratified by varieties; the dotted line is at the data ratio8.66 = 1142.8/132. The right panel shows the variety effects in 200 replicates of the stratified nonparametric resampling scheme.

Beyond having a suitable resampling algorithm to produce the appropriate variation in param eter estimates, we m ust also produce suitable response variation. In the linear model this is provided by the E D F o f standardized residuals, which estimates the C D F o f hom oscedastic errors. Now we need to be able to produce the correct heteroscedasticity.

Suppose tha t we w ant to predict the response Y+ a t x+, with a prediction interval. One possible point prediction is the regression estimate

K = g - ' i x l h

although it would often be wise to make a bias correction. For the prediction interval, let us assume for the m om ent that some m onotone function 5 (Y ,n ) is homoscedastic, with pth quantile ap, and that the mean value o f Y+ is known. Then the 1 — 2a prediction interval should be the values y+>a, y+ ,i-a where y +iP satisfies <5(y,/i+) = ap. I f n is estim ated by p. independently o f Y+ and if 3{Y+,fi) has known quantiles, then the same m ethod applies. So the appropriate bootstrap m ethod is to estimate quantiles o f <5(7+,/t), and then set 5(y,n+) equal to the estim ated a and 1 — a quantiles. The function d(Y, f i ) will correspond to one o f the definitions o f residuals, and the bootstrap algorithm will use resampling from the corresponding standardized residuals, whose homoscedasticity is critical. The full resampling algorithm, which generalizes A lgorithm 6.4, is as follows.

Algorithm 7.1 (Prediction in generalized linear models)

For r =

1 create bootstrap sample response y j a t Xj by solving

d(y,p.j) = e,j , j = l , . . . , n ,

where the ej are random ly sampled from residuals r i , . . . , r „ ;

2 fit estim ates fi* and k", and com pute fitted value p*+ r corresponding to the new observation with x = x + ; then

3 for m = 1 , . . . ,M ,

(,a ) sample S'm from n , . . . , r„,(b ) set y ’+ rm equal to the solution o f the equation S(y,p.+) = d*m,(c) com pute simulated prediction ‘errors’ d’+rm = 8{y’+rm,fi'+r).

Finally, order the R M values d \ rm to give d'_j_ (1) < ■ • ■ < d‘+{RK1). Then calculate the prediction limits as the solutions to

3(y+,fl+) — +,((RM+ !)«)> $(y+>fi+) — +,((HM+l)(l-a))-

In principle any o f the resampling m ethods in Section 7.2.3 could be used. In practice the homoscedasticity is im portant, and should be checked.

Example 7.4 (AIDS diagnoses) Table 7.4 contains the num ber o f A ID S reports in England and Wales to the end o f 1992. They are cross-classified by diagnosis period and length o f reporting delay, in three-m onth intervals. A blank in the table corresponds to an unknow n entry, and > indicates where an entry is a lower bound for the actual value. We shall treat these incomplete da ta as unknow n in our analysis below. The problem was to predict the state o f the epidemic at the time from the given data. This depends heavily on the values missing towards the foot o f the table.

The da ta support the assum ption that the reporting delay does no t depend on the diagnosis period. In this case a simple m odel is that the num ber o f reports in row j and colum n k o f the table, yjk, has a Poisson distribution with m ean fijk = exp(a; + /4). I f all the cells o f the table are regarded as independent, the total diagnoses in period j have a Poisson distribution with m ean J2k Vjk = exP(a;') J2k exP(ft)- Hence the eventual total for an incomplete row can be predicted by adding the observed row total and the fitted values for the unobserved part o f the row. How accurate is this prediction?

To assess this, we first simulate a complete table o f bootstrap data, y*k, using the fitted values fak = exp(a; + /?*) from the original fit. We shall discuss below how to do this; for now simply note tha t this am ounts to steps 1 and 3(b) of A lgorithm 7.1. We then fit the two-way layout model to the simulated data, excluding the cells where the original table was incomplete, thereby obtaining param eter estimates a ’ and /?£. We then calculate

y'+j = Y I yjk’ A+J = exp(a ') exP(PD> 7 = 1 ,...,3 8 ,k unobs k unobs

where the sum m ation is over the cells o f row j for which yjk was unobserved; this is step 2. N ote tha t y*+j is equivalent to the results o f steps 3(a) and 3(b) with M = 1.

We take 8(y,n) = (y — corresponding to Pearson residuals for thePoisson distribution. This means tha t step 3(c) involves setting

_ y-+ J - K j+ J a *1/2

V+J

We repeat this R times, to obtain values d‘+}(l) < ■ ■ ■ < d \ j(R) for each j.The final step is to obtain the bootstrap upper and lower limits y*+j i_a

for y +j , by solving the equations

y+j a*+j _ j* y +j p +j _ j *. 1 / 2 ~ a + , M ( R + 1)«)> TT/2 a + J ,( (R + l) ( l—a))-

*+J < )


Table 7.4 Numbers of AIDS reports in England and Wales to the end o f 1992 (De Angelis and Gilks, 1994). A ^ sign in the body o f the table indicates a count incomplete a t the end of 1992, and t indicates a reporting-delay less than one month.

Diagnosisperiod

Year Q uarter

Reporting delay interval (quarters) Total reports to end

1992ot 1 2 3 4 5 6 7 8 9 10 11 12 13 214

1983 3 2 6 0 1 1 0 0 1 0 0 0 0 0 0 1 124 2 7 1 1 1 0 0 0 0 0 0 0 0 0 0 12

1984 1 4 4 0 1 0 2 0 0 0 0 2 1 0 0 0 142 0 10 0 1 1 0 0 0 1 1 1 0 0 0 0 153 6 17 3 1 1 0 0 0 0 0 0 1 0 0 1 304 5 22 1 5 2 1 0 2 1 0 0 0 0 0 0 39

1985 1 4 23 4 5 2 1 3 0 1 2 0 0 0 0 2 472 11 11 6 1 1 5 0 1 1 1 1 0 0 0 1 403 9 22 6 2 4 3 3 4 7 1 2 0 0 0 0 634 2 28 8 8 5 2 2 4 3 0 1 1 0 0 1 65

1986 1 5 26 14 6 9 2 5 5 5 1 2 0 0 0 2 822 7 49 17 11 4 7 5 7 3 1 2 2 0 1 4 1203 13 37 21 9 3 5 7 3 1 3 1 0 0 0 6 1094 12 53 16 21 2 7 0 7 0 0 0 0 0 1 1 120

1987 1 21 44 29 11 6 4 2 2 1 0 2 0 2 2 8 1342 17 74 13 13 3 5 3 1 2 2 0 0 0 3 5 1413 36 58 23 14 7 4 1 2 1 3 0 0 0 3 1 1534 28 74 23 11 8 3 3 6 2 5 4 1 1 1 3 173

1988 1 31 80 16 9 3 2 8 3 1 4 6 2 1 2 6 1742 26 99 27 9 8 11 3 4 6 3 5 5 1 1 3 2113 31 95 35 13 18 4 6 4 4 3 3 2 0 3 3 2244 36 77 20 26 11 3 8 4 8 7 1 0 0 2 2 205

1989 1 32 92 32 10 12 19 12 4 3 2 0 2 2 0 2 2242 15 92 14 27 22 21 12 5 3 0 3 3 0 1 1 2193 34 104 29 31 18 8 6 7 3 8 0 2 1 2 £2534 38 101 34 18 9 15 6 1 2 2 2 3 2 2233

1990 1 31 124 47 24 11 15 8 6 5 3 3 4 2:2812 32 132 36 10 9 7 6 4 4 5 0 2:2453 49 107 51 17 15 8 9 2 1 1 22604 44 153 41 16 11 6 5 7 2 2285

1991 1 41 137 29 33 7 11 6 4 2 3 22712 56 124 39 14 12 7 10 S i 22633 53 175 35 17 13 11 >2 23064 63 135 24 23 12 S i 2258

1992 1 71 161 48 25 2:5 23102 95 178 39 £ 6 23183 76 181 2:16 22734 67 2:66 2133

This procedure takes into account two aspects o f uncertainty th a t are im portan t in prediction, namely the inaccuracy o f param eter estimates, and the random fluctuations in the unobserved y,*. The first enters through variation in a* and from replicate to replicate, and the second enters through the sam pling variability o f the predictand y'+J over different replicates. The procedure does not allow for a th ird com ponent o f predictive error, due to uncertainty about the form o f the model.

The model described above is a generalized linear model with Poisson errors and the log link function. It contains 52 param eters. The deviance of 716.5 on 413 degrees o f freedom is strong evidence tha t the data are overdispersed relative to the Poisson distribution. The estimate o f k is tc = 1.78, and in

<DCOocO)<C

ooinoorTOOCO

OOCVioo

19 8 4 1 9 8 6 1988 199 0 1992

Skewness

fact a quasilikelihood model in which var(Y ) = k/i appears to fit the data ; this corresponds to treating the counts in Table 7.4 as independent negative binom ial random variables.

The predicted value exp(a; ) J2k exP(A0 is shown as the solid line in the left panel o f Figure 7.5, together with the observed total reports to the end o f 1992. The right panel shows the standardized Pearson residuals plotted against the estim ated skewness p r l/2. The banding o f residuals at the right is characteristic o f da ta containing small counts, with the lower band corresponding to zeroes in the original data, the next to ones, and so forth. The distributions o f the rp change markedly, and it would be inappropriate to trea t them as a hom ogeneous group. The same conclusion holds for the standardized deviance residuals, although they are less skewed for larger fitted values. The dotted lines in the figure divide the observations into three strata, within each of which the residuals are more homogeneous. Finer stratification has little effect on the results described below.

One param etric bootstrap involves generating Poisson random variables Y ’k with means exp(a j + /?*). This fails to account for the overdispersion, which can be mimicked by param etric sampling from a fitted negative binom ial distribution w ith the same means and estim ated overdispersion.

N onparam etric resampling from standardized Pearson residuals will give overdispersion, but the right panel o f Figure 7.5 suggests that the residuals should be stratified. Figure 7.6 shows the ratio o f deviances to degrees of freedom for 999 samples taken under these four sampling schemes; the strata used in the lower right panel are shown in Figure 7.5. Param etric sim ulation from the Poisson model is plainly inappropriate because the da ta so generated

Figure I S Results from the fit of a Poisson two-way layout to the AIDS data. The left panel shows predicted diagnoses (solid), together with the actual totals to the end of 1992 (+). The right panel shows standardized Pearson residuals plotted against estimated skewness, p~l/2; the vertical lines are at skewness 0.6 and 1.


Figure 7.6 Resampling results for AIDS data. The left panels show deviances/degrees of freedom ratios for the four resampling schemes, with the observed ratio given as the vertical dotted line. The right panel shows predicted diagnoses (solid line), with pointwise 95% predictive intervals, based on 999 replicates of Poisson simulation (small dashes), of resampling residuals (dots), and of stratified resampling of residuals (large dashes).

Negative binomial

mi

0.0 0.5 1.0 1.5 2.0 2.5

Davtanca/df

Nonparametric

0.0 0.5 1.0 1.5 2.0 2.5

Devianca'df

Table 7.5 Bootstrap95% prediction intervals 1990 1991 1992for numbers of AIDScases in England and Poisson 296 315 294 327 356 537Wales for the fourth N egative binom ial 294 318 289 333 317 560quarters of 1990, 1991, N onparam etric 294 318 289 333 314 547and 1992. Stratified nonparam etric 292 319 288 335 310 571

are much less dispersed than the original data, for which the ratio is 716.5/413. The negative binomial sim ulation gives more appropriate results, which seem rather similar to those for nonparam etric sim ulation w ithout stratification. W hen stratification is used, the results mimic the overdispersion much better.

The pointwise 95% prediction intervals for the num bers o f AID S diagnoses are shown in the right panel o f Figure 7.6. The intervals for sim ulation from the fitted Poisson model are considerably narrow er than the intervals from resampling residuals, both o f which are similar. The intervals for the last quarters o f 1990, 1991, and 1992 are given in Table 7.5.

There is little change if intervals are based on the deviance residual form ula for the Poisson distribution, S(y,fi) = ±[2 {y log (y /n) + n ~ y } \ x/1-

A serious draw back with this analysis is tha t predictions from the two-way layout model are very sensitive to the last few rows o f the table, to the extent tha t the estim ate for the last row is determ ined entirely by the bottom left

cell. Some sort o f tem poral sm oothing is preferable, and we reconsider these data in Example 7.12 ■

7.3 Survival DataSection 3.5 describes resampling m ethods for a single hom ogeneous sample of da ta subject to censoring. In this section we turn to problems where survival is affected by explanatory variables.

Suppose that the data ( Y , D , x ) on an individual consist of: a survival time Y ; an indicator o f censoring, D, tha t equals one if Y is observed and zero if 7 is right-censored; and a covariate vector x. U nder random censorship the observed value o f Y is supposed to be min( Y °,C ), where C is a censoring variable with distribution G, and the true failure time 7 ° is a variable whose distribution F ( y ; /?, x) depends on the covariates x through a vector o f param eters, /?. M ore generally we m ight suppose tha t Y ° and C are conditionally independent given x, and tha t C has distribution G(c;y,x). In either case, the value o f C is supposed to be uninform ative about the param eter p.

Parametric model

In a param etric model F is fully specified once j3 has been chosen. So if the data consist o f m easurem ents ( y i , d i , x \ ) , . . . , ( y„ , dn,x„) on independent individuals, we suppose tha t p is estimated, often by the m axim um likelihood estim ator p. Param etric sim ulation is perform ed by generating values Y f ' from the fitted

distributions F ( y ; P , X j ) and generating appropriate censoring times Cj, setting Yj = min(Yj0’, Cj), and letting Dj indicate the event Yj>" < Cj. The censoring variables may be generating according to any one o f the schemes outlined in Section 3.5, or otherwise if appropriate.

Example 7.5 (PE T film data) Table 7.6 contains data from an accelerated life test on PET film in gas insulated transform ers; the film is used in electrical insulation. There are failure times y at each o f four different voltages x. Three failure times are right-censored at voltage x = 5: according to the d a ta source they were subject to censoring at a pre-determ ined time, bu t their values make it m ore likely tha t they were censored after a pre-determ ined num ber of failures, and we shall assume this in w hat follows.

The Weibull distribution is often used for such data. In this case plots suggest tha t bo th o f its param eters depend on the voltage applied, and that there is an unknow n threshold voltage xo below which failure cannot occur. O ur model is tha t the distribution function for y a t voltage x is given by

F ( y \ P , x ) = 1 - e x p { - ( y / ; . ) K}, y > 0,

X = exp {Po — Pi log(x — 5 + e^4) } , (7.16)

k = exp(/?2 — P3 log x).

7.3 ■ Survival Data 347

Table 7.6 Failure times (hours) from an accelerated life test on PET film in SFg gas insulated transformers (Hirose, 1993). ^ indicates right-censoring.

T(v) is the Gamma function / 0°° uv-1e-u du.

cVtP is the p quantile of the Xv distribution.

Voltage (kV)

5 7131 8482 8559 8762 9026 9034 9104>9104.25 >9104.25 >9104.25

7 50.25 87.75 87.76 87.77 92.90 92.91 95.96108.30 108.30 117.90 123.90 124.30 129.70 135.60135.60

10 15.17 19.87 20.18 21.50 21.88 22.23 23.0223.90 28.17 29.70

15 2.40 2.42 3.17 3.75 4.65 4.95 6.236.68 7.30

This param etrization is chosen so that the range o f each param eter is unbounded; note tha t xq = 5 — e^*.

The upper panels o f Figure 7.7 show the fit o f this model when the param eters are estim ated by maximizing the log likelihood (. The left panel shows Q-Q plots for each o f the voltages, and the right panel shows the fitted mean failure time and estim ated threshold xo- The fit seems broadly adequate.

We simulate replicate datasets by generating observations from the Weibull model obtained by substituting the M LEs into (7.16). In order to apply our assumed censoring mechanism, we sort the observations simulated with x = 5 to get _y*1} < < say, and then set y(*9), and equal to y'7) + 0.25.We give these three observations censoring indicators d* = 0, so that they are treated as censored, treat all the o ther observations as uncensored, and fit the Weibull model to the resulting data.

For sake o f illustration, suppose tha t interest focuses on the m ean failure time 9 when x = 4.9. To facilitate this we reparam etrize the model to have param eters 9 and /? = ( / f i , . . . , /^ ) , where 9 = 10- 3A r(l + 1/k), with x = 4.9. The lower left panel o f Figure 7.7 shows the profile log likelihood for 9, i.e.

^Prof(0) = m aP

in the figure we renormalize the log likelihood to have m axim um zero. Under the standard large-sample likelihood asymptotics outlined in Section 5.2.1, the approxim ate distribution o f the likelihood ratio statistic

W(9) = 2{<V of(0) —

is xj, so a 1 — a confidence set for the true 9 is the set such that

^ p ro f(0 ) ^ < V o f(^ ) — 5 C U _ a .


o>o

log Weibull quantiles

■ooo

o>o

SQ_

CD

£o_D(5(00

Voltage

Figure 7.7 PETreliability data analysis. Top left panel: Q-Q plot of log failure times against quantiles of log Weibull distribution, with fitted model given by dotted lines, and censored data by o. Top right panel: Fitted mean failure time as a function of voltage x; the dotted line shows the estimated voltage £o below which failure is impossible. Lower left panel: normalized profile log likelihood for mean failure time 0 at x = 4.9; the dotted line shows the 95% confidence interval for Q using the asymptotic chi-squared distribution, and the dashed line shows the 95% confidence interval using bootstrap calibration of the likelihood ratio statistic. Lower right panel: chi-squared Q-Q plot for simulated likelihood ratio statistic, with dotted line showing its large-sample distribution.

theta Chi-squared quantile

where 6 is the overall M LE. For these data 0 = 24.85 and the 95% confidence interval is [19.75,35.53]; the confidence set contains values o f 6 for which f prof(^) exceeds the dotted line in the bottom left panel o f Figure 7.7.

The use o f the chi-squared quantile to set the confidence interval presupposes that the sample is large enough for the likelihood asymptotics to apply, and this can be checked by the param etric sim ulation outlined above. The lower right panel o f the figure is a Q-Q plot o f likelihood ratio statistics w’(6) = 2 { /‘rof(0‘ ) — /* rof(0)} based on 999 sets o f da ta simulated from the

fitted model. The distribution o f the w ’(6) is close to chi-squared, but with

7.3 • Survival Data 349

Table 7.7 Comparison of estimated biases and standard errors of maximum likelihood estimates for the PET reliability data, using standard first-order likelihood theory, parametric bootstrap simulation, and model-based nonparametric resampling.

? is the matrix of second derivatives o f £ with respect to 0 and /?.

Param eter M LE Likelihood Param etric N onparam etric

Bias SE Bias SE Bias SE

Po 6.346 0 0.117 0.007 0.117 0.001 0.112

Pi 1.958 0 0.082 0.007 0.082 0.006 0.080

Pi 4.383 0 0.850 0.127 0.874 0.109 0.871

ft 1.235 0 0.388 0.022 0.393 0.022 0.393* 0 4.758 0 0.029 -0.004 0.030 -0.002 0.028

mean 1.12, and their 0.95 quantile is w(*950) = 4.09, to be com pared with ci,o.95 = 3.84. This gives bootstrap calibrated 95% confidence interval the set o f 9 such tha t / prof(0) > / prof(9) — 5 x 4.09, that is [19.62,36.12], which is slightly wider than the standard interval.

Table 7.7 com pares the bias estim ates and standard errors for the model param eters using the param etric bootstrap described above and standard first- order likelihood theory, under which the estim ated biases are zero, and the variance estimates are obtained as the diagonal elements o f the inverse observed inform ation m atrix (—?)_1 evaluated at the M LEs. The estim ated biases are small but significantly different from zero. The largest differences between the standard theory and the bootstrap results are for fo and fo, for which the biases are o f order 2-3% . The threshold param eter xo is well determ ined; the standard 95% confidence interval based on its asym ptotic norm al distribution is [4.701,4.815], whereas the norm al interval with estim ated bias and variance is [4.703,4.820],

A m odel-based nonparam etric bootstrap may be perform ed by using residuals e = ( y / ) . f , three o f which are censored, then resampling errors £* from their product-lim it estimate, and then m aking uncensored bootstrap observations le*1/*. The observations with x = 5 are then modified as outlined above, and the model refitted to the resulting data. The product-lim it estimate for the residuals is very close to the survivor function o f the standard exponential distribution, so we expect this to give results similar to the param etric simulation, and this is w hat we see in Table 7.7.

For censoring at a pre-determ ined time c, the sim ulation algorithm s would work as described above, except tha t values o f y * greater than c would be replaced by c and the corresponding censoring indicators d* set equal to zero. The num ber o f censored observations in each sim ulated dataset would then be random ; see Practical 7.3.

Plots show tha t the simulated M LEs are close to norm ally distributed: in this case standard likelihood theory works well enough to give good confidence intervals for the param eters. The benefit o f param etric sim ulation is that the bootstrap estimates give empirical evidence tha t the standard theory can

be trusted, while providing alternative m ethods for calculating measures of uncertainty if the standard theory is unreliable. It is typical o f first-order likelihood m ethods tha t the variability o f likelihood quantities is underestim ated, although here the effect is small enough to be unim portant. ■

Proportional hazards model

I f it can be assumed tha t the explanatory variables act multiplicatively on the hazard function, an elegant and powerful approach to survival da ta analysis is possible. U nder the usual form o f proportional hazards model the hazard function for an individual with covariates x is dA(y ) = exp( xT P)dA°(y), where dA°(y) is the ‘baseline’ hazard function that would apply to an individual with a fixed value o f x, often x = 0. The corresponding cumulative hazard and survivor functions are

A{y) = [ y exp(xT P)dA°(u), 1 - F ( y ; p, x) = {1 - F°(y)}exp(x7Jo

P)

where 1 — F°(y) is the baseline survivor function for the hazard dA°(y).The regression param eters P are usually estim ated by maximizing the partial

likelihood, which is the product over cases with dj = 1 o f terms

________g P f r r ft>________ (717)E L i H (yj - yk )exp(xTpky

where H(u) equals zero if u < 0 and equals one otherwise. Since (7.17) isunaltered by recentring the xj, we shall assume below that E x j = 0 ; thebaseline hazard then corresponds to the average covariate value x = 0.

In terms o f the estim ated regression param eters the baseline cumulative hazard function is estim ated by the Breslow estimator

A ° ( y ) = J 2 ^ n m d\ ( T t i V (7-18)j:yj<y £ * = i H (yj - yk)ex p ( x Tfa)

a non-decreasing function that jum ps at yj by

dA°(yj) = ------------------^ --------------— .E L i H (yj - yk) exp (x Tpk)

One standard estim ator o f the baseline survivor function is

1 - ^ 0 0 = n { i - ^ v ) } . (7.i9)i-y&y

which generalizes the product-lim it estim ate (3.9), although other estim ators also exist. W hichever o f them is used, the proportional hazards assum ption implies that

{1 _ F°(y)}exp<-xJfo

7.3 ■ Survival Data 351

will be the estim ated survivor function for an individual with covariate valuesX j .

Under the random censorship model, the survivor function o f the censoring distribution G is given by (3.11).

The bootstrap m ethods for censored da ta outlined in Section 3.5 extend straightforw ardly to this setting. For example, if the censoring distribution is independent o f the covariates, we generate a single sample under the conditional sampling plan according to the following algorithm.

Algorithm 7.2 (Conditional resampling for censored survival data)

For j = 1

1 generate 7?* from the estim ated failure time survivor function {1 — F°(y)}exp(xJW;

2 if dj = 0, set Cj = yj, and if dj = 1, generate Cj from the conditional censoring distribution given tha t Cj > yj, namely {G(y) - G{yj)}/{ 1 - G(y,)}; then

3 set Yj = min(7P*, Cj), with Dj = 1 if YJ = Yf* and zero otherwise.

U nder the more general model where the distribution G o f C also depends upon the covariates and a proportional hazards assum ption is appropriate forG, the estim ated censoring survivor function when the covariate is x is

f n -1 exp(xr y)1 - G ( y ; y , x ) = { l - G ° ( y ) j

where G0(y) is the estim ated baseline censoring distribution given by the analogues o f (7.18) and (7.19), in which 1 — dj and y replace dj and fi. Under m odel-based resampling, a bootstrap dataset is then obtained by

Algorithm 7.3 (Resampling for censored survival data)

For j = 1 , . . . , n,

1 generate 7?* from the estim ated failure time survivor function

{1 — F°(y)}exp{xyP\ and independently generate Cj from the estim ated

censoring survivor function {1 — G°(>')}exp(x ) ; then2 set 7 / = min(7P*,C*), with Dj = 1 if 7 / = Y f and zero otherwise.

The next example illustrates the use o f these algorithms.


Example 7.6 (Melanoma data) To illustrate these ideas, we consider da ta on the survival o f patients with m alignant m elanoma, whose tum ours were removed by operation a t the D epartm ent o f Plastic Surgery, University Hospital o f Odense, D enm ark. O perations took place from 1962 to 1977, and patients were followed to the end o f 1977. Each tum our was completely removed, to gether with about 2.5 cm o f the skin around it. The following variables were available for 205 patients: time in days since the operation, possibly censored; status at the end o f the study (alive, dead from m elanom a, dead from other causes); sex; age; year o f operation; tum our thickness in m m ; and an indicator o f whether or no t the tum our was ulcerated. U lceration and tum our thickness are im portant prognostic variables: to have a thick or ulcerated tum our substantially increases the chance o f death from m elanom a, and we shall investigate how they affect survival. We assume tha t censoring occurs at random .

We fit a proportional hazards model under the assum ption that the baseline hazards are different for the ulcerated group o f 90 individuals, and the nonulcerated group, but tha t there is a com m on effect o f tum our thickness. For a flexible assessment o f how thickness affects the hazard function, we fit a natural spline with four degrees o f freedom ; its knots are placed at the empirical 0.25, 0.5 and 0.75 quantiles o f the tum our thicknesses. Thus our model is that the survivor functions for the ulcerated and non-ulcerated groups are

1 - F l (y ;P ,x ) = {1 - f ? ( 30}“ p(xrw, l - F 2( y ; p , x ) = {1 - F2°(y)}exp(xT/f),

where x has dimension four and corresponds to the spline, /? is com m on to the groups, bu t the baseline survivor functions 1 — F^(y) and 1 — F^iy) may differ. For illustration we take the fitted censoring distribution to be the product-lim it estimate obtained by setting censoring indicators d' = 1 — d, and fitting a model with no covariates, so G is just the product-lim it estim ate o f the censoring time distribution. The left panel o f Figure 7.8 shows the estim ated survivor functions1 — F®(y) and 1 — F°(y); there is a strong effect o f ulceration. The right panel shows how the linear predictor x Tji depends on tum our thickness: from 0-3 mm the effect on the baseline hazard changes from about exp(—1) = 0.37 to about exp(0.6) = 1.8, followed by a slight dip and a gradual upw ard increase to a risk o f about exp(1.2) = 3.3 for a tum our 15 mm thick. Thus the hazard increases by a factor o f about 10, bu t m ost o f the increase takes place from 0-3 mm. However, there are too few individuals with tum ours m ore than 10 mm thick for reliable inferences at the right o f the panel.

The top left panel o f Figure 7.9 shows the original fitted linear predictor, together with 19 replicates obtained by resampling cases, stratified by ulceration. The lighter solid lines in the panel below are pointwise 95% confidence limits, based on R = 999 replicates o f this sampling scheme. In effect these are percentile m ethod confidence limits for the linear predictor a t each thickness.

7.4 ■ Other Nonlinear Models 353

Figure 7.8 Fit o f a proportional hazards model for ulcer histology and survival of patients with malignant melanoma (Andersen et al., 1993, pp.709-714). Left panel: estimated baseline survivor functions for cases with ulcerated (dots) and non-ulcerated (solid) tumours. Right panel: fitted linear predictor x Tfi for risk as a function o f tum our thickness. The lower rug is for non-ulcerated patients, and the upper rug for ulcerated patients. Time (days) Tumour thickness (mm)

The sharp increase in risk for small thicknesses is clearly a genuine effect, while beyond 3mm the confidence interval for the linear predictor is roughly [0,1], w ith thickness having little o r no effect.

Results from model-based resampling using the fitted model and applying Algorithm 7.3, and from conditional resam pling using A lgorithm 7.2 are also shown; they are very similar to the results from resampling cases. In view of the discussion in Section 3.5, we did not apply the weird bootstrap.

The right panels o f Figure 7.9 show how the estim ated 0.2 quantile o f the survival distribution, yo.2 = min{y : Fi(y;P ,x ) > 0.2} depends on tum our thickness. There is an initial sharp decrease from 3000 days to about 750 days as tum our thickness increases from 0-3 mm, but the estimate is roughly constant from then on. The individual estimates are highly variable, but the degree o f uncertainty m irrors roughly tha t in the left panels. Once again results for the three resampling schemes are very similar.

Unlike the previous example, where resampling and standard likelihood m ethods led to similar conclusions, this example shows the usefulness o f resampling when standard approaches would be difficult o r impossible to apply. ■

7.4 Other Nonlinear ModelsA nonlinear regression model with independent additive errors is o f form

yj = Kxj ,P) + £j, ; = (7.20)

ogTDo>k -Q.(0<DC

2 4 6 8

Tumour thickness (mm)

10



0 2 4 6 8 10


Figure 7.9 Bootstrap results for melanoma data analysis. Top left: fitted linear predictor (heavy solid) and 19 replicates from case resampling (solid); the rug shows observed thicknesses. Top right: estimated 0.2 quantile of survivor distribution as a function of tumour thickness, for an individual with an ulcerated tumour (heavy solid), and 19 replicates for case resampling (solid); the rug shows observed thicknesses. Bottom left: pointwise 95% percentile confidence limits for linear predictor, from case (solid),model-based (dots), and conditional (dashes) resampling. Bottom right: pointwise 95% percentile confidence limits for 0.20 quantile of survivor distribution, from case (solid), model-based (dots), and conditional (dashes) resampling, R — 999.

with ji{x, /?) nonlinear in the param eter /?, which m ay be vector or scalar. The linear algebra associated with least squares estim ates for linear regression no longer applies exactly. However, least squares theory can be developed by linear approxim ation, and the least squares estim ate ft can often be com puted accurately by iterative linear fitting.

The linear approxim ation to (7.20), obtained by Taylor series expansion, gives

yj ~ t i x j , P') = u j (0 - P') + ej, j = 1, . . . , n, (7.21)

7.4 • Other Nonlinear Models 355

where

= 8y{xj,P)

W i>-p

This defines an iteration th a t starts a t P' using a linear regression least squares fit, and at the final iteration /?' = /?. A t th a t stage the left-hand side o f (7.21) is simply the residual ej = yj — fi(xj,P). Approxim ate leverage values and other diagnostics are obtained from the linear approxim ation, tha t is using the definitions in previous sections but with the UjS evaluated at p' = p as the values o f explanatory variable vectors. This use o f the linear approxim ation can give m isleading results, depending upon the “intrinsic curvature” o f the regression surface. In particular, the residuals will no longer have zero expectation in general, and standardized residuals r; will no longer have constant variance under hom oscedasticity o f true errors.

The usual norm al approxim ation for the distribution o f P is also based on the linear approxim ation. For the approxim ate variance, (6.24) applies with X replaced by U = (u i , . . . , u n)T evaluated at p. So with s2 equal to the residual mean square, we have

P - P ~ N ( 0 , s 2( U T U r l ) . (7.22)

The accuracy o f this approxim ation will depend upon two types o f curvature effects, called param eter effects and intrinsic effects. The first o f these is specific to the param etrization used in expressing /x(x, •), and can be reduced by careful choice o f param etrization. O f course resampling m ethods will be the m ore useful the larger are the curvature effects, and the worse the norm al approxim ation.

Resam pling m ethods apply here ju st as with linear regression, either simulating da ta from the fitted model with resam pled modified residuals or by resampling cases. For the first o f these it will generally be necessary to make a m ean adjustm ent to whatever residuals are being used as the error population. It would also be generally advisable to correct the raw residuals for bias due to nonlinearity: we do no t show how to do this here.

Example 7.7 (Calcium uptake data) The data plotted in Figure 7.10 show the calcium uptake o f cells, y, as a function of time x after being suspended in a solution o f radioactive calcium. Also shown is the fitted curve

fi(x,P) = Po { l - e x p ( - / ? ix ) } .

The least squares estim ates are Po = 4.31 and Pi = 0.209, and the estimate of a is 0.55 with 25 degrees o f freedom. The standard errors for Po and Pi based on (7.22) are 0.30 and 0.039.

356 7 * Further Topics in Regression

(0ZJ"O*35£5(0cr

too

oo

mo

Time (minutes)

2 4 6 8 10 12 14

Time (minutes)

Figure 7.10 Calcium uptake data and fitted curve (left panel), with raw residuals (right panel) (Rawlings, 1988, p. 403).

Estim ate B ootstrap bias T heoretical SE B ootstrap SE

Poh

4.310.209

0.0280.004

0.300.039

0.380.040

Table 7.8 Results from R = 999 replicates of stratified case resampling for nonlinear regression model fitted to calcium data.

The right panel o f Figure 7.10 shows tha t homogeneity o f variance is slightly questionable here, so we resample cases by stratified sampling. Estim ated biases and standard errors for fo and fo based on 999 bootstrap replicates are given in Table 7.8. The m ain point to notice is the appreciable difference between

A

theoretical and bootstrap standard errors for Po.Figure 7.11 illustrates the results. N ote the non-elliptical pattern o f variation

and the non-norm ality: the z-statistics are also quite non-norm al. In this case the bootstrap should give better results for confidence intervals than norm al approxim ations, especially for Po- The bottom right panel suggests tha t the param eter estim ates are closer to norm al on logarithm ic scales.

Results for m odel-based resampling assuming hom oscedastic errors are fairly similar, although the standard error for fo is then 0.32. The effects o f nonlinearity are negligible in this case: for example, the m axim um absolute bias of residuals is about 0.012<r.

Suppose th a t we w ant confidence limits on some aspect o f the curve, such as the “proportion o f m axim um ” n = 1 — exp(—P\x). Ordinarily one might

7.4 ■ Other Nonlinear Models 357

Figure 7.11 Parameter estimates for case resampling of calcium data, with R = 999. The upper panels show normal plots of fa and

while the lower panels show their joint distributions on the original (left) and logarithmic scales (right).


oo

oco

oCO

d

o

d

•'tO

CO

. A. ©

"".4/ * r.iS w<D 0

JO V v W c g 11;

.

* ^I •••

b3.5 4.0 4.5 5.0 5.5 6.0

betaO

4 5

betaO

approach this by applying the delta m ethod together with the bivariate norm al approxim ation for least squares estimates, bu t the bootstrap can deal with this using only the simulated param eter estimates. So consider the times x = 1, 5, 15, at which the estimates n = 1 — exp(—fiix) are 0.188, 0.647 and 0.956 respectively. The top panel o f Figure 7.12 shows bootstrap distributions of 7T* = 1 — exp(—P[x): note the strong non-norm ality a t x = 15.

The constraint that n m ust lie in the interval (0,1) means tha t it is unwise to construct basic or studentized confidence intervals for n itself. For example, the basic bootstrap 95% interval for n a t x = 15 is [0.922,1.025], The solution is to do all the calculations on the logit scale, as shown in the lower panel of Figure 7.12, and untransform the limits obtained a t the end. T hat is, we obtain


x=1

1

x=5

J ItfTh-i-rL-

x=15

0.0 0.2 0.4 0.6 0.8 1.0Proportion 1 - exp(-beta1*x)

-2 0 2 4Logit of proportion

intervals [rji,rj2] for r\ = log{7t/(l — n)}, and then take

exp(?7i) exp (rj2)1 +exp(j7i)’ 1 +exp(f/2).

as the corresponding intervals for n. The resulting 95% intervals are [0.13,0.26] at x = 1, [0.48,0.76] a t x = 5, and [0.83,0.98] a t x = 15. The standard linear theory gives slightly different values, e.g. [0.10,0.27] at x = 1 and [0.83,1.03] at x = 15. ■

7.5 Misclassification ErrorThe discussion o f aggregate prediction error in Section 6.4.1 was expressed in a general no tation tha t would apply also to the regression models described in this chapter, with appropriate definitions o f prediction rule y+ = fi(x+,F) for a response y+ a t covariate values x+, and measure o f accuracy c(y+,y+). The general conclusions o f Section 6-4.1 concerning bootstrap and cross-validation estimates o f aggregate prediction error should apply here also. In particular the adjusted K -fold cross-validation estim ate and the 0.632 bootstrap estimate should be preferred in m ost situations.

Figure 7.12 Calcium uptake data: bootstrap histograms for estimated proportion of maximum n = 1 — exp(—fi\x) at x = 1, 5 and 15 based on R = 999 resamples of cases.

7.5 ■ Misclassification Error 359

One type o f problem tha t deserves special attention, in part because it differs m ost from the examples o f Section 6.4.1, is the estim ation o f prediction error for binary responses, supposing these to be modelled by a generalized linear model o f the sort discussed in Section 7.2. I f the binary response corresponds to a classification indicator, then prediction of response y + for an individual with covariate vector x + is equivalent to classification o f that individual, and incorrect prediction (y+ == y+ ) is a misclassification error.

Suppose, then, tha t the response y is 0 o r 1, and that the prediction rule fi(x+,F) is an estimate o f Pr(Y+ = 1 | x+ ) for a new case (x+,y +). We im agine that this estim ated probability is translated into a prediction o f y+, or equivalently a classification o f the individual with covariate x+ . For simplicity we set y + = 1 if fi{x+,F) > and y + = 0 otherwise; this would be m odified if incidence rates for the two classes differed. I f costs o f both types o f misclassification error are equal, as we shall assume, then it is enough to set

The aggregate prediction error D is simply the overall misclassification rate, equal to the proportion o f cases where y+ is wrongly predicted.

The special feature o f this problem is tha t the prediction and the measure o f error are not continuous functions o f the data. According to the discussion in Section 6.4.1 we should then expect bootstrap m ethods for estim ating D or its expected value A to be superior to cross-validation estimates, in terms o f variability. Also leave-one-out cross-validation is no longer attractive on com putational grounds, because we now have to refit the model for each resample.

Example 7.8 (Urine data) For an example o f the estim ation o f misclassification error, we take binary data on the presence o f calcium oxalate crystals in 79 samples o f urine. Explanatory variables are specific gravity, i.e. the density o f urine relative to water, pH, osm olarity (mOsm), conductivity (m M ho milliMho), urea concentration (millimoles per litre), and calcium concentration (millimoles per litre). A fter dropping two incomplete cases, 77 remain.

Consider how well the presence o f crystals can be predicted from the explanatory variables. Analysis o f deviance for binary logistic regression suggests the model which includes the p = 4 covariates specific gravity, conductivity, log calcium concentration, and log urine density, and we base our predictions on this model. The simplest estimate o f the expected aggregate prediction error A is the average num ber o f misclassifications, Aapp = n~l E with c(-, •)given by (7.23); it would be equivalent to use instead

otherwise.(7.23)

otherwise.


K -fold (adjusted) cross-validation

B ootstrap 0.632 77 38 10 7 2

24.7 22.1 23.4 23.4 (23.7) 20.8 (21.0) 26.0 (25.4) 20.8 (20.8)

Case ordered by residual

In this case Aapp = 20.8 x 10-2 . O ther estimates o f aggregate prediction error are given in Table 7.9. For the bootstrap and 0.632 estimates, we used R = 200 bootstrap resamples.

The discontinuous nature o f prediction error gives more variable results than for the examples with squared error in Section 6.4.1. In particular the results for K -fold cross-validation now depend m ore critically on which observations fall into the groups. For example, the average and standard deviation o f A qvj for 40 repeats were 23.0 x 10-2 and 2.0 x 10~2. However, the broad pattern is similar to tha t in Table 6.9.

Figure 7.13 shows box plots o f the quantities yj — n(xj;F*) tha t contribute to the 0.632 estimate o f prediction error, plotted against case j ordered by the residual; only three values o f j are labelled. There are about 74 contributions at each value o f j . Only values outwith the horizontal dotted lines contribute to prediction error. The pattern is broadly w hat we would expect: observations with residuals close to zero are generally well predicted, and make little contribution to prediction error. M ore extreme residuals contribute m ost to prediction error. N ote cases 66 and 54, which are always misclassified; their standardized Pearson residuals are 2.13 and 2.54. The figure suggests that case

Table 7.9 Estimates of aggregate prediction error (xlO-2), or misclassification rate, for urine data (Andrews and Herzberg, 1985, pp. 249-251).

Figure 7.13Components of 0.632 estimate of prediction error, yj — fi(xj; F*), for urine data based on 200 bootstrap simulations. Values within the dotted lines make no contribution to prediction error. The components from cases 54 and 66 are the rightmost and the fourth from rightmost sets of errors shown; the components from case 27 are leftmost.

7.5 • Misclassification Error 361

Table 7.10 Summary results for estimates of prediction error for 200 samples of size n = 50 from data on low birth weights (Hosmer and Lemeshow, 1989, pp. 247-252; Venables and Ripley, 1994, p. 193). The table shows the average, standard deviation, and conditional mean squared error (xlO-2) for the 200 estimates of excess error. The “target” average excess error is 8.3 x 10“2.

B ootstrap 0.632

K -fold (adjusted) cross-validation

50 25 10 5 2

M ean 9.1 8.8 11.5 11.7 (11.5) 12.2 (11.7) 12.4 (11.3) 15.3 (11.1)SD 1.2 1.9 4.4 4.5 (4.2) 5.0 (4.6) 4.8 (3.9) 7.1 (4.6)M SE 0.38 0.29 0.62 0.64 (0.63) 0.76 (0.73) 0.64 (0.54) 1.14 (0.59)

54 is outlying. A t the o ther end is case 27, whose residual is -1.84; this was misclassified 42 times out o f 65 in our simulation. ■

Example 7.9 (Low birth weights) In order to com pare the properties of estim ates o f misclassification error under repeated sampling, we took da ta on 189 births a t a US hospital to be our population F. The binary response equals zero for babies with birth weight less than 2.5 kg, and equals one otherwise. We took 200 samples o f size n = 50 from these data, and to each sample we fitted a binary logistic m odel with nine regression param eters expressing dependence on m aternal characteristics — weight, smoking status, num ber of previous prem ature labours, hypertension, uterine irritability and the num ber o f visits to the physician in the first trimester. For each o f the samples we calculated various cross-validation and bootstrap estimates o f misclassification rate, using R = 200 bootstrap resamples.

Table 7.10 shows the results o f this experiment, expressed in term s o f estim ates o f the excess error, which is the difference between true misclassification rate D and the apparent error rate Aapp found by applying the prediction rule to the data. The “target” value o f the average excess error over the 200 samples was 8.3 x 10—2; the average apparent error was 20.0 x 10~2.

The bootstrap and 0.632 excess error estim ates again perform best overall in term s o f mean, variability, and conditional m ean squared error. N ote that the standard deviations for the bootstrap and 0.632 estimates suggest tha t R = 50 would have given results accurate enough for m ost purposes.

O rdinary cross-validation is significantly better than K -fold cross-validation, unless K = 25. However, the results for K -fold adjusted cross-validation are not significantly different from those for unadjusted cross-validation, even with K = 2 . Thus if cross-validation is to be used, adjusted K -fold cross-validation offers considerable com putational savings over ordinary cross-validation, and is about equally accurate.

For reasons outlined in Example 3.6, the E D F o f the data may be a poor estim ate o f the original C D F when there are binary responses yj. One way to overcome this is to switch the response value with small probability, i.e. to replace (x*,y*) with (x* ,l — y ' ) with probability (say) 0.1. This corresponds to a binom ial sim ulation using probabilities shrunk som ewhat towards 0.5


from the observed values o f 0 and 1. It should produce results that are sm oother than those obtained under case resampling from the original data. O ur sim ulation experim ent included this random ized bootstrap, but although typically it improves slightly on bootstrap results, the results here were very similar to those for the ordinary bootstrap. ■

In principle resam pling estim ates o f misclassification rates could be used to select which covariates to include in the prediction rule, along the lines given for linear regression in Section 6.4.2. It seems likely, in the light o f the preceding example, tha t the bootstrap approach would be preferable.

7.6 Nonparametric RegressionSo far we have considered regression models in which the m ean response is related to covariates x through a function o f known form with a small num ber o f unknow n param eters. There are, however, occasions when it is useful to assess the effects o f covariates x w ithout completely specifying the form o f the relationship between m ean response n and x. This is done using nonparam etric regression m ethods, o f which there are now a large number.

The simplest nonparam etric regression relationship for scalar x is

7 - n ( x ) + e,

where fi(x) has completely unknow n form but would be assumed continuous in m any applications, and e is a random error with zero mean. A typical application is illustrated by the scatter plot in Figure 7.14. Here no simple param etric regression curve seems appropriate, so it makes sense to fit a sm ooth curve (which we do later in Example 7.10) with as few restrictions as possible.

Often nonparam etric regression is used as an exploratory tool, either directly by producing a curve estim ate for visual interpretation, or indirectly by providing a com parison with some tentative param etric model fit via a significance test. In some applications the ra ther different objective o f prediction will be of interest. W hatever the application, the com plicated nature o f nonparam etric regression m ethods makes it unlikely tha t probability distributions for statistics o f interest can be evaluated theoretically, and so resampling m ethods will play a prom inent role.

It is not possible here to describe all o f the nonparam etric regression m ethods that are now available, and in any event m any o f them do not yet have fully developed com panion resampling methods. We shall limit ourselves to a brief discussion o f some o f the m ain m ethods, and to applications in generalized additive models, where nonparam etric regression is used to extend the generalized linear models o f Section 7.2.

7.6 • Nonparametric Regression 363

Figure 7.14 Motorcycle impact data. Acceleration y (g) at a time x milliseconds after impact (Silverman,1985). co

2q5<Doo<

Time (ms)

7.6.1 Nonparametric curvesSeveral nonparam etric curve-fitting algorithm s are variants on the idea of local averaging. One such m ethod is kernel smoothing, which estimates mean response E(Y | x) = fi(x) by

£(*) =X > ;w {(x ~ */)/*>} E » { ( x - x # ) } ’

(7.24)

with w(-) a symmetric density function and b an adjustable “bandw idth” constant that determines how widely the averaging is done. This estimate is similar in m any ways to the kernel density estimate discussed in Example 5.13, and as there the choice o f b depends upon a trade-off between bias and variability of the estim ate: small b gives small bias and large variance, whereas large b has the opposite effects. Ideally b would vary with x, to reflect large changes in the derivative o f /i(x) and heteroscedasticity, both evident in Figure 7.14.

M odifications to the estim ate (7.24) are needed at the ends o f the x range, to avoid the inherent bias when there is little or no data on one side of x. In m any ways m ore satisfactory are the local regression methods, where a local linear or quadratic curve is fitted using weights w{(x — xj ) /b} as above, and then p.(x) is taken to be the fitted value at x. Im plem entations o f this idea include the lowess m ethod, which also incorporates trim m ing o f outliers. Again the choice o f b is critical.

A different approach is to define a curve in terms of basis functions, such as powers o f x which define polynomials. The fitted model is then a linear com bination o f basis functions, with coefficients determ ined by least squares regression. W hich basis to use depends on the application, but polynom ials are


generally bad because fitted values become increasingly variable as x moves tow ard the ends o f its da ta range — polynom ial extrapolation is notoriously poor. One popular choice for basis functions is cubic splines, with which n(x) is modelled by a series o f cubic polynom ials joined at “k n o t” values o f x, such that the curve has continuous second derivatives everywhere. The least squares cubic spline fit minimizes the penalized least squares criterion for fitting /i(x),

~ M*/)}2 + * J { t t x ) } 2dx;

weighted sums o f squares can be used if necessary. In m ost software implem entations the spline fit can be determ ined either by specifying the degrees o f freedom o f the fitted curve, o r by applying cross-validation (Section 6.4.1).

A spline fit will generally be biased, unless the underlying curve is in fact a cubic. T hat such bias is nearly always present for nonparam etric curve fits can create difficulties. The other general feature that makes in terpretation difficult is the occurrence o f spurious bum ps and bends in the curve estimates, as we shall see in Example 7.10.

Resampling methods

Two types o f applications o f nonparam etric curves are use in checking a p ara metric curve, and use in setting confidence limits for fi(x) or prediction limits for Y = h ( x ) + e at some values o f x. The first type is quite straightforw ard, because da ta would be sim ulated from the fitted param etric model: Example 7.11 illustrates this. Here we look briefly a t confidence limits and prediction limits, where the nonparam etric curve is the only “m odel”.

The basic difficulty for resampling here is similar to tha t with density estim ation, illustrated in Example 5.13, namely bias. Suppose tha t we want to calculate a confidence interval for ji(x) at one or m ore values o f x. Case resampling cannot be used with standard recom m endations for nonparam etric regression, because the resampling bias o f f i { x ) will be smaller than that o f ju(x). This could probably be corrected, as with density estim ation, by using a larger bandw idth or equivalent tuning constant. But simpler, at least in principle, is to apply the idea o f m odel-based resampling discussed in C hapter 6.

The naive extension o f m odel-based resam pling would generate responses y j = p.{xj) + e*, where fa(x; ) is the fitted value from some nonparam etric regression m ethod, and ej is sampled from appropriately modified versions o f the residuals yj — fi(xj). U nfortunately the inherent bias o f m ost nonparametric regression m ethods distorts both the fitted values and the residuals, and thence biases the resam pling scheme. One recom m ended strategy is to use as sim ulation model a curve tha t is oversm oothed relative to the usual estimate. For definiteness, suppose tha t we are using a kernel m ethod or a local sm oothing m ethod with tuning constant b, and tha t we use cross-validation


to determ ine the best value o f b. Then for the sim ulation m odel we use the corresponding curve with, say, 2b as the tuning constant. To try to eliminate bias from the sim ulation errors ej, we use residuals from an undersm oothed curve, say with tuning constant b / 2. As with linear regression, it is appropriate to use modified residuals, where leverage is taken into account as in (6.9). This is possible for m ost nonparam etric regression m ethods, since they are linear. Detailed asym ptotic theory shows that som ething along these lines is necessary to make resam pling work, but there is no clear guidance as to precise relative values for the tuning constants.

Example 7.10 (Motorcycle impact data) The response y here is acceleration measured x milliseconds after im pact in an accident sim ulation experiment. The full da ta were shown in Figure 7.14, but for com putational reasons we eliminate replicates for the present analysis, which leaves n = 94 cases with distinct x values. The solid line in the top left panel o f Figure 7.15 shows a cubic spline fit for the data o f Figure 7.14, chosen by cross-validation and having approxim ately 12 degrees o f freedom. The top right panel o f the figure gives the plot o f modified residuals against x for this fit. N ote the heteroscedasticity, which broadly corresponds to the three stra ta separated by the vertical dotted lines. The estim ated variances for these stra ta are approxim ately 4, 600 and 140. Reciprocals o f these were used as weights for the spline fit in the left panel. Bias in these residuals is evident a t times 10-15 ms, where the residuals are first mostly negative and then positive because the curve does not follow the da ta closely enough.

There is a rough correspondence between kernel sm oothing and spline sm oothing and this, together with the previous discussion, suggests that for m odel-based resampling we use yj = p(xj) + ej, where fi is the spline fit obtained by doubling the cross-validation choice o f L This fit is the dotted line in the top left panel o f Figure 7.15. The random errors ej are sampled from the modified residuals for another spline fit in which X is ha lf the cross- validation value. The lower right panel o f the figure displays these residuals, which show less bias than those for the original fit, though perhaps a smaller bandw idth would be better still. The sampling is stratified, to reflect the very strong heteroscedasticity.

We sim ulated R = 999 datasets in this way, and to each fitted the spline curve fi’ (x), with the bandw idth chosen by cross-validation each time. We then calculated 90% confidence intervals at six values o f x, using the basic bootstrap m ethod modified to equate the distributions o f /i*(x) —p(x) and Forexample, at x = 20 the estimates ft and p. are respectively —110.8 and —106.2, and the 950th ordered value o f p" is —87.2, so tha t the upper confidence limit is —110.8 — {—87.2 — (—106.2)} = —129.8. The resulting confidence intervals are shown in the bottom left panel o f Figure 7.15, together with the original

co3-g'</)£■og

o

Time (ms) Time (ms)

Time (ms) Time (ms)

fit. N ote how the confidence limits are centred on the convex side o f the fitted curve in order to account for its bias; this is m ost evident at x = 20. ■

7.6.2 Generalized additive modelsThe structural part o f a generalized linear model, as outlined in Section 7.2.1, is the linear predictor rj = x Tft, which is additive in the com ponents x, o f x. It may not always be the case th a t we know w hether Xj o r some transform ation o f it should be used in the linear predictor. Then it makes sense, a t least for exploratory purposes, to include in rj a nonparam etric curve com ponent s;(x,) for each com ponent x,- (except those corresponding to qualitative factors). This still assumes additivity o f the effects o f the x,s on the linear predictor scale.

Figure 7.15 Bootstrap analysis of motorcycle data, without replicate responses. Top left: data, original cubic spline fit (solid) and oversmoothed fit (dots). Top right: residuals from original fit; note their bias at times 10-15 ms. Bottom right: residuals from undersmoothed fit. The lines in these plots show strata used in the resampling. Bottom left: original fit and 90% basic bootstrap confidence intervals at six values of x ; they are not centred on the fitted curve.

7.6 ■ Nonparametric Regression 367

The result is the generalized additive model

pg{fi(x)} = ri(x) = ^ 2 Si(xt), (7.25)

!=i

where g( ) is a known link function, as before. As for a generalized linear model, the model specification is completed by a variance function, var(Y ) = k V(h).

In practice we m ight force some term s s,(x,) in (7.25) to be linear, depending upon w hat is know n about the application. Each nonparam etric term is typically fitted as a linear term plus a nonlinear term, the latter using sm oothing splines or a local smoother. This means tha t the corresponding generalized linear model is a sub-model, so tha t the effects o f nonlinearity can be assessed using differences o f residual deviances, suitably scaled, as in (7.8). In standard com puter im plem entations each nonparam etric curve s,(x,) has (approximately) three degrees o f freedom for nonlinearity. S tandard distributional approxim ations for the resulting test statistics are sometimes quite unreliable, so tha t resam pling m ethods are particularly helpful in this context. For tests o f this sort the null m odel for resam pling is the generalized linear model, and the approach taken can be sum m arized by the following algorithm.

Algorithm 7.4 (Comparison of generalized linear and generalized additive models)

For r = 1 , . . . , R,

1 fix the covariate values at those observed;2 generate bootstrap responses y j , . . . ,y * by resampling from the fitted

generalized linear null model;3 fit the generalized linear model to the bootstrap data and calculate the

residual deviance d'0r;4 fit the generalized additive model to the bootstrap data, calculate the

residual deviance d* and dispersion k*; then5 calculate t* = (d$r — d*)/k*.

Finally, calculate the P-value as [l + #{t* > t}] / ( R + 1), where t = (do — d ) /k is the scaled difference o f deviances for the original data. •

The following example illustrates the use o f nonparam etric curve fits in model- checking.

Example 7.11 (Leukaemia data) For the data in Example 7.1, we originally fitted a generalized linear model with gam m a variance function and linear predictor g roup + x with logarithm ic link, where g roup is a factor with two levels. The fitted m ean function for tha t model is shown as two solid curves in Figure 7.16, the upper curve corresponding to G roup 1. Here we consider


Figure 7.16Generalized linear model fits (solid) and generalized additive model fits (dashed) for leukaemia data of Example 7.1.

Log10 white blood cell count

whether or not the effect o f x is linear. To do this, we com pare the original fit to that o f the generalized additive model in which x is replaced by s(x), which is a sm oothing spline with three degrees o f freedom. The link and variance functions are unchanged. The fitted m ean function for this model is shown as dashed curves in the figure.

Is the sm ooth curve a significantly better fit? To answer this we use the test statistic Q defined in (7.8), where here D corresponds to the residual deviance for the generalized additive model, k is the dispersion for tha t model, and Do is the residual deviance for the smaller generalized linear model. For these da ta D = 40.32 with 30 degrees o f freedom, k = 0.725, and Do = 30.75 with 27 degrees o f freedom, so tha t q = (40.32 — 30.75)/0.725 = 13.2. The standard approxim ation for the null distribution o f Q is chi-squared with degrees of freedom equal to the difference in model dimensions, here p — po = 3, so the approxim ate P-value is 0.004. Alternatively, to allow for estim ation o f the dispersion, (p — po)_12 is com pared to the F distribution with denom inator degrees o f freedom n — p — 1, here 27, and this gives approxim ate P-value 0.012. It looks as though there is strong evidence against the simpler, loglinear model. However, the accuracies o f the approxim ations used here are somewhat questionable, so it makes sense to apply the resampling analysis.

To calculate a bootstrap P-value corresponding to q = 13.2, we simulate the distribution o f Q under the fitted null model, that is the original generalized linear m odel fit, but with nonparam etric resampling. The particular resampling scheme we choose here uses the linear predictor residuals rLj defined in (7.10), one advantage o f which is tha t positive simulated responses are guaranteed. The residuals in this case are

= logCVj) ~ log(Aoj)Ll 4 /2( l - S ) i / 2 ’

7.6 - Nonparametric Regression 369

Figure 7.17Chi-squared Q-Q plot of standardized deviance differences q* for comparing generalized linear and generalized additive model fits to the leukaemia data. The lines show the theoretical x\ approximation (dashes) and the Fapproximation (dots). Resampling uses Pearson residuals on linear predictor scale, with R = 999.

Chi-squared quantiles

where hoj, jhj and kq are the leverage, fitted value and dispersion estimate for the null (generalized linear) model. These residuals appear quite hom ogeneous, so no stratification is used. Thus step 2 o f A lgorithm 7.4 consists o f sampling e j,...,e * random ly with replacem ent from rL{, . . . , r Ln (w ithout mean

correction), and then generating responses y * = /io; exp(KQ/2e*) for j = l , . . . ,n .Applying this algorithm with R = 999 for our da ta gives the P-value 0.035,

larger than the theoretical approxim ations, but still suggesting tha t the linear term in x is not sufficient. The bootstrap null distribution o f q* deviates m arkedly from the standard %\ approxim ation, as the Q-Q plot in Figure 7.17 shows. The F approxim ation is also inaccurate.

A jackknife-after-bootstrap plot reveals that quantiles o f q* are m oderately sensitive to case 2, but w ithout this case the P-value is virtually unchanged.

Very similar results are obtained under param etric resampling with the exponential model, as m ight be expected from the original data analysis. ■

O ur next example illustrates the use o f sem iparam etric regression in prediction.

Example 7.12 (AIDS diagnoses) In Example 7.4 we discussed prediction o f AID S diagnoses based on the da ta in Table 7.4. A sm ooth time trend seems preferable to fitting a separate param eter for each diagnosis period, and accordingly we consider a model where the m ean num ber o f diagnoses in period j reported with delay k, the mean for the ( j , k) cell o f the table, equals

Hjk = exp(aO') + 0k}.

We take a (j) to be a locally quadratic lowess sm ooth with bandw idth 0.5.


The delay distribution is so sharply peaked here tha t although we could take a sm ooth function in the delay time, it is equally parsim onious to take 15 separate param eters f t . We use the same variance function as in Example 7.4, which assumes that the observed counts yjk are overdispersed Poisson with means /ijk, and we fit the model as a generalized additive model. The residual deviance is 751.7 on 444.2 degrees o f freedom, increased from 716.5 and 413 in the previous fit. The curve shown in the left panel o f Figure 7.18 fits well, and is m uch m ore plausible as a model for underlying trend than the curve in Figure 7.5. The panel also shows the predicted values from this curve, which o f course are heavily affected by the observed diagnoses in Table 7.4.

As m entioned above, in resampling from fitted curves it is im portan t to take residuals from an undersm oothed curve, in order to avoid bias, and to add them to an oversm oothed curve. We take Pearson residuals {y — p) /p} l2 from a similar curve with bandw idth 0.3, and add them to a curve with bandw idth 0.7. These fits have deviances 745.3 on 439.2 degrees o f freedom and 754.1 on 446.1 degrees o f freedom. Both o f these curves are shown in Figure 7.18. Leverage adjustm ent is awkward for generalized additive models, but the large num ber o f degrees o f freedom here makes such adjustm ents unnecessary. We modify resampling scheme (7.12), and repeat the calculations as for A lgorithm 7.1 applied to Example 7.4, with R = 999.

Table 7.11 shows the resulting prediction intervals for the last quarters o f 1990, 1991, and 1992. The intervals for 1992 are substantially shorter than those in Table 7.5, because o f the different model. The generalized additive model is based on an underlying sm ooth trend in diagnoses, so predictions for the last few rows o f the table depend less critically on the values observed

Figure 7.18Generalized additive model prediction of UK AIDS diagnoses. The left panel shows the fitted curve with bandwidth 0.5 (smooth solid line), the predicted diagnoses from this fit (jagged dashed line), and the fitted curves with bandwidths 0.7 (dots) and 0.3 (dashes), together with the observed totals (+). The right panel shows the predicted quarterly diagnoses for 1989-92 (central solid line), and pointwise 95% prediction limits from the Poisson bootstrap (solid), negative binomial bootstrap (dashes), and nonparametric bootstrap without (dots) and with (dot-dash) stratification.


Table 7.11 Bootstrap 95% prediction intervals for numbers of AIDS cases in England and Wales for the fourth quarters of 1990, 1991, and 1992, using generalized additive model.

1990 1991 1992

Poisson 295 314 302 336 415 532N egative binom ial 293 317 298 339 407 547N onparam etric 294 316 296 337 397 545Stratified nonparam etric 293 315 295 338 394 542

in those rows. This contrasts with the Poisson two-way layout model, for which the predictions depend completely on single rows of the table and are much more variable. Com pare the slight forecast drop in Figure 7.6 with the predicted increase in Figure 7.18.

The dotted lines in Figure 7.18 show pointwise 95% prediction bands for the A ID S diagnoses. The prediction intervals for the negative binom ial and nonparam etric schemes are similar, although the effect o f stratification is smaller. Stratification has no effect on the deviances. The negative binom ial deviances are typically about 90 larger than those generated under the nonparam etric scheme.

The plausibility of the sm ooth underlying curve and its usefulness for prediction is o f course central to the approach outlined here. ■

7.6.3 Other methodsOften a nonparam etric regression fit will be com pared to a param etric fit, but not all applications are o f this kind. For example, we may w ant to see w hether or not a regression curve is m onotone w ithout specifying its form. The following application is o f this kind.

Example 7.13 (Downs syndrome) Table 7.12 contains a set o f da ta on incidence o f Downs syndrome babies for m others in various age ranges. M ean age is approxim ate m ean age o f the m m others whose babies included y babies with Downs syndrome. These data are plotted on the logistic scale in Figure 7.19, together with a generalized additive spline fit as an exploratory aid in modelling the incidence rate.

W hat we notice about the curve is that it decreases with age for young m others, contrary to intuition and expert belief. A similar phenom enon occurs for o ther datasets. We w ant to see if this dip is real, as opposed to a statistical artefact. So a null model is required under which the rate o f occurrence is increasing with age. L inear logistic regression is clearly inappropriate, and m ost other standard models give non-increasing rates. The approach taken is isotonic regression, in which the rates are fitted nonparam etrically subject to their increasing with age. Further, in order to make the null model a special

X 17.0 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5m 13555 13675 18752 22005 23896 24667 24807 23986 22860 21450y 16 15 16 22 16 12 17 22 15 14

X 27.5 28.5 29.5 30.5 31.5 32.5 33.5 34.5 35.5 36.5m 19202 17450 15685 13954 11987 10983 9825 8483 7448 6628y 27 14 9 12 12 18 13 11 23 13

X 37.5 38.5 39.5 40.5 41.5 42.4 43.5 44.5 45.5 47.0m 5780 4834 3961 2952 2276 1589 1018 596 327 249y 17 15 30 31 33 20 16 22 11 7

a>o<d■ocoO)o

Mean age x

case o f the general model, the la tter is taken to be an arbitrary convex curve for the logit o f incidence rate.

If the incidence rate at age x, is n(xi) with logit{7r(x/)} = rj(xi) = */*, say, for i = then the binom ial log likelihood is

1=1

A convex model is one in which

Xi+1 - Xi Xi - X i - 1 .t i i< - ------ —rn- 1 + 7------ —1i+1. I = 2 , . . . ,k -1 .

x i+1 %i— 1 Xj+1 - Xi- 1

The general model fit will maximize the binom ial log likelihood subject to these constraints, giving estim ates fji,...,rjk- The null model satisfies the constraints rji < rji+i for i = l , . . . , k — 1, which are equivalent to the previous convexity

Table 7.12 Number y of Downs syndrome babies in m births for mothers with age groups centred on x years (Geyer, 1991).

Figure 7.19 Logistic scale plot of Downs syndrome incidence rates. Solid curve is generalized additive spline fit with 3 degrees of freedom

7.6 ■ Nonparametric Regression 373

Figure 7.20 Logistic scale plot of incidence rates for Downs syndrome data, with convex fit (solid line) and isotonic fit (dotted line).

Mean age x

constraints plus the single constraint r\\ < r\2- The null fit essentially pools adjacent age groups for which the general estim ates fji violate the m onotonicity o f the null model. If the null estimates are denoted by then we take as our test statistic the deviance difference

T = 2{(f(»)i,. . . ,r\k) ~ -• •»flojc)}-

The difficulty now is tha t the standard chi-squared approxim ation for deviance differences does no t apply, essentially because there is not a fixed value for the degrees o f freedom. There is a com plicated large-sample approxim ation which may well no t be reliable. So a param etric bootstrap is used to calculate the P-value. This requires sim ulation from the binom ial model with sample sizes m„ covariate values x, and logits fjo,i-

Figure 7.20 shows the convex and isotone regression fits, which clearly differ for age below 30. The deviance difference for these fits is t = 5.873. Sim ulation o f R = 999 binom ial datasets from the isotone model gave 33 values o f t* in excess o f 5.873, so the P-value is 0.034 and we conclude that the dip in incidence rate may be real. (Further analysis with additional da ta does not support this conclusion.) Figure 7.21 is a histogram o f the t* values.

It is possible tha t the null distribution o f T is unstable with respect to param eter values, in which case the nested bootstrap procedure o f Section 4.5 should be used, possibly in conjunction with the recycling m ethod o f Section 9.4.4 to accelerate the com putation. ■


0 2 4 6 8 10

t*

7.7 Bibliographic NotesA full treatm ent o f all aspects o f generalized linear models is given by M cCul- lagh and N elder (1989). D obson (1990) is a more elementary discussion, while F irth (1991) gives a useful shorter account. D avison and Snell (1991) describe m ethods o f checking such models. Books by Cham bers and H astie (1992) and Venables and Ripley (1994) cover m ost o f the basic m ethods discussed in this chapter, but restricted to im plem entations in S and S-Plus.

Published discussions o f bootstrap m ethods for generalized linear models are usually limited to one-step iterations from the model fit, with resampling of Pearson residuals; see, for example, M oulton and Zeger (1991). There appears to be no systematic study o f the various schemes described in Section 7.2.3. N elder and Pregibon (1987) briefly discuss a more general application. M oulton and Zeger (1989) discuss bootstrap analysis o f repeated measure data, while Booth (1996) describes m ethods for use when there is nested variation.

Books giving general accounts o f survival da ta are m entioned in Section 3.12. H jort (1985) describes m odel-based resampling m ethods for proportional hazards regression, and studies their theoretical properties such as confidence interval accuracy. Burr and D oss (1993) outline how the double bootstrap can be used to provide confidence bands for a m edian survival time, and com pare its perform ance with simulated bands based on asym ptotic results. Lo and Singh (1986) and H orvath and Yandell (1987) make theoretical contributions to bootstrapping survival data. Bootstrap and perm utation tests for com parison o f survivor functions are discussed by Heller and V enkatram an (1996).

Burr (1994) studies empirically various bootstrap confidence interval m ethods for the proportional hazards model. She finds no overall best com bination,

Figure 7.21 Histogram of 999 resampled deviance test statistics for the Downs syndrome data. The unshaded portion corresponds to values exceeding observed test statistic t = 5.873.

7.7 • Bibliographic Notes 375

but concludes tha t norm al-theory asym ptotic confidence intervals and basic bootstrap intervals are generally good for regression param eters fi, while percentile intervals are satisfactory for survival probabilities derived from the product-lim it estimate. Results from the conditional bootstrap are more erratic than those for resampling cases or from model-based resampling, and the latter is generally preferred.

Altm an and A ndersen (1989), Chen and G eorge (1985) and Sauerbrei and Schumacher (1992) apply case resampling to variable selection in survival data models, bu t there seems to be little theoretical justification o f this. The use o f bootstrap m ethods in general assessment o f model uncertainty in regression is discussed by Faraw ay (1992).

Bootstrap m ethods for general nonlinear regression models are usually studied theoretically via linear approxim ation. See Huet, Jolivet and M essean (1990) for some sim ulation results. There appears to be no literature on incorporating curvature effects into model-based resampling. The behaviour o f residuals, leverages and diagnostics for nonlinear regression models are developed by Cook, Tsai and Wei (1986) and St. Laurent and Cook (1993).

The large literature on prediction error as related to discrim ination is surveyed by M cLachlan (1992). References for bootstrap estim ation o f prediction error are m entioned in Section 6.6. Those dealing particularly with misclassification error include Efron (1983) and Efron and Tibshirani (1997). Gong (1983) discusses a particular case where the prediction rule is based on a logistic regression model obtained by forward selection.

References to bootstrap m ethods for model selection are mentioned in Section 6.6. The treatm ent by Shao (1996) covers both generalized linear models and nonlinear models.

There are now num erous accounts o f nonparam etric regression, such as H astie and Tibshirani (1990) on generalized additive models, and G reen and Silverman (1994) on penalized likelihood methods. A useful treatm ent o f local weighted regression by Hastie and Loader (1993) is followed by a discussion o f the relative merits o f various kernel-type estimators. Venables and Ripley (1994) discuss im plem entation in S-Plus with examples; see also Cham bers and H astie (1992). Considerable theoretical work has been done on bootstrap m ethods for setting confidence bands on nonparam etric regression curves, mostly focusing on kernel estimators. H ardle and Bowman (1988) and H ardle and M arron (1991) both emphasize the need for. different levels o f smoothing in the com ponents o f model-based resampling schemes. H all (1992b) gives a detailed theoretical assessment o f the properties o f such confidence band methods, and emphasizes the benefits o f the studentized bootstrap. There appears to be no corresponding treatm ent for spline smoothing m ethods, nor for the many complex m ethods now used for fitting surfaces to model the effects o f multiple covariates.


A sum m ary o f much o f the theory for resampling in nonlinear and nonparam etric regression is given in C hapter 8 of Shao and Tu (1995).

7.8 Problems1 The estimator ft in a generalized linear model may be defined as the solution to

the theoretical counterpart of (7.2), namely

/ c V ( t ) d f / e/ F{x' y} = 0'

where n is regarded as a function of ft through the link function g(fi) = r\ = x T ft. Use the result of Problem 2 . 1 2 to show that the empirical influence value for ft based on data (x1,ci,yi),...,(x„,c„,}'„) is

lj = n(XTW X ) - 1xj J ? " * 1' .cjV(Hj)3t]j/8fij

evaluated at the fitted model, where W is the diagonal matrix with elements given by (7.3).Hence show that the approximate variance matrix for ft' for case resampling in a generalized linear model is

k (X T W X ) - 1 X T W S X { X T W X )~ \

where $ = diag(rp,,. . . , rj,n) with the rpj standardized Pearson residuals (7.9). Show that for the linear model this yields the modified version of the robust variance matrix (6.26).(Section 7 . 2 . 2 ; Moulton and Zeger, 1 9 9 1 )

2 For the gamma model of Examples 7.1 and 7.2, verify that var(7 ) = k/i2 and that the log likelihood contribution from a single observation is

= - ^ { l o g i ^ + y/fi}.

Show that the unstandardized Pearson and deviance residuals are respectively k _ i/ 2 ( z — 1) and sign(z—1 ) [ 2 k _ 1 / 2 { z — 1 — log(z)}]1/2, where z = y/p.. If the regression is loglinear, meaning that the log link is used, verify that the unstandardized linear predictor residuals are simply k~i/2 log(z).What are the possible ranges of the standardized residuals rP, rL and rDl Calculate these for the model fitted in Example 7.2.If the deviance residual is expressed as d(y,p), check that d(y,p) = d(z, 1). Hence show that the resampling scheme based on standardized deviance residuals can be expressed as y’ = faz’, where zj is defined by d(zj, 1) = s' with «' randomly sampled from rDi, . . . , rDn. What further simplification can be made?(Sections 7.2.2, 7.2.3)

3 The figure below shows the fit to data pairs (xuy \ ),•■■,(x„,y„) of a binary logistic model

Pr(7 = 1) = 1 - Pr(Y = 0) = eXp(/?0 + /?lX)1 + exp(/?0 + /fix)

x

(a) Under case resampling, show that the maximum likelihood estimate for a bootstrap sample is infinite with probability close to e~2. What effect has this on the different types o f bootstrap confidence intervals for fa ?(b) Bias-corrected maximum likelihood estimates are obtained by modifying response values (0,1) to (/iy/2, l+hj), where hj is the jth leverage for the model fit to the original data. D o infinite parameter estimates arise when bootstrapping cases from the modified data?(Section 7.2.3; Firth, 1993; Moulton and Zeger, 1991)

4 Investigate whether resampling schemes given by (7.12), (7.13), and (7.14) yield Algorithm 6.1 for bootstrapping the linear model.

5 Suppose that conditional on P = n, Y has a binomial distribution with probability n and denominator m, and that P has a beta density

/ ( n | «,ff) = - n f- ' , 0 < n < 1, tx,P>0.r W i(P )

Show that Y has unconditional mean and variance (7.15) and express n and <f> in terms o f a and faExpress a and /? in terms o f n and <j>, and hence explain how to generate data with mean and variance (7.15) by generating n from a beta distribution, and then, conditional on the probabilities, generating binomial variables with probabilities n and denominators m.How should your algorithm be amended to generate beta-binomial data with variance function </>II(l — II)?(Example 7.3)

6 For generalized linear models the analogue o f the case-deletion result in Problem 6.2 is

Kj = P-(xTwxy'wjk-^xj^^i.

(a) Use this to show that when the y'th case is deleted the predicted value for y, is

(b) Use (a) to give an approximation for the leave-one-out cross-validation estimate o f prediction error for a binary logistic regression with cost (7.23).(Sections 6.4.1,7.2.2)

7.9 Practicals1 Dataframe rem iss io n contains data from Freeman (1987) concerning a measure

o f cancer activity, the LI values, for 27 cancer patients, o f whom 9 went into remission. Remission is indicated by the binary variable r = 1. Consider testing the hypothesis that the LI values do not affect the probability o f remission. First, fit a binary logistic model to the data, plot them, and perform a permutation test:

attach(remission)plot(LI+O.03*rnorm(27),r,pch=l,xlab="LI, jittered",xlim=c(0,2.5)) rem.glm <- glm(r"LI.binomial,data=remission) summary(rem.glm) x <- seqC0.4,2.0,0.02)eta <- cbind(rep(l,81) ,x)/C*'/.coeff icients(rem.glm) lines(x,inv.logit(eta),lty=2) rem.perm <- function(data, i){ d <-data

d$LI<- d$LI[i]d.glm <- glm(r~LI,binomial,data=d) coefficients(d.glm) >

rem.boot <- boot(remission, rem.perm, R=199, sim="permutation") qqnorm(rem.boot$t[,2],ylab="Coefficient of LI",ylim=c(-3,3)) abline(h=rem.boot$tO[2],lty=2)

Compare this significance level with that from using a normal approximation for the coefficient o f LI in the fitted model.Construct bootstrap tests o f the hypothesis by extending the methods outlined in Section 6.2.5.(Freeman, 1987; Hall and Wilson, 1991)

2 Dataframe b reslow contains data from Breslow (1985) on death rates from heart disease among British male doctors. A standard model is that the numbers o f deaths y have a Poisson distribution with mean nX, where n is the number o f person-years and X is the death rate. The focus o f interest is how death rate depends on two explanatory variables, a factor representing the age group and an indicator o f smoking status, x. Two competing models are

X = exp(aage + fix), X = aage + fix;

these are respectively multiplicative and additive. To fit these models we proceed as follows:

breslow.mult <- glm(y*offset(log(n))+age+smoke,poisson(log), data=breslow)

breslow.add <- glm(y~n:age+ns-l,poisson(identity),data=breslow)

Here ns is a variable for the effect o f smoking, constructed to allow for the difficulty in applying an offset in fitting the additive model. The deviances o f the fitted models are Dadd = 7.43 and Dmuit = 12.13. Although it appears that the additive model is the better fit, these models are not nested, so a chi-squared approximation cannot be applied to the difference o f deviances. For bootstrap

7.9 • Practicals 379

assessment o f fit based on the difference o f deviances, we simulate in turn from each fitted model. Because fits o f the additive model fail if there are no deaths in the lowest age group, and this happens with appreciable probability, we constrain the simulation so that there are deaths at each age.

breslow.fun <- function(data){ mult <- glm(y"offset(log(n))+age+smoke,poisson(log),data=data)

add <- glm(y~n:age+ns-1,poisson(identity),data=data) deviance(mult)-deviance(add) }

breslow.sim <- function(data, mle){ data$y <- rpois(nrow(data), mle)

while(min(data$y)==0) data$y <- rpois(nrow(data), mle) data }

add.mle <- fitted(breslow.add)add.boot <- boot(breslow, breslow.fun, R=99, sim="parametric",

ran.gen=breslow.sim, mle=add.mle) mult.mle <- fitted(breslow.mult)mult.boot <- boot(breslow, breslow.fun, R=99, sim="parametric",

ran.gen=breslow.sim, mle=mult.mle) boxplot(mult.boot$t,add.boot$t,ylab="Deviance difference",

names=c("multiplicative","additive")) abline(h=mult.boot$tO,lty=2)

What does this tell you about the relative fit o f the models?A different strategy would be to use parametric simulation, simulating not from the fitted models, but from the model with separate Poisson distributions for each o f the original data. Discuss critically this approach.(Section 7.2; Example 4.5; Wahrendorf, Becher and Brown, 1987; Hall and Wilson, 1991)

Dataframe h ir o s e contains the PET reliability data o f Table 7.6. Initially we consider estimating the bias and variance o f the MLEs o f the parameters /?o,. . . , / ? 4

and xo discussed in Example 7.5, using parametric simulation from the fitted Weibull model, but assuming that the data were subject to censoring at the fixed time 9104.25. Functions to calculate the minus log likelihood (in parametrization

and to find the MLEs are:

hirose.lik <- function(mle, data){ xO <- 5-exp(mle [5])

lambda <- exp(mle[l]+mle[2]*(-log(data$volt-x0))) beta <- exp(mle[3]+mle[4]*(-log(data$volt))) z <- (data$time/lambda)“beta sum(z - data$cens*log(beta*z/data$time)) }

hirose.fun <- function(data, start){ d <- nlminb(start, hirose.lik, data=data)

conv <- (d$message=="RELATIVE FUNCTION CONVERGENCE") c(conv, d$objective, d$parameters) }

The M L E s for the original data can be obtained by setting hirose. start <- c(6,2,4,l,l) (obtained by introspection), and then iterating the following lines

hirose.start <- hirose.fun(hirose, start=hirose.start)[3:7] hirose.start

a few times.New data are generated by

hirose.gen <- function(data, mle){ xO <- 5 - exp (mle [5])

xl <- -log(data$volt-xO) xb <- -log(data$volt) lambda <- exp(mle[1]+mle[2]*xl) beta <- exp(mle[3]+mle[4]*xb)y <- rweibull(nrow(data), shape=beta, scale=lambda) data$cens <- ifelse(y<=9104.25,1,0) data$time <- ifelse(data$cens==l,y,9104.25) data >

and the bootstrap results are obtained by

hirose.mle <- hirose.starthirose.boot <- boot(hirose, hirose.fun, R=19, sim="parametric",

ran.gen=hirose.gen, mle=hirose.mle, start=hirose.start)

hirose.boot$t[,7] <- 5-exp(hirose.boot$t[,7]) hirose.boot$tO[7] <- 5-exp(hirose.boot$tO[7]) hirose.boot

Try this with a larger value of R — but don’t hold your breath.For a full likelihood analysis for the parameter 9, the log likelihood must be maximized over /?i,...,/?4 for a given value of 9. A little thought shows that the necessary code is

betaO <- function(theta, mle){ x49 <- -log(4.9-(5-exp(mle[4])))

x <- -log(4.9)log(theta*10"3) - mle[1]*x49-lgamma(l + exp (-mle [2]-mle [3] *x)) }

hirose.Iik2 <- function(mle, data, theta){ xO <- 5-exp(mle[4])

lambda <- exp(betaO(theta,mle)+mle[1]*(-log(data$volt-xO))) beta <- exp(mle[2]+mle[3]*(-log(data$volt))) z <- (data$time/lambda)“beta sum(z - data$cens*log(beta*z/data$time)) }

hirose.fun2 <- function(data, start, theta){ d <- nlminb(start, hirose.Iik2, data=data, theta=theta)

conv <- (d$message=="RELATIVE FUNCTION CONVERGENCE") c(conv, d$objective, d$parameters) }

hirose.f <- function(data, start, theta) c( hirose.fun(data,i.start),

hirose.fun2(data,i ,start[-1],theta))

so that h iro s e . f does likelihood fits when 6 is fixed and when it is not.The quantiles of the simulated likelihood ratio statistic are then obtained by

make.theta <- function(mle, x=hirose$volt){ xO <- 5-exp(mle[5])

lambda <- exp(mle[1]-mle[2]*log(x-x0))/10~3 beta <- exp(mle[3]-mle[4]*log(x)) lambda*gamma(l+l/beta) >

theta <- make.theta(hirose.mle,4.9)hirose.boot <- boot(hirose, hirose.f, R=19, "parametric",

ran.gen=hirose.gen, mle=hirose.mle, start=hirose.start, theta=theta)

R <- hirose.bootSRi <- c(l:R) [(hirose.boot$t[,l]==l)&(hirose.boot$t[,8]==l)] w <- 2*(hirose.boot$t[i,9]-hirose.boot$t[,2]) qqplot(qchisq(c(l:length(w))/(l+length(w)),1),w) abline(0,1,lty=2)

Again, try this with a larger R.Can you see how the code would be modified for nonparametric simulation? (Section 7.3; Hirose, 1993)

Dataframe nodal contains data on 53 patients with prostate cancer. For each patient there are five explanatory variables, each with two levels. These are aged (<60, >60); s ta g e , a measure of the seriousness o f the tumour; grade, a measure o f the pathology o f the tumour; a measure o f the seriousness o f an xray; and acid , the level o f serum acid phosphatase. The higher level o f each o f the last four variables indicates a more severe condition. The response r indicates whether the cancer has spread to the neighbouring lymph nodes. The data were collected to see whether nodal involvement can be predicted from the explanatory variables. Analysis o f deviance for a binary logistic regression model suggests that the response depends only on s ta g e , xray and acid , and we base our predictions on the model with these variables. Our measure o f error is the average number o f misclassifications n 1 c(yj,/ij), where c(y, ft) is given by (7.23).For an initial model, apparent error, and ordinary and X-fold cross-validation estimates o f prediction error:

attach(nodal)cost <- function(r, pi=0) mean(abs(r-pi)>0.5)nodal.glm <- glm(r~stage+xray+acid,binomial,data=nodal)nodal.diag <- glm.diag(nodal.glm)app.err <- cost(r, fitted(nodal.glm))cv.err <- cv.glm(nodal, nodal.glm, cost, K=53)$deltacv.ll.err <- cv.glm(nodal, nodal.glm, cost, K=ll)$delta

For resampling-based estimates and plot for 0.632 errors:

nodal.pred.fun <- function(data, i, model){ d <- data[i,]

d.glm <- update(model,data=d)pred <- predict(d.glm,data,type="response")D.F.Fhat <- cost(data$r, pred)D.Fhat.Fhat <- cost(d$r, fitted(d.glm)) c(data$r-pred, D.F.Fhat - D.Fhat.Fhat) }

nodal.boot <- boot(nodal, nodal.pred.fun, R=200, model=nodal.glm) nodal.boot$f <- boot.array(nodal.boot) n <- nrow(nodal)err.boot <- mean(nodal.boot$t[,n+l]) + app.errord <- order(nodal.diag$res)nodal.pred <- nodal.boot$t[,ord]err.632 <- 0n.632 <- NULLpred.632 <- NULLfor (i in l:n) {

inds <- nodal,boot$f[,i]==0err.632 <- err.632 + cost(nodal.pred[inds,i])/nn.632 <- c(n.632, sum(inds))pred.632 <- c(pred.632, nodal.pred[inds,i]) }

err.632 <- 0.368*app.err + 0.632*err.632 nodal.fac <- factor(rep(l:n,n.632),labels=ord) plot(nodal.fac, pred.632,ylab="Prediction errors",

xlab="Case ordered by residual") abline(h=-0.5,lty=2); abline(h=0.5,lty=2)

Cases with errors entirely outside the dotted lines are always misclassified, and conversely.Estimate the misclassification error using the model with all five explanatory variables.(Section 7.5; Brown, 1980)

5 Dataframe c lo t h records the number o f faults y in lengths x o f cloth. Is it true that E(y) oc x?

plot(cloth$x,cloth$y)cloth.glm <- glm(y~offset(log(x)).poisson,data=cloth)lines(cloth$x,f itted(cloth.glm))summary(cloth.glm)cloth.diag <- glm.diag(cloth.glm)cloth.gam <- gam(y~s(log(x)) .poisson,data=cloth)lines(cloth$x,fitted(cloth.gam),lty=2)summary(cloth.gam)

There is some overdispersion relative to the Poisson model with identity link, and strong evidence that the generalized additive model fit c lo th .g a m improves on the straight-line model in which y is Poisson with mean /30 + fa x . We can try parametric simulation from the model with the linear fit (the null model) to assess the significance o f the decrease; cf. Algorithm 7.4:

cloth.gen <- function(data, fits){ y <- rpois(n=nrow(data).fits)

data.frame(x=data$x,y=y) > cloth.fun <- function(data){ d.glm <- glm(y~offset(log(x)),poisson,data=data)

d.gam <- gam(y~s(log(x)).poisson,data=data) c(deviance(d.glm),deviance(d.gam)) }

cloth.boot <- boot(cloth, cloth.fun, sim="parametric", R=99, ran.gen=cloth.gen, mle=fitted(cloth.glm))

Are the simulated drops in deviance roughly as they would be if standard asymptotics applied? How significant is the observed drop?In addition to the hypothesis that we want to test — that E(y) depends linearly on x — the parametric bootstrap imposes the constraint that the data are Poisson, which is not intended to be part o f the null hypothesis. We avoid this by a nonparametric bootstrap, as follows:

clothl <- data.frame(cloth,fits=fitted(cloth.glm), pearson=cloth.diag$rp)

cloth.funl <- function(data, i){ y <- data$fits+sqrt(data$fits)*data$pearson[i]

y <- round(y) y[y<0] <- 0d.glm <- glm(y~offset(log(data$x)).poisson) d.gam <- gam(y~s(log(data$x)).poisson) c(deviance(d.glm).deviance(d.gam)) }

cloth.boot <- boot(clothl, cloth.funl, R”99)

Here we have used resampled standardized Pearson residuals for the null model, obtained by c lo th .d ia g $ r p .How significant is the observed drop in deviance under this resampling scheme? (Section 7.6.2; Bissell, 1972; Firth, Glosup and Hinkley, 1991)

6 The data n it r o fe n are taken from a test o f the toxicity o f the herbicide nitrofen on the zooplankton Ceriodaphnia dubia, an important species that forms the basis o f freshwater food chains for the higher invertebrates and for fish and birds. The standard test measures the survival and reproductive output o f 10 juvenile C. dubia in each o f four concentrations o f the herbicide, together with a control in which the herbicide is not present. During the 7-day period o f the test each o f the original individuals produces three broods o f offspring, but for illustration we analyse the total offspring.A previous model for the data is that at concentration x the total offspring y for each individual is Poisson distributed with mean exp(/?, + [3[X + (h*1)- The fit o f this model to the data suggests that low doses o f nitrofen augment reproduction, but that higher doses inhibit it.One thing required from analysis is an estimate o f the concentration x 5o o f nitrofen at which the mean brood size is halved, together with a 95% confidence interval for x 50. A second issue is posed by the surprising finding from a previous analysis that brood sizes are slightly larger at low doses o f herbicide than at high or zero doses: is this true?A wide variety o f nonparametric curves could be fitted to the data, though care is needed because there are only five distinct values o f x. The data do not look Poisson, but we use models with Poisson errors and the log link function to ensure that fitted values and predictions are positive. To compare the fits o f the generalized linear model described above and a robustified generalized additive model with Poisson errors:

nitro <- rbind(nitrofen,nitrofen,nitrofen,nitrofen,nitrofen) nitro <- rbind(nitro,nitro,nitro,nitro,nitro) nitro$conc <- seq(0,310,length=nrow(nitro)) attach(nitrofen)plot(conc,j itter(total),ylab="total")nitro.glm <- glm(total~conc+conc“2,poisson,data=nitrofen) lines(nitro$conc,predict(nitro.glm,nitro,"response"),lty=3) nitro.gam <- gam(total~s(conc,df=3).robust(poisson),data=nitrofen) lines(nitro$conc,predict(nitro.gam,nitro,"response"))

To compare bootstrap confidence intervals for x 50 based on these models:

nitro.fun <- function(data, i, nitro){ assignC'd" ,data[i,] ,frame=l)

d.fit <- gam(total~s(conc,df=3).robust(poisson),data=d) f <- predict(d.fit,nitro,"response") f.gam <- max(nitro$conc[f>0.5*f[1]]) d.fit <- glm(total~conc+conc“2,poisson,data=d) f <- predict(d.fit,nitro,"response") f.glm <- max(nitro$conc[f>0.5*f [1]]) c(f.gam, f.glm) >

nitro.boot <- boot(nitrofen, nitro.fun, R=499,strata=rep(l:5,rep(10,5)), nitro=nitro)

boot.ci(nitro.boot,index=l,type=c("norm","basic","perc","bca")) boot.ci(nitro.boot,index=2,type=c("norm","basic","perc","bca"))

Do the values of x'^ look normal? What is the bias estimate for x50 using the two models?To perform a bootstrap test of whether the peak is a genuine effect, we simulate from a model satisfying the null hypothesis of no peak to see if the observed value of a suitable test statistic (, say, is unusual. This involves fitting a model with no peak, and then simulating from it. We read fitted values m0(x) from the robust generalized additive model fit, but with 2.2 df (chosen by eye as the smallest for which the curve is flat through the first two levels of concentration). We then generate bootstrap responses by setting y’ = m o (x ) + s', where the e’ are chosen randomly from the modified residuals at that x. We take as test statistic the difference between the highest fitted value and the fitted value at x = 0.

nitro.test <- fitted(gam(total~s(conc,df=2.2).robust(poisson),data=nitrofen))

f <- predict(nitro.glm,nitro,"response") nitro.orig <- max(f) - f[l]res <- (nitrofen$total-nitro.test)/sqrt(l-0.1) nitrol <- data.frame(nitrofen,res=res,fit=nitro.test) nitrol.fun <- function(data, i, nitro){ assignC'd" ,data[i,] ,frame=l)

d$total <- round(d$fit+d$res[i]) d.fit <- glm(total~conc+conc“2,poisson,data=d) f <- predict(d.fit,nitro,"response") max(f)-f[l] }

nitrol.boot <- boot(nitrol, nitrol.fun, R=99,strata=rep(l:5,rep(10,5)), nitro=nitro)

(1+sum(nitrol.boot$t>nitro.orig))/(1+nitrol.boot$R)

Do your conclusions change if other smooth curves are fitted?(Section 7.6.2; Bailer and Oris, 1994)

8

Complex Dependence

8.1 IntroductionIn previous chapters our models have involved variables independent at some level, and we have been able to identify independent com ponents that can be simulated. W here a model can be fitted and residuals o f some sort identified, the same ideas can be applied in the m ore complex problems discussed in this chapter. W here tha t model is param etric, param etric sim ulation can in principle be used to obtain resamples, though M arkov chain M onte Carlo techniques may be needed in practice. But in nonparam etric situations the dependence may be so complex, or our knowledge o f it so limited, that neither o f these approaches is feasible. O f course some assum ption o f repeatedness within the da ta is essential, o r it is impossible to proceed. But the repeatability m ay not be at the level o f individual observations, but o f groups o f them, and there is typically dependence between as well as within groups. This leads to the idea o f constructing bootstrap da ta by taking blocks o f some sort from the original observations. The area is in rapid development, so we avoid a detailed m athem atical exposition, and merely sketch key aspects o f the m ain ideas. In Section 8.2 we describe some o f the resam pling schemes proposed for time series. Section 8.3 outlines some ideas useful in resampling point processes.

8.2 Time Series

8.2.1 IntroductionA time series is a sequence o f observations arising in succession, usually at times spaced equally and taken to be integers. M ost models for time series assume that the da ta are stationary, in which case the jo in t distribution of any subset o f them depends only on their times o f occurrence relative to each other

385

386 8 ■ Complex Dependence

and not on their absolute position in the series. A weaker assum ption used in data analysis is that the jo in t second m om ents o f observations depend only on their relative positions; such a series is said to be second-order or weakly stationary.

Time domain

There are two basic types o f sum m ary quantities for stationary time series. The first, in the time dom ain, rests on the jo in t m om ents o f the observations. Let {7,} be a second-order stationary time series, with zero mean and autocovariance function yj. T hat is, E (Yj) = 0 and co\(Yk, Yk+j) = yj for all k and j ; the variance o f Yj is yo- Then the autocorrelation function o f the series is pj = y j /yo, for j = 0, + 1, . . which measures the correlation between observations at lag j apart; o f course —1 < pj < 1, po = 1, and ps = p_; . An uncorrelated series would have pj = 0, and if the data were norm ally distributed this would imply tha t the observations were independent.

For example, the stationary moving average process o f order one, or MA(1) model, has

Yj = ej + Pej-i , ; = 1 ,0 ,1 ,. . . , (8.1)

where {ej} is a white noise process o f innovations, tha t is, a stream o f independent observations with mean zero and variance a 1. The autocorrelation function for the (Y)} is p\ = /? /(l + P2) and pj = 0 for |y| > 1; this sharp cut-off in the autocorrelations is characteristic o f a moving average process. Only if P = 0 is the series Yj independent. O n the o ther hand the stationary autoregressive process o f order one, o r AR(1) model, has

Yj = ctYj-i + Ej, j = . . . , - 1, 0, 1, . . . , | « | < 1. (8.2)

The autocorrelation function for this process is pj = a 1-'1 for j = +1, ± 2 and so forth, so large a gives high correlation between successive observations. The autocorrelation function decreases rapidly for bo th models (8.1) and (8.2).

A close relative o f the autocorrelation function is the partial autocorrelation function, defined as pj = yj/yo, where yj is the covariance between Y& and Yk+j after adjusting for the intervening observations. The partial autocorrelations for the MA(1) model are

p ’j = - ( - m i - /?2){i - ^2(;+1)} - 1, j = ± i , +2, —

The AR(1) model has p\ = a, and pj = 0 for \j\ > 1; a sharp cut-off in the partial autocorrelations is characteristic o f autoregressive processes.

The sample estimates o f pj and pj are basic summaries o f the structure of a time series. Plots o f them against j are called the correlogram and partial correlogram o f the series.

One widely used class o f linear time series models is the autoregressive- moving average or A R M A process. The general ARMA(p,<j) model is defined

8.2 • Time Series 387

byP 9

Yj = '^2<*kYj-k + Ej + '^2PkEj-k, (8.3)k=l k=1

where {£,} is a white noise process. I f all the a& equal zero, {Yj} is the moving average process MA(q), whereas if all the f t equal zero, it is AR(p). In order for (8.3) to represent a stationary series, conditions m ust be placed on the coefficients. Packaged routines enable models (8.3) to be fitted readily, while series from them are easily sim ulated using a given innovation series . . . , £ — 1, £ o , £ j , . . . .

Frequency domain

The second approach to time series is based on the frequency domain. The spectrum o f a stationary series with autocovariances yj is

00

g(co) = y0 + 2 ^ 2 yj cos (co j), 0 < co < n. (8.4)i =i

This summarizes the values o f all the autocorrelations o f {Yj}. A white noise process has the flat spectrum g(co) = yo, while a sharp peak in g(to) corresponds to a strong periodic com ponent in the series. For example, the spectrum for a stationary AR(1) model is g(co) = cr2{ 1 — 2acos(co) + a2}-1 .

The empirical Fourier transform plays a key role in data analysis in the frequency dom ain. The treatm ent is simplified if we relabel the series as yo, a n d suppose that n = 2np + 1 is odd. Let f = e2n'^n be the nth complex root o f unity, so (" = 1. Then the empirical Fourier transform o f the d a ta is the set o f n complex-valued quantities

n—1y k = Y 2 £}ky j ’ fc = o ,. . . , n - 1;

7=0

note that yo = ny and tha t the complex conjugate of % is y n-k, for k = 1 , . . . ,« — 1. For different k the vectors (1, Ck, . . . , are orthogonal. It isstraightforw ard to see that

1 "-1~^2C~}kyk = yj, 7 = 0, . . — l,

k=0

so this inverse Fourier transform retrieves the data. Now define the Fourier frequencies cok — 2nk/n, for k = 1 , . . . ,n p . The sample analogue o f the spectrum at a>k is the periodogram,

I{(ok) = n ' l & l 2 = n 1

n- 1 Y ( n - l ' 2Y2 yj cos(cokj) \ +1 YI yj sin(mkj)j =0 I I j =0


The orthogonality properties o f the vectors involved in the Fourier transform imply that the overall sum o f squares o f the data may be expressed as

The empirical Fourier transform and its inverse can be rapidly calculated by an algorithm known as the fas t Fourier transform.

If the da ta arise from a stationary process {Yj} with spectrum g(co), where Yj = YlT=-ccai - i Ei’ '"'ith {£/} a norm al white noise process, then as n increases and provided the terms |a/| decrease sufficiently fast as l—> ± oo, the real and im aginary parts o f the complex-valued random variables y i , . . . , y „ F are asym ptotically independent norm al variables with means zero and variances ng(o)[)/2, . . . , ng(«„f ) /2 ; furtherm ore the % a t different Fourier frequencies are asymptotically independent. This implies that as n—>co for such a process, the periodogram values I{a>k) a t different Fourier frequencies will be independent, and tha t I(cok) will have an exponential distribution with m ean g(co^). (If n is even I (n) m ust be added to (8.5); I(n) is approxim ately independent o f the /(ajfc) and its asym ptotic distribution is g(Tt)xi-) Thus (8.5) decomposes the to tal sum o f squares into asymptotically independent com ponents, each associated with the am ount o f variation due to a particular Fourier frequency. W eaker versions o f these results hold when the process is no t linear, or when the process {e/} is no t norm al, the key difference being that the jo in t limiting distribution o f the periodogram values holds only for a finite num ber o f fixed frequencies.

I f the series is white noise, under mild conditions its periodogram ordinates I{co\) , . . . , I{(o„F) are roughly a random sample from an exponential distribution with m ean yo. Tests o f independence may be based on the cumulative periodogram ordinates,

W hen the da ta are white noise these ordinates have roughly the same jo in t distribution as the order statistics o f np — 1 uniform random variables.

Example 8.1 (Rio Negro data) The da ta for our first time series example are m onthly averages o f the daily stages — heights — o f the R io Negro, 18 km upstream a t M anaus, from 1903 to 1992, made available to us by ProfessorsH. O ’Reilly Sternberg and D. R. Brillinger o f the University o f California a t Berkeley. Because o f the tiny slope o f the water surface and the lower courses o f its flatland affluents, these da ta may be regarded as a reasonable approxim ation o f the w ater level in the Am azon River at the confluence o f the

n- 1

(8.5)

J2j=iZ j U H u j Y

k = — 1.


Deseasonalized monthly average stage (metres) of the Rio Negro at Manaus, 1903-1992 (Sternberg, 1995).

Figure 8.1

1900 1920 1940 1960 1980 2000

Time (years)

two rivers. To remove the strong seasonal com ponent, we subtract the average value for each m onth, giving the series o f length n = 1080 shown in Figure 8.1.

For an initial example, we take the first ten years o f observations. The top panels o f Figure 8.2 show the correlogram and partial correlogram for this shorter series, with horizontal lines showing approxim ate 95% confidence limits for correlations from a white noise series. The shape o f the correlogram and the cut-off in the partial correlogram suggest that a low-order autoregressive m odel will fit the data, which are quite highly correlated. The lower left panel o f the figure shows the periodogram o f the series, which displays the usual high variability associated with single periodogram ordinates. The lower right panel shows the cum ulative periodogram , which lies well outside its overall 95% confidence band and clearly does not correspond to a white noise series.

An AR(2) model fitted to the shorter series gives oil = 1.14 and a.2 = —0.31, both with standard error 0.062, and estim ated innovation variance 0.598. The left panel o f Figure 8.3 shows a norm al probability plot o f the standardized residuals from this model, and the right panel shows the cumulative periodogram o f the residual series. The residuals seem close to G aussian white noise. ■

8.2.2 Model-based resamplingThere are two approaches to resampling in the time dom ain. The first and simplest is analogous to model-based resampling in regression. The idea is to fit a suitable model to the data, to construct residuals from the fitted model, and then to generate new series by incorporating random samples from the

390 8 ' Complex Dependence

£too>o

oO

Figure 8.2 Summary plots for the Rio Negro data, 1903-1912. The top panels show the correlogram and partial correlogram for the series. The bottom panels show the periodogram and cumulative periodogram.

Lag Lag

omega omega/pi

residuals into the fitted model. The residuals are typically recentred to have the same m ean as the innovations o f the model. A bout the simplest situation is when the AR(1) model (8.2) is fitted to an observed series y i , . . . , y„ , giving estim ated autoregressive coefficient a and estim ated innovations

e j = yj - & y j - u j = 2 , . . . , n ;

e\ is unobtainable because yo is unknown. M odel-based resampling m ight then proceed by equi-probable sampling with replacem ent from centred residuals

— e, . . . , en — e to obtain simulated innovations e j , . . . , and then setting

8.2 ■ Time Series 391

Figure 8.3 Plots for residuals from AR(2) model fitted to the Rio Negro data, 1903-1912: normal Q-Q plot of the standardized residuals (left), and cumulative periodogram of the residual series (right).

V)coD“Ocn <Dcc

E2?o>o-oo0Q.

3e3o

Quantiles of standard normal omega/pi

yo = ej and

y j = a yj_! + e j , j = l , . . . , n ; (8.6)

o f course we m ust have |a| < 1. In fact the series so generated is no t stationary, and it is better to start the series in equilibrium, or to generate a longer series o f innovations and start (8.6) at j = —k, where the ‘burn-in’ period —k , . . . , 0 is chosen large enough to ensure that the observations y [ , . . . , y* are essentially stationary; the values y'_k, . . . , y ' ) are discarded.

Thus model-based resam pling for time series is based on applying the defining equation(s) o f the series to innovations resam pled from residuals. This procedure is simple to apply, and leads to good theoretical behaviour for estim ates based on such data when the model is correct. For example, studentized bootstrap confidence intervals for the autoregressive coefficients ak in an AR(p) process enjoy the good asym ptotic properties discussed in Section 5.4.1, provided tha t the model fitted is chosen correctly. Just as there, confidence intervals based on transform ed statistics may be better in practice.

Example 8.2 (Wool prices) The A ustralian W ool C orporation m onitors prices weekly when wool m arkets are held, and sets a m inim um price just before each week’s m arkets open. This reflects the overall price o f wool for tha t week, but the prices actually paid can vary considerably relative to the minimum. The left panel o f Figure 8.4 shows a plot o f log(price paid /m in im um price) for those weeks when m arkets were held from July 1976 to June 1984. The series does not seem stationary, having some o f the characteristics o f a random walk, as well as a possible overall trend.

I f the log ratio in week j follows a random walk, we have Yj = Yj-\ + Sj,

Figure 8.4 Weekly log ratio o f price paid to minimum price for Australian wool from July 1976 to June 1984 (Diggle, 1990, pp. 229-237). Left panel: original data. Right panel: first differences o f data.

0 50 100 150 200 250 300

Time in weeks Time in weeks

where the ej are white noise; a non-zero mean for the innovations Ej will lead to drift in yj. The right panel o f Figure 8.4 shows the differenced series, ej = y j —yj - i , which appears stationary apart from a change in the innovation variance at about the 100th week. In our analysis we drop the first 100 observations, leaving a differenced series o f length 208.

An alternative to the random walk model is the AR(1) model

( Y j - n ) = <x(Y}- 1 -iJ.) + ej ; (8.7)

this gives the random walk when a = 1. If the innovations have mean zero and a is close to but less than one, (8.7) gives stationary data, though subject to the climbs and falls seen in the left panel o f Figure 8.4. The implications for forecasting depend on the value of a, since the variance o f a forecast is only asymptotically bounded when |a| < 1. We test the unit root hypothesis that the data are a random walk, or equivalently that a = 1, as follows.

O ur test is based on the ordinary least squares estimate o f a in the regression Yj = }’ + a Yj-1 +Sj for j = 2 , . . . , n using test statistic T = (1 — a) /S, where S is the standard error for a calculated using the usual form ula for a straight-line regression model. Large values o f T are evidence against the random walk hypothesis, with or w ithout drift. The observed value o f T is t = 1.19. The distribution o f T is far from the usual standard norm al, however, because of the regression o f each observation on its predecessor.

Under the hypothesis tha t a = 1 we simulate new time series Y J , . . . , Y * by generating a bootstrap sample e \ , . . . , e* from the differences e i , . . . , e n and then setting YJ = Y\, Y j = YJ + e "2 , and YJ = Y]'_l + £* for subsequent j . This is(8.6) applied with the null hypothesis value a = 1. The value o f T ' is then obtained from the regression o f YJ on YJ_X for j = 2 The left panel


Figure 8.5 Results for 199 replicates of the random walk test statistic, T*. The left panel is a normal plot of t*. The right panel shows t* plotted against the inverse sum of squares for the regressor, with the dotted line giving the observed value.

Quantiles of standard normal 1/SSy*

of Figure 8.5 shows the empirical distribution o f T * in 199 simulations. The distribution is close to norm al with m ean 1.17 and variance 0.88. The observed significance level for t is (97 + l ) / ( 199 + 1) = 0.49: there is no evidence against the random walk hypothesis.

The right panel o f Figure 8.5 shows the values o f f* plotted against the inverse sum o f squares for the regressor y j_v In a conventional regression, inference is usually conditional on this sum o f squares, which determ ines the precision o f the estimate. The dotted line shows the observed sum o f squares. If the conditional distribution o f tm is thought to be appropriate here, the distribution o f values o f t* close to the dotted line shows that the conditional significance level is even higher; there is no evidence against the random walk conditionally or unconditionally. ■

M odels are com m only fitted in order to predict future values of a time series, but as in o ther settings, it can be difficult to allow for the various sources of uncertainty tha t affect the predictions. The next example shows how bootstrap m ethods can give some idea o f the relative contributions from innovations, estim ation error, and model error.

Example 8.3 (Sunspot numbers) Figure 8.6 shows the much-analysed annual sunspot num bers y[ , - - - , y 2%g from 1700-1988. The data show a strong cycle with a period o f about 11 years, and some hint of non-reversibility, which shows up as a lack o f symmetry in the peaks. We use values from 1930-1979 to predict the num bers o f sunspots over the next few years, based on fitting

Time in years

1980 81 82 83 84 85 86 87 1988

A ctual 23.0 21.8 19.6 14.4 11.7 6.7 5.6 9.0 18.1Predicted 21.6 18.9 14.9 12.2 9.1 7.5 6.8 8.8 13.6

S tandard error

N om inal 2.0 2.9 3.2 3.2 3.2 3.2 3.3 3.4 3.4M odel, A R (9) 2.2 2.9 3.0 3.2 3.3 3.8 4.1 4.0 3.6M odel 2.3 3.3 3.6 3.5 3.5 3.6 3.8 3.9 3.8M odel, cond it’l 2.5 3.6 4.1 3.9 3.8 3.8 3.9 4.0 4.1Block, I = 10 7.8 7.0 6.9 6.9 6.7 6.6 6.7 6.8 6.5Post-black’d, I = 10 2.1 3.3 3.9 4.0 3.6 3.6 3.9 4.3 4.3

AR(p) modelsp

Y j - n = ^2<3ik ( Yj - k - n) + £j , ( 8 .8 )

fc=i

to the transform ed observations yj = 2{(yj + l )1/2 — 1}; this transform ation is chosen to stabilize the variance. The corresponding maximized log likelihoods are denoted ?p. A standard approach to model selection is to select the model that minimizes A IC = — 2 i p + 2p, which trades off goodness o f fit (measured by the maximized log likelihood) against model complexity (measured by p). H ere the resulting “best” model is AR(9), whose predictions yj for 1980-88 and their nom inal standard errors are given a t the top o f Table 8.1. These standard errors allow for prediction error due to the new innovations, but not for param eter estim ation or model selection, so how useful are they?

To assess this we consider m odel-based sim ulation from (8.8) using centred residuals and the estim ated coefficients o f the fitted AR(9) model to generate series y*1,.. . ,y * 59, corresponding to the period 1930-1988, for r = l , . . . , R . We then fit autoregressive models up to order p = 25 to y ’{, . . . , y '50, select the model giving the smallest AIC, and use this model to produce predictions y'rj for j = 51, . . . , 59. The prediction error is y ’j — y'r], and the estim ated standard

Figure 8.6 Annual sunspot numbers, 1700-1988 (Tong, 1990, p. 470).

Table 8.1 Predictions and their standard errors for2{{y'j + 1)1/2 - 1} for sunspot data,1980-1988, based on data for 1930-1979. The standard errors are nominal, and also those obtained under model-based resampling assuming the simulated series y* are AR(9), not assuming y ‘ is AR(9), and by a conditional scheme, and the block and post-blackened bootstraps with block length / = 10. See Examples 8.3 and 8.5 for details.


errors o f this are given in Table 8.1, based o n J? = 999 bootstrap series. The orders o f the fitted models were

O rder 1 2 3 4 5 6 7 8 9 10 11 12# 3 257 126 100 273 85 22 18 83 23 7 2

so the AR(9) model is chosen in only 8% o f cases, and m ost o f the models selected are less complicated. The fifth and sixth rows o f Table 8.1 give the estim ated standard errors o f the y ’ — y* using the 83 simulated series forwhich the selected model was AR(9) and using all the series, based on the999 replications. There is about a 10-15% increase in standard error due to param eter estim ation, and the standard errors for the AR(9) models are mostly smaller.

Prediction errors should take account o f the values o f yj immediately prior to the forecast period, since presum ably these are relevant to the predictions actually made. Predictions tha t follow on from the observed da ta can be obtained by using innovations sampled at random except for the period j = n — k + 1 ,... ,n, where we use the residuals actually observed. Taking k = n yields the original series, in which case the only variability in the y'rj is due to the innovations in the forecast period; the standard errors o f the predictions will then be close to the nom inal standard error. However, if k is small relative to n, the differences y*j — y' j will largely reflect the variability due to the use of estim ated param eters, although the y*rj will follow on from yn. The conditional standard errors in Table 8.1, based on k = 9, are about 10% larger than the unconditional ones, and substantially larger than the nom inal standard errors.

The distributions o f the y' j — y' j appear close to norm al with zero means, and a sum m ary o f variation in term s o f standard errors seems appropriate. There will clearly be difficulties with norm al-based prediction intervals in 1985 and 1986, when the lower limits o f 95% intervals for y are negative, and it m ight be better to give one-sided intervals for these years. It would be better to use a studentized version o f y' j — y' j if an appropriate standard error were readily available.

W hen bootstrap series are generated from the AR(9) model fitted to the da ta from 1700-1979, the orders o f the fitted models are

O rder 5 9 10 11 12 13 14 15 16 17 18 19 20# 1 765 88 57 28 21 11 11 5 1 4 2 5

so the AR(9) model is chosen in about 75% o f cases. There is a tendency for A IC to lead to overfitting: just one o f the models has order less than 9. For this longer series param eter estim ation and model selection inflate the nom inal standard error by at m ost 6%.

The above analysis gives the variability o f predictions based on selecting themodel tha t minimizes A IC on the basis that an AR(9) model is correct, and

does not give a true reflection o f the error otherwise. Is an autoregressive or m ore generally a linear model appropriate? A test for linearity o f a time series can be based on the non-additivity statistic T — w2{n — 2m — 2)/(R SS — w2), where RSS is the residual sum o f squares for regression o f (ym+i, . . . ,y„) on the (n — m) x (m + 1) m atrix X whose y'th row is ( l , y m+j - i , . . . , y j ) , with residuals qj and fitted values gy. Let q'j denote the residuals from the regression o f gj on X , and let w equal Then the approxim ate distributionof T is fi,n—2m—2> with large values o f T indicating potential nonlinearity. The observed value o f T when m = 20 is 5.46, giving significance level 0.02, in good agreement w ith bootstrap sim ulations from the fitted AR(9) model. The significance level varies little for values o f m from 6 to 30. There is good evidence tha t the series is nonlinear. We return to these data in Example 8.5.

■

The m ajor draw back with model-based resampling is that in practice not only the param eters o f a model, but also its structure, m ust be identified from the data. I f the chosen structure is incorrect, the resampled series will be generated from a wrong model, and hence they will not have the same statistical properties as the original data. This suggests that some allowance be made for model selection, as in Section 3.11, but it is unclear how to do this w ithout some assum ptions about the dependence structure o f the process, as in the previous example. O f course this difficulty is less critical when the model selected is strongly indicated by subject-m atter considerations or is well-supported by extensive data.

8.2.3 Block resamplingThe second approach to resampling in the time dom ain treats as exchangeable not innovations, but blocks o f consecutive observations. The simplest version o f this idea divides the da ta into b non-overlapping blocks o f length /, where we suppose tha t n = bl. We set z\ = (y i , . . . ,y i ) , z2 = (yi+u■ ■ ■,yn), and so forth, giving blocks z \ , . . . , z&. The procedure is to take a bootstrap sample with equal probabilities b~l from the z; , and then to paste these end-to-end to form a new series. As a simple example, suppose tha t the original series is y i , . . . , y i2, and tha t we take I = 4 and b = 3. Then the blocks are z\ = (y i , y 2, y 3,y«), Z2 = iys,y6,yi,y%), and z3 = {y<),yw,yu,yi2)- If the resampled blocks are z\ = Z2, z \ = zj, and z\ = zi, the new series o f length 12 is

[y]} = z i>z2>z3 = y5,ye,yi,y%, y u y i , y i , y * , y5,yt , ,yi ,yz-

In general, the resampled series are m ore like white noise than the original series, because o f the joins between blocks where successive independently chosen z* meet.

The idea that underlies this block resampling scheme is tha t if the blocks


are long enough, enough o f the original dependence will be preserved in the resampled series tha t statistics f* calculated from {yj} will have approxim ately the same distribution as values t calculated from replicates o f the original series. Clearly this approxim ation will be best if the dependence is weak and the blocks are as long as possible, thereby preserving the dependence more faithfully. On the o ther hand, the distinct values o f t* m ust be as num erous as possible to provide a good estimate o f the distribution o f T, and this points towards short blocks. Theoretical work outlined below suggests that a compromise in which the block length I is o f order ny for some y in the interval (0,1) balances these two conflicting needs. In this case both the block length / and the num ber o f blocks b = n/ l tend to infinity as n —* oo, though different values o f y are appropriate for different types o f statistic t.

There are several variants on this resampling plan. One is to let the original blocks overlap, in our example giving the n — I + 1 = 9 blocks z\ = (>’i , ...,> '4), 22 = Z3 = t o , . . . , ye), and so forth up to z9 = (y9, . . . , yn) . Thisincurs end effects, as the first and last / — 1 o f the original observations appear in fewer blocks than the rest. Such effects can be removed by w rapping the data around a circle, in our example adding the blocks z\o = (yio,y n , y n , y \ ) , Z n . = ( y u , y n , y i , y 2 ), and Z12 = 0 '12,J '1»J'2,J '3)- This ensures tha t each o f the original observations has an equal chance o f appearing in a simulated series. End correction by w rapping also removes the m inor problem with the nonoverlapping scheme tha t the last block is shorter than the rest if n/ l is not an integer.

Post-blackening

The m ost im portant difficulty with resampling schemes based on blocks is that they generate series tha t are less dependent than the original data. In some circumstances this can lead to catastrophically bad resampling approxim ations, as we shall see in Example 8.4. It is clearly inappropriate to take blocks of length / = 1 when resampling dependent data, for the resampled series is then white noise, bu t the “w hitening” can rem ain substantial for small and m oderate values o f I. This suggests a strategy interm ediate between model- based and block resampling. The idea is to “pre-w hiten” the series by fitting a model that is intended to remove much o f the dependence between the original observations. A series o f innovations is then generated by block resampling o f residuals from the fitted model, and the innovation series is then “post-blackened” by applying the estim ated model to the resampled innovations. Thus if an AR(1) model is used to pre-whiten the original data, new series are generated by applying (8.6) but with the innovation series {ej} sampled no t independently but in blocks taken from the centred residual series e2 - e , . . . , e„ - e.

398 8 • Complex Dependence

A different approach to removing the whitening effect o f block resampling is to resample blocks o f blocks. Suppose tha t the focus o f interest is a statistic T which estimates 6 and depends only on blocks o f m successive observations.

B locks o f blocks

which m = k + 1. Then unless / » m the distribution o f T* — M s typically a poor approxim ation to tha t o f T — 6, because a substantial proportion o f the pairs (YJ, Yj+k) in a resam pled series will lie across a jo in between blocks, and will therefore be independent. To im plem ent resam pling blocks o f blocks we define a new m-variate process {Yj} for which Yj = ( Y j , Y j +m- 1), rewrite T

so tha t it involves averages o f the Yj, and resample blocks o f the new “d a ta” y \ , .. .,y'„_m+1, each o f the observations o f which is a block o f the original data. For the lag 1 autocovariance, for example, we set

and write t = (n — I )-1 YXVij ~ y'lMy'ij ~ ?2-)- The key point is tha t t should no t com pare observations adjacent in each row. W ith n = 12 and / = 4 a bootstrap replicate might be

Since a bootstrap version o f t based on this series will only contain products o f (centred) adjacent observations o f the original data, the whitening due to resampling blocks will be reduced, though not entirely removed.

This approach leads to a shorter series being resampled, bu t this is unim portan t relative to the gain from avoiding whitening.

Stationary bootstrap

A further bu t less im portant difficulty with these block schemes is that the artificial series generated by them are no t stationary, because the jo in t distribution o f resampled observations close to a jo in between blocks differs from tha t in the centre o f a block. This can be overcome by taking blocks o f random length. The stationary bootstrap takes blocks whose lengths L are geometrically distributed, with density

This yields resam pled series tha t are stationary with m ean block length Z = p *. Properties o f this scheme are explored in Problems 8.1 and 8.2.

A n example is the lag k autocovariance (n — k) 1 Y ^ J l i y j ~ y)(yj+k ~ y), for

ys y 6 y i ys y i y i ys y 4 y i y 9 y io^6 y i >'8 y 9 y i w y i ys ys y? yio yn

Pr(L = j ) = ( l - p y - ' p , j = 1 ,2 ,—

Example 8.4 (Rio Negro data) To illustrate these resampling schemes we consider the shorter series o f river stages, o f length 120, with its average subtracted. Figure 8.7 shows the original series, followed by three bootstrap


Figure 8.7 Resamples from the shorter Rio Negro data. The top panel shows the original series, followed by three series generated by model-based sampling from the fitted AR(2) model, then three series generated using the block bootstrap with / = 24 and no end correction, and three series made using the post-blackened method, with the same blocks as the block series and the fitted AR(2) model.

0 20 40 60 80 100 120

0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120

series generated by model-based sampling from the AR(2) model. The next three panels show series generated using the block bootstrap with length I = 24 and no wrapping. There are some sharp jum ps a t the ends o f contiguous blocks in the resam pled series. The bottom panels show series generated using the same blocks applied to the residuals, and then post-blackened using the AR(2) model. The jum ps from using the block bootstrap are largely removed by post-blackening.

For a m ore systematic com parison o f the methods, we generated 200 boo tstrap replicates under different resampling plans. For each plan we calculated the standard error SE o f the average y * o f the resampled series, and the average o f the first three autocorrelation coefficients. The m ore dependent


Pi P2 h

O riginal values — 0.85 0.62 0.45Sam pling SE 0.017 0.002 0.007 0.010

Resam pling p lan D etails SE Pi Pi P\

M odel-based AR(2) 0.34 0.83 0.60 0.38

AR(1) 0.49 0.82 0.67 0.54A R(3) 0.44 0.83 0.58 0.39

Blockwise I = 2 0.20 0.41 -0.02 -0.011 = 5 0.26 0.67 0.35 0.14/ = 10 0.33 0.75 0.47 0.271 = 20 0.33 0.79 0.54 0.35

Blocks o f blocks 1 = 2 0.20 0.85 0.63 0.45I = 5 0.26 0.85 0.63 0.45I = 10 0.33 0.85 0.64 0.471 = 20 0.33 0.85 0.64 0.48

Stationary 1 = 2 0.25 0.40 0.13 0.03I = 5 0.28 0.66 0.37 0.20/ = 10 0.31 0.74 0.47 0.28/ = 20 0.28 0.79 0.54 0.36

Post-blackened A R(2), I = 2 0.39 0.83 0.59 0.38A R(1), I = 2 0.58 0.85 0.69 0.56A R(3), I = 2 0.43 0.83 0.58 0.40

the series, the larger we expect SE and the autocorrelation coefficients to be. Table 8.2 gives the results. The top two rows show the correlations in the data and approxim ate standard errors for the resam pling results below.

The results for model-based sim ulation depend on the model used, although the overfitted AR(3) model gives results similar to the AR(2). The AR(1) model adds correlation not present in the original data.

The block m ethod is applied with no end correction, bu t further sim ulations show that it makes little difference. Block length has a dram atic effect, and in particular, block length / = 2 essentially removes correlation a t lags larger than one. Even blocks o f length 20 give resam pled da ta noticeably less dependent than the original series.

The whitening is overcome by resam pling blocks o f blocks. We took blocks o f length m = 4, so tha t the m-variate series had length 117. The m ean resampled autocorrelations are essentially unchanged even with / = 2, while SE* does depend on block length.

Table 8.2 Comparison of time-domain resampling plans applied to the average and first three autocorrelation coefficients for the Rio Negro data, 1903-1912.


The stationary bootstrap is used with end correction. The results are similar to those for the block bootstrap, except tha t the varying block length preserves slightly more o f the original correlation structure; this is noticeable at I = 2.

Results for the post-blackened m ethod with AR(2) and AR(3) models are similar to those for the corresponding model-based schemes. The results for the post-blackened AR(1) scheme are interm ediate between AR(1) and AR(2) m odel-based resampling, reflecting the fact tha t the AR(1) model underfits the data, and hence structure remains in the residuals. Longer blocks have little effect for the AR(2) and AR(3) models, bu t they bring results for the AR(1) model m ore into line with those for the others. ■

The previous example suggests tha t post-blackening generates resampled series with correlation structure similar to the original data. Correlation, however, is a m easure o f linear dependence. Is nonlinear dependence preserved by resampling blocks?

Example 8.5 (Sunspot numbers) To assess the success o f the block and postblackened schemes in preserving nonlinearity, we applied them to the sunspot data, using / = 10. We saw in Example 8.3 tha t although the best autoregressive model for the transform ed da ta is AR(9), the series is nonlinear. This nonlinearity m ust rem ain in the residuals, which are alm ost a linear transform ation o f the series. Figure 8.8 shows probability plots o f the nonlinearity statistic T from Example 8.3, with m = 20, for the block and post-blackened bootstraps with I = 10. The results for model-based resampling o f residuals are not shown bu t lie on the diagonal line, so it is clear tha t both schemes preserve some o f the nonlinearity in the data, which m ust derive from lags up to 10. Curiously the post-blackened scheme seems to preserve more.

Table 8.1 gives the predictive standard errors for the years 1980-1988 when the simple block resampling scheme with I = 10 is applied to the da ta for 1930— 1979. Once data for 1930-1988 have been generated, the procedure outlined in Example 8.3 is used to select, fit, and predict from an autoregressive model. Owing to the jo ins between blocks, the standard errors are much larger than for the o ther schemes, including the post-blackened one with I = 10, which gives results similar to bu t somewhat m ore variable than the model-based bootstraps. U nadorned block resampling seems inappropriate for assessing prediction error, as one would expect. ■

Choice o f block lengthSuppose tha t we w ant to use the block bootstrap to estimate some feature k

based on a series o f length n. A n example would be the standard error o f the series average, as in the third colum n o f Table 8.2. Different block lengths / result in different bootstrap estimates k(n,l). W hich should we use?

A key result is tha t under suitable assum ptions and for large n and I the

Quantile of F distribution Quantile of F distribution

mean squared error o f k(n, I) is proportional to

where Ci and C2 depend only on k and the dependence structure o f the series. In (8.9) d = 2, c = 1 if k is a bias or variance, d = 1, c = 2 if k is a onesided significance probability, and d = 2, c = 3 if k is a two-sided significance probability. The justification for (8.9) when k is a bias or a variance is discussed after the next example. The im plication o f (8.9) is that for large n, the mean squared error o f o f k{n, I) is minimized by taking I oc n1/(c+2), bu t we do not know the constant o f proportionality. However, it can be estim ated as follows.

We guess an initial value o f I, and simulate to obtain k(n, I). We then take m < n and k j , . . . , y j +m- 1 for j = 1 , — m + 1. The estim ated m ean squared error for k(m, k) from a series o f length m with block size k is then

1 n—m+1MSE(m,fc) = ^-----------j- {£/(wi, k) — k(n, I)}2 .

j = 1

By repeating this procedure for different values o f k but the same m, we obtain the value k for which MSE(m,/c) is minimized. We then choose

Z = k x (n /m )1/(c+2) (8.10)

as the optim um block length for a series o f length n, and calculate k(n,l). This procedure eliminates the constant o f proportionality. We can check on the adequacy o f I by repeating the procedure with initial value I = I, iterating if necessary.

Figure 8.8Distributions of nonlinearity statistic for block resampling schemes applied to sunspot data. The left panel shows R = 999 replicates of a test statistic for nonlinearity, based on detecting nonlinearity at up to 20 lags for the block bootstrap with / = 10. The right panel shows the corresponding plot for the post-blackened bootstrap using the AR(9) model.

Figure 8.9 Ten-year running average of Manaus data (left), together with Abelson-Tukey coefficients (right) (Abelson and Tukey, 1963).

co'oit=<boo>»£DCO<n<Dn<

1900 1940 1980

Time (years)

The m inim um asym ptotic mean squared error is n d 2/<c+2)(Ci + C2), so if the block length selection procedure is applicable,

A(m ) = log |rn inM SE (m ,fc) j + {d + 2 /(c + 2)} logm

should be approxim ately independent o f m. This suggests th a t values o f A(m) for different m should be com pared as a check on the asymptotics.

Example 8.6 (Rio Negro data) There is concern tha t river heights at M anaus m ay be increasing due to deforestation, so we test for trend in the river series, a ten-year running average o f which is shown in the left panel o f Figure 8.9. There may be an upw ard trend, but it is hard to say w hether the effect is real. To proceed, we suppose tha t the da ta consist o f a stationary time series to which has been added a m onotonic trend. O ur test statistic is T = Y?j=1 ai where the coefficients

are optim al for detecting a m onotonic trend in independent observations. The plot o f the a , in the right panel o f Figure 8.9 shows that T strongly contrasts the ends o f the series. We can think o f T as alm ost being a difference o f averages for the two ends o f the series, and this falls into the class of statistics for which the m ethod o f choosing the block length described above is appropriate. Resam pling blocks o f blocks would not be appropriate here. The value o f T for the full series is 7.908. Is this significantly large?

To simulate da ta under the null hypothesis o f no trend, we use the stationary

<DOc<d<6>

o

ir>co

oco

10CM

oCM

oCO

oCO

0 j \ / \ J--' \ / V x/’o / v / \ I '/ \ " _ j — c /\jy '0-/ \]

,y\ \ s/ Nv ca / V' v »

/V

ari

40 /o /CM

5 10 15

Block length

20 0 10 20 30 40 50

Block length

Figure 8.10 Estimated variances of T for Rio Negro data, for stationary (solid) and block (dots) bootstraps. The left panel is for 1903-1912 {R = 999), the right panel is for the whole series (R = 199).

bootstrap with w rapping to generate new series Y \ We initially apply this to the shorter series o f length 120, adjusted to have m ean zero, for which T takes value 0.654. U nder the null hypothesis the m ean o f T = J 2 ajYj is zero and the distribution o f T will be close to normal. We estimate its variance by taking the empirical variance o f values T" generated with the stationary bootstrap. The left panel o f Figure 8.10 shows these variances k(n, /) based on different mean block lengths I, for both stationary and block bootstraps. The stationary bootstrap sm ooths the variances for different fixed block lengths, resulting in a fairly stable variance for / > 6 or so. Variances o f T * based on the block bootstrap are m ore variable and increase to a higher eventual value. The variances for the full series are larger and more variable.

In order to choose the block length /, we took 50 random ly selected subseries o f m consecutive observations from the series with n = 120, and for each value o f k = 2 , . . . , 20 calculated values o f k(m, k) from R = 50 stationary bootstrap replicates. The left part o f Table 8.3 shows the values k tha t minimize the mean squared error for different possible values o f k{n, I). N ote tha t the values o f k do not broadly increase with m, as the theory would predict. For smaller values of k(n, I) the values o f k vary considerably, and even for k(n, I) = 30 the corresponding values o f I as given by (8.10) with c = 1 and d = 2 vary from 12 to 20. The left panel o f Figure 8.10 shows that for / in this range, the variance k(n, I) takes value roughly 25. For k(n, I) = 25, Table 8.3 gives I in the range 8-20, so overall we take k(n, I) = 25 based on the stationary bootstrap.

The right part o f Table 8.3 gives the values o f k when the block bootstrap with w rapping is used. The series so generated are not exactly stationary, bu t are nearly so. Overall the values are more consistent than for the stationary

Table 8.3 Estimatedvalues of k for Rio k(m, /) S tationary, m Block, mNegro data, 1903-1912, based on stationary bootstrap with mean length k applied to 50 subseries of length m (left figures) and block bootstrap with block length k applied to 50 subseries of length m (right figures).

20 30 40 50 60 70 20 30 40 50 60 70

15 10 6 3 2 2 2 4 3 18 2 2 217.5 11 18 3 3 3 2 4 10 16 3 3 320 11 18 6 3 3 3 4 5 4 6 3 422.5 11 18 6 6 5 4 4 5 5 6 4 425 11 11 12 6 7 8 4 5 5 6 5 527.5 11 11 14 9 10 8 4 5 5 9 6 830 11 11 14 9 10 11 4 5 5 9 6 8

bootstrap, with broadly increasing values o f k w ithin each row, provided k(n, I) > 20. For these values of k(n, I), the values o f k suggest that I lies in the range 5-8, giving k(n, I) = 25 or slightly less. Thus both the stationary and the block bootstrap suggest that the variance o f T is roughly 25, and since t = 0.654, there is no evidence o f trend in the first ten years o f data.

For the stationary bootstrap, the values o f A(m ) have smallest variance for k(n, I) = 22.5, when they are 13.29, 13.66, 14.18, 14.01, 13.99 and 13.59 for m = 2 0 ,...,7 0 . For the block bootstrap the variance is smallest when k(n,l) = 27.5, when the values are 13.86, 14.25, 14.63, 14.69, 14.73 and 14.44. However, the m inim um mean squared error shows no obvious pattern for any value o f k(n, I), and it seems tha t the asym ptotics apply adequately well here.

Overall Table 8.3 suggests that a range o f values o f m should be used, and tha t results for different m are more consistent for the block than for the stationary bootstrap. For given values o f m and k, the variances k j (m,k ) have approxim ate gam m a distributions, bu t calculation o f their m ean squared error on the variance-stabilizing log scale does little to improve matters.

For the stationary bootstrap applied to the full series, we take I in the range (8,20) x (1080/120)1/3 = (17,42), which gives variances 46-68, with average variance roughly 55. The corresponding range o f I for the block bootstrap is 10-17, which gives variances k(n,l) in the range 43-53 or so, with average value 47. In either case the lowest reasonable variance estimate is about 45. Since the value o f t for the full series is 7.9, an approxim ate significance level for the hypothesis o f no trend based on a norm al approxim ation to T* is 1 — <E>(7.9/451/2) = 0.12. The evidence for trend based on the m onthly data is thus fairly weak. ■

Some block theoryIn order to gain some theoretical insight into block resampling and the fundam ental approxim ation (8.9) which guides the choice of I, we examine the estim ation o f bias and variance for a special class o f statistics.


Consider a stationary time series {Yy} with m ean n and covariances yj = cov( Yo, Y j) , and suppose tha t the param eter o f interest is 9 = h(ji). The obvious estim ator o f 9 based on Y i,.. . , Y„ is T = h(Y), whose bias and variance are

P = E { M Y )- /i( / i )} = ^" (/i)v a r(Y ),

(8.11)

v = var { h( Y )} = h'(n)2var( Y ),

by the delta m ethod o f Section 2.7.1. N ote that

var( Y) = n~2 {ny0 + 2(n - l)? i + ----- 1- 2y„_i} = n-2^ ,

say, and tha t as n—>oo,

tt—1 co

»_14 B) = yo + 2 _ i / n^ i ^ 2 y j = £■7=1 j=—co

Therefore p ~ \h"(p)n~l£, and v ~ for large n. Now suppose thatwe estimate P and v by simple block resampling, with b non-overlapping blocks o f length I, with n = bl, and use S; to denote the average Z_1 ^o-i)/+i ofthe j i h block, for j = 1 Thus S = Y, and Y* = b~l Y j = i Sj , where the Sj are sampled independently from S i,. . . , St. The bootstrap estim ates o f the bias and variance o f T are

p = E*{fc(Y*)- / ! ( ? ) } = h ' ( Y )E ’ ( Y ' - Y) + i / i" (Y )E * { (Y * - Y )2} ,

(8.12)v = var’ {/i(Y*)} = fr'(Y)2var’ (Y*).

W hat we w ant to know is how the accuracies o f P and v vary with /.Since the blocks are non-overlapping,

b

E ’(Y*) = S, var*(Y*) = r 2 ^ ( 5 y - S ) 2.j = i

It follows by com paring (8.11) and (8.12) tha t the means o f P and v will be asym ptotically correct provided that when n is large, E{b~l ^ ( S , —S)2} ~ n-1£. This will be so because ^Z(Sj ~ S)2 = ~ I1)2 ~ ~ I1)2 has mean

bwar(Si) — b\ar(S) = b(l~2c^ — n~2c(0n>) ~

if I —►co and Z/n->0 as n—>oo. To calculate approxim ations for the m ean squared errors o f P and v requires m ore careful calculations and involves the variance o f — S)2. This is messy in general, bu t the essential points rem ain under the simplifying assum ptions tha t {Yj) is an m -dependent norm al process. In this case ym+i = ym+2 = • • • = 0, and the third and higher cum ulants o f the

Y is the average ofYu . . . ,Yn.

process are zero. Suppose also that m < I. Then the variance o f ~ S )2 is approxim ately

v a r { X ] ( S' - V ) 2} = bvar{ (S l ~ n ) 2 } + 2 (b - l)cov {(Si - n)2, (S2 - n)2} .

For norm al data,

var {(Si — n)2} = 2{var(Si - n)}2 ,

cov{(Si - ju ) 2,(S2 - / i ) 2} = 2 {cov(Si - n, S2 - n )}2 ,

SO

var { J 2 ( SJ - S)2} = 2b(l~24 ))2 + 4 6 ( r V 1'))2,

where under suitable conditions on the process,OO

c f = y i + 2 y 2 -\------ 1- l y i - > ^ j y j ~ j i ,i = i

say. After a delicate calculation we find that

E{$) — (} ~ x f ' r ' t , var(/?) ~ {^h" (n) }2 x 2ln~3( 2, (8.13)

E(v) — v ~ - t i ( f i ) 2 xn~lr lT, var(v) ~ hf(j i f x 2/n“ 3f 2, (8.14)

thus establishing that the mean squared errors o f fi and v are o f form (8.9).This developm ent can clearly be extended to m ultivariate time series, and

thence to more com plicated param eters o f a single series. For example, for the first-order correlation coefficient o f the univariate series {Xj}, we would apply the argum ent to the trivariate series {Yj} = { ( X j , X 2, X j X j - 1)} with mean

and set G = M^i» A*n, ^ 12) = ~ H2)-W hen overlapping blocks are resampled, the argum ent is similar but the

details change. If the da ta are not w rapped around a circle, there are n — I + 1 blocks with averages Sj = /-1 Y?i=i an(^

E‘(? * - ? ) = /(„-*+ !) | /(/ ~ 1)? ~ + y"“;+l)} ' (8'15)

In this case the leading term o f the expansion for fi is the product o f h'(Y) and the right-hand side o f (8.15), so the bootstrap bias estimate for Y as an estim ator o f 9 = n is non-zero, which is clearly misleading since E (T ) = fi. W ith overlapping blocks, the properties o f the bootstrap bias estim ator depend on E*(Y *)— Y, and it turns out that its variance is an order o f m agnitude larger than for non-overlapping blocks. This difficulty can be removed by wrappingYi.......Y„ around a circle and using n blocks, in which case E*(Y*) = Y, orby re-centring the bootstrap bias estimate to ^ = E’ {/i(Y*)} — ft {E ”(Y ')} . In either case (8.13) and (8.14) apply. One asym ptotic benefit o f using overlapping

blocks when the re-centred estim ator is used is that var(/?) and var(v) are reduced by a factor | , though in practice the reduction may not be visible for small n.

The corresponding argum ent for tail probabilities involves Edgeworth expansions and is considerably more intricate than that sketched above.

A part from sm oothness conditions on h(-), the key requirem ent for the above argum ent to work is that x and ( be finite, and that the autocovariances decrease sharply enough for the various terms neglected to be negligible. This is the case if ~ a; for sufficiently large j and some a with |a| < 1, as is the case for stationary finite A R M A processes. However, if for large j we find that yj ~ j ~ s, where 5 < S < 1, £ and x are not finite and the argum ent will fail. In this case g(<u) ~ oj~s for small co, so long-range dependence o f this sort is characterized by a pole in the spectrum at the origin, where £ = g(0) is the value o f the spectrum. The data counterpart o f this is a sharp increase in periodogram ordinates a t small values o f co. Thus a careful exam ination of the periodogram near the origin and o f the long-range correlation structure is essential before applying the block bootstrap to data.

8.2.4 Phase scramblingRecall the basic stochastic properties o f the empirical Fourier transform of a series y o , . . . , y n- i o f length n = 2nf + 1 : for large n and under certain conditions on the process generating the data, the transform ed values % for k = 1 , . . . , n F are approxim ately independent, and their real and imaginary parts are approxim ately independent norm al variables with means zero and variances ng(cok)/2, where cok = 2nk/n. The approxim ate independence of y i , . . . , y nF suggests that, provided the conditions on the underlying process are met, the frequency dom ain is a better place to look for exchangeable com ponents than the time dom ain. Expression (8.4) shows that the spectrum summarizes the covariance structure o f the process { Y j } , and correspondingly the periodogram values I(tOk) = \%\2/ n summarize the second-order structure o f the data, which as far as possible we should preserve when resampling. This suggests that we generate resamples by keeping fixed the m oduli |y*|, but random izing their phases Uk = arg %, which anyway are asymptotically uniformly distributed on the interval [0,2n), independent o f the \yk\- This phase scrambling can be done in a variety o f ways, one o f which is the following.

Algorithm 8.1 (Phase scrambling)

1 Com pute from the data yo, -1 the empirical Fourier transformn—1

h = _ y)’ k = 0 , . . . , n - l ,j = 0

where ( = exp(2ni/n).


2 Set X k = exp(iUk)ek, k = 0 — 1, where the Uk are independent variables uniform on [0, 2 k).

3 Set

ek = 2~ ^ 2 {Xk + X cn_k) , fc = 0, . . . , n — 1,

where superscript c denotes complex conjugate and we take X„ = Xo-4 Apply the inverse Fourier transform to e*0, . . . , e'n_ ] to obtain

n— 1

Y j ^ y + n - 1 Y , Z ~ ik~e'k’ j = 0 , . . . , n - l .fc=0

5 Calculate the bootstrap statistic T ' from Y0' , . . . , Y ’_ ,.

•

Step 3 guarantees that Yk has complex conjugate Y*_k, and therefore that the bootstrap series Y0*, . . . , Yn'_{ is real. A n alternative to step 2 is to resample the Uk from the observed phases

The bootstrap series always has average y , which implies that phase scram bling should be applied only to statistics tha t are invariant to location changes o f the original series; in fact it is useful only for linear contrasts o f the y j , as we shall see below. It is straightforw ard to see that

-1/2 n-1 n-1Y j = y + -----Y P l ~ ^ Y l cos {2nk(l ~ + U k } ’ j = 0 , . . . , n - 1,

n 1=0 k=0(8.16)

from which it follows that the bootstrap data are stationary, with covariances equal to the circular covariances o f the original series, and tha t all their odd jo in t cum ulants equal zero (Problem 8.4). This representation also makes it clear that the resampled series will be essentially linear with norm al margins.

The difference between phase scrambling and model-based resam pling can be deduced from Algorithm 8.1. U nder phase scrambling,

\Yk' \ 2 = \ h \2 {1 + cos (u; + u ;_k)} , (8.17)

which gives

e *(|y; i2) = Iw I2, v a r* ( |y ; |2) = i |j ) , |4.

U nder model-based resampling the approxim ate distribution o f n~] \ Y^ \ 2 is g(a>k)X*, where g(-) is the spectrum o f the fitted model and X ’ has a standard exponential d istribution; this gives

E * ( |y ;i2) = g t o ) , v a r* ( |y ;i2) = g W

Clearly these resam pling schemes will give different results unless the quantities o f interest depend only on the means o f the |y fe' | 2, i.e. are essentially quadratic

Figure 8.11 Three time series generated by phase scrambling the shorter Rio Negro data.

in the data. Since the quantity o f interest m ust also be location-invariant, this restricts the dom ain o f phase scrambling to such tasks as estim ating the variances o f linear contrasts in the data.

Example 8.7 (Rio Negro data) We assess empirical properties o f phase scrambling using the first 120 m onths o f the R io Negro data, which we saw previously were well-fitted by an AR(2) model with norm al errors. N ote tha t our statistic o f interest, T = Y l ajYj> has the necessary structure for phase scrambling not autom atically to fail.

Figure 8.11 shows three phase scram bled datasets, which look similar to the AR(2) series in the second row o f Figure 8.7.

The top panels o f Figure 8.12 show the empirical Fourier transform for the original da ta and for one resample. Phase scrambling seems to have shrunk the m oduli o f the series towards zero, giving a resampled series with lower overall variability. The lower left panel shows sm oothed periodogram s for the original data and for 9 phase scrambled resamples, while the right panel shows corresponding results for sim ulation from the fitted AR(2) model. The results are quite different, and show that data generated by phase scram bling are less variable than those generated from the fitted model.

Resampling with 999 series generated from the fitted AR(2) model and by phase scrambling, the distribution o f 7” is close to norm al under bo th schemes bu t it is less variable under phase scram bling; the estim ated variances are 27.4 and 20.2. These are similar to the estim ates o f about 27.5 and 22.5 obtained using the block and stationary bootstraps.

Before applying phase scrambling to the full series, we m ust check that it shows no sign o f nonlinearity or o f long-range dependence, and that it is plausibly close to a linear series with norm al errors. W ith m = 20 the nonlinearity statistic described in Example 8.3 takes value 0.015, and no value for m < 30 is greater than 0.84: this gives no evidence tha t the series is nonlinear. M oreover the periodogram shows no signs o f a pole as to—>0+, so long-range dependence seems to be absent. A n AR(8) model fits the series well, but the residuals have heavier tails than the norm al distribution, with kurtosis 1.2. The variance o f T * under phase scrambling is about 51, which


Figure 8.12 Phase scrambling for the shorter Rio Negro data. The upper left panel shows an Argand diagram containing the empirical Fourier transform % of the data, with phase scrambled y'k in the upper right panel. The lower panels show smoothed periodograms for the original data (heavy solid), 9 phase scrambled datasets (left) and 9 datasets generated from an AR(2) model (right); the theoretical AR(2) spectrum is the lighter solid line.

-60 - 40 - 20 0 2 0 4 0 60

oCD

OTj-

OC\Jo

oC£>

- 60 -40 - 20 0 2 0 4 0 60

CGo>0)eo

o>o

1 2 3 4

omega

1 2 3 4

omega

again is similar to the estim ates from the block resampling schemes. A lthough this estimate may be untrustw orthy, on the face o f things it casts no doubt on the earlier conclusion tha t the evidence for trend is weak. ■

The discussion above suggests tha t not only should phase scrambling be confined to statistics tha t are linear contrasts, bu t also tha t it should be used only after careful scrutiny o f the data to detect nonlinearity and long- range dependence. W ith non-norm al da ta there is the further difficulty that the Fourier transform and its inverse are averaging operations, which can produce resam pled da ta quite unlike the original series; see Problem 8.4 and Practical 8.3. In particular, when phase scram bling is used in a test o f the null

hypothesis o f linearity, it imposes on the distribution o f the scrambled data the additional constraints o f stationarity and a high degree o f symmetry.

8.2.5 Periodogram resamplingLike time dom ain resampling m ethods, phase scrambling generates an entire new dataset. This is unnecessary for such problems as setting a confidence interval for the spectrum at a particular frequency or for assessing the variability o f an estimate tha t is based on periodogram values. There are well-established limiting results for the distributions o f periodogram values, which under certain conditions are asym ptotically independent exponential random variables, and this suggests that we somehow resample periodogram values.

The obvious approach is to note th a t if gf (wk) is a suitable consistent estimate o f g(a)k) based on data yo, ... ,y „ _ i, where n = 2 np + 1, then for k = 1 , . . . , « f the residuals e k — I(cok)/g^(o}k) are approxim ately standard exponential variables. This suggests tha t we generate bootstrap periodogram values by setting I ’(ojk) = g{(ok)e*k, where g(o)k) is also a consistent estimate o f g(a>k), and the e\ are sampled random ly from the set ( e \ / e , . . . , e nF/e); this ensures tha t E*(e£) = 1. The choice o f g^co) and g(co) is discussed below. Such a resampling scheme will only work in special circumstances. To see why, we consider estim ation o f 6 = f a(co)g(a>)dco by a statistic that can be written in the form

r = -?-£> /* ,tr

where Ik = Ho}k), ak = a(cok), and (ok is the /cth Fourier frequency. For a linear process

00

Y j = T , b*H>i=—oo

where {£,} is a stream o f independent and identically distributed random variables with standardized fourth cum ulant K4, the means and covariances of the Ik are approximately

E (Ik) = g(a>k), cov(Ik,Ii) = g (a > k )g (co ,) (S k, + n~1 K4). (8.18)

From this it follows tha t under suitable conditions,

E (T ) = J a(co)g(a>)d(o,

var(T ) = ri- 1 2n J a2(co)g2(co) dco + K4 | J a(o))g(«) dco j

e is the average of

<5hi is the Kronecker delta symbol, which equals one if k = I and zero otherwise.

The bootstrap analogue of T is T* = nnF] Y k ak^k’ an< under the resampling scheme described above this has mean and variance

E*(T*) = Ja(co)g((o)da>, v a r '( T ') = 2nn~l J a 2(to)g2(co)da).

For var*(T*) to converge to var(T ) it is therefore necessary tha t k4 = 0 or that f a(co)g(to) dco be asym ptotically negligible relative to the first variance term. A process with norm al innovations will have K4 = 0, but since this cannot be ensured in general the structure of T m ust be examined carefully before this resampling scheme is applied; see Problem 8.6. One situation where it can be applied is kernel density estim ation o f g( ), as we now see.

Example 8.8 (Spectral density estim ation) Suppose that our goal is inference for the spectral density g(tj) a t some t] in the interval (0, 7r), and let our estimate o f g(tj) be

r k= 0

where X ( ) is a symmetric P D F with m ean zero and unit variance and h is a positive sm oothing param eter. Then

E (T ) = '1' 1/ ^ p ) s ( w ) ‘iffl= ^ ) + 5'1V 'W ,

var(T ) = ^ { g ( r i ) } 2 J K 2 (u) du + J K g(co)d(o^ .

Since we m ust have h—>0 as n—*00 in order to remove the bias o f T, the second term in the variance is asymptotically negligible relative to the first term, as is necessary for the resampling scheme outlined above to work with a time series for which /c4 0. Com parison o f the variance and bias terms implies that theasym ptotic form o f the relative m ean squared error for estim ation o f g(//) is minimized by taking h oc n~[ 5. However, there are two difficulties in using resampling to m ake inference about g(^) from T.

The first difficulty is analogous to that seen in Example 5.13, and appears on com paring T and its bootstrap analogue

k=1

We suppose tha t I k is generated using a kernel estim ate g(a>k) with sm oothing param eter h. The standardized versions o f T and T * are

Z = (nhc)1/2 T g^ \ Z* = (nhc)1 / l T


where c = {2n f K 2(u)du} These have means

E (Z) = (nhc) l / 1 E *(Z ') = (n/ic)1/2EgO/) gU/)

Considerations similar to those in Example 5.13 show that E '( Z ’ ) ~ E (Z) if h—>0 such tha t h / h ^ O as n—>o o .

The second difficulty concerns the variances o f Z and Z*, which will both be approxim ately one if the rescaled residuals ek have the same asym ptotic distribution as the “errors” h/g{u>k). For this to happen with g f (co) a kernel estimate, it m ust have sm oothing param eter hf oc n-1//4. T hat is, asymptotically gt (ftj) m ust be undersm oothed com pared to the estim ate that minimizes the asym ptotic relative mean squared error o f T.

Thus the application o f the bootstrap outlined above involves three kernel density estim ates: the original, g(co), with h o c n 1/5; a surrogate g(co) for g(a>) used when generating bootstrap spectra, with sm oothing param eter h asymptotically larger than h ; and g t (oj), from which residuals are obtained, with sm oothing param eter ht o c n-1//4 asym ptotically smaller than h. This raises substantial difficulties for practical application, which could be avoided by explicit correction to reduce the bias o f T o r by taking h asymptotically narrow er than n ~ ^ 5, in which case the limiting means o f Z and Z* equal zero.

For a num erical assessment o f this procedure, we consider estim ating the spectrum g(a>) = {1 — 2acos(co) + a2}-1 o f an AR(1) process with a. = 0.9 at rj = n i l . The kernel K(-) is the standard norm al PDF. Table 8.4 com pares the means and variances o f Z with the average means and variances of Z* for 1000 time series o f various lengths, with norm al and x2 innovations. The first set o f results has bandw idths h = an~1/5, hf = an-1/4, and h = an-1/6, with a chosen to minimize the asym ptotic relative m ean squared error o f g(>/).

Even for time series o f length 1025, the means and variances o f Z and Z ’ can be quite different, with the variances more sensitive to the distribution o f innovations. For the second block o f num bers we took a non-optim al bandw idth h = an~{/4, and hf = h = h. A lthough in this case the true and bootstrap m om ents agree better for norm al innovations, the results for chi- squared innovations are alm ost as bad as previously, and it would be unwise to rely on the results even for fairly long series.

M ean and variance only summarize limited aspects o f the distributions, and for a more detailed com parison we com pare 1000 values o f Z and o f Z ’ for a particular series o f length 257. The left panel o f Figure 8.13 shows that the Z* are far from norm ally distributed, while the right panel com pares the simulated Z ’ and Z . A lthough Z ' captures the shape o f the distribution o f Z quite well, there is a clear difference in their means and variances, and confidence intervals for g(rj) based on Z ' can be expected to be poor. ■

8.3 ■ Point Processes 415

Table 8.4 Comparison o f actual and bootstrap means and variances for a standardized kernel spectral density estimate Z . For the means the upper figure is the average o f Z from 1000 AR(1) time series with a = 0.9 and length n, and the lower figure is the average o f E*(Z*) for those series; for the variances the upper and lower figures are estimates o f var(Z ) and E{var’ (Z*)}. The upper 8 lines o f results are for h oc n-1/5, h* oc n~l/4> and h oc n~l/6 ; for the lower 8 lines h = { oc 1/4.

Innovations

65 129 257 513 1025 00

N orm al M ean 1.4 0.9 0.8 0.7 0.6 0.52.0 1.7 1.3 1.0 0.8

Variance 2.5 1.5 1.3 1.1 1.1 1.02.7 2.0 1.7 1.5 1.3

C hi-squared M ean 1.2 1.0 0.8 0.7 0.7 0.52.1 1.7 1.3 1.0 0.8

V ariance 6.9 4.9 3.8 3.1 2.7 1.02.8 2.0 1.6 1.4 1.3

N orm al M ean 0.9 0.5 0.5 0.3 0.2 0.00.6 0.4 0.3 0.3 0.2

V ariance 2.3 1.3 1.1 1.1 1.0 1.01.5 1.4 1.4 1.3 1.3

C hi-squared M ean 1.0 0.6 0.5 0.4 0.3 0.00.7 0.4 0.3 0.3 0.2

Variance 5.6 3.7 3.1 2.5 2.2 1.01.4 1.4 1.4 1.3 1.2

Figure 8.13Comparison of distributions of Z and Z* for time series of length 257. The left panel shows a normal plot o f 1000 values of Z . The right panel compares the distributions of Z and Z*.

Quantiles of standard normal Z*

8.3 Point Processes8.3.1 Basic ideasA point process is a collection o f events in a continuum . Examples are times of arrivals at an intensive care unit, positions o f trees in a forest, and epicentres


of earthquakes. M athem atical properties o f such processes are determ ined by the jo in t distribution o f the num bers o f events in subsets o f the continuum . Statistical analysis is based on some notion o f repeatability, usually provided by assum ptions o f stationarity.

Let N { A ) denote the num ber o f events in a set A . A point process is stationary if Pr{/V(/li) = m , . . . , N ( A k ) = n k ) is unaffected by applying the same translation to all the sets A u . . . , A k , for any finite k. Under second-order stationarity only the first and jo in t second m om ents o f the N ( A t) remain unchanged by translation.

For a stationary process E{N(/1)} = X\A\, where X is the intensity o f the process and \A\ is the length, area, or volume o f A . Second-order mom ent properties can be defined in various ways, with the m ost useful definition depending on the context.

The simplest stationary point process model is the hom ogeneous Poisson process, for which the random variables N(Ai) , N iA i ) have independent Poisson distributions whenever A\ and A 2 are disjoint. This completely random process is a natural standard with which to com pare data, although it is rarely a plausible model. M ore realistic models o f dependence can lead to estim ation problems tha t seem analytically insuperable, and M onte C arlo m ethods are often used, particularly for spatial processes. In particular, sim ulation from fitted param etric models is often used as a baseline against which to judge data. This often involves graphical tests o f the type outlined in Section 4.2.4.

In practice the process is observed only in a finite region. This can give rise to edge effects, which are increasingly severe in higher dimensions.

Example 8.9 (Caveolae) The upper left panel o f Figure 8.14 shows the positions o f n = 138 caveolae in a 500 unit square region, originally a 2.65 /*m square o f muscle fibre. The upper right panel shows a realization o f a binom ial process, for which n points were placed at random in the same region; this is an hom ogeneous Poisson process conditioned to have 138 events. The data seem to have fewer alm ost-coincident points than the simulation, bu t it is hard to be sure.

Spatial dependence is often summarized by K-functions. Suppose that the process is orderly and isotropic, i.e. multiple coincident events are precluded and jo in t probabilities are invariant under ro tation as well as translation. Then a useful sum m ary o f spatial dependence is Ripley’s K-function,

K ( t ) = A-1 E (#{events within distance t o f an arbitrary even t}), t > 0.

The mean- and variance-stabilized function Z( t ) = { K ( t ) / n Y /2—t is sometimes used instead. For an hom ogeneous Poisson process, K ( t ) = nt2. Empirical versions o f K(t ) m ust allow for edge effects, as made explicit in Example 8.12.

The solid line in the lower left panel o f Figure 8.14 is the empirical version

Figure 8.14 Muscle caveolae analysis. Top left: positions of 138 cavoelae in a 500 unit square of muscle fibre (Appleyard et al., 1985). Top right: realization of an homogeneous binomial process with n = 138. Lower left:Z(t) (solid), together with pointwise 95% confidence bands (dashes) and overall 92% confidence bands (dots) based on R = 999 simulated binomial processes. Lower right: corresponding results for R = 999 realizations of a fitted Strauss process.

ooLO

oo

ooCO

ooC\J

oo

%•.* • • • •

0 100 200 300 400 500

m

o o

lO in---------

o ^ o - __r

\ N I 'v\ / " ^Vv ~ r-— \ _-- --------

K 5

o OT7 V \ <■

Vin mT~

0 20 40 60 80 100 0 20 40 60 80 100

Distance Distance

Z (t) o f Z(t). The dashed lines are pointwise 95% confidence bands from R = 999 realizations o f the binom ial process, and the dotted lines are overall bands with level about 92% , obtained by using the m ethod outlined after(4.17) with k = 2. Relative to a Poisson process there is a significant deficiency o f pairs o f points lying close together, which confirms our previous impression.

The lower right panel o f the figure shows the corresponding results for simulations from the Strauss process, a param etric model o f interaction that can inhibit patterns in which pairs lie close together. This models the local behaviour o f the data better than the stationary Poisson process. ■

-200 0 100 200

Time (ms)

oco o o <N

-200 -100 0 100 200

Time (ms)

Wc0)

oo

-200 -100 0 100 200

Time (ms)

Figure 8.15Neurophysiological point process. The rows of the left panel show 100 replicates of the interval surrounding the times at which a human subject was given a stimulus; each point represents the time at which the firing of a neuron was observed. The right panels shows a histogram and kernel intensity estimate (xlO-2 ms-1) from superposing the events on the left, which are shown by the rug in the lower right panel.

8.3.2 Inhomogeneous Poisson processesThe sampling plans used in the previous example both assume stationarity of the process underlying the data, and rely on sim ulation from fitted param etric models. Sometimes independent cases can be identified, in which case it may be possible to avoid the assum ption o f stationarity.

Example 8.10 (Neurophysiological point process) The data in Figure 8.15 were recorded by D r S. J. Boniface o f the Clinical N europhysiology U nit at the Radcliffe Infirmary, Oxford, in a study o f how a hum an subject responded to a stimulus. Each row o f the left panel o f the figure shows the times at which the firing o f a m otoneurone was observed, in an interval extending 250 ms either side o f 100 applications o f the stimulus, which is taken to be at time zero. A lthough little can be assumed about dependence within each interval, the stimulus was given far enough apart for firings in different intervals to be treated as independent. Firings occur a t random about 100 ms apart prior to the stimulus, but on about one-third o f occasions a firing is observed about 28 ms after it, and this partially synchronizes the firings immediately following.

Theoretical results imply tha t under mild conditions the process obtained by superposing all N = 100 intervals will be a Poisson process with time- varying intensity, NX(y). Here it seems plausible tha t the conditions are m et: for example, 90 o f the 100 intervals contain four or fewer events, so the overall intensity is not dom inated by any single interval. The superposed da ta have n — 389 events whose times we denote by yj.

The right panels o f Figure 8.15 show a histogram o f the superposed data and a rescaled kernel estimate o f the intensity X(y) in units o f 10-2 ms-1 ,

k y , h ) = 100 x (N h )~1 £ w ( ^ y 1 ) ,7=1

where w(-) is a symmetric density with mean zero and unit variance; we use the standard norm al density with bandw idth h = 7.5 ms. Over the observation period this estim ate integrates to 100n /N . The estim ated intensity is highly variable and it is unclear which o f its features are spurious. We can try to construct a confidence region for A(y) at a set o f y values o f interest, but the same problems arise as in Examples 5.13 and 8.8.

Once again the key difficulty is bias: l ( y ;h ) estimates not k(y) but / w(u)A(y — hu) du. For large n and small h this means that

E{l(y;/j)} = 2.(y) + \ h 2X'(y), var{2(y;/i)} = c(iVft)_1A(>>),

where c = f w2(u)du. As in Example 5.13, the delta m ethod (Section 2.7.1) im plies that l ( y ; h ) l/2 has approxim ately constant variance \ c (N h)~ l . We choose to work with the standardized quantities

l l' 2( y ; h ) - k l/2(y ) K M )-V 2 c1/22 ( y , h ) = y e f .

In principle an overall 1 — 2a confidence band for k(y) over W is determ ined by the quantiles ZLA(h) and zuA(h) tha t satisfy

1 - a = P r{zLiX(h) < Z { y ; h ) , y e 9 } = Pt{Z(y;h) < z U:X(h),y £ <&}. (8.19)

The lower and upper limits o f the band would then be

\ l l/2(y;h) - \ (N h )~ V 2cl/2z UA(h) \ ,L J2 (8.20) { l i/2( y , h ) - \ ( N h ) - 1/2cl/2z U h ) } ■

In practice we m ust use resampling analogues Z * ( y \ h ) o f Z ( y ; h ) to estimate ZL,a(h) and zu,x(h), and for this to be successful we m ust choose h and the resampling scheme to ensure tha t Z* and Z have approxim ately the same distributions.

In this context there are a num ber o f possible resampling schemes. The simplest is to take n events a t random from the observed events. This relies on the independence assum ptions for Poisson processes. A second scheme generates n events from the observed events, where n* has a Poisson distribution with m ean n. A more robust scheme is to superpose 100 resampled intervals, though this does not hold fixed the total num ber o f events. These schemes would be

inappropriate if the estim ator o f interest presupposed that events could not coincide, as did the K -function o f Example 8.9.

For all o f these resampling schemes the bootstrap estim ators r ( y ; h ) are unbiased for l(y',h). The natural resampling analogue o f Z is

{ r ( r ; f c ) } '/ 2 - { r ( r ) ) l/2

bu t E*(Z*) = 0 and E(Z) ^ 0. This situation is analogous to that o f Exam ple 5.13, and the conclusion is the same: to make the first two m om ents o f Z and Z* agree asymptotically, one m ust choose h oc N ~ y with y > j . Further detailed calculations for the jo in t distributions over % suggest also that y < The essential idea is th a t h should be smaller than is commonly used for point estim ation o f the intensity.

A quite different approach is to generate realizations o f an inhom ogeneous Poisson process from a sm ooth estim ate l ( y ;h ) o f the intensity. This can be achieved by using the sm oothed bootstrap, as outlined in Section 3.4, and detailed in Problem 8.7. U nder this scheme

E* | X*{y; h) j = J l ( y — hu',h)w(u) du = l ( y ' ,h)+ j h 2l"(y;h),

and the resampling analogue of Z is

z ( y M ------------------------------------ ■

whose mean and variance closely m atch those o f Z .W hatever resampling scheme is employed, simulated values o f Z* will be

used to estim ate the quantiles z i A(h) and z y A{h) in (8.19). If R realizations are generated, then we take ZL,cc{h) and zu,Jh) to be respectively the (R + l)a th ordered values of

min z*(y:/i), m a xz*(y;h).

The upper panel o f Figure 8.16 shows overall 95% confidence bands for A(y;5), using three o f the sampling schemes described above. In each case R = 999, and zl,0.025(5) and zl',0.025(5) are estim ated by the empirical 0.025 and 0.975 quantiles o f the R replicates o f min{z’(j;;5),>' = —250, —248,...,250} and m ax { z '(y ;5),y = —250,—248 ,...,250}. Results from resampling intervals and events are alm ost indistinguishable, while generating da ta from a fitted intensity gives slightly sm oother results. In order to avoid problems at the boundaries, the set is taken to be (—230,230). The experim ental setup implies tha t the intensity should be about 1 x 10-2 firings per second, the only significant departure from which is in the range 0-130 ms, where there is strong evidence that the stimulus affects the firing rate.


Figure 8.16 Confidence bands for the intensity of theneurophysiological point process data. The upper panel shows the estimated intensity x(y;5) (10“2 ms-1) (heavy solid), with overall 95% equi-tailed confidence bands based on resampling intervals (solid), resampling events (dots), and generating events from a fitted intensity (dashes). The outer lines in the lower panel show the 2.5% and 97.5% quantiles of the standardized quantile processes z ’(y;h) for resampling intervals (solid) and generating from a fitted intensity (dashes), while the lines close to zero are the bootstrap bias estimates for k.

-200 -100 0 100 200

Time (ms)

Time (ms)

The lower panel o f the figure shows z0.025(5)’ z0. 975(5), and the bootstrap bias estimate for /*(>>) for resampling intervals and for generating da ta from a fitted intensity function, with h = 7.5 ms. The quantile processes suggest that the variance-stabilizing transform ation has worked well, but the double sm oothing effect o f the latter scheme shows in the bias. The behaviour o f the quantile process when y = 50 ms — where there are no firings — suggests that a variable bandw idth sm oother might be better. ■

Essentially the same ideas can be applied when the data are a single realization o f an inhom ogeneous Poisson process (Problem 8.8).

8.3.3 Tests o f association

W hen a point process has events o f different types, interest often centres on association between the different types o f events or between events and associated covariates. Then perm utation or bootstrap tests may be appropriate, although the sim ulation scheme will depend on the context.

Example 8.11 (Spatial epidemiology) Suppose that events of a point pattern correspond to locations y o f cases of a rare disease S> tha t is thought to be related to emissions from an industrial site at the origin, y = 0. A model for the incidence o f Q) is that it occurs at rate /.(y) per person-year at location y,


where the suspicion is tha t X(y) decreases with distance from the origin. Since the disease is rare, the num ber o f cases a t y will be well approxim ated by a Poisson variable with m ean X{y)n(y), where fi(y) is the population density of susceptible persons at y. The null hypothesis is tha t My) = Xo, i.e. tha t y has no effect on the intensity o f cases, o ther than through /i(y). A crucial difficulty is that n{y) is unknow n and will be hard to estimate from the data available.

One approach to testing for constancy o f X (y ) is to com pare the point pattern for 2> to that o f another disease 2)'. This disease is chosen to have the same population o f susceptible individuals as 3), but its incidence is assumed to be unrelated to emissions from the site and to incidence o f S>, and so it arises with constant bu t unknow n rate X’ per person-year. If Sfi' is also rare, it will be reasonable to suppose that the num ber o f cases o f at y has a Poisson distribution with mean X 'f i (y ) . Hence the conditional probability o f a case of

a t y given tha t there is a case o f or 3 ' a t y is n { y ) = X { y ) / { X ' + A(y)}. If the disease locations are indicated by yj, and dj is zero or one according as the case at yj has 3)' or Q>, the likelihood is

n ^ { i - « ( y ^ .j

If a suitable form for X(y) is assumed we can obtain the likelihood ratio or perhaps another statistic T to test the hypothesis that 7i(y) is constant. This is a test o f proportional hazards for Q) and & , but unlike in Example 4.4 the alternative is specified, at least weakly.

W hen A(y) = Xo an approxim ation to the null distribution o f T can be obtained by perm uting the labels on cases at different locations. T hat is, we perform R random reallocations o f the labels and 3l' to the yj, recom pute T for each such reallocation, and see whether the observed value o f t is extreme relative to the simulated values t \ , . . . , t ’R. m

Example 8.12 (Brambles) The upper left panel o f Figure 8.17 shows the locations o f 103 newly emergent and 97 one-year-old bram ble canes in a 4.5 m square plot. It seems plausible tha t these two types o f event are related, but how should this be tested? Events o f both types are clustered, so a Poisson null hypothesis is not appropriate, nor is it reasonable to perm ute the labels attached to events, as in the previous example.

Let us denote the locations o f the two types o f event by y i , . . . , y „ and y[, . . . , y 'n-. Suppose tha t a statistic T = t (y i , . . . , y„ ,y[ , . . . , y 'n,) is available that tests for association between the event types. If the extent o f the observation region were infinite, we m ight construct a null distribution for T by applying random translations to events o f one type. Thus we would generate values T ‘ = t(yi + U*, . . . , y„ + U*,y[,.. . ,y'rf), where I/* is a random ly chosen location in the plane. This sam pling scheme has the desirable property o f fixing the

8.3 • Point Processes 423

Figure 8.17 Brambles data. Top left: positions of newly emergent (+) and one-year bramble canes (•) in a 4.5 m square plot. Top right: random toroidal shift of the newly emergent canes, with the original edges shown by dotted lines. Bottom left: Original dependence function Z n (solid) and 20 replicates (dots) under the null hypothesis of no association between newly emergent and one-year canes. Bottom right: original dependence function and pointwise (dashes) and overall (dots) 95% null confidence sets. The data used here are the upper left quarter of those displayed on p. 113 of Diggle (1983).

[•] denotes integer part.

+ +\ \

+ *4 •: • + ++ *-v* + +

; ++ •**r 4 - ++

► V.

t t

relative locations o f each type o f event, bu t cannot be applied directly to the data in Figure 8.17 because the resampled patterns will not overlap by the same am ount as the original.

We overcome this by random toroidal shifts, where we imagine that the pattern is w rapped on a torus, the random translation is applied, and the translated pattern is then unwrapped. Thus for points in the unit square we would generate U * = ( [ /j , Uj) at random in the unit square, and then m ap the event a t y} = (y i j , y 2j) to yj = (y{] + U\ - [yij + U[],y2j + U '2 - [y2J + U\]). The upper right panel o f Figure 8.17 shows how such a shift uncouples the two types o f events.

We can construct a test through an extension o f the K -function to events of two types, that is the function

(#{type 2 events within distance t o f an arbitrary type 1 even t}),

where A2 is the overall intensity o f type 2 events. Suppose that there are «i, ri2 events o f types 1 and 2 in an observation region A o f area \A\, that u,, is the distance from the ith type 1 event to the 7th type 2 event, that w,(u) is the proportion o f the circumference o f the circle that is centred at the ith type 1 event and has radius u that lies within A, and let /(•) denote the indicator o f the event Then the sample version o f this bivariate K -function is

K i2(r) = (nin2r l \A \ J 2 '^2 w - l (uij)I(uij < t). i=i j=\

A lthough it is possible to base an overall statistic on K ni t ) , for example taking T = f Z n ( t ) 2 dt, where Z\ i( t) = { k n { t ) / n } 112 — f, a graphical test is usually more informative.

The lower left panel o f Figure 8.17 shows results from 20 random toroidal shifts o f the data. The original value o f Z \ 2(t) seems to show m uch stronger local association than do the simulations. This is confirmed by the lower right panel, which shows 95% pointwise and overall confidence bands for Z n ( t ) based on R = 999 shifts. There is clear evidence that the point patterns are not independent: as the original da ta suggest, new canes emerge close to those from the previous year. ■

8.3.4 TilesLittle is known about resampling spatial processes when there is no param etric model. One nonparam etric approach that has been investigated starts from a partition o f the observation region St into disjoint tiles o f equalsize and shape. I f we abuse notation by identifying each tile with the pattern it contains, we can write the original value o f the statistic as T = t(.stf The idea is to create a resampled pattern by taking a random sample of tiles s4 \ , . . . , s4 'n from with corresponding bootstrap statistic T* =t( j /J , . . . , ,s /* ) . The hope is that if dependence is relatively short-range, taking large tiles will preserve enough dependence to make the properties o f T* close to those o f T. I f this is to work, the size o f the tile m ust be chosen to trade off preserving dependence, which requires a few large tiles, and getting a good estimate o f the distribution o f T , which requires m any tiles.

This idea is analogous to block resampling in time series, and is capable of similar variations. For example, ra ther than choosing the stf* independently from the fixed tiles s i we may resample moving tiles by setting

8.3 • Point Processes 425

Figure 8.18 Tile resampling for the caveolae data. The left panel shows the original data, with nine tiles sampled at random using toroidal wrapping. The right panel shows the resampled point pattern.

..... •oin

oo

..: ••

• • :•: *

• • *

• . . • r :• V #

* - V • •. • • •

• • » : • • ; •

• .v • - * . . . ••

-•........ - ♦ ...................................... ............* : . . • •.........

1 • ..... . *

* .. « ! • ! . !

. * * * - I . . 200

30

0 • ! • • • • ! •• • * ** • :* •

.......•......• » • . * •:• j * ......... «-•

• V •

10

0 •/. * * . *• *•• * * ' •

* / .•: : : •; •* . o

. ** * • • • • • : • •

0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 0 1 0 0 2 0 0 3 0 0 4 0 0

■srf'j = Uj + sJj, where Uj is a random vector chosen so that s / j lies wholly w ithin we can avoid bias due to undersam pling near the boundaries o f 9t by toroidal wrapping. As in all problems involving spatial data, edge effects are likely to play a critical role.

Example 8.13 (Caveolae) Figure 8.18 illustrates tile resampling for the data o f Example 8.9. The left panel shows the original caveolae data, with the dotted lines showing nine square tiles taken using the moving scheme with toroidal wrapping. The right panel shows the resampled pattern obtained when the tiles are laid side-by-side. For example, the centre top tile and middle right tiles were respectively taken fropi the top left and bottom right o f the original data. Along the tile edges, events seem to lie closer together than in the left panel; this is analogous to the whitening tha t occurs in blockwise resampling o f time series. N o analogue o f the post-blackened bootstrap springs to mind, however.

For a numerical evaluation o f tile resampling, we experimented with estim ating the variance 9 o f the num ber o f events in an observation region 3tt o f side 200 units, using da ta generated from three random processes. In each case we generated 8800 events in a square o f side 4000, then estim ated 9 from 2000 squares o f side 200 taken at random . For each o f 100 random squares o f side 200 we calculated the empirical m ean squared error for estim ation o f 9 using bootstraps o f size R, for bo th fixed and moving tiles. D ata were generated from a spatial Poisson process (9 = 23.4), from the Strauss process tha t gave the results in the bottom right panel o f Figure 8.14 (9 = 17.5), and from a sequential spatial inhibition process, which places points sequentially at random but not within 15 units o f an existing event (6 = 15.6).


n

4 16 36 64 100 144 196 256

Poisson theory 224.2 77.9 47.3 36.3 31.2 28.4 26.7 25.6fixed 255.2 66.1 40.2 31.7 27.6 27.6 25.5 27.8moving 92.2 39.7 35.8 31.6 33.0 30.8 27.4 27.0

Strauss fixed 129.1 49.1 27.9 19.2 16.4 19.3 20.8 21.9moving 53.2 26.4 19.0 17.4 15.9 18.9 18.7 17.9

SSI fixed 123.8 37.7 14.8 13.5 17.9 25.1 34.6 42.4moving 36.5 12.9 11.2 15.6 18.3 21.2 28.6 35.4

Table 8.5 shows the results. For the Poisson process the fixed tile results broadly agree with theoretical calculations (Problem 8.9), and the moving tile results accord with general theory, which predicts tha t m ean squared errors for moving tiles should be lower than for fixed tiles. Here the m ean squared error decreases to 22 as n—► o o .

The fitted Strauss process inhibits pairs o f points closer together than 12 units. The m ean squared error is minimized when n = 100, corresponding to tiles o f side 20; the average estim ated variances from the 100 replicates are then 19.0 and 18.2. The m ean squared errors for moving tiles are rather lower, bu t their pattern is similar.

The sequential spatial inhibition results are similar to those for the Strauss process, bu t with a sharper rise in m ean squared error for larger n.

In this setting theory predicts tha t for a process with sufficiently short- range dependence, the optim al n o c \ I f the caveolae data were generated by a Strauss process, results from Table 8.5 would suggest that we take n = 100 x 500/200 = 162, so there would be 16 tiles along each side o f 3k. W ith R = 200 and fixed and moving tiles this gives variance estimates o f 101.6 and 100.4, both considerably smaller than the variance for Poisson data, which would be 138. ■

8.4 Bibliographic NotesThere are m any books on time series. Brockwell and Davis (1991) is a recent book aimed at a fairly m athem atical readership, while Brockwell and Davis (1996) and Diggle (1990) are m ore suitable for the less theoretically inclined. Tong (1990) discusses nonlinear time series, while Beran (1994) covers long- memory processes. Bloomfield (1976), Brillinger (1981), Priestley (1981), and Percival and W alden (1993) are introductions to spectral analysis o f time series.

Table 8.5 M ean squared errors for estimation o f the variance o f the number o f events in a square of side 200, based on bootstrapping fixed and moving tiles. D ata were generated from a Poisson process, a Strauss process with parameters chosen to match the data in Figure 8.14, and from a sequential spatial inhibition process with radius 15. In each case the mean number o f events is 22. For n £ 64, we took R = 200, for n = 100, 144, we took R = 400, and for n ^ 196 we took R = m .


M odel-based resampling for time series was discussed by Freedm an (1984), Freedm an and Peters (1984a,b), Swanepoel and van Wyk (1986) and Efron and Tibshirani (1986), am ong others. Li and M addala (1996) survey much o f the related time dom ain literature, which has a som ewhat theoretical emphasis; their account stresses econometric applications. For a m ore applied account of param etric bootstrapping in time series, see Tsay (1992). B ootstrap prediction in time series is discussed by K abaila (1993b), while the bootstrapping o f state- space models is described by Stoffer and Wall (1991). The use o f model-based resampling for order selection in autoregressive processes is discussed by Chen et al. (1993).

Block resam pling for time series was introduced by Carlstein (1986). In an im portan t paper, Kiinsch (1989) discussed overlapping blocks in time series, although in spatial da ta the proposal o f block resam pling in Hall (1985) predates both. Liu and Singh (1992a) also discuss the properties o f block resampling schemes. Politis and R om ano (1994a) introduced the stationary bootstrap, and in a series o f papers (Politis and R om ano, 1993, 1994b) have discussed theoretical aspects o f more general block resam pling schemes. See also Buhlm ann and Kiinsch (1995) and Lahiri (1995). The m ethod for block length choice outlined in Section 8.2.3 is due to Hall, Horowitz and Jing (1995); see also H all and Horowitz (1993). Bootstrap tests for unit roots in autoregressive models are discussed by Ferretti and Rom o (1996). Hall and Jing (1996) describe a block resampling approach in which the construction of new series is replaced by R ichardson extrapolation.

Bose (1988) showed that model-based resam pling for autoregressive processes has good asym ptotic higher-order properties for a wide class o f statistics. Lahiri (1991) and G otze and Kiinsch (1996) show that the same is true for block resampling, but Davison and H all (1993) point out tha t unfortunately — and unlike when the da ta are independent — this depends crucially on the variance estim ate used.

Form s o f phase scram bling have been suggested independently by several authors (N ordgaard, 1990; Theiler et al., 1992), and Braun and K ulperger (1995, 1997) have studied its properties. H artigan (1990) describes a m ethod for variance estim ation in G aussian series tha t involves similar ideas bu t needs no random ization; see Problem 8.5.

Frequency dom ain resampling has been discussed by Franke and Hardle (1992), who m ake a strong analogy with bootstrap m ethods for nonparam etric regression. It has been further studied by Janas (1993) and D ahlhaus and Janas (1996), on which our account is based.

O ur discussion o f the R io Negro data is based on Brillinger (1988, 1989), which should be consulted for statistical details, while Sternberg (1987, 1995) gives accounts o f the da ta and background to the problem.

M odels based on point processes have a long history and varied provenance.

Daley and Vere-Jones (1988) and K arr (1991) provide careful accounts o f their m athem atical basis, while Cox and Isham (1980) give a m ore concise treatm ent. Cox and Lewis (1966) is a standard account o f statistical m ethods for series o f events, i.e. point processes in the line. Spatial po in t processes and their statistical analysis are described by Diggle (1983), Ripley (1981, 1988), and Cressie (1991). Spatial epidemiology has recently received attention from various points o f view (M uirhead and D arby, 1989; Bithell and Stone, 1989; Diggle, 1993; Lawson, 1993). Example 8.11 is based on Diggle and Rowlingson (1994).

Owing to the impossibility o f exact inference, a num ber o f statistical procedures based on random ization or sim ulation originated in spatial da ta analysis. Examples include graphical tests, which were used extensively by Ripley (1977), and various approaches to param etric inference based on M arkov chain M onte C arlo m ethods (Ripley, 1988, C hapters 4, 5). However, nonparam etric boo tstrap m ethods for spatial da ta have received little attention. One exception is H all (1985), a pioneering work on the theory tha t underlies block resampling in coverage processes, a particular type o f spatial data. Further discussion o f resampling these processes is given by H all (1988b) and G arcia-Soidan and H all (1997). Possolo (1986) discusses subsam pling m ethods for estim ating the param eters o f a random field. O ther applications include Hall and Keenan (1989), who use the bootstrap to set confidence “gloves” for the outlines of hands, and Journel (1994), who uses param etric bootstrapping to account for estim ation uncertainty in an application o f kriging. Young (1986) describes bootstrap approaches to testing in some geometrical problems.

Cowling, H all and Phillips (1996) describe the resampling m ethods for inhom ogeneous Poisson processes tha t form the basis o f Example 8.10, as well as outlining the related theory. Ventura, D avison and Boniface (1997) describe a different analysis o f the neurophysiological da ta used in that example. Diggle, Lange and Benes (1991) describe an application o f the bootstrap to a point process problem in neuroanatom y.

8.5 Problems1 Suppose that y i,...,y„ is an observed time series, and let zy denote the block

of length / starting at yu where we set y, = yi+(i_i mod „) and y0 = yn- Also let h , . . . be a stream of random numbers uniform on the integers 1 , . . . ,n and let be a stream of random numbers having the geometric distributionPr(L = I) = p(l — p)‘~ \ I = 1,— The algorithm to generate a single stationary bootstrap replicate is

Algorithm 8.2 (Stationary bootstrap)• Set 7* = z/jx,, and set i = 1.• While length(Y’) < n, {increment /; replace Y ’ with (Y z ; i>Li)}.


• Set 7* =

(a) Show that the algorithm above is equivalent to

Algorithm 8.3. Set Yl' = y , r• For i = 2,. .. ,n, let Y ' = with probability p, and let Y" = yj+l with

probability 1 — p, where y,l, = yj.

(b) Define the empirical circular autocovariancen

Ck = O '; - y ) ( y i + u + k - t mod n) - y ) , k = Q , . . . , n .

;=1

Show that conditional on y i , . . . , y „ ,

E’(y /) = y, cov*(y,-,y;+1) = ( i - PyCj

and deduce that y ' is second-order stationary.(c) Show that if y i , . . . , y n are all distinct, 7 ’ is a first-order Markov chain. Under what circumstances is it a fcth-order Markov chain?(Section 8.2.3; Politis and Romano, 1994a)

2 Let Y i , . . . , Yn be a stationary time series with covariances f j = cov(Y!, Yj+1 ). Show that

var(? ) = y0 + 2 ^ f l - yh;=l '

and that this approaches C = Vo + 2 £ 5 ° yj if ! j\yj\ is finite.Show that under the stationary bootstrap, conditional on the data,

n—1 /v a r '(y ‘) = c0 + 2 ^ 3 ( 1 - " ) (! ~ P ) JCj,

;=l ' nJ

where Co,c1;. . . are the empirical circular autocovariances defined in Problem 8.1. (Section 8.2.3; Politis and Romano, 1994a)

3 (a) Using the setup described on pages 405-408, show that J2($j ~ S)2 has mean vy — b~l v,j and variance

V i j j j + 2VjjVtj - 2b _1(v Uj j t + 2v u viJt) + b - 2(v iJJcJ + 2 v u vkJ),

where vy = cov(S,,S,), = cum(S,, Sj, St) and so forth are the joint cumulantso f the Sj, and summation is understood over each index.(b) For an m-dependent normal process, show that provided / > m,

( l~'4 }, i = j, v‘.i = \ l -2c(l>, \ i - j \ = l,

( 0, otherwise,

and show that /“ ‘cq1—>(, c,1*— as /—»oo. Hence establish (8.13) and (8.14). (Section 8.2.3; Appendix A; Hall, Horowitz and Jing, 1995)

4 Establish (8.16) and (8.17). Show that under phase scrambling,

n_1 H YJ = cov‘(y/. Y,'+m) = «_ 1 - y)(yi+* - y)>where j + m is interpreted mod n, and that all odd joint moments o f the Y j are zero.This last result implies that the resampled series have a highly symmetric joint distribution. When the original data have an asymmetric marginal distribution, the following procedure has been proposed:

• let Xj = <t>- 1 { r j/(n + 1 )}, where rj is the rank o f y} among the original series ya, . . . , y n- 1;

• apply Algorithm 8.1 to x0 , . . . , x„-i, giving X ‘_ , ; then• set Y j = y(r/), where rj is the rank o f X j among Aro,...,A '*_1.

Discuss critically this idea (see also Practical 8.3).(Section 8.2.4; Theiler et al., 1992; Braun and Kulperger, 1995, 1997)

5 (a) Let / i , . . . , / m be independent exponential random variables with means fij , and consider the statistic T = Yl"j=\ ai h ’ where the a; are unknown. Show that V = | ajl? is an unbiased estimate o f var(T) = Y jNow let C = (c0 , - . . , c m) be an ( m + 1) x ( m + 1) orthogonal matrix with columns cj, where co is a vector o f ones; the Zth element o f c, is cj,-. That is, for some constant b,

c j c i = 0 , i i = j , c j c j = b, j = l , . . . , m .

Show that for a suitable choice o f b, V is equal to

j ffl+1 ttl+1

2 ^ r n ) g B r ' - TO'

where for i = 1 , . . . , m + 1 , Tf = + ca)h-(b) Now suppose that Yo,. . . , Y„_i is a time series o f length n = 2m + lz with empirical Fourier transform fb .---.iB _ i and periodogram ordinates h = \Yk\2/n, for k = 0 , . . . , m. For each i = 1 , . . . , m + 1, let the perturbed periodogram ordinates be

YJ = ?o, Y> = ( l + c ^ 2Yk, = ( l + c * ) 1/2Y„_*, k = l,...,m ,

from which the ith replacement time series is obtained by the inverse Fourier transform.Let T be the value o f a statistic calculated from the original series. Explain how the corresponding resample values, T1' , . . . ,T ^ +1, may be used to obtain an approximately unbiased estimate o f the variance o f T , and say for what types o f statistics you think this is likely to work.(Section 8.2.4; Hartigan, 1990)

6 In the context o f periodogram resampling, consider a ratio statistic

T = a(u>k)I((Qk) = / a M g M dw( 1 + n } ' /2X a)

Y k F= i 1 (®fc) / g(ft>) dw( 1 -f 1/2Z i)

say. Use (8.18) to show that X a and X i have means zero and that

var(-Xa) = n l aaggl ^ 2 + i(c4, \a r (X i) = n lgel ~ 2 + ^ k4,

COV(XUX a) — 1llagglag Ig “t- 2 ^4 .

where I aagg = / a2(co)g2(co) dco, and so forth. Hence show that to first order the mean and variance o f T do not involve k4, and deduce that periodogram resampling may be applied to ratio statistics.Use simulation to see how well periodogram resampling performs in estimating the distribution o f a suitable version o f the sample estimate o f the lag j autocorrelation,

= e~toJg M dco

Pl f l n g (« ) dco

(Section 8.2.5; Janas, 1993; Dahlhaus and Janas, 1996)

7 Let y \ , . . . , y n denote the times o f events in an inhomogeneous Poisson process o f intensity My), observed for 0 < y < 1, and let

J= 1

denote a kernel estimate o f My), based on a kernel w( ) that is a PDF. Explain why the following two algorithms for generating bootstrap data from the estimated intensity are (almost) equivalent.

Algorithm 8.4 (Inhomogeneous Poisson process 1)

• Let N have a Poisson distribution with mean A = f Q' l(u ;h)du .• For j = 1 , . . . , N , independently take 17* from the t / (0 ,1) distribution, and

then set Y ’ = F ~ l(Uj), where F(y) = A-1 f0} l(u ;h )du .

Algorithm 8.5 (Inhomogeneous Poisson process 2)A p 1• Let N have a Poisson distribution with mean A = J0 /.(u; h) du.

• For j = 1 , . . . , N , independently generate /* at random from the integers { ! , . . . ,« } and let s* be a random variable with PD F w(-). Set YJ = y,- + ht:'.

(Section 8.3.2)

8 Consider an inhomogeneous Poisson process o f intensity /.(y) = Nn(y), where fi(y) is fixed and smooth, observed for 0 < y < 1.A kernel intensity estimate based on events at y i , . . . , y n is

i=i

where w( ) is the PD F o f a symmetric random variable with mean zero and variance one; let K = / w2(u)du.(a) Show that as N -* cc and h—>0 in such a way that N h —>cej,

E { l(y ; h)} = X(y) + ± h2X"(y), var j l(y ; h) j = K h~l X(y);

you may need the facts that the number o f events n has a Poisson distribution with mean A = /J Mu) du, and that conditional on there being n observed events, their

times are independent random variables with PDF Hence show that theasymptotic mean squared error of is minimized when h oc N~l/S. Use thedelta method to show that the approximate mean and variance of l 1/2(y;h) are

*'/2(y) + \ * r m (y) {h2f ( y ) - ±K h r 1}, \Kh~l.

(b) Now suppose that resamples are formed by taking n observations at random from yi,...,y„. Show that the bootstrapped intensity estimate

, y - y jw 'h J=l

has mean E’{ l ‘(y, h)} = l(y;h), and that the same is true when there are n' resampled events, provided that E '(n ') = n.For a third resampling scheme, let n have a Poisson distribution with mean n, and generate n events independently from density ).(y;h)/ f Ql l(u;h)du. Show that under this scheme

E*{3.*{_y; Ai)} = J w(u)2(y — hu;h)du.

(c) By comparing the asymptotic distributions of

P 2( y ; h ) - ^ 2(y) , { r ( y ; h ) \ ' - l 1/2(y;h)z i y ’h) = { k U - w ’ Z ( r ’h) = ------- W m F u i ---------*

find conditions under which the quantiles of Z ' can estimate those of Z.(Section 8.3.2; Example 5.13; Cowling, Hall and Phillips, 1996)

Consider resampling tiles when the observation region ^ is a square, the data are generated by a stationary planar Poisson process of intensity X, and the quantity of interest is d = var(Y), where Y is the number of events in 3t.Suppose that 0t is split into n fixed tiles of equal size and shape, which are then resampled according to the usual bootstrap. Show that the bootstrap estimate of6 is t = ^2(yj — y)2, where yj is the number of events in the j th tile. Use the fact that var(T) = (n — 1)2{k4/h + 2 k \ /( n — 1)}, where Kr is the rth cumulant of Yj, to show that the mean squared error of T is

^ { n + ( n - l ) ( 2n + n - l ) } ,

where n = l\9l\. Sketch this when p. > 1, fi = 1, and /i < 1, and explain in qualitative terms its behaviour when fi > 1.Extend the discussion to moving tiles.(Section 8.3)

8.6 Practicals1 Dataframe lynx contains the Canadian lynx data, to the logarithm of which we

fit the autoregressive model that minimizes A IC:

ts.plot(log(lynx)) lynx.ar <- arClogClynx)) lynx.ar$order

• Practicals 433

The best model is A R(ll). How well determined is this, and what is the variance of the series average? We bootstrap to see, using lynx .fun (given below), which calculates the order of the fitted autoregressive model, the series average, and saves the series itself.Here are results for fixed-block bootstraps with block length I = 20:

lynx.fun <- function(tsb){ ar.fit <- ar(tsb, order,max=25)

c(ar.fit$order, mean(tsb), tsb) > lynx.l <- tsboot(log(lynx), lynx.fun, R=99, 1=20, sim="fixed") tsplot(ts(lynx.l$t[l,3:116],start=c(1821,1)),

main="Block simulation, 1=20") boot.array(lynx.1) [1,] table(lynx.l$t[,1]) var(lynx.l$t[,2]) qqnormdynx. l$t [,2] )abline(mean(lynx.l$t[,2]),sqrt(var(lynx.l$t[,2])),lty=2)

To obtain similar results for the stationary bootstrap with mean block length 1 = 2 0 :

.Random.seed <- lynx.l$seedlynx.2 <- tsboot(log(lynx), lynx.fun, R=99, 1=20, sim="geom")

See if the results look different from those above. Do the simulated series using blocks look like the original? Compare the estimated variances under the two resampling schemes. Try different block lengths, and see how the variances of the series average change.For model-based resampling we need to store results from the original model:

lynx.model <- list(order=c(lynx.ar$order,0,0),ar=lynx.ar$ar) lynx.res <- lynx.ar$resid[!is.na(lynx.ar$resid)] lynx.res <- lynx.res - mean(lynx.res) lynx.sim <- function(res,n.sim, ran.args){ rgl <- function(n, res) sample(res, n, replace=T)

ts.orig <- ran.args$ts ts.mod <- ran.args$modelmean(ts.orig)+ts(arima.sim(model=ts.mod, n=n.sim,

rand.gen=rgl, res=as.vector(res))) }.Random.seed <- lynx.l$seedlynx.3 <- tsboot(lynx.res, lynx.fun, R=99, sim="model",

n.sim=114,ran.gen=lynx.sim,ran.args=list(ts=log(lynx), model=lynx.model))

Check the orders of the fitted models for this scheme.For post-blackening we need to define yet another function:

lynx.black <- function(res, n.sim, ran.args){ ts.orig <- ran.args$ts

ts.mod <- ran.args$modelmean(ts.orig) + ts(arima.sim(model=ts.mod,n=n.sim,innov=res)) }

.Random.seed <- lynx.l$seedlynx.lb <- tsboot(lynx.res, lynx.fun, R=99, 1=20, sim="fixed",

n .sim=l14,ran.gen=lynx.black, ran.args=list(ts=log(lynx), model=lynx.model))

8 ■ Complex Dependence

Compare these results with those above, and try the post-blackened bootstrap with sim=" geom".(Sections 8.2.2, 8.2.3)

The data in beaver consist o f a time series o f n = 100 observations on the body temperature y i , . . . , y „ and an indicator x i , . . . , x n o f activity o f a female beaver, Castor canadensis. We want to estimate and give an uncertainty measure for the body temperature o f the beaver. The simplest model that allows for the clear autocorrelation o f the series is

yj = P o + PiXj + rij, rij = tcrij_i +Ej, j = l , . . . , n , (8.21)

a linear regression model in which the errors r\j form an AR(1) process, and the are independent identically distributed errors with mean zero and variance a 2. Having fitted this model, estimated the parameters a,/?o, j8i,<t2 and calculated the residuals e i , . . . , e n (e\ cannot be calculated), we generate bootstrap series by the following recipe:

y'j = Po + PiXj + *}], n] = j = i , . . . , n , (8.22)

where the error series {>/'} is formed by taking a white noise series {e‘ } at randomfrom the set {a(e2 — e) , . . . , o(e„ — e)} and then applying the second part o f (8.22).To fit the original model and to generate a new series:

f i t <- fu n c t io n ( d a ta ){ X <- c b in d ( r e p ( l ,1 0 0 ) ,d a ta $ a c t iv )

para <- l i s t ( X=X,data=data) a ss ig n (" p a ra " ,p a ra ,fra m e = l)d <- a r im a .m le (x = p a ra $ d a ta $ tem p ,m o d e l= lis t(a r= c (0 .8 )) ,

xreg=para$X)r e s <- a r im a .d ia g (d ,p lo t= F ,s td .r e s id = T )$ s td .r e s id r e s <- r e s [ ! i s .n a ( r e s ) ]l i s t (p a r a s= c (d $ m o d e l$ a r ,d $ r e g .c o e f ,sq r t(d $ s ig m a 2 )) ,

res= re s-m ea n (res) ,f i t= X 7,*7, d $ r e g .c o e f) > b e a v e r .a r g s <- f i t ( b e a v e r )w h ite .n o is e <- fu n c t io n (n .s im , t s ) sa m p le (ts ,s iz e = n .s im ,r e p la c e = T ) b ea v er .g en < - f u n c t io n ( t s , n .s im , r a n .a r g s ){ t s b < - r a n .a r g s$ r e s

f i t <- r a n .a r g s $ f i t c o e f f < - ra n .a rg s$ p a ra sts$tem p <- f i t + c o e f f [ 4 ] * a r im a .s im (m o d e l= lis t (a r = c o e ff[ 1 ] ) ,

n = n .s im ,r a n d .g e n = w h ite .n o ise ,t s= ts b )t s }

new .beaver <- b e a v e r .g e n (b e a v e r , 100, b e a v e r .a r g s )

Now we are able to generate data, we can bootstrap and see the results o f b ea v er .b o o t as follows:

b e a v e r .fu n < - f u n c t io n ( t s ) f i t ( t s ) $ p a r a sb ea v er .b o o t <- t s b o o t(b e a v e r , b e a v e r .fu n , R=99,sim ="model",

n . s im=1 0 0 ,r a n . gen=beaver. g en , r a n . args= b eaver . a r g s ) nam es(beaver. b o o t) b e a v e r . b oo t$ t0 b e a v e r .b o o t$ t [1 :1 0 ,]

showing the original value o f b e a v e r . fun and its value for the first 10 replicate

series. Are the estimated mean temperatures for the R = 99 simulations normal? Use b o o t . c i to obtain normal and basic bootstrap confidence intervals for the resting and active temperatures.In this analysis we have assumed that the linear model with AR(1) errors is appropriate. How would you proceed if it were not?(Section 8.2; Reynolds, 1994)

3 Consider scrambling the phases o f the sunspot data. To see the original data, two replicates generated using ordinary phase scrambling, and two phase scrambled series whose marginal distribution is the same as that o f the original data:

su n sp o t .fu n < - fu n c t io n ( t s ) t ss u n s p o t .1 <- tsb o o t(su n sp o t,su n sp o t.fu n ,R = 2 ,s im = " scra m b le" ).Random.seed <- s u n s p o t .l$ se e dsu n s p o t .2 <- tsb oot(su n sp ot,su n sp ot.fu n ,R = 2 ,sim = " scram b le" ,n orm = F ) s p l i t . s c r e e n ( c ( 3 ,2 ) ) y l <- c (-5 0 ,2 0 0 )s c r e e n ( l ) ; t s .p lo t ( s u n s p o t ,y l im = y l) ; a b lin e (h = 0 ,lty = 2 ) s c r e e n ( 3 ) ; t s p lo t ( s u n s p o t . l $ t [ 1 , ] ,y l im = y l) ; a b lin e (h = 0 ,lty = 2 ) s c r e e n (4 ) ; t s p lo t ( s u n s p o t . l $ t [ 2 , ] ,y l im = y l) ; a b lin e (h = 0 ,lty = 2 ) s c r e e n (5 ) ; t s p lo t ( s u n s p o t .2 $ t [ 1 , ] ,y l im = y l) ; a b lin e (h = 0 ,lty = 2 ) s c r e e n (6 ) ; t s p lo t ( s u n s p o t . 2$ t [ 2 , ] ,y l im = y l) ; a b lin e (h = 0 ,lty = 2 )

What features o f the original data are preserved by the two algorithms? (You may find it helpful to experiment with different shapes for the figures.)(Section 8.2.4; Problem 8.4; Theiler et a l , 1992)

4 c o a l contains data on times o f explosions in coal mines from 15 March 1851 to 22 March 1962, often modelled as an inhomogeneous Poisson process. For a kernel intensity estimate (accidents per year):

c o a l .e s t <- fu n c t io n (y , h=5) len gth (y )*k sm ooth (y ,b an d w id th = 2 . 7*h, k ern e l= " n " ,x .p o in ts= se q (1 8 5 1 ,1 9 6 3 ,2 ) )$y

year <- se q (1 8 5 1 ,1 9 6 3 ,2 )p lo t ( y e a r ,c o a l .e s t ( c o a l$ d a te ) ,t y p e = " l" ,y la b = " in te n s i ty " ,

y lim = c (0 ,6 ) ) r u g (c o a l)

Try other choices o f bandwidth h, noting that the estimate for the period (1851 + 4/i, 1962 — 4h) does not have edge effects. D o you think that the drop from about three accidents per year before 1900 to about one thereafter is spurious? What about the peaks at around 1910 and 1940?For an equi-tailed 90% bootstrap confidence band for the intensity, we take h = 5 and R = 199 (a larger R will give more reliable results):

c o a l .fu n < - fu n c t io n (d a ta , i , h=5) c o a l . e s t ( d a t a [ i ] , h) c o a l.b o o t <- b o o t ( c o a l$ d a te , c o a l .f u n , R=199)A <- 0 .5 /s q r t ( 5 * 2 * s q r t ( p i ) )Z <- s w e e p ( s q r t ( c o a l .b o o t $ t ) ,2 ,s q r t ( c o a l .b o o t$ t0 ) ) /A Z.max < - s o r t (a p p ly ( Z ,l ,m a x ) ) [190]Z.min <- s o r t ( a p p ly ( Z ,l .m in ) ) [10] to p < - (sq r t(c o a l.b o o t$ tO )-A * Z .m in )”2 bot <- (sq rt(coa l.b oo t$ tO )-A *Z .m ax)" 2 l i n e s ( y e a r , t o p , l t y = 2 ) ; l in e s ( y e a r ,b o t , l t y = 2 )

Z <- apply(Z,2,sort)

Z.05 <- Z[10,]Z.95 <- Z[190,]plot(year,Z .05,type="1",ylab="Z",ylim=c(-3,3)) lines(year,Z .95)

Construct symmetric bootstrap confidence bands based on za{h) such that

Pr{|Z(y; /i)| < z„(h),y € &} = a

(no more simulation is required). How different are they from the equi-tailed ones? For simulation with a random number o f events, use

coal.gen <- function(data, n){ i <- sampled :n,size=rpois(n=l ,lambda=n) ,replace=T)

datafi] }coal.boot2 <- boot(coal$date, coal.est, R=199, sim="parametric",

ran.gen=coal.gen, mle=nrow(coal))

D oes this make any difference?(Section 8.3.2; Cowling, Hall and Phillips 1996; Hand et al., 1994, p. 155)

To see the quantile process:

9

Improved Calculation

9.1 IntroductionA few o f the statistical questions in earlier chapters have been am enable to analytical calculation. However, m ost o f our problems have been too com plicated for exact solutions, and samples have been too small for theoretical large-sample approxim ations to be trustworthy. In such cases sim ulation has provided approxim ate answers through M onte Carlo estimates o f bias, variance, quantiles, probabilities, and so forth. T hroughout we have supposed that the sim ulation size is limited only by our impatience for reliable results.

Sim ulation o f independent bootstrap samples and their use as described in previous chapters is usually easily program m ed and implemented. I f it takes up to a few hours to calculate enough values o f the statistic o f interest, T, ordinary simulation o f this sort will be an efficient use o f a researcher’s time. But sometimes T is very costly to compute, or sampling is only a single com ponent in a larger procedure — as in a double bootstrap — or the procedure will be repeated m any times with different sets o f data. Then it may pay to invest in m ethods o f calculation that reduce the num ber o f sim ulations needed to obtain a given precision, o r equivalently increase the accuracy o f an estimate based on a given sim ulation size. This chapter is devoted to such methods.

N o lunch is free. The techniques tha t give the biggest potential variance reductions are usually the hardest to implement. O thers yield less spectacular gains, but are m ore easily implemented. Thoughtless use of any o f them may make m atters worse, so it is essential to ensure tha t use o f a variance reduction technique will save the investigator’s time, which is much more valuable than com puter time.

M ost o f our bootstrap estimates depend on averages. For example, in testing a null hypothesis (C hapter 4) we w ant to calculate the significance probability p = Pr’(7” ^ t | Fo), where t is the observed value o f test statistic T and

437

438 9 ■ Improved Calculation

the fitted model Fo is an estimate o f F under the null hypothesis. The simple M onte C arlo estimate o f p is R ^ 1 / {T ' > (}, where I is the indicatorfunction and the T ’ are based on R independent samples generated from Fo- The variance o f this estimate is cR~{, where c = p fl — p). N othing can generally be done about the factor R ~ l , but the constant c can be reduced if we use a more sophisticated M onte Carlo technique. M ost o f this chapter concerns such techniques. Section 9.2 describes m ethods for balancing the sim ulation in order to make it m ore like a full enum eration o f all possible samples, and in Section 9.3 we describe m ethods based on the use o f control variates. Section 9.4 describes m ethods based on im portance sampling. In Section 9.5 we discuss one im portan t m ethod o f theoretical approxim ation, the saddlepoint m ethod, which eliminates the need for simulation.

9.2 Balanced BootstrapsSuppose for simplicity tha t the data are a hom ogeneous random sample y \ , . . . , yn with E D F F, and tha t as usual we are concerned with the properties o f a statistic T whose observed value is t = t ( y i , . . . , y n). O ur focus is T ‘ = t ( Y { , . . . , Y„*), where the Y" are a random sample from F. Consider the bias estimate for T, namely B = E ’(T* | F) — t. I f g denotes the jo in t density of

then

B = J t {y \ , . . . , y'n)g(y[, . . . , y'„)dy{

This might be com putable analytically if t( ) is simple enough, particularly for some param etric models. In the nonparam etric case, if the calculation cannot be done analytically, we set g equal to n~n for all possible samples y\ , ..., y'n in the set Sf = {y i , . . . , y„}n and write

B = n~n ^ 2 t ( y [ , . . . , y ' n) - t . (9.1)

This sum over all possible samples need involve only (2n„_1) calculations o f (*, since the symmetry o f t( ) with respect to the sample can be used, bu t even so the complete enumeration of values t* that (9.1) requires will usually be im practicable unless n is very small. So it is that, especially in nonparam etric problems, we usually approxim ate the average in (9.1) by the average o f R random ly chosen elements o f Zf and so approxim ate B by B r = R _i Y , T* — t.

This calculation with a random subset o f has a m ajor defect: the values y i , . . . , yn typically do no t occur with equal frequency in tha t subset. This is illustrated in Table 9.1, which reproduces Table 2.2 but adds (penultim ate row) the aggregate frequencies for the data values; the final row is explained later. In the even simpler case o f the sample average t = y we can see clearly

9.2 ■ Balanced Bootstraps 439

Table 9.1 R = 9resamples for city population data, chosen by ordinary bootstrap sampling from F.

Dataj 1 2 3 4 5 6 7 8 9 10u 138 93 61 179 48 37 29 23 30 2X 143 104 69 260 75 63 50 48 111 50

Number o f times j sampled StatisticData 1 1 1 1 1 1 1 1 1 1 t = 1.520

Sample 1 3 2 1 2 1 1 t\ = 1.4662 1 1 2 1 2 1 t ’2 = 1.7613 1 1 1 1 4 2 = 1.9514 1 2 1 1 2 2 1 t’A = 1.5425 3 1 3 1 1 1 t; = 1.3716 1 1 2 1 1 1 3 t'6 = 1.6867 1 1 2 2 2 1 1 t; = 1.3788 2 1 3 1 1 1 1 (• = 1.4209 1 1 1 2 1 2 1 1 t; = 1.660

Aggregate 9 8 11 5 13 8 8 7 11 10

F* 9 8 1 1 5 13 8 8 7 n 10r 50 55 50 50 50 50 50 50 50 50

that the unequal frequencies completely account for the fact tha t B r differs from the correct value B = 0. The corresponding phenom enon for param etric bootstrapping is tha t the aggregated E D F o f the R samples is not as close to the C D F o f the fitted param etric model as it is to the same model with different param eter values.

There are two ways to deal with this difficulty. First, we can try to change the sim ulation to remove the defect; and secondly we can try to adjust the results o f the existing simulation.

9.2.1 Balancing the simulationThe idea o f balanced resampling is to generate tables o f random frequencies, but to force them to be balanced in an appropriate way. A set o f R bootstrap samples is said to have first-order balance if each o f the original observations appears with equal frequency, i.e. exactly R times overall.

F irst-order balance is easy to achieve. A simple algorithm is as follows:

Algorithm 9.1 (Balanced bootstrap)

Concatenate R copies o f y i , . . . , y „ into a single set o f size Rn.

Permute the elements o f 9 at random , giving <&*, say.

For r = 1 ,. . . , /? , take successive sets o f n elements o f to be the balanced resamples, y *, and set t'r = t (y‘ ). •

440 9 • Improved Calculation

Data

Sample

Aggregate

1 2 3 4 5 6 7 8 9 10

Number of times j sampled Statistic1 1 1 1 1 1 1 1 1 1 t = 1.520

1 1 1 3 2 1 1 1 t\ = 1.6322 2 1 1 2 1 1 2 t i = 1.8233 2 2 2 1 1 1 1 t"3 = 1.3344 2 2 2 1 1 1 1 t'4 = 1.3175 1 3 i 2 1 2 t‘5 = 1.5316 2 1 1 1 1 1 1 1 1 t‘6 = 1.3447 2 1 1 1 1 1 2 1 ty = 1.7308 1 2 2 1 1 1 1 1 t\ = 1.4249 1 2 1 2 1 1 1 1 t; = 1.678

9 9 9 9 9 9 9 9

O ther algorithm s (e.g. Problem 9.2) have been suggested tha t economize on the time and space needed to generate balanced samples, bu t the m ost time-consuming part o f a bootstrap sim ulation is usually the calculation o f the values o f t \ so the details o f the sim ulation algorithm are rarely critical. W hatever the m ethod used to generate the balanced samples, the result will be that individual observations have equal overall frequencies, just as for complete enum eration — a simple illustration is given below. Indeed, so far as the m arginal frequencies o f the da ta values are concerned, a complete enum eration has been performed.

Example 9.1 (City population data) Consider estim ating the bias o f the ratio estimate t = x / u for the da ta in the second and third rows o f Table 9.1. Table 9.2 shows the results for a balanced bootstrap with R = 9: each data value occurs exactly 9 times overall.

To see how well the balanced bootstrap works, we apply it with the more realistic num ber R = 49. The bias estim ate is B R = T* — t = R ~ l J2r T ' — t, and its variance over 100 replicates o f the ordinary resampling scheme is 7.25 x 10-4 . The corresponding figure for the balanced bootstrap is 9.31 x 10-5 , so the balanced scheme is about 72.5/9.31 = 7.8 times m ore efficient for bias estimation. ■

Here and below we say tha t the efficiency o f a bootstrap estimate such as Br relative to the ordinary bootstrap is the variance ratio

v ' K J b r )var I J B r Y

where for this com parison the subscripts denote the sampling scheme under which B r was calculated.

Table 9.2 First-order balanced bootstrap with R = 9 for city population data.

9.2 • Balanced Bootstraps 441

Table 93 Approximate efficiency gains when balancing schemes with R = 49 are applied in estimating biases for estimates of nonlinear regression model applied to the calcium uptake data, based on 100 repetitions of the bootstrap.

Cases Stratified R esiduals

Balanced A djusted Balanced A djusted Balanced A djusted

Po 8.9 6.9 141 108 1.2 0.6

Pi 13.1 8.9 63 49 1.4 0.6a 11.1 9.1 18.7 18.0 15.3 13.5

So far we have focused on the application to bias estimation, for which the balance typically gives a big improvement. The same is not generally true for estim ating higher m om ents or quantiles. For instance, in the previous example the balanced bootstrap has efficiency less than one for calculation o f the variance estim ate VR.

The balanced bootstrap extends quite easily to more com plicated sampling situations. I f the data consist o f several independent samples, as in Section 3.2, balanced sim ulation can be applied separately to each. Some other extensions are straightforward.

Example 9.2 (Calcium uptake data) To investigate the im provement in bias estim ation for the param eters of the nonlinear regression model fitted to the data o f Example 7.7, we calculated 100 replicates o f the estim ated biases based on 49 bootstrap samples. The resulting efficiencies are given in Table 9.3 for different resampling schemes; the results labelled “A djusted” are discussed in Example 9.3. For stratified resampling the data are stratified by the covariate value, so there are nine stra ta each with three observations. The efficiency gains under stratified resampling are very large, and those under case resampling are worthwhile. The gains when resampling residuals are not worthwhile, except for a 2. ■

First-order balance ensures tha t each observation occurs precisely R times in the R samples. In a scheme with second-order balance, each pair o f observations occurs together precisely the same num ber o f times, and so on for schemes with third- and higher-order balance. There is a close connection to certain experim ental designs (Problem 9.7). Detailed investigation suggests, however, tha t there is usually no practical gain beyond first-order balance. A n open question is whether or no t there are useful “nearly balanced” designs.

9.2.2 Post-simulation balanceConsider again estim ating the bias o f T in a nonparam etric context, based on an unbalanced array o f frequencies such as Table 9.1. The usual bias estimate


R

= (9.2)r= l

where as usual F* denotes the E D F corresponding to the rth row o f the array. Let F* denote the average o f these ED Fs, that is

f* = r - ^ f ; + --- + F*r ).

For a frequency table such as Table 9.1, F* is the C D F o f the distribution corresponding to the aggregate frequencies o f da ta values, as shown in the final row. The resulting adjusted bias estimate is

R

B r mj = R - 1 *(£*) - (9-3)r= 1

This is sometimes called the re-centred bias estimate. In addition to the usualA _ _

bootstrap values t(Fr ), its calculation requires only F* and f(F*). N ote that for adjustm ent to work, t( ) m ust be in a functional form, i.e. be defined independently o f sample size n. For example, a variance m ust be calculated with divisor n ra ther than n — 1.

The corresponding calculation for a param etric bootstrap is similar. In effect the adjustm ent com pares the simulated estim ates T ' to the param eter value Or = t(F*) obtained by fitting the model to da ta with E D F F* rather than F.

Example 9.3 (Calcium uptake data) Table 9.3 shows the efficiency gains from using B r ^ in the nonparam etric resampling experim ent described in Example 9.2. The gains are broadly similar to those for balanced resampling, bu t smaller.

For param etric sampling the quantities F ’ in (9.3) represent sets o f da ta generated by param etric sim ulation from the fitted model, and the average F* is the dataset o f size Rn obtained by concatenating the simulated samples. Here the simplest param etric sim ulation is to generate da ta y j = p-j + ej, where the fa are the fitted values from Example 7.7 and the e* are independent iV(0,0.552) variables. In 100 replicates o f this bootstrap with R = 49, the efficiency gains for estim ating the biases o f Po, P\, and a were 24.7, 42.5, and 20.7; the effect o f the adjustm ent is much more m arked for the param etric than for the nonparam etric bootstraps. ■

The same adjustm ent does not apply to the variance approxim ation V r ,

higher m om ents or quantiles. R ather the linear approxim ation is used as a conventional control variate, as described in Section 9.3.

can be w ritten in expanded notation as

9.2 ■ Balanced Bootstraps 443

9.2.3 Some theorySome theoretical insight into both balanced sim ulation and post-sim ulation balancing can be gained by means o f the nonparam etric delta m ethod (Section 2.7). As before, let F* denote the E D F o f a bootstrap sample Y J , . . . , Y„*. The expansion o f T* = t (F’) about F is, to second-order terms,

ti n n

t (F') = tQ(F') = t(F) + n~l 5 3 lj + \ n~ 2 5 3 5 3 q 'jk’ <9'4)j=i j= l t=i

where lj = H Y J ; F) and qjk = q(YJ, Yk‘ ; F) are values o f the empirical first- and second-order derivatives o f t a t F; equation (9.4) is the same as (2.41), but with F and F replaced by F' and F. We call the right-hand side o f (9.4) the quadratic approximation to T". Omission o f the final term leaves the linear approxim ation

n

tL(F’ ) = t(F) + n~l 5 3 l j ’ (9-5)i = i

which is the basis o f the variance approxim ation vL; equation (9.5) is simply a recasting o f (2.44).

In terms o f the frequencies f j with which the yj appear in the bootstrap sample and the empirical influence values lj = l (yj ;F) and qjk = q(yj ,yk;F), the quadratic approxim ation (9.4) is

n = t+ E fpj + K2 E E fjfa*’ w 7=1 7=1 k= 1

in abbreviated notation. Recall tha t 22j h — 0 an^ 22j Qjk = 22k Qjk ~ We can now com pare the resampling schemes through the properties o f the frequencies f j .

Consider bootstrap sim ulation to estimate the bias o f T. Suppose th a t there are R sim ulated samples, and that yj appears in the rth with frequency f ’rJ, while T takes value T ' . Then from (9.2) and (9.6) the bias approxim ation B r = R ~ l 22 T ’ ~ t can be approxim ated by

a-'E *+»-1E a + i»'2E E » ) - c- {9J)r = l \ 7=1 7=1 k = l J

In the ordinary resam pling scheme, the rows o f frequencies (/* 1 , . . . , f ' n) are independent samples from the m ultinom ial distribution with denom inator n and probability vector (n-1 , . . . , n _1). This is the case in Table 9.1. In this situation the first and second jo in t m om ents o f the frequencies are

E*(/V) = 1, co v -(/V ,/;fc) = SrASjk - n~l ),

where <5;* = 1 if j = k and zero otherwise, and so forth; the higher cu- m ulants are given in Problem 2.19. Straightforw ard calculations show that approxim ation (9.7) has m ean ^n~2 ^2 jq j j and variance

1Rn 1

j= 1

+

i=1

An1 + 2 ± ^ t ij= i \j= i j j= i * = i

Qjk • (9.8)

For the balanced bootstrap, the jo in t distribution o f the R x n table o f frequencies f ' j is hypergeom etric with row sums n and colum n sums R. Because = 0 and f'r] = R for all j , approxim ation (9.7) becomes

/ = 1 k= l r—l

U nder balanced resam pling one can show (Problem 9.1) that

(nSJk - 1)(JW„ - 1)e *(/*;•) = i, cov*(/;;, / ; , ) =

so the bias approxim ation (9.7) has mean

ni? - 1(9.9)

lM( i ? - l ) _2 , A

j = i

14Rr?

m ore painful calculations show tha t its variance is approxim ately

-2I T 1 qjj + 2nT2R - 2 ( ^ + 2(n - I)/!"1 £ £ q)kj=1 \ ; = 1 / j=1 /c=l

(9.10)The m ean is alm ost the same under bo th schemes, but the leading term o f the variance in (9.10) is smaller than in (9.8) because the term in (9.7) involving the lj is held equal to zero by the balance constraints Y l r f*j = First-order balance ensures that the linear term in the expansion for B r is held equal to its value o f zero for the complete enum eration.

Post-sim ulation balance is closely related to the balanced bootstrap. I t is straightforw ard to see tha t the quadratic nonparam etric delta m ethod approxim ation o f Bg^adj in (9.3) equals

y = l k= 1 I

(9.11)r= l r= l r= l

9.2 • Balanced Bootstraps 445

Figure 9.1 Efficiency comparisons for estimating biases of normal eigenvalues. The left panel compares the efficiency gains over the ordinary bias estimate due to balancing and post-simulation adjustment. The right panel shows the gains for the balanced estimate, as a function of the correlation between the statistic and its linear approximation; the solid line shows the theoretical relation. See text for details.

■O0)ocJSCOm

oin

in©

o 1o jT—

■ V

icy 5.0 . - j :

" . / W v 'c0)oifcLU ■ "■V* ■"

o _____ -»—''r'TV'V.T/* ■

ind

0.1 0.5 5.0

Adjusted

0.0 0.2 0.4 0.6 0.8 1.0

Correlation

Like the balanced bootstrap estimate o f bias, there are no linear term s in this expression. Re-centring has forced those terms to equal their population values o f zero.

W hen the statistic T does not possess an expansion like (9.4), balancing may not help. In any case the correlation between the statistic and its linear approxim ation is im portant: if the correlation is low because the quadratic com ponent o f (9.4) is appreciable, then it may not be useful to reduce variation in the linear com ponent. A rough approxim ation is tha t var*(B«) is reduced by a factor equal to 1 m inus the square o f the correlation between T" and T'L (Problem 9.5).

Example 9.4 (N orm al eigenvalues) For a numerical com parison o f the efficiency gains in bias estim ation from balanced resam pling and post-sim ulation adjustm ent, we perform ed M onte C arlo experiments as follows. We generated n variates from the m ultivariate norm al density with dimension 5 and identity covariance m atrix, and took t to be the five eigenvalues of the sample covariance matrix. For each sample we used a large bootstrap to estim ate the linear approxim ation t"L for each o f the eigenvalues and then calculated the correlation c between t* and t"L. We then estim ated the gains in efficiency for balanced and adjusted estimates o f bias calculated using the bootstrap with R = 39, using variances estim ated from 100 independent bootstrap simulations.

Figure 9.1 shows the gains in efficiency for each o f the 5 eigenvalues, for 50 sets o f da ta with n = 15 and 50 sets with n = 25; there are 500 points in each panel. The left panel com pares the efficiency gains for the balanced and adjusted schemes. Balanced sampling gives better gains than post-sample adjustm ent, bu t the difference is smaller at larger gains. The right panel shows

the efficiency gains for the balanced scheme plotted against the correlation c. The solid line is the theoretical curve (1 — c2)-1 . Knowledge o f c would enable the efficiency gain to be predicted quite accurately, at least for c > 0.8. The potential im provement from balancing is no t guaranteed to be worthwhile when c < 0.7. The corresponding plot for the adjusted estimates suggests that c m ust be at least 0.85 for a useful efficiency gain. ■

This example suggests the following strategy when a good estim ate o f bias is required: perform a small standard unbalanced bootstrap, and use it to estimate the correlation between the statistic and its linear approxim ation. If th a t correlation exceeds about 0.7, it may be worthwhile to perform a balanced simulation, but otherwise it will not. I f the correlation exceeds 0.85, post-sim ulation adjustm ent will usually be worthwhile, but otherwise it will not.

9.3 Control MethodsThe basis o f control m ethods is extra calculation during or after a series o f sim ulations with the aim o f reducing the overall variability o f the estimator. This can be applied to nonparam etric sim ulation in several ways. The postsim ulation balancing described in the preceding section is a simple control m ethod, in which we store the sim ulated random samples and m ake a single post-sim ulation calculation.

M ost control m ethods involve extra calculations a t the time o f the simulation, and are applicable when there is a simple statistic tha t is highly correlated with T*. Such a statistic is known as a control variate. The key idea is to write T* in terms o f the control variate and the difference between T* and the control variate, and then to calculate the required properties for the control variate analytically, estim ating only the differences by simulation.

Bias and variance

In m any bootstrap contexts where T is an estim ator, a natural choice for the control variate will be the linear approxim ation T[ defined in (2.44). The m om ents o f can be obtained theoretically using m om ents o f the frequencies f j . In ordinary random sampling the f j are m ultinom ial, so the m ean and variance o f T£ are

E'(T'l ) = t, var ' ( T i ) = n~2 £ lj = vL.7=1

In order to use T ’L as a control variate, we write T* = T[ + D ’, so that D* equals the difference T * — T[. The m ean and variance o f T* can then

9.3 ■ Control Methods 447

be w ritten

E 'e r* ) = E m( T l ) + E*(D‘), var *(T*) = var *(T£) + 2co v ' { T ’L, D ‘) + var *(/)*),

the leading term s o f which are known. Only terms involving D * need to be approxim ated by simulation. Given sim ulations T with corresponding linear approxim ations and differences D* = T* — T£r, the meanand variance o f T* are estim ated by

i? i? t + D \ VKcon = v L + ^ ^ ( T £ r - f i ) ( D r* - D') + ^ J 2 ( D ; ~ D ' ) 2,

r= l r= l(9.12)

where T[ = Ylr ^L,r and D" = Use o f these and relatedapproxim ations requires the calculation o f the T[ r as well as the T*.

The estim ated bias o f T* based on (9.12) is B r co„ = D ' . This is closely related to the estim ate obtained under balanced sim ulation and to the recentred bias estim ate B r ^ . Like them, it ensures that the linear com ponent o f the bias estim ate equals its population value, zero. Detailed calculation shows that all three approaches achieve the same variance reduction for the bias estimate in large samples. However, the variance estimate in (9.12) based on linear approxim ation is less variable than the estim ated variances obtained under the o ther approaches, because its leading term is no t random .

Example 9.5 (City population data) To see how effective control m ethods are in reducing the variability o f a variance estimate, we consider the ratio statistic for the city population data in Table 2.1, with n = 10. For 100 bootstrap sim ulations with R = 50, we calculated the usual variance estimate vr = ( R — I)-1 — t*)2 and the estim ate VR>con from (9.12). The estim atedgain in efficiency calculated from the 100 simulations is 1.92, which though worthwhile is no t large. The correlation between t* and t ‘L is 0.94.

For the larger set o f da ta in Table 1.3, with n = 49, we repeated the experim ent with R = 100. Here the gain in efficiency is 7.5, and the correlation is 0.99.

Figure 9.2 shows scatter plots o f the estim ated variances in these experiments. For both sample sizes the values of v r <co„ are more concentrated than the values o f vR, though the m ain effect o f control is to increase underestim ates o f the true variances. ■

Example 9.6 (Frets heads) The data o f Example 3.24 are a sample o f n = 25 cases, each consisting o f 4 measurements. We consider the efficiency gains from using v ^ con to estim ate the bootstrap variances o f the eigenvalues of their covariance matrix. The correlations between the eigenvalues and their linear approxim ations are 0.98, 0.89, 0.85 and 0.74, and the gains in efficiency estim ated from 100 replicate bootstraps o f size R = 39 are 2.3, 1.6, 0.95 and

0

/ : •

0

.....................................

Usual Usual

1.3. The four left panels o f Figure 9.3 show plots o f the values o f v r >co„ against the values o f v r . No strong pattern is discernible.

To get a m ore systematic idea o f the effectiveness o f control m ethods in this setting, we repeated the experim ent outlined in Example 9.4 and com pared the usual and control estimates o f the variances o f the five eigenvalues. The results for the five eigenvalues and n = 15 and 25 are shown in Figure 9.3. G ains in efficiency are not guaranteed unless the correlation between the statistic and its linear approxim ation is 0.80 or more, and they are not large unless the correlation is close to one. The line y = (1 — x4)-1 summarizes the efficiency gain well, though we have not attem pted to justify this. ■

QuantilesC ontrol m ethods may also be applied to quantiles. Suppose tha t we have the simulated values t\, ..., t ’R o f a statistic, and tha t the corresponding control variates and differences are available. We now sort the differences by the values o f the control variates. For example, if our control variate is a linear approxim ation, with R = 4 and t 'L 2 < t"L, < t*L4 < t] 3, we put the differences in order d"2, d\, d"4, d\. The procedure now is to replace the p quantile o f the linear approxim ation by a theoretical approxim ation, tp, for p = 1/(jR + 1 ) ,... , R / ( R + 1), thereby replacing t'r) with t ’C r = tp + d '(r), where 7t(r) is the rank o f t'L r. In our example we would obtain t ’c j = t0.2 + d'2, t'c 2 = £0 . 4 + d.\, t'c 3 = to. 6 + d\, and t ’CA = fo.g + d\. We now estimate the pth quantile o f the distribution o f T by t'c ^ , i.e. the rth quantile o f t“c v .. . ,t*CR. I f the control variate is highly correlated with T m, the bulk o f the variability in the estim ated quantiles will have been removed by using the theoretical approxim ation.

Figure 9.2 Comparison of estimated variances (xlO-2) for city population ratio, using usual and control methods, for n = 10 with R = 50 (left) and for n = 49 with R = 100 (right). The dotted line is the line x = y, and the dashed lines show the “true” variances, estimated from a much larger simulation.

9.3 ■ Control Methods 449

Figure 9.3 Efficiency comparisons for estimating variances of eigenvalues. The left panels compare the usual and control variance estimates for the data ofExample 3.24, for which n = 25, when R = 39. The right panel shows the gains made by the control estimate in 50 samples of sizes 15 and 25 from the normal distribution, as a function of the correlation between the statistic and its linear approximation; the solid line shows the line y = (1 — x4)-1. See text for details.

Third Fourth

S 10 15 20 25

0.0 0.2 0.4 0.6 0.8 1.0

Correlation

One desirable property o f the control quantile estim ates is that, unlike most other variance reduction m ethods, their accuracy improves with increasing n as well as R.

There are various ways to calculate the quantiles o f the control variate. The preferred approach is to calculate the entire distribution o f the control variate by saddlepoint approxim ation (Section 9.5), and to read off the required quantiles tp. This is better than other methods, such as C orn ish 'F isher expansion, because it guarantees tha t the quantiles o f the control variate will increase with p.

Example 9.7 (Returns data) To assess the usefulness o f the control m ethod ju st described, we consider setting studentized bootstrap confidence intervals for the rate o f return in Example 6.3. We use case resampling to estimate quantiles o f T* = (/?J —/?i ) / S \ where fli is the estim ate o f the regression slope, and S 2 is the robust estim ated variance o f fii based on the linear approxim ation to Pi.

For a single bootstrap sim ulation we calculated three estimates o f the quantiles o f T * : the usual estimates, the order statistics < ■ ■ ■ < t 'R); the control estimates taking the control variate to be the linear approxim ation to T* based on exact empirical influence values; and the control estimates obtained using the linear approxim ation with empirical influence values estim ated by regression on the frequency array for the same bootstrap. In each case the quantiles o f the control variate were obtained by saddlepoint approxim ation, as outlined in Example 9.13 below. We used R = 999 and repeated the experim ent 50 times in order to estimate the variance o f the quantile estimates. We

450 9 * Improved Calculation

CMo

Figure 9.4 Efficiency and bias comparisons for estimating quantiles o f a studentizedbootstrap statistic for the returns data, based on a bootstrap of size R = 999. The left panel

c® O

o

-3 -2 -1 0

Normal quantile

2 3

CMo

-3 -2 -1 0

Normal quantile

2 3

shows the variance of the usual quantile estimate divided by the variance o f the control estimate based on an exact linear approximation, plotted against thecorresponding normal quantile. The dashed lines show efficiencies of 1, 2, 3, 4 and 5. The right panel shows the estimated biases for the exact control (solid) and estimated control (dots)

estim ated their bias by com paring them with quantiles o f T * obtained from quantiles. See text for

100000 bootstrap resamples. detailsFigure 9.4 shows the efficiency gains o f the exact control estimates relative

to the usual estimates. The efficiency gain based on the linear approxim ation is not shown, bu t it is very similar. The right panel shows the biases o f the two control estimates. The efficiency gains are largest for central quantiles, and are o f order 1.5-3 for the quantiles o f m ost interest, a t about 0.025-0.05 and 0.95-0.975. There is some suggestion tha t the control estimates based on the linear approxim ation have the smaller bias, but both sets o f biases are negligible a t all but the m ost extreme quantiles.

The efficiency gains in this example are broadly in line with simulations reported in the literature; see also Example 9.10 below. ■

9.4 Importance Resampling

9.4.1 Basic estimatorsImportance sampling

M ost o f our sim ulation calculations can be thought o f as approxim ate integrations, with the aim o f approxim ating

for some function m(), where y ' is abbreviated notation for a simulated d a ta set. In expression (9.1), for example, m(y' ) = t(y*), and the distribution G for y* = (y^,..., y„*) puts mass n~n on each element o f the set f f = {y i,...,y„}".

9.4 ■ Importance Resampling 451

W hen it is impossible to evaulate the integral directly, our usual approach is to generate R independent samples 7,”, .. . , YR* from G, and to estimate fi by

R

and so is unbiased for fi. In the situation m entioned above, this is a reexpression o f ordinary bootstrap simulation. We use no tation such as po and Eg to indicate tha t estim ates are calculated from random variables simulated from G, and tha t m om ent calculations are with respect to the distribution G.

One problem with po is tha t some values o f y* may contribute much more to fi than others. For example, suppose tha t the aim is to approxim ate the probability P r’(T* < to \ F), for which we would take m(y*) = I{t(y") < to}, where I is the indicator function. If the event t(y*) < t0 is rare, then m ost o f the sim ulations will contribute zero to the integral. The aim o f importance sampling is to sample m ore frequently from those “im portan t” values o f y * whose contributions to the integral are greatest. This is achieved by sampling from a distribution th a t concentrates probability on these y ' , and then weighting the values o f m(y*) so as to mimic the approxim ation we would have used if we had sampled from G. Im portance sam pling in the case o f the nonparam etric bootstrap am ounts to re-weighting samples from the empirical distribution function F , so in this context it is sometimes known as importance resampling.

The identity th a t m otivates im portance sampling is

where necessarily the support o f H includes the support o f G. Im portance sampling approxim ates the right-hand side o f (9.14) using independent samples y ;,..., yR* from H. The new approxim ation for fi is the raw importance sampling estimate

where w(y’) = dG(y’ ) /dH (y ' ) is known as the importance sampling weight. The estimate fin,raw has m ean fi by virtue o f (9.14), so is unbiased, and has variance

pG = R ‘ 5 3 " H Or=1

This estim ator has m ean and variance

n = J m(y’)dG(y*) = J dH (y ’ ), (9.14)

R

Ph ,raw = / r 15 > ( y r> ( y ; ) , (9.15)r=l

O ur aim is now to choose H so that

J m ( y * ) 2w(y ' ) dG( y ' ) < J m ( y *)2 dG(y*).

Clearly the best choice is the one for which m(y*)w(y*) = n, because then Ah,raw has zero variance, bu t this is no t usable because /i is unknown. In general it is hard to choose H, bu t sometimes the choice is straightforw ard, as we now outline.

Tilted distributionsA potentially im portant application is calculation o f tail probabilities such as n = Pr*(T* < to | F), and the corresponding quantiles o f T*. For probabilities w(y’ ) is taken to be the indicator function I {t(y') < £o}, and if y \ , . . . , yn is a single random sample from the E D F F then dG(y') = n~". Any admissible nonparam etric choice for H is a m ultinom ial distribution with probability pj on yj, for j = 1 , . . . , n. Then

d H ( f ) = J J p f , j

where f j counts how m any com ponents o f Y * equal y; . We would like to choose the probabilities pj to minimize var#(/iH.raw), or at least to make this much smaller than R_1rc(l — n). This appears to be impossible in general, but if T is close to norm al we can get a good approxim ate solution.

Suppose tha t T * has a linear approxim ation T l which is accurate, and that the N ( t ,v ) approxim ation for T[ under ordinary resampling is accurate. Then the probability n we are trying to approxim ate is roughly $ {(t0 — f)/u 1/2}. If we were using sim ulation to approxim ate such a norm al probability directly, then provided tha t to < t a good (near-optim al) im portance sampling m ethod would be to generate t*s from the N(to, vi) distribution, where vl is the nonparam etric delta m ethod variance. I t turns out th a t we can arrange tha t this happen approxim ately for T* by setting

pj cc e x p ( M j ) , j = l , . . . , n , (9.18)

where the lj are the usual empirical influence values for t. The result o f Problem 9.10 shows tha t under this distribution T * is approxim ately N ( t + XnvL,vi ) , so the appropriate choice for X in (9.18) is approxim ately X = (to — t)/{nvL), again provided to < t\ in some cases it is possible to choose X to make T* have mean exactly to- The choice of probabilities given by (9.18) is called an exponential tilting o f the original values n~l . This idea is also used in Sections 4.4, 5.3, and 10.2.2.

Table 9.4 shows approxim ate values o f the efficiency R ~ 1 n ( l —n) / \aT , (p.H,raw) o f near-optim al im portance resampling for various values o f the tail probability 7i. The values were calculated using norm al approxim ations for the distributions

9.4 • Importance Resampling 453

Table 9.4 Approximate efficiencies for estimating tail probability n under importance sampling with optimal tilted EDF when T isapproximately normal.

n 0.01 0.025 0.05 0.2 0.5 0.8 0.95 0.975 0.99Efficiency 37 17 9.5 3.0 1.0 0.12 0.003 0.0005 0.00004

o f T* under G and H ; see Problem 9.8. The entries in the table suggest that for n < 0.05 we could atta in the same accuracy as with ordinary resampling with R reduced by a factor larger than about 10. Also shown in the table is the result o f applying the exponential tilted im portance resampling distribution when t > to, or n > 0.5: then im portance resampling will be worse — possibly much worse — than ordinary resampling.

This last observation is a warning: straightforw ard im portance sampling can be bad if misapplied. We can see how from (9.17). If dH(y ' ) becomes very small where m(y’) and dG(y') are not small, then w{y') = dG(y’ ) /dH(y ' ) will become very large and inflate the variance. For the tail probability calculation, if to > t then all samples y ' with t(y*) < to contribute R ~ lw(y'r ) to pH,raw, and some o f these contributions are enorm ous: although rare, they w reak havocOn flH,raw-

A little thought shows tha t for to > t one should apply im portance sampling to estimate 1 — n = Pr*(T* > to) and subtract the result from 1, ra ther than estimate n directly.

QuantilesTo see how quantiles are estimated, suppose tha t we want to estimate the a quantile o f the distribution o f 7” , and T* is approxim ately N(t, vL) under G = F. Then we take a tilted distribution for H such tha t T* is approximately N( t + zxV l 2,vl) . For the situation we have been discussing, the exponential tilted distribution (9.18) will be near-optim al with k = zi / (nv i/ 2), and in large samples this will be superior to G = F for any ct =/= i . So suppose tha t we have used im portance resampling from this tilted distribution to obtain values fj < ■■■ < tf; with corresponding weights vvj,. . . , w ’R. Then for a < | the raw quantile estimate is t"M, where

- m . M+l— V wr* < a < - — - V wr\ (9.19)R + 1 ^ r R + l ^ r

r= l r = 1

while for a > j we define M by

R R- i - y w ; < l - a < - — - y w*;

r=M r=M + 1

see Problem 9.9. W hen there is no im portance sampling we have w* = 1, and the estimate equals the usual (”(R+1)a).

The variation in w (y') and its implications are illustrated in the following

example. We discuss stabilizing modifications to raw im portance resampling in the next subsection.

Example 9.8 (Gravity data) For an example o f im portance resampling, we follow Example 4.19 and consider testing for a difference in means for the last two series o f Table 3.1. Here we use the studentized pivot test, with observed test statistic

Z° = , , y2~ 7yi ,1 /2 ' (9'2°) ( s \ /n 2 + s \ /n i )

where y t and sj are the average and variance o f the sample y n , . . . , y i „ n for i = 1,2. The test com pares zo to the general distribution o f the studentized pivot

? 2 - ? l - ( / ^ 2 - W ) .z =(S f /n 2 + S f / n i ) 1/2 ’

zo is the value taken by Z under the null hypothesis m = n2. The observed value o f zo is 1.84, with norm al one-sided significance probability Pr(Z > zo) = 0.033.

We aim to estimate Pr(Z > zo) by Pr*(Z” > zo | F), where F stands for the E D Fs o f the two samples. In this case y* = ( y u , - - - , y i ni, y 2i>--->y2n2)< an( ® is the jo in t density under the two ED Fs, so the probability on each simulated dataset is dG{y*) = n p 1 x n ""2.

Because zo > 0 and the P-value is clearly below raw im portance sampling is appropriate and the estim ated P-value is

W‘ = ^ ) r d H W Y

pH,raw = R 1 y ^ J { z'r > 0}wr*,

The choice o f H is made by analogy with the single-sample case discussed earlier. The two E D Fs are tilted so as to m ake Z* approxim ately N ( zq, v l), which should be near-optim al. This is done by working with the linear approxim ation

n lZ'L = Z + Mi 1 f ’ j l ' j + n 2 1 Y l f 2 J lV>

7=1 ;=1

where / and f'2j are the bootstrap sample frequencies o f y \j and y 2j, and the empirical influence values are

l _ yi j - h t _ yi j - yi{s \ /n2 + s f / n i ) 1/2 1 ( s l /n 2 + s2l / n i ) U2

We take H to be the pair o f exponential tilted distributions

Pi] = Pr(Y{ = yij) cc exp(/.hJ/ n l ), p2j = Pr(72‘ = y2J) cc exp(A/2y/n 2),(9.21)

Figure 9.5 Importance resampling to test for a location difference between series 7 and 8 of the gravity data. The solid points in the left panel are the weights w* and bootstrap statistics z‘ for R = 99 importance resamples; the hollow points are the pairs (z*,w‘) for 99 ordinary resamples. The right panel compares the survivor function Pr*(Z* > 2*) estimated from 50000 ordinary bootstrap resamples (heavy solid) with estimates of it based on the 99 ordinary bootstrap samples (dashes) and the 99 importance resamples (solid). The vertical dotted lines show z q .

OOoo

5 oO

oo

XLL W Q

8 1 O

B °

2o

r i•

•• i \

I:1 V L i; y

-2 0

z*

■4 -2 0

z*

where X is chosen so tha t Z ’L has mean z0 : this should make Z* approxim ately N(zo ,v i) under H. The explicit equation for X is

1 hj exp(A/u /n i) E ”l i hj exp(Xl2}/ n 2) _E "L iexp (/U ij/n i) + £ " l i exp(Xl2J/ n 2) Z°’

with approxim ate solution X = zo since vL = 1. For our da ta the exact solution is X = 1.42.

Figure 9.5 shows results for R = 99 simulations. The solid points in the left panel are the weights

Wr = = eXP | - E f l j lQg ("1 Plj) - E f a l0® fa P v )

plotted against the bootstrap values z* for the im portance resamples. These values o f z* are shifted to the right relative to the hollow points, which show the values o f z ’ and w* (all equal to 1) for 99 ordinary resamples. The values o f w* for the im portance re-weighting vary over several orders o f magnitude, with the largest values when z* <C zq . But only those for z* > z0 contribute tof^H,raw •

How well does this single im portance resam pling distribution work for estim ating all values o f the survivor function Pr*(Z* > z)? The heavy solid line in the right panel shows the “true” survivor function o f Z* estim ated from 50 000 ordinary bootstrap simulations. The lighter solid line is the im portance

K- 1 £ wrf{*r* ^ Z)r= 1

with R = 99, and the dotted line is the estimate based on 99 ordinary bootstrap samples from the null distribution. The im portance resampling estimate follows the “true” survivor function accurately close to zq bu t does poorly for negative z*. The usual estimate does best near z* = 0 but poorly in the tail region o f interest; the estim ated significance probability is f a = 0. While the usual estimate decreases by R ~ { at each z*, the weighted estimate decreases by much smaller jum ps close to z<>; the raw im portance sampling tail probability estimate is p.H,raw = 0.015, which is very close to the true value. The weighted survivor function estimate has large jum ps in its left tail, where the estimate is unreliable.

In 50 repetitions o f this experim ent the ordinary and raw im portance resampling tail probability estimates had variances 2.09 x 10-4 and 2.63 x 10-5 . For a tail probability o f 0.015 this efficiency gain o f about 8 is smaller than would be predicted from Table 9.4, the reason being that the distribution of z* is rather skewed and the norm al approxim ation to it is poor. ■

In general there are several ways to obtain tilted distributions. We can use exponential tilting with exact empirical influence values, if these are readily available. O r we can estimate the influence values by regression using jRo initial ordinary bootstrap resamples, as decribed in Section 2.7.4. A nother way o f using an initial set o f bootstrap samples is to derive weighted sm ooth distributions as in (3.39): illustrations o f this are given later in Examples 9.9 and 9.11.

resampling estimate

9.4.2 Improved estimatorsRatio and regression estimatorsOne simple modification o f the raw im portance sampling estim ate is based on the fact that the average weight R-1 w ( Y ' ) from any particular simulationwill not equal its theoretical value o f E*{w(Y*)} = 1. This suggests tha t the weights w(Yr”) be normalized, so tha t (9.15) is replaced by the importance resampling ratio estimate

tl _ E f - i h y ; m y ; )

Z L m y ; )

To some extent this controls the effect o f very large fluctuations in the weights.In practice it is better to treat the weight as a control variate or covariate.

Since our aim in choosing H is to concentrate sampling where m( ) is largest, the values o f m(Yr’ )w(Yr*) and w(Yr*) should be correlated. If so, and if

the average weight differs from its expected value o f one under sim ulation from H, then the estimate pH,raw probably differs from its expected value fi. This m otivates the covariance adjustm ent made in the importance resampling regression estimate

Ph ,reg = Ah,raw ~ b(w - 1), (9.23)

where vv* = R ~ x w(Yr*), and b is the slope o f the linear regression o f the m (Y' )w (Y *) on the w (Y r*). The estim ator pH,reg is the predicted value for m { Y ' ) w { Y “) a t the point w(Y*) = 1.

The adjustm ents m ade to pH,raw in bo th pH,rat and pH,reg m ay induce bias, but such biases will be o f order R ~ l and will usually be negligible relative to sim ulation standard errors. Calculations outlined in Problem 9.12 indicate that for large R the regression estim ator should outperform the raw and ratio estimators, but the im provement depends on the problem, and in practice the raw estim ator o f a tail probability or quantile is usually the best.

Defensive mixturesA second im provement aims to prevent the weight w(y' ) from varying wildly. Suppose tha t H is a mixture o f distributions, nH\ + (1 —n)H 2, where 0 < n < 1. The distributions Hi and H 2 are chosen so that the corresponding probabilities are not both small simultaneously. Then the weights

d G ( / ) /{ j id H ,( / ) + (1 - 7z)dH2(y')}

will vary less, because even if dHi(ym) is very small, dH2(y*) will keep the denom inator away from zero and vice versa. This choice o f H is known as a defensive mixture distribution, and it should do particularly well if m any estimates, with different m(y’), are to be calculated. The m ixture is applied by stratified sampling, th a t is by generating exactly nR observations from Hi and the rest from H 2, and using pH,reg as usual.

The com ponents o f the mixture H should be chosen to ensure tha t the relevant range o f values o f t* is well covered, bu t beyond this the detailed choice is not critical. For example, if we are interested in quantiles o f T* for probabilities between a and 1 — a, then it would be sensible to target Hi at the a quantile and H2 a t the 1 — a quantile, m ost simply by the exponential tilting m ethod described earlier. As a further precaution we m ight add a third com ponent to the mixture, such as G, to ensure stable perform ance in the middle o f the distribution. In general the mixture could have many com ponents, bu t careful choice o f two or three will usually be adequate. Always the application o f the mixture should be by stratified sampling, to reduce variation.

Example 9.9 (Gravity data) To illustrate the above ideas, we again consider the hypothesis testing problem of Example 9.8. The left panel o f Figure 9.6

shows 20 replicate estim ates o f the null survivor function o f z*, using ordinary bootstrap resampling with R = 299. The right panel shows 20 estim ates o f the survivor function using the regression estimate fiH,reg after sim ulations with a defensive m ixture distribution. This m ixture has three com ponents which are G (the two ED Fs), and two pairs o f exponential tilted distributions targeted at the 0.025 and 0.975 quantiles o f Z*. From our earlier discussion these distributions are given by (9.21) with X = ± 2 / v L \ we shall denote the first pair o f distributions by probabilities p i j and p 2j , and the second by probabilities q i j and q 2j . The first com ponent G was used for R i = 99 samples, the second com ponent (the ps) for R2 = 100 and the third com ponent (the qs) for R j = 100: the mixture proportions were therefore nj = R j/ (R \ + R 2 + R 3) for j = 1,2,3. The im portance resampling weights were

where as before f \ j and f y respectively count how m any times y ij and y 2j appear in the resample.

For convenience we estim ated the C D F o f Z* at the sample values z*. The regression estimate at z* is obtained by setting m(y’ ) = I { z ( y *) < z ( y ’ )} and calculating (9.23); this appears to involve 299 regressions for each C D F estimate, bu t Problem 9.13 shows how in fact just one m atrix calculation is needed. The im portance resam pling estim ate o f the C D F is about as variable as the ordinary estim ate over m ost o f the distribution, bu t much less variable well into the tails.

For a m ore systematic com parison, we calculated the ratio o f the mean

Figure 9.6 Importance resampling to test for a location difference between series 7 and 8 of the gravity data. In each panel the heavy solid line is the survivor function Pr’(Z ‘ > z‘) estimated from 50000 ordinary bootstrap resamples and the vertical dotted lines show z q . The left panel shows the estimates for 20 ordinary bootstraps of size 299. The right panel shows 20 importance resampling estimates using 299 samples with a regression estimate following resampling from a defensive mixture distribution with three components. See text for details.


Table 9.5 Efficiency gains (ratios of mean squared errors) for estimating a tail probability, a bias, a variance and two quantiles for the gravity data, using importance resampling estimators together with defensive mixture distributions, compared to ordinary resampling. The mixtures have Ri ordinary bootstrap samples mixed with R2 samples exponentially tilted to the 0.025 quantile of z*, and with R3 samples exponentially tilted to the 0.975 quantile of r*. See text for details.

M ixture Estim ate Estim and

Ri r 2 R i Pr* (Z* > z0) E ’( Z ') var*(Z*) Z0.05 z0.025

299 R aw 11.2 0.04 0.03 0.07 _R atio 3.5 0.06 0.05 0.06 0.05Regression 12.4 0.18 0.07 0.06 0.04

99 100 100 R aw 3.8 0.73 1.5 1.3 2.5R atio 3.4 0.79 1.5 0.93 1.3Regression 4.0 0.93 1.6 0.87 1.2

19 140 140 R aw 3.9 0.34 1.2 0.96 2.6R atio 2.3 0.43 0.82 0.48 1.1Regression 4.3 0.69 1.3 0.44 1.3

squared error from ordinary resampling to that when using defensive mixture d istributions to estimate the tail probability Pr*(Z* > z q ) with zo = 1.77, two quantiles, and the bias E*(Z’ ) and the variance var’ (Z*) for sampling from the two series. The m ixture distributions have the same three com ponents as before, bu t with different values for the num bers o f samples R \ , R 2 and Rt, from each. Table 9.5 gives the results for three resampling mixtures with a total o f R = 299 resamples in each case. The mean squared errors were estim ated from 100 replicate bootstraps, with “true” values obtained from a single bootstrap o f size 50000. The m ain contribution to the m ean squared error is from variance ra ther than bias.

The first resampling distribution is not a mixture, but simply the exponential tilt to the 0.975 quantile. This gives the best estimates o f the tail probability, with efficiencies for raw and regression estimates in line with Example 9.8, but it gives very poor estim ates o f the other quantities. For the other two mixtures the regression estimates are best for estim ating the m ean and variance, while the raw estimates are best for the quantiles and no t really worse for the tail probability. Both mixtures are about the same for tail quantiles, while the first m ixture is better for the moments.

In this case the efficiency gains for tail probabilities and quantiles predicted by Table 9.4 are unrealistic, for two reasons. First, the table com pares 299 ordinary sim ulations with just 100 tilted to each tail o f the first m ixture distribution, so we would expect the variance for a tail quantity based on the mixture to be larger by a factor o f about three; this is just w hat we see when the first distribution is com pared to the second. Secondly, the distribution of Z* is quite skewed, which considerably reduces the efficiency out as far as the 0.95 quantile.

We conclude that the regression estim ate is best for estim ating central


quantities, that the raw estim ate is best for quantiles, tha t results for estim ating quantiles are insensitive to the precise mixture used, and that theoretical gains may not be realized in practice unless a single tail quantity is to be estimated.

9.4.3 Balanced importance resamplingIm portance resampling works best for the extreme quantiles corresponding to small tail probabilities, but is less effective in the centre o f a distribution. Balanced resampling, on the o ther hand, works best in the centre o f a distribution. Balanced im portance resampling aims to get the best o f both worlds by com bining the two, as follows.

Suppose tha t we wish to generate R balanced resamples in which y j has overall probability p, o f occurring. To do this exactly in general is impossible for finite n R , but we can do so approxim ately by applying the following simple algorithm ; a m ore efficient algorithm is described in Problem 9.14.

Algorithm 9.2 (Balanced importance resampling)

Choose Ri = nRp i , . . . , = nRpn, such tha t Ri H----- + R n = nR.

Concatenate R\ copies o f y\ with R 2 copies of y 2 with . .. with R n copies o f yn, to form .

Permute the n R elements o f W at random to form and read off the R

balanced im portance resamples as sets o f n successive elements o f . •

A simple way to choose the Rj is to set Rj = 1 + [n(R — l)p ; ], j = 1 wher e [•] denotes integer part, and to set Rj = Rj + \ for the d = nR — (R[ - (- ■+ R'„) values o f j with the largest values o f nRpj — R j ; we setR j = Rj for the rest. This ensures tha t all the observations are represented in the bootstrap simulation.

Provided that R is large relative to n , individual samples will be approximately independent and hence the weight associated with a sample having frequencies ( / j , . . . , / ^ ) is approxim ately

this does not take account o f the fact tha t sampling is w ithout replacement.Figure 9.7 shows the theoretical large-sample efficiencies o f balanced re

sampling, im portance resampling, and balanced im portance resampling for estim ating the quantiles o f a norm al statistic. O rdinary balance gives m aximum efficiency o f 2.76 a t the centre o f the distribution, while im portance

This is in line with o ther studies.


Figure 9.7 Asymptotic efficiencies of balanced importance resampling (solid), importance resampling (large dashes), and balanced resampling (small dashes) for estimating the quantiles of a normal statistic. The dotted horizontal line is at relative efficiency one.

- 2 - 1 0 1 2

Normal quantile

resampling works well in the lower tail but badly in the centre and upper tail o f the distribution. Balanced im portance resampling dom inates both.

Example 9.10 (Returns data) In order to assess how well these ideas might work in practice, we again consider setting studentized bootstrap confidence intervals for the slope in the returns example. We perform ed an experiment like that o f Example 9.7, bu t with the R = 999 bootstrap samples generated by balanced resampling, im portance resampling, and balanced im portance resampling.

Table 9.6 shows the mean squared error for the ordinary bootstrap divided by the m ean squared errors o f the quantile estimates for these m ethods, using 50 replicate sim ulations from each scheme. This slightly different “efficiency” takes into account any bias from using the improved m ethods o f simulation, though in fact the contribution to m ean squared error from bias is small. The “true” quantiles are estim ated from an ordinary bootstrap o f size 100000.

The first two lines o f the table show the efficiency gains due to using the control m ethod when the linear approxim ation is used as a control variate, with empirical influence values calculated exactly and estim ated by regression from the same bootstrap simulation. The results differ little. The next two rows show the gains due to balanced sampling, both w ithout and with the control


M ethod D istribu tion Q uantile (% )

1 2.5 5 10 50 90 95 97.5 99

C ontro l (exact) 1.7 2.7 2.8 4.0 11.2 5.5 2.4 2.6 1.4C ontro l (approx) 1.4 2.8 3.2 4.1 11.8 5.1 2.2 2.6 1.3

Balance 1.0 1.2 1.5 1.4 3.1 2.9 1.7 1.4 0.6with con tro l 1.4 1.8 3.0 2.8 4.4 4.7 2.5 2.2 1.5

Im portance Hi 7.8 3.7 3.6 1.8 0.4 3.5 2.3 3.1 5.5H i 4.6 2.9 3.5 1.1 0.1 2.6 3.1 4.3 5.2H i 3.6 3.7 2.0 1.7 0.5 2.4 2.2 2.6 3.6H* 4.3 2.6 2.5 1.8 0.9 1.6 1.6 2.2 2.3Hs 2.6 2.1 0.7 0.3 0.4 0.5 0.6 1.6 2.1

Balanced Hi 5.0 5.7 4.1 1.9 0.5 2.6 2.2 6.3 4.5im portance H i 4.2 3.4 2.4 1.8 0.2 2.0 3.6 4.2 3.9

h 3 5.2 4.2 3.8 1.8 0.9 3.0 2.4 4.0 4.0h 4 4.3 3.3 3.4 2.2 2.1 2.7 3.7 3.3 4.3h 5 3.2 2.8 1.0 0.4 0.9 0.9 1.4 2.1 2.1

m ethod, which gives a worthwhile improvement in perform ance, except in the tail.

The next five lines show the gains due to different versions o f im portance resampling, in each case using a defensive mixture distribution and the raw quantile estimate. In practice it is unusual to perform a bootstrap sim ulation with the aim o f setting a single confidence interval, and the choice o f im portance sampling distribution H m ust balance various potentially conflicting requirements. O ur choices were designed to reflect this. We first suppose that the empirical influence values lj for t are known and can be used for exponential tilting o f the linear approxim ation t'L to t ‘. The first defensive mixture, H\, uses 499 simulations from a distribution tilted to the a quantile o f t*L and 500 sim ulations from a distribution tilted to the 1 — a quantile o f fL, for a = 0.05. The second m ixture is like this bu t with a = 0.025.

The third, fourth and fifth distributions are the sort tha t m ight be used in practice with a com plicated statistic. We first perform ed an ordinary bootstrap o f size Ro, which we used to estim ate first the empirical influence values lj by regression and then the tilt values rj for the 0.05 and 0.95 quantiles. We then perform ed a further bootstrap o f size (R — Ro)/2 using each set o f tilted probabilities, giving a total o f R sim ulations from three different distributions, one centred and two tilted in opposite directions. We took Ro = 199 and Ro = 499, giving Hj and i / 4. For H$ we took Ro = 499, bu t estim ated the tilted distributions by frequency sm oothing (Section 3.9.2) with bandw idth

Table 9.6 Efficiencies for estimation of quantiles of studentized slope for returns data, relative to ordinary bootstrap resampling.


e = 0.5t>1/2 at the 0.05 and 0.95 quantiles o f t*, where v x/1 is the standard error o f t estim ated from the ordinary bootstrap.

Balance generally improves im portance resampling, which is not sensitive to the mixture distribution used. The effect o f estim ating the empirical influence values is not m arked, while frequency sm oothing does not perform so well as exponential tilting. Im portance resampling estimates o f the central quantiles are poor, even when the sim ulation is balanced. Overall, any o f schemes H \- H4 leads to appreciably more accurate estim ates o f the quantiles usually of interest. ■

9.4.4 Bootstrap recyclingIn Section 3.9 we introduced the idea o f bootstrapping the bootstrap, both for m aking bias adjustm ents to bootstrap calculations and for studying the variation o f properties o f statistics. Further applications o f the idea were described in C hapters 4 and 5. In both param etric and nonparam etric applications we need to sim ulate samples from a series o f distributions, themselves obtained from sim ulations in the nonparam etric case. Recycling m ethods replace m any sets o f sim ulated samples by one set o f samples and many sets o f weights, and have the potential to reduce the com putational effort greatly. This is particularly valuable when the statistic o f interest is expensive to calculate, for example when it involves a difficult optim ization, or when each bootstrap sample is costly to generate, as when using M arkov chain M onte C arlo m ethods (Section 4.2.2).

The basic idea is repeated use o f the im portance sampling identity (9.14), as follows. Suppose tha t we are trying to calculate = E{m(Y)} for a series o f distributions G i , . . . , G k ■ The naive M onte C arlo approach is to calculate each value Hk = E{m(Y) | Gk} independently, sim ulating R samples yu - - - ,y R from G/c and calculating pk = R ~ l m(yr). But for any distribution H whose support includes tha t o f G* we have

We can therefore estimate all K values using one set of samples y \ , . . . , y N simulated from H, with estimates

E{m(Y) | Gk} = J m(y)dGk{y) = J = E jm(Y)dGk(Y )d H ( Y )

N

P k = N 1^ m ( y , ) (9.24)

In some contexts we may choose N to be much larger than the value R we m ight use for a single simulation, but less than K R . It is im portant to choose H carefully, and to take account o f the fact tha t the estimates are correlated.


Both N and the choice o f H depend upon the use being made o f the estimates and the form o f m(-).

Example 9.11 (City population data) Consider again estim ating the bias and variance functions for ratio 8 = t(F) o f the city population da ta with n = 10. In Example 3.22 we estim ated b(F) = E (T | F) — t(F) and v(F) = var( T | F) for a range o f values o f 0 = t{F) using a first-level bootstrap to calculate values o f t* for 999 bootstrap samples F*, and then doing a second-

A A

level bootstrap to estimate b(F') and v(F’) for each o f those samples. Here the second level o f resampling is avoided by using im portance re-weighting. A t the same time, we retain the sm oothing introduced in Example 3.22.

R ather than take each Gk to be one o f the bootstrap E D Fs F*, we obtain a sm ooth curve by using sm ooth distributions F'f) with probabilities pj(6 ) as defined by (3.39). Recall tha t the param eter value o f F ’e is t(F'g) = 0*, say, which will differ slightly from 6 . For H we take F, the E D F o f the original data, on the grounds tha t it has the correct support and covers the range of values for y ’ w ell: it is no t necessarily a good choice. Then we have weights

dGk( f r ) = dFg(y') = A ( PjW V " = . dH(y'r ) dH(y'r ) y i n - 1 /

say, where as usual /*• is the frequency with which y} occurs in the rth bootstrap sample. We should emphasize tha t the samples y * draw n from H here replace second-level bootstrap samples.

Consider the bias estimate. The weighted sum R~' ^ ( f ’ — 6")w'(0} is an unbiased estim ate o f the bias E” (T*‘ | F'e) — 6 *, and we can plot this estimate to see how the bias varies as a function o f O' or 6 . However, the weighted sum can behave badly if a few o f the w' ( 0 ) are very large, and it is better to use the ratio and regression estimates (9.22) and (9.23).

The top left panel o f Figure 9.8 shows raw, ratio, and regression estimates of the bias, based on a single set o f R = 999 simulations, with the curve obtained from the double bootstrap calculation used in Figure 3.7. For example, the ratio estimate o f bias for a particular value o f d is ]Tr(r' — 0 ’)w‘(0 ) / 2 2 r w '(0), and this is plotted as a function o f 0*. The raw and ratio estim ates are rather poor, but the regression estim ate agrees fairly well with the double bootstrap curve. The panel also shows the estim ated bias from a defensive m ixture with 499 ordinary samples mixed with 250 samples tilted to each o f the 0.025 and 0.975 quantiles; this is the best estimate o f those we consider. The panels below show 20 replicates o f these estim ated biases. These confirm the impression from the panel above: with ordinary resam pling the regression estim ator is best, but it is better to use the m ixture distribution.

The top right panel shows the corresponding estimates for the standard

Figure 9.8 Estimated bias and standard error functions for the city population data ratio.In the top panels the heavy solid lines are the double bootstrap values shown in Figure 3.7, and the others are the raw estimate (large dashes), the ratio estimate (small dashes), the regression estimate (dots), and the regression estimate based on a defensive mixture distribution (light solid). The lower panels show 20 replicates of raw, ratio, and regression estimates from ordinary sampling, and the regression estimate from a defensive mixture (clockwise from upper left) for the panels above.

00oo

0)

1 1</)® *CO O

CDCMOooo

IDo

£ ^o o

■o

iS coc d &"go

■o •_1 0 oa j| ^

_____

oo dLU

^ / /

00

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.2 1.3 1.4 1.5 1.6 1.7 1.8

; ' Y / *

) / / M

0.04

0.06

0.

08

0.10

l;k n 'if vsMrr-- 02

0.3

0.4

0.

5 /

/ /.

/ A'J B &

0.2

0.3

0.4

0.5

/ >

20

0 5 6

9 °

1.2 1.3 1.4 1.5 1.6 1.7 1.6 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.2 1.3 1.4 1.5 1.6 1.7 1.6 1.2 1.3 1.4 1.5 1.6 1.7 1.8

error o f t. For each o f a range o f values o f 6 , we calculate this by estim ating the bias and mean squared error

h f ; ) = e**(t** - 0* i f ;), e**{(r* - 0*)2 1 f ; }

by each o f the raw, ratio, and regression methods, and plotting the resulting estimate

v^ 2(F'e) = [E**{(r* - e ')2 1 f ; } - {e**(t** - e ' \ f 0*)}2]1/2.

The conclusions are the same as for the bias estimate.As we saw in Example 3.22, here T — 6 is not a stable quantity because its

m ean and variance depend heavily on 6 . ■


The results for the raw estimate suggest that recycling can give very variable results, and it m ust be used with care, as the next example vividly illustrates.

Example 9.12 (Bias adjustm ent) Consider the problem o f adjusting the bootstrap estim ate o f bias o f T, discussed in Section 3.9. The adjustm ent C in equation (3.30) is (R M )_1 5Zf=1 ]Cm=i(f*m ~ K) ~ which uses M samples from each o f the R models F* fitted to samples from F. The recycling m ethod replaces each average M -1 2 2 m=i([rm — t ”) by a weighted average o f the form(9.24), so tha t C is estim ated by

where t ’’ is the value o f T for the /th sample y ’{ , y ’’ draw n from the distribution H. If we applied recycling only to the first term o f C, which estimates E” (T**), then a different — and as it turns out inferior — estimate would be obtained for C.

The support o f H m ust include all R first-level bootstrap samples, so as in the previous example a natural choice is H = F , the model fitted to (or the E D F of) the original sample. However, this can give highly unstable results, as one m ight predict from the leftm ost panel in the second row o f Figure 9.8. This can be illustrated by considering the case o f the param etric model Y ~ N(0, 1), with estimate T = Y . Here the term s being summed in (9.25) have infinite variance; see Problem 9.15. The difficulty arises from the choice H = F , and can be avoided by taking H to be a m ixture as described in Section 9.4.2, with at least three com ponents. ■

Instability due to the choice H = F does not occur with all applications o f recycling. Indeed applications to bootstrap likelihood (C hapter 10) work well with this choice.

9.5 Saddlepoint Approximation

9.5.1 Basic approximationsBasic ideas

Let W\, .. . , W„ be a random sample from a continuous distribution F with cum ulant-generating function

Suppose tha t we are interested in the linear com bination U = Y 2 ajWj , and that this has exact P D F and C D F g(w) and G(u). U nder suitable conditions, the cumulant-generating function o f U, which is is the

(9.25)

9.5 • Saddlepoint Approximation 467

basis o f highly accurate approxim ations to the P D F and C D F o f U, known as saddlepoint approximations. The saddlepoint approxim ation to the density o f U a t u is

gs(«) = { 2 n K " ( I ) } ~ m exp [ k ( | ) - {«} , (9.26)

where the saddlepoint <f satisfies the saddlepoint equation

K ' ( b = u, (9.27)

and is therefore a function of u. Here K ' and K " are respectively the first and second derivatives o f K with respect to £. A simple approxim ation to the C D F o f U, P r( l/ < u), is

Gf(u) = (w + - l o g ( - ) l , (9.28)[ w \ w / J

where $(■) denotes the standard norm al CDF, and

w = slgn ( 3 ) [ 2 { a u - K ( 3 ) } ] 1/2, v = l { K " ( l ) } U\

are both functions o f u. A n alternative to (9.28) is the Lugannani-Rice formula

cD(w) + 0 ( w ) f - - - ) , (9.29)\ W V J

but in practice the difference between them is small. W hen \ = 0 the C D F approxim ation is more com plicated and we do not give it here. The approxim ations are constructed by numerical solution o f the saddlepoint equation to obtain the value o f I for each value o f u o f interest, from which the approxim ate P D F and C D F are readily calculated.

Form ulae such as (9.26) and (9.28) can provide rem arkably accurate approxim ations in m any statistical problems. In fact,

g(u) = gs(u) {1 + 0 (n -1 )} , G(u) = Gs(u) {1 + 0 ( n - 3/2)} ,

for values of u such that |w| < c for some positive c; the error in the C D F approxim ation rises to 0 (n ~ l ) when u is such that |w| < cn1 2. A key feature is tha t the error is relative, so tha t the ratio o f the true density of U to its saddlepoint approxim ation is bounded over the likely range o f u. A consequence is tha t unlike other analytic approxim ations to densities and tail probabilities, (9.26), (9.28) and (9.29) are very accurate far into the tails o f the density o f U. I f there is doub t about the accuracy o f (9.28) and (9.29), Gs may be calculated by numerical integration o f gs.

The m ore complex formulae tha t are used for conditional and marginal density and distribution functions are given in Sections 9.5.2 and 9.5.3.


n a (% )

0.1 0.5 1 2.5 5 95 97.5 99 99.5 99.9

10 Sim’n 7.8 10.9 12.8 15.4 18.1 78.5 85.1 96.0 102.1 116.4S’point 7.6 10.8 12.5 15.2 17.8 78.1 85.9 95.3 101.9 115.8

15 Sim’n 11.8 13.6 14.5 16.0 17.4 37.4 39.7 42.3 44.4 48.2S’point 11.7 13.5 14.4 15.9 17.4 37.4 39.7 42.4 44.3 48.2

Application to resampling

In the context o f resampling, suppose that we are interested in the distribution o f the average o f a sample from y \ , . . . , y n, where is sampled with probability pj, j = 1 , . . . , n. Often, but not always, Pj = n-1 . We can write the average as U' = n~l J2 f)yj> where as usual ( / j , . . . , ) has a jo in t m ultinom ial distribution with denom inator n. Then U' has cum ulant-generating function

K( £) = nlog j ] T f ? ; exp(£o,) j , (9.30)

where a; = y j / n . The function (9.30) can be used in (9.26) and (9.28) to give non-random approxim ations to the P D F and C D F o f U ‘. Unlike m ost o f the m ethods described in this book, the error in saddlepoint approxim ations arises not from sim ulation variability, but from deterministic numerical error in using gs and Gs rather than the exact density and distribution function.

In principle, o f course, a nonparam etric bootstrap statistic is discrete and so the density does not exist, but as we saw in Section 2.3.2, U* typically has so m any possible values tha t we can think o f it as continuous away from the extreme tails o f its distribution. Continuity corrections can sometimes be applied, but they make little difference in bootstrap applications.

W hen it is necessary to approxim ate the entire distribution o f U', we calculate the values o f Gs(u) for m values o f u equally spaced between min aj and max aj and use a spline sm oother to interpolate between the corresponding values o f C>_ 1{Gs(m)}. Q uantiles and cum ulative probabilities for U ' can be read off the fitted curve. Experience suggests that m = 50 is usually ample.

Example 9.13 (Linear approxim ation) A simple application o f these ideas is to the linear approxim ation t'L for a bootstrap statistic t \ as was used in Example 9.7. We write T[ = t + n~l where as usual /* is the frequencyo f the yth case in the bootstrap sample and lj is the 7 th empirical influence value. The cum ulant-generating function o f T[ — t is (9.30) with aj = l j /n and

Table 9.7 Comparison o f saddlepoint approximation to bootstrap a quantiles (x lO -2 ) o f a linear statistic for samples of sizes 10 and 15, with results from R = 49 999 simulations.

9.5 ■ Saddlepoint Approximation 469

Figure 9.9 Comparison of the saddlepoint approximation (solid) to the PDF of a linear statistic with results from a bootstrap simulation with R = 49 999, for samples of sizes 10 (left) and 15 (right).

Ocvi

o

Q °Q_

in©

oo

0.0 0.5 1.0 1.5

QCL

A

co

C M

J h lltlii**,----------

0.1 0.2 0.3 0.4 0.5 0.6

t

Pj = n so the saddlepoint equation for approxim ation to the P D F and C D F o f T[ a t t'L is

E " = iO exP (^ ;/« ) _ .E J - .e x p ({ /;/« ) L ’

whose solution is \ .For a numerical example, we take the variance t = n~l J2(yj ~ y )2 f° r

exponential samples o f sizes 10 and 15; the empirical influence values are lj = (yj — y )2 — t. Figure 9.9 com pares the saddlepoint approxim ations to the PD Fs o f t'L with the histogram from bootstrap calculations with R = 49 999. The saddlepoint approxim ation accurately reflects the skewed lower tail of the bootstrap distribution, whereas a norm al approxim ation would not do so. However, the saddlepoint approxim ation does not pick up the m ultim odality o f the density for n = 10, which arises for the same reason as in the right panels o f Figure 2.9: the bulk o f the variability o f T[ is due to a few observations with large values o f |/; |, while those for which \lj\ is small merely add noise. The figure suggests th a t with so small a sample the C D F approxim ation will be more useful. This is borne out by Table 9.7, which com pares the sim ulation quantiles and quantiles obtained by fitting a spline to 50 saddlepoint C D F values.

In more complex applications the empirical influence values lj would usually be estim ated by numerical differentiation or by regression, as outlined in Sections 2.7.2, 2.7.4 and 3.2.1. ■

Example 9.14 (Tuna density estimate) We return to the double bootstrap

used in Example 5.13 to calibrate confidence intervals based on a kernel density estimate. This involved estim ating the probabilities

is the variance-stabilized estim ate o f the quantity o f interest. The double

(nh)~l {4>{—yj /h) + <t>(yj/h)} and /** is the frequency with which yj appears in a second-level bootstrap sample. Conditional on a first-level bootstrap sample F ’ with frequencies /*„, the /* ’ are independent m ultinom ial variableswith m ean vector (/* ! ,...,/* „ ) and denom inator n.

Now if 2 t’r —t < 0, the probability (9.31) equals zero, because T is positive. If 2t*—t > 0, the event T** < 2 t*—t is equivalent to n~l ^ (2^ — 0 2- Thusconditional on F", if 21* — t > 0, we can obtain a saddlepoint approxim ation to (9.31) by applying (9.28) and (9.30) with u = (21* — t )2 and pj =

Including program m ing, it took about ten minutes to calculate 3000 values o f (9.31) by saddlepoint approxim ation; direct sim ulation with 250 samples at the second level took about four hours on the same w orkstation. ■

Estimating functions

One simple extension o f the basic approxim ations is to statistics determ ined by m onotonic estim ating functions. Suppose tha t the value o f a scalar bootstrap statistic T* based on sam pling from y i , . . . , y „ is the solution to the estim ating equation

where for each y the function a(6 ;y) is decreasing in d. Then T* < t if and only if U’(t) < 0. Hence Pr*(T* < t) may be estim ated by Gs(0) applied with cum ulant-generating function (9.30) in which aj = a{t;yj). A saddlepoint approxim ation to the density o f T is

Pr**(T" < 2t‘ - t | F ’), (9.31)

where

bootstrap version o f t can be written as t " = (2 2 f j ' aj ) ,/2> where aj =

nU*(t) = ^2,a{ f ,y j ) f ' j = 0, (9.32)

(9.33)

. A

where K (^ ) = dK /d t , and £ solves the equation K '( £) = 0. The first term on the right in (9.33) corresponds to the Jacobian for transform ation from the density o f U ’ to tha t o f T ' .


Example 9.15 (Maize data) Problem 4.7 contains da ta from a paired com parison experim ent perform ed by D arw in on the growth o f maize plants. The da ta are reduced to 15 differences y \ , . . . , y n between the heights (in eighths of an inch) o f cross-fertilized and self-fertilized plants. W hen two large negative values are excluded, the differences have average J> = 33 and look close to norm al, but when those values are included the average drops to 20.9.

W hen da ta may have been contam inated by outliers, robust M -estimates are useful. If we assume that Y = 8 + as, where the distribution o f e is symmetric about zero but m ay have long tails, an estimate o f location 0 can be found by solving the equation

' = 0, (9.34)j=i

where tp(e) is designed to downweight large values o f s. A com m on choice is the H uber estim ate determ ined by

y>(e) = c, (9.35)

W ith c = oo this gives 1p(s) = s and leads to the norm al-theory estim ate 9 = y, bu t a smaller choice o f c will give better behaviour when there are outliers.

W ith c = 1.345 and a fixed at the m edian absolute deviation s o f the data, we obtain 8 = 26.45. How variable is this? We can get some idea by looking at replicates o f 9 based on bootstrap samples y j , . . . ,y * . A bootstrap value 9* solves

P i ^ h

so the saddlepoint approxim ation to the P D F o f bootstrap values is obtained starting from (9.32) with a(f ,yj ) = y>{(yj — t)/s}. The left panel o f Figure 9.10 com pares the saddlepoint approxim ation with the empirical distribution o f 9*, and with the approxim ate P D F o f the bootstrapped average. The saddlepoint approxim ation to 9’ seems quite accurate, while the PD F o f the average is wider and shifted to the left.

The assum ption o f symmetry underlies the use o f the estim ator 9, because the param eter 9 m ust be the same for all possible choices o f c. The discussion in Section 3.3 and Example 3.26 implies that our resampling scheme should take this into account by enlarging the resampling set to y i , . . . , y„ , 9 — (yi — 9 ) , . . . , 9 — {y„ — 9), for some very robust estimate o f 9 ; we take 9 to be the median. The cum ulant-generating function required when taking samples

oQ_

Q

CDo

o

C\Joo

od

-20

theta

20 40

theta

60

o f size n from this set is

n log (2n) 1 ^ exP{£a (f ;?;)} +exp{<^a(t;2§ - y , ) } j = i

The right panel o f Figure 9.10 com pares saddlepoint and M onte C arlo ap proxim ations to the P D F o f O' under this symmetrized resampling scheme; the P D F o f the average is shown also. All are symmetric about 6 .

One difficulty here is tha t we m ight prefer to approxim ate the P D F o f O' when s is replaced by its bootstrap version s', and this cannot be done in the current framework. M ore fundam entally, the distribution o f interest will often be for a quantity such as a studentized form o f O' derived from 6 ", s' , and perhaps other statistics, necessitating the more sophisticated approxim ations outlined in Section 9.5.3. ■

9.5.2 Conditional approximationThere are num erous ways to extend the discussion above. One o f the m ost straightforw ard is to situations where U is a q x 1 vector which is a linear function o f independent variables W i , . . . , W „ with cum ulant-generating functions K j { £ ) , j = T hat is, U = A T W , where A is a n x q m atrix withrows a j .

The jo in t cum ulant-generating function o f U is

Figure 9.10Comparison of the saddlepoint approximation to the PDF of a robust M-estimate applied to the maize data (solid), with results from a bootstrap simulation with R = 50000. The heavy curve is the saddlepoint approximation to the PDF of the average. The left panel shows results from resampling the data, and the right shows results from a symmetrized bootstrap.

n

K ( 0 = log E exp (ZTATW) =

and the saddlepoint approxim ation to the density o f U at u is

gs(u) = ( 2 n r q/2 \ K " ( t r 1/2cxp { * (£ ) - l Tu ] , (9.36)

where | satisfies the q x 1 system o f equations 8K(t;)/d£, = u, and K "(£) = d2K { t ) / d m T is the q x q m atrix o f second derivatives o f K ; | • | denotes determ inant.

Now suppose that U is partitioned into U\ and U2, that is, U T = ( I / / , U j ), where U\ and U2 have dimension q\ x 1 and (q — qi) x 1 respectively. N ote that U2 = A% W , where A 2 consists o f the last q — qi columns o f A. The cum ulant-generating function o f U2 is simply K(0, £,2), where £ T = (£jr ,<J^) has been partitioned conform ably with U, so the saddlepoint approxim ation to the m arginal density o f U2 is

g,(u2) = ( 2 ^ r (9-«1,/2|X^2( 0 , |20) r 1/2 exp { k ( 0 , |20) - U>u2) , (9.37)

where £20 satisfies the (q — qi) x 1 system o f equations dK ( 0 ,£ 2)/dt ; 2 = u2, and K '22 is the (q — q\) x {q — qi) corner o f K " corresponding to U2.

Division o f (9.36) by (9.37) gives a double saddlepoint approxim ation to the conditional density o f U\ at ui given tha t U2 = u2. W hen U\ is scalar, i.e. q\ = 1, the approxim ate conditional C D F is again (9.28), bu t with

w = sign(lt) ^ | x ( 0 , | 2o) - 32T0M2} - { ^ (^ ) - ' .

\ |K"2( 0 , 6 o)I J

Example 9.16 (City population data) A simple bootstrap application is to obtain the distribution o f the ratio T* in bootstrap sampling from the city population da ta pairs with n = 10. In order to avoid conflicts o f notation we set yj = (Zj, Xj), so tha t T* is the solution to the equation ]T(x; — tZj)f* = 0.

For this we take the W j to be independent Poisson random variables with equal means / j , s o K j ( £ ) = n{e{ — 1). We set

= - ( ”)• " - ( V 'Now T ' < t if and only if J2 j(xj ~ tzj )Wj < 0, where Wj is the num ber of times ( z j , X j ) is included in the sample. But the relation between the Poisson and m ultinom ial distributions (Problem 9.19) implies tha t the jo in t conditional distribution o f ( W \ , . . . , W „ ) given tha t J 2 ^ j = n is the same as tha t o f the m ultinom ial frequency vector (/*, . . . , / * ) in ordinary bootstrap sampling from a sample o f size n. Thus the probability tha t J2 j(xj “ tzj)Wj < 0 given that J2 W j = n is just the probability that T ' < t in ordinary bootstrap sampling from the da ta pairs.

a Unconditional Conditional Without replacement

S’point Sim’n S’point Sim’n S’point Sim’n

0.001 1.150 1.149 1.216 1.215 1.070 1.0700.005 1.191 1.192 1.236 1.237 1.092 1.0920.01 1.214 1.215 1.248 1.247 1.104 1.1030.025 1.251 1.252 1.273 1.269 1.122 1.1220.05 1.286 1.286 1.301 1.291 1.139 1.1380.1 1.329 1.329 1.340 1.337 1.158 1.1580.9 1.834 1.833 1.679 1.679 1.348 1.3480.95 1.967 1.967 1.732 1.736 1.392 1.3920.975 2.107 2.104 1.777 1.777 1.436 1.4350.99 2.303 2.296 1.829 1.833 1.493 1.4950.995 2.461 2.445 1.865 1.863 1.537 1.5400.999 2.857 2.802 1.938 1.936 1.636 1.635

In this situation it is o f course m ore direct to use the estim ating function m ethod with a(t;yj) = Xj—tZj and the simpler approxim ations (9.28) and (9.33). Then the Jacobian term in (9.33) is | 22 z; ex p { |(x , — t z j )} / 22 exp{|(x,- — tZj)}\.

A nother application is to conditional distributions for T*. Suppose that the population pairs are related by x; = Zj6 + z l/ 2£j, where the e; are a random sample from a distribution with m ean zero. Then conditional on the Zj, the ratio 2 2 xj / 2 2 zj has variance proportional to (]P Zj)~ ' ■ In some circumstances we might w ant to obtain an approxim ation to the conditional distribution o f T * given tha t 22 Z j = 22 zj- this case we can use the approach outlined in the previous paragraph, bu t with two conditioning variables: we take the Wj to be independent Poisson variables with equal means, and set

" E ( x j - t z j ) W j \ ( o[ / * = 2 2 zjW j h u = \ 2 2 zj a , = zJ

22 w j J \ n ) V 1A third application is to approxim ating the distribution o f the ratio when

a sample o f size m = 10 is taken w ithout replacem ent from the n = 49 data pairs. Again T ' < t is equivalent to the event 2 2 j( x j ~ t z j ) W j < 0, but now W j

indicates that (z j ,X j) is included in the m cities chosen; we w ant to impose the condition 2 2 ^ 0 = m- We take Wj to be binary variables with equal success probabilities 0 < n < 1, giving Kj(£) = log(l — n + ne*), with n any value. We then apply the double saddlepoint approxim ation with

- U ) - " - ( V 'Table 9.8 com pares the quantiles o f these saddlepoint distributions with

Table 9.8 Comparison of saddlepoint and simulation quantile approximations for the ratio when sampling from the city population data. The statistics are the ratio £ x j / £ zj with n = 10, the ratio conditional on Yl zj = 640 with n = 10, and the ratio in samples of size 10 taken without replacement from the full data. The simulation results are based on 100000 bootstrap samples, with logistic regression used to estimate the simulated conditional probabilities, from which the quantiles were obtained by a spline fit.

M onte C arlo approxim ations based on 100000 samples. The general agreement is excellent in each case. ■

A further application is to perm utation distributions.

Example 9.17 (Correlation coefficient) In Example 4.9 we applied a perm utation test to the sample correlation t between variables x and z based on pairs (x i,z i), ..., (x„,z„). For this statistic and test, the event T > t is equivalent to EjXjZ(U) - Y l x i zj> where £(•) is a perm utation o f the integers 1 , . . . ,n .

An alternative form ulation is as follows. Let Wy, i , j = 1 denoteindependent binary variables with equal success probabilities 0 < n < 1, for any n. Then consider the distribution o f U\ = J 2 i j x izj ^ U conditional on

U2 = ( £ , W i j , . . . , Y , j w nj,E , W|1.......5Di w i,n-i ) r = M2, where u2 is a vectoro f ones o f length 2n — 1. Notice tha t the condition E , = 1 is entailed by the o ther conditions and so is redundant. Each value o f Xj and each value o f zj appears precisely once in the sum U\, with equal probabilities, and hence the conditional distribution o f U\ given U2 = u2 is equivalent to the perm utation distribution o f T. Here m = n2, q = 2n, and qi = 1.

O ur limited numerical experience suggests tha t in this example the saddlepoint approxim ation can be inaccurate if the large num ber o f constraints results in a conditional distribution on only a few values. ■

9.5.3 Marginal approximationThe approxim ate distribution and density functions described so far are useful in contexts such as testing hypotheses, but they are harder to apply to such problem s as setting studentized bootstrap confidence intervals. A lthough (9.26) and (9.28) can be extended to some types o f com plicated statistics, we merely outline the results.

Approximate cumulant-generating functionThe simplest approach is direct approxim ation to the cum ulant-generating function o f the bootstrap statistic o f interest, T ’. The key idea is to replace the cum ulant-generating function K (^ ) by the first four terms o f its expansion in powers o f

+ \ £ 2k 2 + g£3*C3 + ^<^4k4, (9.38)

where K; is the ith cum ulant o f T*. The exact cum ulants are usually unavailable, so we replace them with the cum ulants o f the cubic approxim ation to T* given by

n n n

t * = t + n - 1 £ / ; / , • + K 2 E / * / ; * ; + i n~3 f i f j f k d j k , i=l i,j= 1 i,jjt=1

where t is the original value o f the statistic, and lj, qjj and Cy* are the empirical linear, quadratic and cubic influence values; see also (9.6). To the order required the approxim ate cum ulants are

where the quantities in parentheses are o f order one.We get an approxim ate cum ulant-generating function K c ( 0 by substituting

the Kc,i into (9.38), and then use the standard approxim ations (9.26) and (9.28) with K(£) replaced by Kc(£). Detailed consideration establishes tha t this preserves the usual asym ptotic accuracy o f the saddlepoint approxim ations. From a practical point o f view it may be better to sacrifice some theoretical accuracy bu t reduce the com putational burden by dropping from kc ,2 and Kc,4 the terms involving the cy*; with this modification both P D F and C D F approxim ations have error o f order n~l .

In principle this approach is fairly simple, but in applications there is no guarantee that K c(£) is close to the true cum ulant-generating function o f T ' except for small I t may be necessary to modify K c (£) to avoid multiple roots to the saddlepoint equation or if Kc,4 < 0, for then K c(£) cannot be convex. In these circumstances we can modify K c (£) to ensure that the cubic and quartic term s do not cause trouble, for example replacing it by

where b is chosen to ensure tha t the second derivative o f Kc,b(£) with respect to £ is positive; Kc,b(£) is then convex. A suitable value is

and this can be found by numerical search.

Empirical Edgeworth expansion

The approxim ate cum ulants can also be used in series expansions for the density and distribution o f T*. The Edgeworth expansion for the C D F o f

+ !2 (n 3 Y .ijk l‘lm qjk) +4 (n 3 Y ,ijk W ^ w ) } -

K c M ) = f rc . i + { £ 2*c ,2 + (|<^3Kc,3 + J4 %4kc,4) exp ( - \ n b 2£2KC,2) ,

b = max [5, in f {a : K'ca(£) > 0, —oo < < oo}],

Here H(x) is the Heaviside function, jumping from 0 to 1 at x = 0.

Z q = (T* ~ kc,i ) / k£ is

P r\ Z ' C < z) = <D(z) + {pi(z) + p2(z)}(t>(z) + Op(n~3/2), (9.39)

where

Pi (z ) = -5'fc,3'Cc,2/2(z 2 - 1).

p2{z) = - z { ^ K C,4Kcl( z 2 - 3) + j 2 KC,3Kc U z4 ~ ^ + 15)} •

D ifferentiation o f (9.39) gives an approxim ate density for Z'c and hence forT*. However, experience suggests tha t the saddlepoint approxim ations (9.28)and (9.29) are usually preferable if they can be obtained, prim arily because (9.39) results in less accurate tail probability estim ates: its error is absolute ra ther than relative. Further drawbacks are that (9.39) need not increase with z, and that the density approxim ation may become negative.

Derivation o f the influence values tha t contribute to kc,i , . . . , Kc,4 can be tedious.

Example 9.18 (Studentized statistic) A statistic T = t (F) studentized using the nonparam etric delta m ethod variance estimate obtained from its linear influence values L t(y ,F ) may be w ritten as Z = nx 2 W, where

t(F) - t(F)W = w(F,F) =

{ / L t( y ,F ) 2 dF(y)}1/ 2 ’

with F the E D F o f the data. The corresponding bootstrap statistic is w (F \ F), where F* is the E D F corresponding to a bootstrap sample. For economy of notation below we write

v = v(F) = J L t(y;F) dF(y), L w(yi) = M j ^ F ) , Q A y u y i ) = Q A y u y n F ) ,

and so forth.To obtain the linear, quadratic and cubic influence values for w(G, F) at

G = F, we replace G(y) with

(1 - ei - s2 - e3)F(y) + £1 H(y - j>i) + e2H (y - y 2) + £3H (y - y3),

differentiate with respect to £1, s2, and £3, and set £1 = £2 = £3 = 0. The empirical influence values for W a t F are then obtained by replacing F with F. In terms o f the influence values for t and v the result o f this calculation is

L w(yi) = v~ 1/2L t(yi),

Q v i y u y i ) = v~ll2Qt{yx, y 2) - ^v~3/ 2L t{yi)Lv{y2 )[2],

Cw{y\ ,y2, y i ) = v ^ l/2 Ct(yu y 2, y 3)

- \ v ~ V 2 { 6 f0 'i,j '2 ) l‘.,0'3) + Qv(yuy2)Lt(yi)} [3]+ 1V~5/2L[ (y 1 )LV (y2 )LV (y3) [3],


where [fc] after a term indicates tha t it should be summed over the perm utations o f its y^s tha t give the k distinct quantities in the sum. Thus for example

L t(yi )Lv(y2)Lv(y3)[3] = L t(y i)Lv(y2 )Lv(y}) + L t(yi )Lv(yi )Lv(y2)

+ L t{y2)Lv(yi )Lv(yi).

The influence values for z involve linear, quadratic, and cubic influence values for t, and linear and quadratic influence values for v, the latter given by

L v(yi) = L t{yi)2 — J L t(x )2 dF(x) + 2 J L t(x)Qt( x , y l )dF(x),

lQv(y i ,y2) = L t(y i )Qt(yuy2 )l2 ] - L t{yi)Lt(y2)

~ J { Q t ( x , y i ) + Qt(x ,y 2) }Lt(x)dF(x)

+ J {Qt(x,y2)Qt(x,yi) + L t(x)Ct{x , yu y2)} dF(x).

The simplest example is the average t(F) = f xd F (x ) = y o f a sample of values yu- - - ,y„ from F. Then Lt(j/,-) = y t - y , Qtiyuyj) = Ct(yi ,yj ,yk) = 0, the expressions above simplify greatly, and the required influence quantities are

li = Lw(yi; F) = v~x,2{yi - y),9ij = Q U y i , y j i h = - i v ~ i/2 ( y i - y ) { ( y j - y ) 2 - v } [ 2 ],

Cijk = Cw(yi ,yj , yk ;F) = 3v~i/2(yi - y)(yj - y)(yk - y)

+\v~5n{yi - y) {(yj - y)2 - {(yk - y)2 - [3],where v = n-1 J2(yi ~ y)2- The influence quantities for Z are obtained from those for W by m ultiplication by n1/2. A numerical illustration o f the use of the corresponding approxim ate cum ulant-generating function Kc(£) is given in Example 9.19. ■

Integration approach

A nother approach involves extending the estim ating function approxim ation to the m ultivariate case, and then approxim ating the m arginal distribution of the statistic o f interest. To see how, suppose that the quantity T o f interest is a scalar, and tha t T and S = ( S i , . . . , S q- i) r are determ ined by a q x 1 estim ating function

nU(t,s) = ^ a ( t , s i , . . . , s 9- l ;Yj ).

J=i

Then the bootstrap quantities T* and S ’ are the solutions o f the equationsn

U'(t ,s) = J 2 a j ( t , s ) f j = 0 , (9.40)j=i


where a; (t,s) = a(t ,s;yj) and the frequencies ( / j , . . . , / * ) have a m ultinom ial distribution with denom inator n and m ean vector n ( p \ , - typically pj = n_1. We assume that there is a unique solution (t*,s*) to (9.40) for each possible set o f /* , and seek saddlepoint approxim ations to the m arginal P D F and C D F o f r .

For fixed t and s, the cum ulant-generating function o f U" is

K(£ ; t , s ) = n log Y 2 PJex P l ^ a / M ) } ;'=i

(9.41)

and the jo in t density o f the U * at u is given by (9.36). The Jacobian needed to obtain the jo in t density o f T* and S ' from that o f U' is hard to obtain exactly, bu t can be approxim ated by

j=i

dcij(t,s) 8aj(t ,s) ' dt ’ dsT

where

, , . p j e x p { Z Taj(t,s)}1 ’ Y l= iP ke x p { £ ,Tak{ t , s) } ’

as usual for r x 1 and c x 1 vectors a and s with com ponents at and sj, we write 8 a / d s T for the r x c array whose (i,j) element is dat/dsj. The Jacobian J { t , s \ £,) reduces to the Jacobian term in (9.33) when s is not present. Thus the saddlepoint approxim ation to the density o f (T*,S*) at (t,s) is

J ( t , s ; l ) { 2 n ) - ^ 2 \ K " { l ; t, s) p 1/2 exp K & ; t, s), (9.42)

where £ = £(t,s) is the solution to the q x 1 system o f equations 8K/d£, = 0. Let us write A(t,s) = —K{£( t,s) ;t , s} .

We require the m arginal density and distribution functions o f T* at t. In principle they can be obtained by integration of (9.42) numerically with respect to s, but this is time-consuming when s is a vector. A n alternative approach is analytical approxim ation using Laplace’s m ethod, which replaces the most im portant part o f the integrand — the rightm ost term in (9.42) — by a norm al integral, suitably centred and scaled. Provided tha t the m atrix d2A(t ,s ) /dsdsT is positive definite, the resulting approxim ate m arginal density o f T * at t is

J(t,S;?)(2n)-,/2|X"(|;t,S)|-1/2d2A(t, s)

dsdsT

- 1 /2

exp s), (9.43)

where \ = \ ( t ) and s = s(t) are functions o f t tha t solve simultaneously the

q x 1 and (q — 1) x 1 systems of equations

8K ; t, s) i 8K (< ; t, s) , 8 a j s)— al— = nY lp A s ) = °> — js — = nYlpft>s ) d-s- t = ° -

j = 1 ;= 1(9.44)

These can be solved using packaged routines, with starting values given by noting tha t when t equals its sample value to, say, s equals its sample value and £ = 0.

The second derivatives o f A needed to calculate (9.43) may be expressed as

8 2A(t,s) _ d2K(£; t , s ) f d 2K ( i ; t , s ) Y i d2K(£-,t ,s) 82K(t;-,t,s)8s8sT 8 s8 £T \ 8 ^ 8 ^ T J 8 £,dsT 8sdsT

where a t the solutions to (9.44) the matrices in (9.45) are given by

8 2K{t - , t , s )n^2p' j( t ,s )aj ( t ,s )aj ( t ,s )T, (9.46)

(9.47,

with sc and sj the cth and dth com ponents o f s.The m arginal C D F approxim ation for T ' a t t is (9.28), with

w = s ig n ( t - t0){ 2 X (^ ;t,s )} 1/2, (9.49)

|K "(£ ;t,* )l1/2d2A (t, s)

8 s8sT

1/2

dt

evaluated at s = s, £ = | ; the only additional quantity needed here is

(9.51);= i

Approxim ate quantiles o f T* can be obtained in the way described ju st before Example 9.13.

The expressions above look forbidding, bu t their im plem entation is relatively straightforward. The key point to note is tha t they depend only on the quantities aj(t, s), their first derivatives with respect to t, and their first two derivatives with respect to s. Once these have been program m ed, they can be input to a generic routine to perform the saddlepoint approxim ations. Difficulties tha t sometimes arise with numerical overflow due to large exponents can usually be circumvented by rescaling da ta to zero m ean and unit variance, which has no


effect on location- and scale-invariant quantities such as studentized statistics. Rem em ber, however, our initial com m ents in Section 9.1: the investm ent of time and effort needed to program these approxim ations is unlikely to be worthwhile unless they are to be used repeatedly.

Example 9.19 (M aize data) To illustrate these ideas we consider the bootstrap variance and studentized average for the maize data. Both these statistics are location-invariant, so w ithout loss o f generality we replace yj with yj — y and henceforth assume that y = 0. W ith this simplification the statistics o f interest are

ii-1 Y , V 2 = V * {1 + Z * 2/(n - 1)} , n~l Y , Yj = Z ' V l/2{n - 1)~1/2,

needed to calculate (9.43)—(9.51) are readily obtained.To find the m arginal distribution o f Z*, we apply (9.43)-(9.51) with t = z

and s = v. For a given value of z, the three equations in (9.44) are easily solved numerically. The upper panels o f Figure 9.11 com pare the saddlepoint distribution and density approxim ations for Z* with a large simulation. The analytical quantiles are very close to the simulated ones, and although the saddlepoint density seems to have integral greater than one it captures well the skewness o f the distribution.

For V * we take t = v and s = z, but the lower left panel o f Figure 9.11 shows that resulting P D F approxim ation fails to capture the bim odality o f the density. This arises because V * is deflated for resamples in which neither of the two smallest observations — which are som ewhat separated from the rest — appear.

The contours o f —A(z, v) in the lower right panel reveal a potential problem with these methods. For z = —3.5, the Laplace approxim ation used to obtain (9.43) am ounts to replacing the integral o f exp{— A(z, t>)} along the dashed vertical line by a norm al approxim ation centred at A and with precision given by the second derivative o f A(z, v) at A along the line. But A(—3.5, v) is bim odal for v > 0, and the Laplace approxim ation does no t account for the second peak at B. As it turns out, this doesn’t m atter because the peak at B is so much

where Y" = n 1 J2 YJ. A little algebra shows that

so to apply the integration approach we take pj = n 1 and

from which the 2 x 1 m atrices o f derivatives

daj(z,v) daj(z,v) d2cij(z,v) 8 2aj(z,v)8 z dv 8z 2 ’ 8v1

482 9 * Improved Calculation

oWa>cco3o


1000 2000 3000

v

-4 *2 0 2

z

-2

Figure 9.11Saddlepoint approximations for the bootstrap variance V* and studentized average Z* for the maize data. Top left:approximations to quantiles of Z* by integration saddlepoint (solid) and simulation using 50000 bootstrap samples (every 20th order statistic is shown). Top right: density approximations for Z* by integration saddlepoint (heavy solid), approximate cumulant-generating function (solid), and simulation using 50 000 bootstrap samples. Bottom left: corresponding approximations for V*. Bottom right: contours of — A(z,t>), with local maxima along the dashed line z = —3.5 at A and at B.

lower than at A tha t it adds little to the integral, but clearly (9.43) would be catastrophically bad if the peaks at A and B were com parable. This behaviour occurs because there is no guarantee that A(z, v) is a convex function o f v and z. If the difficulty is thought to have arisen, numerical integration of (9.42) can be used to find the m arginal density o f Z ’, but the problem is no t easily diagnosed except by checking that (9.45) is positive definite a t any solution to (9.44) and by checking tha t different initial values o f c, and s lead to the the same solution for a given value o f t. This may increase the com putational burden to an extent tha t direct sim ulation is m ore efficient. Fortunately this difficulty is m uch rarer in larger samples.

The quantities needed for the approxim ate cum ulant-generating function

approach to obtaining the distribution o f n '/2(n — 1 )-1/2Z* were given in Example 9.18. The approxim ate cum ulants for Z* are Kc,i = 0.13, k c ,2 = 1.08, Kc,3 = 0.51 and kc ,4 = 0.50, with k c ,2 = 0.89 and k c ,4 = —0.28 when the terms involving the are dropped. W ith or w ithout these terms, the cum ulants are some way from the values 0.17, 1.34, 1.05, and 1.55 estimated from 50000 simulations. The upper right panel o f Figure 9.11 shows the PD F approxim ation based on the modified cum ulant-generating function; in this case Kcfi(£) is convex. The modified P D F m atches the centre o f the distribution more closely than the integration PDF, but is poor in the upper tail.

For V' , we have

U = (yi ~ y f ~ t, qtj = - 2 (yt - y)(yj - y), ciJk = 0,

so the approxim ate cum ulants are kc,i = 1241, kc ,2 /k c i = kc j / kc i = 0.013 and , = —0.0015; the corresponding simulated values are 1243,0.18, 0.018, 0.0010. N either saddlepoint approxim ation captures the bimodality o f the simulations, though the integration m ethod is the better o f the two. In this case b = j for the approxim ate cum ulant-generating function method, and the resulting density is clearly too close to norm al. ■

Example 9.20 (Robust M -estim ate) For a second example o f m arginal approximation, we suppose tha t 8 and a are M -estimates found from a random sample y i , . . . , y n by sim ultaneous solution o f the equations

7=1 v 7 ;=1 v 7

The choice rp(e) = e, ^(e) = e2, y = 1 gives the non-robust estimates 8 = y and <t2 = h-1 Y ^ y j ~ y)2- Below we use the more robust H uber M -estimate specified by (9.35), with x(s) = W2(s) and with y taken to equal E{x(e)}, where e is standard normal. For purposes o f illustration we take c = 1.345.

Suppose we w ant to use 8 to get a confidence interval for 8 . Since the nonparam etric delta m ethod estimate o f var(0) is <x2 Y V 2(ej ) / { Y vA f/)}2 (Problem 6.7), where e; = (yj — 8 ) /a, the studentized version o f 8 is

8 - 9 n~] Y nJ=i v' jej )a ( y/n)x/ 2

which is proportional to the usual Student-t statistic when ip(s) = e. In order to set studentized bootstrap confidence limits for 8 , we need approxim ations to the bootstrap quantiles o f Z . These may be obtained by applying the m arginal saddlepoint approxim ation outlined above with T = Z , S = (Si, S i )7 , Pj = n-1 ,

« (% )

0.1 1 2.5 5 10 90 95 97.5 99 99.9

Sim’n -3.81 -2.68 -2.21 -1 .86 -1.49 1.25 1.62 1.94 2.35 3.49S’poin t -3.68 -2 .60 -2.11 -1 .72 -1.31 1.24 1.62 1.97 2.42 3.57

and

(xp (ae j/s i - z d / s 2) \ tp2 (ae j/ si - z d / s 2) - y j , (9.52)

tp' (pe j/ s \ — zd /s 2) - s 2 )

where d = (y/ri)l/2. For the H uber estimate, s2 = n_1 ~ z*d/s\\ < c)takes the discrete range o f values j / n ; here 1(A) is the indicator o f the event A. In the extreme case with c = oo, sj always equals one, bu t even if c is finite it will be unwise to treat sj as continuous unless n is very large. We therefore fix sj = s2, and modify a,- by dropping its third com ponent, so tha t q = 2, and S = Si. W ith this change the quantities needed for the P D F and C D F approxim ations are

daj(t,s) _ / —xp'd/s2

dt \ —2 xpxp'd/s2 ,ddj(t,s) _ f —aejtp' /s2

ds \ —2 oej\py>'/sid2cij(t,s) _ ( 2 aej\p'/sl

8s8sT \4aejxpxp,/ s l + 4 a 2e2\p,2 /s'\

where tp and xp' are evaluated at aej/ s \ — zd / s 2.For the maize da ta the robust fit downweights the largest and two smallest

observations, giving 6 = 26.68 and a = 25.20. Table 9.9 com pares saddlepoint and sim ulated quantiles o f Z*. The agreem ent is generally poorer than in the previous examples.

To investigate the properties o f studentized bootstrap confidence intervals based on Z , we conducted a small sim ulation experiment. We generated 1000 samples o f size n from the norm al distribution, the t$ distribution, the “slash” distribution — the distribution o f an N ( 0 , 1) variate divided by an independent 17(0,1) variate — and the x$ distribution. For each sample confidence intervals were obtained using the saddlepoint m ethod described above. Table 9.10 shows the actual coverages o f nom inal 95% confidence intervals based on the integration saddlepoint approxim ation. For the symmetric distributions the results are rem arkably good. The assum ption o f symmetric errors is false for the xl distribution, and its results are poorer. In the symmetric cases the saddlepoint m ethod failed to converge for about 2% o f samples, for which

Table 9.9 Comparison of results from 50000 simulations with integration saddlepoint approximation to bootstrap a quantiles of a robust studentized statistic for the maize data.

9.6 • Bibliographic Notes 485

Table 9.10 Coverage (%) of nominal 90% and 95% confidence intervals based on the integration saddlepoint approximation to a studentized bootstrap statistic, based on 1000 samples of size n from underlying normal, ts, slash, and xl distributions. Two-sided 90% and 95% coverages are given for all distributions, but for the asymmetric x2 distribution one-sided 5, 95, 2.5 and 97.5 % coverages are given also.

N orm al h Slash C hi-squared

90 95 90 95 90 95 5 95 90 2.5 97.5 95

n = 20 91 95 91 96 91 95 14 97 83 9 97 88n = 40 90 94 89 95 89 95 9 94 85 6 95 89

sim ulation would be needed to obtain confidence intervals; we simply left these samples out. Curiously there were no convergence problems for the xl samples.

One com plication arises from assuming that the error distribution is symmetric, in which case the discussion in Section 3.3 implies that our resampling scheme should be modified accordingly. We can do so by replacing (9.41) with

X (^ ;z ,S !) = nlog i £ ^ e x p { ^ Ta/(z ’Sl)} + I ^ P . / exP { f Ta}(z>s i)}j = 1 j ~ i

where a'j{z,s\) is obtained from (9.52) by replacing ej with —e; . However the odd cum ulants o f are then zero, and a norm al approxim ation to the distribution o f Z will often be adequate.

Even w ithout this modification, it seems that the m ethod described above yields robust confidence intervals with coverages very close to the nom inal level.

Relative timings for sim ulation and saddlepoint approxim ations to the boo tstrap distribution o f Z* depend on how the m ethods are implemented. In our im plem entation for this example it takes about the same time to obtain 1000 values o f Z ' by sim ulation as to calculate 50 saddlepoint approxim ations using the integration m ethod, but this com parison is not realistic because the saddlepoint m ethod gives accurate quantile estimates much further into the tails o f the distribution o f Z*. If just two quantile estimates are needed, as would be the case for a 95% confidence interval, the saddlepoint m ethod is about ten times faster. O ther studies in the literature suggest that, once programmed, saddlepoint m ethods are 20-50 times faster than simulation, and that efficiency gains tend to be larger with larger samples. However, saddlepoint approxim ation fails on about 1-2% o f samples, for which sim ulation is needed. ■

9.6 Bibliographic NotesVariance reduction m ethods for param etric sim ulation have a long history and a scattered literature. They are discussed in books on M onte C arlo methods,


such as Hammersley and H andscom b (1964), Bratley, Fox and Schrage (1987), Ripley (1987), and N iederreiter (1992).

Balanced bootstrap sim ulation was introduced by Davison, Hinkley and Schechtman (1986). O gbonm wan (1985) describes a slightly different m ethod for achieving first-order balance. G raham et al. (1990) discuss second-order balance and the connections to classical experim ental design. A lgorithm s for balanced sim ulation are described by Gleason (1988). Theoretical aspects o f balanced resampling have been investigated by D o and Hall (1992b). Balanced sampling m ethods are related to num ber-theoretical m ethods for integration (Fang and Wang, 1994), and to Latin hypercube sampling (M cKay, Conover and Beckman, 1979; Stein, 1987; Owen, 1992b). Diaconis and Holmes (1994) discuss the complete enum eration o f bootstrap samples by m ethods based on G ray codes.

Linear approxim ations were used as control variates in bootstrap sampling by Davison, Hinkley and Schechtman (1986). A different approach was taken by Efron (1990), who suggested the re-centred bias estimate and the use o f control variates in quantile estimation. D o and Hall (1992a) discuss the properties o f this m ethod, and provide com parisons with other approaches. Further discussion o f control m ethods is contained in theses by Therneau (1983) and H esterberg (1988).

Im portance resampling was suggested by Johns (1988) and D avison (1988), and was exploited by Hinkley and Shi (1989) in the context o f iterated bootstrap confidence intervals. Gigli (1994) outlines its use in param etric sim ulation for regression and certain time series problems. H esterberg (1995b) suggests the application o f ratio and regression estim ators and o f defensive mixture distributions in im portance sampling, and describes their properties. The large-sample perform ance o f im portance resampling has been investigated by D o and H all (1991). Booth, Hall and W ood (1993) describe algorithm s for balanced im portance resampling.

B ootstrap recycling was suggested by Davison, Hinkley and W orton (1992) and independently by N ew ton and G eyer (1994), following earlier ideas by J. W. Tukey; see M orgenthaler and Tukey (1991) for application o f similar ideas to robust statistics. Properties o f recycling in various applications are discussed by V entura (1997).

Saddlepoint m ethods have a history in statistics stretching back to Daniels (1954), and they have been studied intensively in recent years. Reid (1988) reviews their use in statistical inference, while Jensen (1995) and Field and Ronchetti (1990) give longer accounts; see also Barndorff-Nielsen and Cox (1989). Jensen (1992) gives a direct account o f the distribution function approxim ation we use. Saddlepoint approxim ation for perm utation tests was proposed by Daniels (1955) and further discussed by Robinson (1982). D avison and Hinkley (1988), Daniels and Young (1991), and W ang (1993b) in

9.7 - Problems 487

vestigate their use in a num ber o f resampling applications, and others have investigated their use in confidence interval estim ation (DiCiccio, M artin and Young 1992a,b, 1994). The use of approxim ate cum ulant-generating functions is suggested by Easton and Ronchetti (1986), G atto and Ronchetti (1996), and G atto (1994), while W ang (1992) shows how the approxim ation may be modified to ensure the saddlepoint equation has a single root. W ang (1990) discusses the accuracy o f such m ethods in the bootstrap context. Booth and Butler (1990) show how relationships am ong exponential family distributions may be exploited to give saddlepoint approxim ations for a num ber o f bootstrap and perm utation inferences, while W ang (1993a) describes an alternative approach for use in finite population problems. The m arginal approxim ation in Section 9.5.3 extends and corrects tha t o f Davison, Hinkley and W orton (1995); see also Spady (1991). The discussion in Example 9.18 follows Hinkley and Wei (1984). Jing and Robinson (1994) give a careful discussion o f the accuracy o f conditional and m arginal saddlepoint approxim ations in bootstrap applications, while Chen and D o (1994) discuss the efficiency gains from com bining saddlepoint m ethods with im portance resampling.

O ther m ethods o f variance reduction applied to bootstrap sim ulation include antithetic sam pling (Hall, 1989a) — see Problem 9.21 — and R ichardson extrapolation (Bickel and Yahav, 1988) — see Problem 9.22.

Appendix II o f Hall (1992a) com pares the theoretical properties o f some of the m ethods described in this chapter.

9.7 Problems1 Under the balanced bootstrap the descending product factorial moments o f the /*•

are

= Y [ m(p“> R ^ / i n R ) ^ s">, (9.53)U V

where / (a) = / ! / ( / — a)!, and

Pu = ^ Qu — ^ ^ Sw>vv:rw = u w :jw=v

with u and v ranging over the distinct values o f row and column subscripts on the left-hand side o f (9.53).(a) Check the first- and second-order moments for the f’j at (9.9), and verify that the values in Problem 2.19 are recovered as R —* c o .

(b) Use the results from (a) to obtain the mean o f the bias estimate under balanced resampling.(c) Now suppose that 7” is a linear statistic, and let V — ( R — I)-1 ^ r(Tr' — T ' ) 2 be the estimated variance o f T based on the bootstrap samples. Show that the mean o f V ’ under multinomial sampling is asymptotically equivalent to the mean under hypergeometric sampling, as R increases.(Section 9.2.1; Appendix A; Haldane, 1940; Davison, Hinkley and Schechtman, 1986)

2 Consider the following algorithm for generation o f R balanced bootstrap samples from y = (y i,...,y „ ):

Algorithm 9.3 (Balanced bootstrap 2)Concatenate y with itself R times to form a list o f length nR.For I = n R , . . . , 2 :

(a) Generate a random integer U in the range 1 , . . . , I.(b) Swap a.Vi and <&iv

Show that this produces output equivalent to Algorithm 9.1.Suggest a balanced bootstrap algorithm that uses storage 2n, rather than the Rn used above.(Section 9.2.1; Gleason, 1988; Booth, Hall and Wood, 1993)

Show that the re-centred estimate o f bias, B r^ , can be approximated by (9.11), and obtain its mean and variance under ordinary bootstrap sampling. Compare the results with those obtained using the balanced bootstrap.(Section 9.2.2; Appendix A; Efron, 1990)

D ata y \ , . . . , yn are sampled from a N{n,a2) distribution and we estimate a by the MLE t = {n-1 Y ( y j ~ >’)2}'/2- The bias o f T can be estimated theoretically:

D [ 21/2r ( f ) 1

l " I/2r ( ¥ ) J

But suppose that we estimate the bias by parametric resampling; that is, we generate samples y[,... ,y„ from the N(y,t2) distribution. Show that the raw and adjusted bootstrap estimates o f B can be expressed as

Br = Y xr/2 ~ 1

and

B RM j = n - ' /2R - ' | Y X r /2 - R i/2 ( Y X' + X ‘R+l

1 /2 >

where X l, . . . , X R are independent xl-i and X R+\ is independently ■/1R_ l.Use simulation with these representations to show that the efficiencies o f BRajj are roughly 8 and 16 for n = 10 and 20, for any R.(Section 9.2.2; Efron, 1990)

(a) Show that, for large n, the variance o f the bias estimate under ordinary resampling, (9.8), can be written (nA + 2B + C) / (Rn 2), while the variance of the bias estimate under balanced resampling, (9.11), is C / ( R n 2); here A, B, and C are quantities o f order one. Show also that the correlation p between a quadratic statistic T ‘ and its linear approximation T'L can be written as (nA + B ) /{nA(nA + 2B + C )}1/2, and hence verify that the variance o f the bias estimate is reduced by a factor o f 1 — p2 when balanced resampling is used.(b) Give p in terms o f sample moments when t = n~‘ Y ( y j — y)2, and evaluate the resulting expression for samples o f sizes n = 10 and 20 simulated from the normal and exponential distributions.(Section 9.2.3)

6 Consider generating bootstrap samples y ' { , . . . , y ^ n, r = 1 ,.. . ,R , from y i , . . . , y n. Write y' j = y ^ j ) , where the elements o f the R x n matrix £ ( r , j ) take values in1(a) Show that first-order balance can be arranged by ensuring that each column o f £ contains each o f 1 wi th equal frequency, and deduce that when R = n the matrix is a randomized block design with treatment labels 1 , . . . , n and with columns as blocks. Hence explain how to generate such a design when R = kn.(b) Use the representation

nf'rj = ' £ H j - Z ( r , i ) } ,

i= 1

where S(u) = 1 if u = 0 and equals zero otherwise, to show that the ^-balanced design is balanced in terms o f f ’rj. Is the converse true?(c) Suppose that we have a regression model Yj = fixj + ej, where the independent errors e; have mean zero and variance a 2. We estimate fi by T = Y j X j / J 2 x j- Let T ' = 52(tXj+e' j )xj / Y l x ) denote a resampled version o f T , where e’ is selected randomly from centred residuals e; — e, with e} = y, — t x j and e = n~l ^ e;. Show that the average value o f T* equals t if R values o f T ' are generated from a <!;-balanced design, but not necessarily if the design is balanced in terms o f /* . (Section 9.2.1; Graham e t al., 1990)

7 (a) Following on from the previous question, a design is second-order <J-balanced if all n2 values o f ( £ ( r , i ) , l ; ( r , j ) ) occur with equal frequency for any pair o f columns i and j. With R = n2, show that this is achieved by setting the first column o f c to be ( l , . . . , l , 2 , . . . , 2 , . . . , n , . . . , n ) T, the second column to be ( 1 , 2 , . 1 , 2 , . . , , n ) T, and the remaining n — 2 columns to be the elements o f n — 2 orthogonal Latin squares with treatment labels 1 , . . . , n . Exhibit such a design for n = 3.(b) Think o f the design matrix as having rows as blocks, with treatment labels 1 , . . . , n to be allocated within blocks; take R = kn. Explain why a design is said to be second-or der / ’- ba lanced if

R R

Y l f ' J = k (2 n ~ !) . J 2 f r j f r k = k ( n - l ) , j , k = l , . . . , n , j + k.r = l r = l

Such a design is derived by replacing the treatment labels by 0 , . . . , n — 1, choosing k initial blocks with these replacement labels, adding in turn l , 2 , . . . , n — 1, and reducing the values mod n. With n = 5 and k = 3, construct the design with initial blocks (0 ,1 ,2 ,3 ,4 ), (0 ,0 ,0 ,1 ,3 ), and (0 ,0 ,0 ,1 ,2 ), and verify that it is first- and second-order balanced. Can the initial blocks be chosen at random?(Section 9.2.1; Graham e t al., 1990)

8 Suppose that you wish to estimate the normal tail probability f I { z < a}<p(z)dz, where <p(.) is the standard normal density function and I [A] is the indicator o f the event A, by importance sampling from a distribution H( ).Let H be the normal distribution with mean fi and unit variance. Show that the maximum efficiency is

<P(fl){l-< D (q)} exp(/i2)(a + fi) — 9 ( a ) 2 ’

where (i is chosen to minimize exp(/r )<t>(a+^). Use the fact that <t>(z) = —<f>(z)/z for z < 0 to give an approximate value for fj., and plot the corresponding approximate efficiency for — 3 < a < 0. What happens when a > 0 1 (Section 9.4.1)

9 Suppose that T \ , . . . , T R is a random sample from PD F h( ) and C D F H ( ) , and let g( ) be a PDF with the same support as h( ) and with p quantile t]p, i.e.

rip rip

' - L * w - / . S 5 " w

Let T(! ) < • • • < T{R) denote the order statistics o f the Tr, set

p 1 \ " g(T(r)) m — \ RR + i ^ H T {r)y

and let M be the random index determined by SM < p < SM+1- Show that as R —*-oo, and hence justify the estimate t"M given at (9.19).(Section 9.4.1; Johns, 1988)

10 Suppose that T has a linear approximation T[, and let p be the distribution on y y n with probabilities p; oc exp{ l l j / ( n v lL/ 2 ) } , where v L = n ~ 2 Y I l j - Find the moment-generating function o f T[ under sampling from p, and hence show that in this case T* is approximately N( t -I- A v j / 2 , v L ). You may assume that T[ is approximately N ( 0, v L ) when A = 0.(Section 9.4.1; Johns, 1988; Hinkley and Shi, 1989)

11 The linear approximation t ’L for a single-sample resample statistic is typicallyaccurate for t ’ near t, but may not work well near quantiles o f interest. For anapproximation that is accurate near the a quantile o f T \ consider expandingt" = t(p’) about p„ = (pi*,. . . ,p„a) rather than about (£ , . . . ,£ ) .(a) Show that if pja oc cx-p(n~lv~[[/2z j j ) , then t(pa) will be close to the a quantile o f T" for large n.(b) Define

Show that

d ,lj* = ^ t { ( l - f i)p a + e l j]

t’ = ?Ljz = t(pa) + n f ] h a -j = i

(c) For the ratio estimates in Example 2.22, compare numerically t’L, t'Lfi9 and the quadratic approximation

t Q = t + n ^ J l f ? J + l2 n ~ 2 H f j f k t * j = l j = 1 k= 1

with t ’ .(Sections 2.7.4, 3.10.2, 9.4.1; Hesterberg, 1995a)

12 (a) The importance sampling ratio estimator o f n can be written as

E ^ r ) w W n + R - V h ,J 2 W(yr) 1 + R~,/2Eo’

where si = R 1/2 $3{m(yr)w(jv) A1} an(i ®o = R 1/2 EW Ov) — !}• Show that thisimplies that

var (fiH, at) = K ^var {m(Y )w(Y) - /*w(Y)} .

\ j is a vector with one in the jth position and zeroes elsewhere.

(b) The variance o f the importance sampling regression estimator is approximately

var(/iHreg) = R -'var {m (Y)w(Y ) - m v(Y )}, (9.54)

where a = cov{m(Y)w(Y), w(Y )}/var{w (Y )}. Show that this choice o f a achieves minimum variance among estimators for which the variance has form (9.54), and deduce that when R is large the regression estimator will always have variance no larger than the raw and ratio estimators.(c) As an artificial illustration o f (b), suppose that for 0 > 0 and some non-negative integer k we wish to estimate

/ /•°° vke~ym(y)g(y)dy = - j j - x e e ~ eydy

by simulating from density h(y) = fie~^y, y > 0, fi > 0. Give w(y) and show that E{m(Y)w(Y)} = n for any fi and 6, but that var(£//rat) is only finite when0 < fi < 2 6 . Calculate var{m(Y)w(Y)}, cov{m(Y)w(Y), w(Y)}, and var{w(Y)}.Plot the asymptotic efficiencies var(/i;; raw) / var(£// ra, ) and var(/*//ratv) / var(^W rfg) as functions o f fi for 0 = 2 and fc = 0 ,1 ,2 ,3 . Discuss your findings.(Section 9.4.2; Hesterberg, 1995b)

13 Suppose that an application o f importance resampling to a statistic T" has resulted in estimates tj < ■ ■ ■ < t'R and associated weights w”, and that the importance reweighting regression estimate o f the C D F o f T" is required. Let A be the R x R matrix whose (r,s) element is w“/ ( t “ < t‘ ) and B be the R x 2 matrix whose rth row is ( I X ) . Show that the regression estimate o f the C D F at t \ , . . . , t ’R equals (1,1 ) (BTB)~iB TA.(Section 9.4.2)

14 (a) Let h = (h \ , . • ■,hn), k = 1 , . . . , n R , denote independent identically distributed multinomial random variables with denominator 1 and probability vector p = (p\ , . . . ,p„). Show that SnK = Ylk=l ^ ^as a multinomial distribution with denominator nR and probability vector p, and that the conditional distribution o f I nR given that S„R = q is multinomial with denominator 1 and mean vector (nR)~{q, where q = (R i , . . . ,R„) is a fixed vector. Show also that

Prf/] i i , . . . ,/ni? InR I SnR q )

equalsnfi-l

g(inR I S„r = q) g (inR-j | = q — i„R-J+] — ■ ■ ■ — i„R j ,i =i

where g( ) is the probability mass function o f its argument.(b) Use (a) to justify the following algorithm:

Algorithm 9.4 (Balanced importance resampling)Initialize by setting values o f Ri , . . . ,R„ such that Rj = nRPj and = n^- For m = nR , . . . , 1:

(a) Generate u from the 1/(0,1) distribution.(b) Find the j such that £ 1=i Ri < um < Y2i=i Rife) Set I m = j and decrease Rj to Rj — 1.

Return the sets {I„+l, . . . , I 2n}, •••, { /n(R_i)+1, ... , /„ « } as the indices ofthe R bootstrap samples o f size n. •

(Section 9.4.3; Booth, Hall and Wood, 1993)


15 For the bootstrap recycling estimate o f bias described in Example 9.12, consider the case T = Y with the parametric model Y ~ N ( 0 , 1). Show that if H is taken to be the N ( y , a ) distribution, then the simulation variance o f the recycling estimate o f C is approximately

i / a2 y ~ 1)/2 r « ( « - ! ) 1 ° 2 1 1R + \ 2 « - l / I (2a - 3)3/2 R N 8 (a - \ f ' 2 N J J ’

provided a > Compare this to the simulation variance when ordinary double bootstrap methods are used.What are the implications for nonparametric double bootstrap calculations? Investigate the use o f defensive mixtures for H in this problem.(Section 9.4.4; Ventura, 1997)

16 Consider exponential tilting for a statistic whose linear approximation is

where the ( / ' , , . . . , f ‘„s), s = 1 , . . . , S, are independent sets o f multinomial frequencies.(a) Show that the cumulant-generating function o f TI is

s f 1 "sK { 0 = f t + Y n* lo6 \ ~ exP (^ y M )

s= l I t= 1

Hence show that choosing £ to give K' (^ ) = t0 is equivalent to exponential tilting o f T[ to have mean to, and verify the tilting calculations in Example 9.8.(b) Explain how to modify (9.26) and (9.28) to give the approximate PDF and C D F o f T[.(c) How can stratification be accommodated in the conditional approximations o f Section 9.5.2?(Section 9.5)

17 In a matched pair design, two treatments are allocated at random to each o f n pairs o f experimental units, with differences dj and average difference d = n~l J2 dj- If there is no real effect, all 2" sequences + d i , . . . , + d n are equally likely, and so are the values D" = n~l J2^j^j> where the Sj take values +1 with probability

The one-sided significance level for testing the null hypothesis o f no effect is Pr*(D* > d).(a) Show that the cumulant-generating function o f D' is

n

K(£ ) = Y iogcosh (Zdj/n), i=i

and find the saddlepoint equation and the quantities needed for saddlepoint approximation to the observed significance level. Explain how this may be fitted into the framework o f a conditional saddlepoint approximation.(b) See Practical 9.5.(Section 9.5.1; Daniels, 1958; Davison and Hinkley, 1988)

18 For the testing problem o f Problem 4.9, use saddlepoint methods to develop an approximation to the exact bootstrap P-value based on the exponential tilted EDF. Apply this to the city population data with n = 10.(Section 9.5.1)

1n

19 (a) If W \ , . . . , W „ are independent Poisson variables with means show thattheir joint distribution conditional on J2j = m is multinomial with probability vector n = (fi\ ^ fij and denominator w. Hence justify the first saddlepoint approximation in Example 9.16.(b) Suppose that T* is the solution to an estimating equation o f form (9.32), but that f j = 0 or 1 and f j = m < n; T" is a delete-m jackknife value o f the original statistic. Explain how to obtain a saddlepoint approximation to the PD F o f T ’. How can this PDF be used to estimate var*(T‘)? D o you think the estimate will be good when m = n — 1 ?(Section 9.5.2; Booth and Butler, 1990)

20 (a) Show that the bootstrap correlation coefficient t ’ based on data pairs ( x j , Zj), j = 1 , . . . ,n , may be expressed as the solution to the estimating equation (9.40) with

where sT = (s1,s 2,s 3,s4), and show that the Jacobian J( t , s ;£) = n5(s3s4)1/2. Obtain the quantities needed for the marginal saddlepoint approximation (9.43) to the density o f T*.(b) What further quantities would be needed for saddlepoint approximation to the marginal density o f the studentized form o f T ‘ ?(Section 9.5.3; Davison, Hinkley and Worton, 1995; DiCiccio, Martin and Young, 1994)

21 Let T[‘ be a statistic calculated from a bootstrap sample in which appears with frequency f j (j = 1 , . . . ,n ) , and suppose that the linear approximation to T ' is T [ = t + n~‘ Y s f j h ’ where /i < k <■ ■■ </ „ . The statistic r2* antithetic to T,' is calculated from the bootstrap sample in which y, appears with frequency /* +l ..(a) Show that if T [ and r 2“ are antithetic,

and deduce that when T is the sample average and F is the exponential distribution the large-sample performance gain o f antithetic resampling is 6 /(12 — n2) = 2.8.(c) What happens if F is symmetric? Explain qualitatively why.(Hall, 1989a)

/ X j - S i \Zj - s2

Oj( t , s )= (Xj S i ) 2 5 3

(Zj - s2 j2 - s4 V (Xj - Si ) (Zj - S2) - t{s3s4)1/2 J

var{i(7Y + r 2*)} = J-n (n-l j 2 lJ + »~l E bh' n+l - j ,\ 7=1 7=1

and that this is roughly x2/2n as n—► 00, where

and t]p is the pth quantile o f the distribution o f L t(Y ;F).(b) Show that if T j is independent o f r,' the corresponding variance is

494 9 - Improved Calculation

22 Suppose that resampling from a sample o f size n is used to estimate a quantity z(n) with expansion

z(n) = zQ + n~az\ + n~2az2 -\----- , (9-55)

where zo, zi, z2 are unknown but a is known; often a = j . Suppose that we resample from the E D F F, but with sample sizes nQ, m, where 1 < no < nt < n, instead o f the usual n, giving simulation estimates z ' ( n0), z ' ( n t ) o f z(n0), z(nx).(a) Show that z*(n) can be estimated by

z‘(n) = z ’ (no) + ^ ^ (z‘(n0) - z > i ) } • n o n,

(b) Now suppose that an estimate o f z ’(n; ) based on Rj simulations has variance approximately b /R j and that the computational effort required to obtain it is cnjRj, for some constants b and c. Given no and ni, discuss the choice o f Rq and R\ to minimize the variance o f z"(n) for a given total computational effort.(c) Outline how knowledge o f the limit zo in (9.55) can be used to improve z ’(n). How would you proceed if a were unknown? D o you think it wise to extrapolate from just two values no and ?(Bickel and Yahav, 1988)

9.8 Practicals1 For ordinary bootstrap sampling, balanced resampling, and balanced resampling

within strata:

y <- rnorm(lO)junk.fun <- function(y, i) var(y[i]) junk <- boot(y, junk.fun, R=9) boot.array(junk) apply(j unk$t,2,sum)junk <- boot(y, junk.fun, R=9, sim="balanced") boot.array(junk) apply(j unk$t,2,sum)junk <- boot(y, junk.fun, R=9, sim="balanced",

strata=rep(l:2,c(5,5))) boot.array(j unk) apply(junk$t,2,sum)

Now use balanced resampling in earnest to estimate the bias for the gravity data weighted average:

grav.fun <- function(data, i){ d <- data[i,]

m <- tapply(d$g,d$series,mean) v <- tapply(d$g,d$series,var) n <- table(d$series) v <- (n-l)*v/nc(sum(m*n/v)/sum(n/v), sum(n/v)) }

grav.bal <- boot(gravity, grav.fun, R=49,strata=gravity$series, sim="balanced")

mean (grav. bal$t [, 1] ) -grav. bal$tO [1]

For the adjusted estimate o f bias:

Practicals 495

grav.ord <- boot(gravity, grav.fun, R=49, strata=gravity$series)

control(grav.ord,bias.adj =T)

Now a more systematic comparison, with 40 replicates each with R = 19:

R <- 19; nreps <- 40; bias <- matrix(,nreps,3) for (i in 1.’nreps) {grav.ord <- boot(gravity, grav.fun, R=R, strata=gravity$series) grav.bal <- boot(gravity, grav.fun, R=R,

strata=gravity$series, sim="balanced") bias[i,] <- c(mean(grav.ord$t[,l])-grav.ord$tO[l] ,

mean(grav.bal$t[,1])-grav.bal$t0[1], control(grav.ord,bias.adj=T)) }

biasapply(bias,2,mean) apply(bias,2,var) split.screen(c(l,2)) screen(l)

qqplot(bias [,1],bias[,2J,xlab="ordinary",ylab="balanced")abline(0,1,lty=2)screen(2)qqplot (bias [ ,2],bias[,3],xlab="balanced",ylab="adjusted") abline(0,1,lty=2)

What are the efficiency gains due to using balanced simulation and post-simulation adjustment for bias estimation here? Now a calculation to see the correlation between T ' and its linear approximation:

grav.ord <- boot(gravity, grav.fun, R=999, strata=gravity$series)grav.L <- empinf(grav.ord,type="reg")tL <- linear.approx(grav.ord,grav.L,index=l)close.screen(all=T)plot(tL,grav.ord$t[,1])cor(tL,grav.ord$t[,1])

Finally, calculations for the estimates o f bias, variance and quantiles using the linear approximation as control variate:

grav.cont <- control(grav.ord,L=grav.L,index=l)grav.contSbiasgrav.cont$var

grav.cont$quantiles

To use importance resampling to estimate quantiles o f the contrast o f averages for the tau data o f Practical 3.4, we first set up strata, a weighted version o f the statistic t, a contrast o f averages, and calculate the empirical influence values:

tau.w <- function(data, w){ d <- data$rate*w

d <- tapply(d,data$decay,sum)/tapply(w,data$decay,sum) d[l]-sum(d [-1]) }

tau.L <- empinf(data=tau, statistic=tau.w, strata=tau$decay)

We could use exponential tilting to find distributions tilted to 14 and 18 (the original value o f t is 16.16):

e x p . t i l t ( t a u .L , t h e t a = c ( 1 4 ,1 8 ) ,t0 = 1 6 .16)

Function t i l t . b o o t does this automatically. Here we do 199 bootstraps without tilting, then 100 each tilted to the 0.05 and 0.95 quantiles o f these 199 values o f t". We then display the weights, without and with defensive mixture distributions:

t a u . t i l t <- t i l t . b o o t ( ta u , t a u . w, R=c( 1 9 9 ,1 0 0 ,1 0 0 ) ,s tr a ta = ta u $ d e c a y , sty p e= "w", L =tau. L, a lp h a= c( 0 .0 5 ,0 .9 5 ) )

s p l i t . s c r e e n ( c ( 1 ,2 ) )s c r e e n ( l ) ; p lo t ( t a u . t i l t $ t , im p .w e i g h t s ( t a u . t i l t ) , l o g = " y " ) s c r e e n ( 2 ) ; p lo t ( t a u . t i l t $ t , imp. w e ig h ts ( ta u . t i l t , d e f= F ), log= " y")

The corresponding estimated quantiles are

im p .q u a n t i l e ( t a u .t i l t ,a lp h a = c ( 0 .0 5 ,0 .9 5 ) )imp. q u a n t i le ( t a u . t i l t , a lp h a= c( 0 .0 5 ,0 .9 5 ) ,def=F)

The same can be done with frequency smoothing, but then the initial value o f R must be larger:

t a u .f r e q <- t i l t . b o o t ( t a u , ta u .w , R=c(4 9 9 ,2 5 0 ,2 5 0 ) ,s tr a ta = ta u $ d e c a y , stype="w", t i l t = F , a lp h a= c( 0 .0 5 ,0 .9 5 ) )

im p .q u a n t ile ( ta u .f r e q ,a lp h a = c (0 .0 5 ,0 .9 5 ) )

For balanced importance resampling we simply add sim ="balanced" to the arguments o f t i l t . b o o t . For a small simulation study to see the potential efficiency gains over ordinary sampling, we compare the performance o f ordinary sampling and importance resampling with and without balance, in estimating the 0.1 and0.9 quantiles o f the distribution o f t".

t a u . t e s t < - NULL fo r ( ir e p in 1 :10){ ta u .b o o t < - b o o t( ta u , ta u .w , R=199, stype="w",

stra ta = ta u $ d eca y ) q .o rd <- s o r t ( t a u .b o o t $ t ) [ c ( 2 0 ,180)] t a u . t i l t < - t i l t . b o o t ( t a u , ta u .w , R = c (9 9 ,5 0 ,5 0 ) ,

s tr a ta = ta u $ d e c a y , stype="w", L =tau.L , a lp h a= c( 0 .1 ,0 .9 ) )

q . t i l t <- im p .q u a n t i l e ( t a u . t i l t , a lp h a= c( 0 . 1 , 0 . 9 ) ) $raw t a u .b a l < - t i l t . b o o t ( t a u , ta u .w , R = c (9 9 ,5 0 ,5 0 ) ,

s tra ta = ta u $ d eca y , stype="w", L =tau.L , a lp h a = c (0 .1 ,0 .9 ) , sim ="balanced")

q .b a l < - im p .q u a n t i le ( ta u .b a l , a lp h a = c (0 .1 , 0 .9 ))$ ra w t a u . t e s t < - r b in d ( t a u . t e s t , c (q .o r d , q . t i l t , q .b a l ) ) >

s q r t ( a p p ly ( t a u . t e s t , 2 , v a r ))

What are the efficiency gains o f the two importance resampling methods?

Consider the bias and standard deviation functions for the correlation o f the c la r id g e data (Example 4.9). To estimate them, we perform a double bootstrap and plot the results, as follows.

c la r .f u n < - fu n c t io n (d a ta , f ){ r <- c o r r (d a ta , f / s u m ( f ) )

n <- nrow (data) d <- d a ta [r e p ( 1 : n , f ) , ]us < - ( d [ , l ] - m e a n ( d [ , l ] ) ) / s q r t ( v a r ( d [ , l ] ) ) xs < - (d [ ,2 ]-m e a n (d [ ,2 ] ) ) / s q r t (v a r (d [ ,2 ] ) )

L <- us*xs - r*(us"2+xs~2)/2 v <- sum((L/n)*2)clar.t <- boot(d, corr, R=25, stype="w")$t i <- is.na(clar.t) clar.t <- clar.t[!i]c(r, v, mean(clar.t)-r, var(clar.t), sum(i)) >

clar.boot <- boot(claridge, clar.fun, R=999, stype="f") split.screen(c(1,2)) screen(l)plot (clar .boot$t [, 1] , clar .boot$t [,3] ,pch=" . " ,

xlab="theta*",ylab="bias") lines(lowess(clar.boot$t[,l] ,clar.boot$t[,3] ,f=l/2) ,lwd=2) screen(2)plot(clar.boot$t[,1],sqrt(clar,boot$t[ , 4 ] ),pch=".",

xlab="theta*",ylab="SD")

1 <- lowess(clar.boot$t[,l] ,clar.boot$t[,4] ,f=l/2) lines(l$x,sqrt(l$y),lwd=2)

To obtain recycled estimates using only the results from a single bootstrap, and to compare them with those from the double bootstrap:

clar.rec <- boot(claridge, corr, R=999, stype="w")

IS.ests <- function(theta, boot.out, statistic, A=0.2){ f <- smooth.f(theta,boot.out,width=A)

theta.f <- statistic(boot.out$data,f/sum(f))IS.w <- imp.weights(boot.out,q=f)moms <- imp.moments(boot.out,t=boot.out$t[,1]-theta.f,w=IS.w) c(theta, theta.f, moms$raw, moms$rat, moms$reg) }

IS.clar <- matrix(,41,8) theta <- seq(0,0.8,length=41)for (j in 1:41) IS.clar[j,] <- IS.ests (theta [j] , clar .rec, corr) screen(l,new=F)lines(IS.clar[,2],IS.clar [,7]) lines (IS. clar [, 2] ,IS.clar[,5] ,lty=3) lines(IS.clar [,2] ,IS.clar[,3] ,lty=2) screen(2, new=F)

lines(IS.clar[,2],sqrt(IS.clar[,8])) lines(IS.clar[,2],sqrt(IS.clar[,6]),lty=3) lines(IS.clar [,2],sqrt(IS.clar[,4]),lty=2)

D o you think these results are close enough to those from the double bootstrap? Compare the values o f 9 in IS. c la r [, 1] to the values o f O' = t(Fg) in IS. c la r [, 2].

Dataframe c a p a b i l i t y gives “data” from Bissell (1990) comprising 75 successive observations with specification limits U = 5.79 and L = 5.49; see Problem 5.6 and Practical 5.4. Suppose that we wish to use the range o f blocks o f 5 observations to estimate <x, in which case 0 = k / r 5, where k = (U — L)d5. Then 8 is the root o f the estimating equation ^T;(/c — r5j-0) = 0; this is just a ratio statistic. We estimate thePD F o f 6' by saddlepoint methods as follow s:

psi <- function(tt, r, k=2.236*(5.79-5.49)) k-r*tt psil <-function(tt, r, k=2.236*(5.79-5.49)) r det.psi <- function(tt, r, xi){ p <- exp(xi * psi(tt, r))

length(r) * abs(sum(p * psil(tt,r))/sum(p)) }

r5 <- apply(matrix(capability$y,15,5,byrow=T), 1, function(x) diff(range(x)))

m <- 300; top <- 10; bot <- 4 sad <- matrix(, m, 3) th <- seq(bot,top,length=m) for (i in l:m){ sp <- saddle(A=psi(th[i], r5), u=0)

sad[i,] <- c(th[i] , sp$spa[l] *det .psi(th[i] , r5, xi=sp$zeta.hat) , sp$spa[2]) }

sad <- sad[! is.na(sad[,2] )&!is.na(sad[,3] ) ,]

plot(sad[,l],sad[,2],type="l",xlab="theta hat",ylab="PDF")

To obtain the quantiles o f the distribution o f 6' , we use the following code; here capab. tO contains 9 and its standard error.

theta.fun <- function(d, w, k=2.236*(5.79-5.49)) k*sum(w)/sum(d*w) capab.v <- var.linear(empinf(data=r5, statistic=theta.fun)) capab.tO <- c(2.236*(5.79-5.49)/mean(r5),sqrt(capab.v))Afn <- function(t, data, k=2.236*(5.79-5.49)) k-t*dataufn <- function(t, data, k=2.236*(5.79-5.49)) 0capab.sp <- saddle.distn(A=Afn, u=ufn, t0=capab.t0, data=r5) capab.sp

We can use the same ideas to apply the block bootstrap. Now we take b = 15 of the n — I + 1 blocks o f successive observations o f length / = 5. We concatenate them to form a new series, and then take the ranges o f each block o f successive observations. This is equivalent to selecting b ranges from among the n — I + 1 possible ranges, with replacement. The quantiles o f the saddlepoint approximation to the distribution o f 6’ under this scheme are found as follows.

r5 <- NULLfor (j in 1:71) r5 <- c(r5, diff(range(capability$y[j:(j+4)])))Afn <- function(t, data, k=2.236*(5.79-5.49)) cbind(k-t*data,1)ufn <- function(t, data, k=2.236*(5.79-5.49)) c(0,15)capab.spl <- saddle.distn(A=Afn,u=ufn,wdist="p",

type="cond",t0=capab.tO,data=r5)capab.spl$quantiles

Compare them with the quantiles above. How do they differ? Why?

5 To apply the saddlepoint approximation given in Problem 9.17 to the paired comparison data o f Problem 4.7, and obtain a one-sided significance level Pr'(D’ > d):

K <- function(xi) sum(log(cosh(xi*darwin$y)))-xi*sum(darwin$y)K2 <- function(xi) sum(darwin$y~2/cosh(xi*darwin$y)~2) darwin.saddle <- saddle(K.adj=K,K2=K2) darwin.saddle 1-darwin.saddle$spa[2]

10

Semiparametric Likelihood Inference

10.1 LikelihoodThe likelihood function is central to inference in param etric statistical models. Suppose tha t da ta y are believed to have come from a distribution F w, where xp is an unknow n p x 1 vector param eter. Then the likelihood for rp is the corresponding density evaluated at y, namely

L ( w ) = / v (y ).

regarded as a function o f xp. This measures the plausibility o f the different values o f ip which m ight have given rise to y , and can be used in various ways.

If further inform ation about \p is available in the form of a prior probability density, n(xp), Bayes’ theorem can be used to form a posterior density for ip given the da ta y,

"(V I ■

Inferences regarding xp or other quantities o f interest may then be based on this density, which in principle contains all the inform ation concerning xp.

If prior inform ation about xp is not available in a probabilistic form, the likelihood itself provides a basis for com parison o f different values o f xp. The m ost plausible value is tha t which maximizes the likelihood, namely the m a x im u m l ikel ihood est imate, xp. The relative plausibility o f o ther values is m easured in terms of the log likelihood t?(xp) = log L(xp) by the l i kel ihood rat io statist ic

W{ y>) = 2 { t ( \ p ) - / ( x p ) } .

A key result is tha t under repeated sampling o f da ta from a regular model, W(xp) has approxim ately a chi-squared distribution with p degrees o f freedom. This forms the basis for the prim ary m ethod of calculating confidence regions

499

500 10 ■ Semiparametric Likelihood Inference

in param etric models. One special feature is that the likelihood determ ines the shape o f confidence regions when xp is a vector.

Unlike many o f the confidence interval m ethods described in C hapter 5, likelihood provides a natural basis for the com bination o f inform ation from different experiments. If we have two independent sets o f data, y and z, that bear on the same param eter, the overall likelihood is simply L(xp) = f ( y I W)f(z I and tests and confidence intervals concerning 1p may be based on this. This type o f com bination is particularly useful in applications where several independent experim ents are linked by com m on param eters; see Practical 10.1.

In applications we can often write xp = (6 ,X), where the com ponents o f 8 are o f prim ary interest, while the so-called nuisance param eters X are o f secondary concern. In such situations inference for 8 is based on the profile likelihood,

L p(6 ) = max L( 8 , X), (10.1)

which is treated as if it were a likelihood. In some cases, particularly those where X is high dimensional, the usual properties o f likelihood statistics (consistency o f maximum likelihood estimate, approxim ate chi-squared distribution o f log likelihood ratio) do not apply w ithout m aking an adjustm ent to the profile likelihood. The adjusted likelihood is

L a ( 8 ) = L p ( o m e , i<or1/2, (io.2)where % is the M LE o f X for fixed 6 and jx(xp) is the observed inform ation m atrix for X, i.e. jx(xp) = —d2f(xp)/dXdXT.

W ithout a param etric model the definition o f a param eter is m ore vexed. As in C hapter 2, we suppose that a param eter d is determ ined by a statistical function t(-), so that 8 = t(F) is a mean, median, o r o ther quantity determined by, bu t not by itself determining, the unknow n distribution F. Now the nuisanceparam eter X is all aspects o f F o ther than t(F), so tha t in general X isinfinite dimensional. N ot surprisingly, there is no unique way to construct a likelihood in this situation, and in this chapter we describe some o f the different possibilities.

10.2 Multinomial-Based Likelihoods10.2.1 Empirical likelihood Scalar parameter

Suppose tha t observations form a random sample from an unknowndistribution F, and that we wish to construct a likelihood for a scalar param eter8 = t(F), where t( ) is a statistical function. One view o f the E D F F is tha t it is the nonparam etric m axim um likelihood estim ate o f F, with corresponding

10.2 ■ Multinomial-Based Likelihoods 501

nonparam etric m axim um likelihood estimate t = t(F) for 9 (Problem 10.1). The E D F is a m ultinom ial distribution with denom inator one and probability vector («_1, . . . , n _1) attached to the yj. We can think o f this distribution as em bedded in a more general m ultinom ial distribution with arbitrary probability vector p = (pi , . . . ,p„) attached to the data values. If F is restricted to be such a m ultinom ial distribution, then we can write t(p) rather than t(F) for the function which defines 8 . The special m ultinom ial probability vector (n_1, . . . , n _1) corresponding to the E D F is p, and t = t(p) is the nonparam etric maxim um likelihood estimate o f 6 . This m ultinom ial representation was used earlier in Sections 4.4 and 5.4.2.

Restricting the model to be m ultinom ial on the da ta values with probability vector p, the param eter value is 9 = t{p) and the likelihood for p is L(p) = n "= i Pj > with / j equal to the frequency o f value yj in the sample. But, assuming there are no tied observations, all f j are equal to 1, so tha t U p ) = p x x • • • x pn: this is the analogue o f L(i/;) in the param etric case. We are interested only in 9 = t(p), for which we can use the profile likelihood

nL e l{Q)= sup TT Pj, (10.3)

p-Ap)=ejJi

which is called the empirical likelihood for 9. Notice tha t the value o f 9 which maximizes L El { 8 ) corresponds to the value o f p maximizing L(p) with only the constraint Yl Pj = 1> tha t is p. In other words, the empirical likelihood is maximized by the nonparam etric maximum likelihood estimate t.

In (10.3) we maximize over the p; subject to the constraints imposed by fixing t(p) = 9 and Yl Pj = 1> which is effectively a maximization over n — 2 quantities when 9 is scalar. Rem arkably, although the num ber o f param eters over which we maximize is com parable with the sample size, the approxim ate distributional results from the param etric situation carry over. Let do be the true value o f 8 , with T the maxim um empirical likelihood estim ator. Then under mild conditions on F and in large samples, the empirical likelihood ratio statistic

We l (90) = 2 {log L e l ( T ) - log L e l (6o)}

has an approxim ate chi-squared distribution with d degrees o f freedom. Although the limiting distribution o f We l (8q) is the same as that o f Wp(8o) under a correct param etric model, such asym ptotic results are typically less useful in the nonparam etric setting. This suggests that the bootstrap be used to calibrate empirical likelihood, by using quantiles o f bootstrap replicates of

A

We l (9q), i.e. quantiles o f W ^ L(8 ). This idea is outlined below.

Example 10.1 (Air-conditioning data) We consider the empirical likelihood for the mean o f the larger set o f air-conditioning data in Table 5.6; n = 24

502 10 • Semiparametric Likelihood Inference

o oJO

o

CvJ

00d

oo

o

o Figure 10.1 Likelihood and log likelihoods for the mean of the air-conditioning data: empirical (dots), exponential (dashes), and gamma profile (solid). Values of 6 whose log likelihood lies above the horizontal dotted line in the right panel are contained in an asymptotic 95% confidence set for the true mean.

40 60 80 100 120 40 60 80 100 120

theta theta

and y = 64.125. The m ean is d = f ydF(y) , which equals Y j P j y j f ° r the m ultinom ial distribution tha t puts masses pj on the yj. For a specified value o f 8 , finding (10.3) is equivalent to maximizing E l ° g P ; with respect to subject to the constraints tha t E Pj = 1 ar,d Pjyj = Use o f Lagrange multipliers gives pj oc {1 + rjoiyj — 0)}_1, where the Lagrange m ultiplier rjg is determ ined by 8 and satisfies the equation

Thus the log empirical likelihood, norm alized to have m axim um zero, is

This is maximized at the sample average 8 = y, where ye = 0 and Pj = n_1. It is undefined outside (miny^, m axy7-), because no m ultinom ial distribution on the yj can have m ean outside this interval.

Figure 10.1 shows L e l (&), which is calculated by successive solution o f (10.4) to yield tjg at values o f 8 small steps apart. The exponential likelihood and gam m a profile likelihood for the m ean are also shown. As we should expect, the gam m a profile likelihood is always higher than the exponential likelihood, which corresponds to the gam m a likelihood but with shape param eter k = 1.

Both param etric likelihoods are wider than the empirical likelihood. Direct com parison between param etric and empirical likelihoods is misleading, however, since they are based on different models, and here and in later figures

(10.4)

n

(10.5)

10.2 • Multinomial-Based Likelihoods 503

Figure 10.2 Simulated empirical likelihood ratio statistics (left panel) and gamma profile likelihood ratio statistics (right panel) for exponential samples of size 24. The dotted line corresponds to the theoretical x\ approximation.

yToswo2"Doo<D

Q .

ELU

OWswo2■DooJO

o

OJEEaj0

Chi-squared quantiles Chi-squared quantiles

we give the gam m a likelihood purely as a visual reference. The circumstances in which empirical and param etric likelihoods are close are discussed in Problem 10.3.

The endpoints o f an approxim ate 95% confidence interval for 8 are obtained by reading off where £ e l ( 8 ) = 501,0.95, where c^a is the a quantile o f the chi- squared distribution with d degrees o f freedom. The interval is (43.3,92.3), which com pares well with the nonparam etric B C a interval o f (42.4,93.2). The likelihood ratio intervals for the exponential and gam m a models are (44.1,98.4) and (44.0,98.6).

Figure 10.2 shows the empirical likelihood and gam m a profile likelihood ratio statistics for 500 exponential samples o f size 24. Though good for the param etric statistic, the chi-squared approxim ation is poor for W EL, whose estim ated 95% quantile is 5.92 com pared to the xj quantile o f 3.84. This suggests strongly that the empirical likelihood-based confidence interval given above is too narrow. However, the sim ulations are only relevant when the da ta are exponential, in which case we would not be concerned with empirical likelihood.

We can use the bootstrap to estimate quantiles for W e l ( 8 o ), by setting 6 q = y

and then calculating W ’ ( 6 q ) for bootstrap samples from the original data. The resulting Q-Q plot is less extreme than the left panel o f Figure 10.2, with a 95% quantile estim ate o f 4.08 based on 999 bootstrap samples; the corresponding empirical likelihood ratio interval is (42.8,93.3). W ith a sample o f size 12, 41 o f the 999 sim ulations gave infinite values o f We l (6q) because y did not lie within the limits (m in y ',m ax y * ) o f the bootstrap sample. W ith a sample o f size 24, this problem did not arise. ■

In principle, empirical likelihood is straightforw ard to construct when 6 has dimension d < n — 1. Suppose that 9 = (91, . . . , 8d)T is determ ined implicitly as the root o f the sim ultaneous equations

Vector param eter

/ u(9;y)dF(y) = 0, i = l , . . . , d ,

where u(9;y) is a d x 1 vector whose ith element is Ui(9;y). Then the estimate 9 is the solution to the d estim ating equations

n

= ( 10.6);'=i

An extension o f the argum ent in Example 10.1, involving the vector o f L agrange multipliers rjg = (t]ou- ■ ■, f]od)T, shows tha t the log empirical likelihood is

nSe l (0) = ~ £ log {1 + n l U j ( 9 ) } , (10.7)

i =i

where uj(9) = u(9;yj). The value o f rjg is determ ined by 9 through the. sim ultaneous equations

V - UjiTd)- - = 0 . (10.8)1 + t i j u j ( 8 )

The simplest approxim ate confidence region for the true 9 is the set o f valuessuch that WEL(0 ) < q j _ „ bu t in small samples it will again be preferable toreplace the Xd quantile by its bootstrap estimate.

10.2.2 Empirical exponential family likelihoodsA nother data-based m ultinom ial likelihood can be based on an empirical exponential family construction. Suppose that 9 \ , . . . , 9 i are defined as the solutions to the equations (10.6). Then rather than putting probability n_1{ l+ f /J u ; (0)}_1 on yj, corresponding to (10.7), we can take probabilities proportional to ex p { £ jUj(9)} ■ this is the exponential tilting construction described in Example 4.16 and in Sections 5.3 and 9.4. Here = (iei, ■ ■ ■, ied)T is determ ined by 9 through

n

E M;( 0 ) e x p { ^ u ; (0 )} = O . (10.9)j=i

This is analogous to (10.8), but it may be solved using a program that fits regression models for Poisson responses (Problem 10.4), which is often m ore convenient to deal with than the optim ization problems posed

10.2 ■ Multinomial-Based Likelihoods 505

by empirical likelihood. The log likelihood obtained by integrating (10.9) is £e e f ( 6 ) = X !exP { £ j“ /(0)}- This can be close to £el{&), which suggests that both the corresponding log likelihood ratio statistics share the same rather slow approach to their large-sample distributions.

In addition to likelihood ratio statistics from empirical exponential families and empirical likelihood, m any other related statistics can be defined. For example, we can regard as the param eter in a Poisson regression model and construct a quadratic form

u ,( 0 ) j (10.10)

based on the score statistic th a t tests the hypothesis = 0. There is a close parallel between Q e e f ( O ) and the quadratic forms used to set confidence regions in Section 5.8, but the nonlinear relationship between 6 and Q e e f (6)

means tha t the contours o f (10.10) need not be elliptical. As discussed there, for example, theory suggests tha t when the true value o f 6 is 9o, Q e e f (&o) has a large-sample x i distribution. Thus an approxim ate 1 — a confidence region for 8 is the set o f values o f 6 for which Q e e f (&) does not exceed q , i_a. And as there, it is generally better to use bootstrap estimates o f the quantiles of Q e e f (0)-

Example 10.2 (Laterite data) We consider again setting a confidence region based on the data in Example 5.15. Recall that the quantity o f interest is the mean polar axis,

a(8 , (p) = (cos 6 cos 0 , cos 9 sin cj>, sin 0)T,

which is the axis given by the eigenvector corresponding to the largest eigenvalue o f E ( y y T). The da ta consist o f positions on the lower half-sphere, or equivalently the sample values o f a(9, (j>), which we denote by yj, j = 1 , . . . , n.

In order to set an empirical likelihood confidence region for the m ean polar axis, or equivalently for the spherical polar coordinates (9, cj)), we let

b(9, <f>) = (sin 9 cos 0 , sin 0 sin <f>, — cos 9)T, c(9, </>) = (— sin (p, — cos <f>, 0)T

denote the unit vectors orthogonal to a(0,4>). Then since the eigenvectors o f E (y Y T) may be taken to be orthogonal, the population values o f (0, <f>) satisfy sim ultaneously the equations

b(9, 4>)t E{Y Y T)a(0, <f>) = 0, c(0, <j))TE ( Y Y T)a(9, <t>) = 0,

with sample equivalents

506 10 - Semiparametric Likelihood Inference

Figure 10.3 Contours of WEi (left) and Qeef (right) for the mean polar axis, in the square region shown in Figure 5.10. The dashed lines show the 95% confidence regions using bootstrap quantiles. The dotted ellipse is the 95% confidence region based on a studentized statistic (Fisher, Lewis and Embleton, 1987, equation 6.9).

In terms o f the previous general discussion, we have d = 2 and

- < e , 4 > , y j ) - [ c{e>(j))T y j y j a i d (j))

The left panel o f Figure 10.3 shows the empirical likelihood contours based on (10.7) and (10.8), in the square region shown in Figure 5.10. The corresponding contours for Q e e f ( S ) are shown on the right. The dashed lines show the boundaries o f the 95% confidence regions for (6 , (f>) using bootstrap calibration; these differ little from those based on the asym ptotic y\ distribution. In each panel the dotted ellipse is a 95% confidence region based on a studentized form o f the sample m ean polar axis, for which the contours are ellipses. The elliptical contours are appreciably tighter than those for the likelihood-based statistics.

Table 10.1 com pares theoretical and bootstrap quantiles for several likelihood- based statistics and the studentized bootstrap statistic, Q , for the full da ta and for a random subset o f size 20. For the full data, the quantiles for Q e e f and W e l are close to those for the large-sample distribution. For the subset, Q e e f is close to its nom inal distribution, bu t the o ther statistics seem considerably more variable. Except for Q e e f , it would be misleading to rely on the asym ptotic results for the subsample. ■

Theoretical work suggests tha t W e l should have better properties than statistics such as W e e f or Q e e f , but since sim ulations do not always confirm this, bootstrap quantiles should generally be used to set the limits o f confidence regions from m ultinom ial-based likelihoods.

10.3 ■ Bootstrap Likelihood 507

Table 10.1 Bootstrap p quantiles of likelihood-based statistics for mean polar axis data.

p xi

Full da ta , n = 50 Subset, n = 20

Q W el W eef Qe e f Q W EL W eef Qeef

0.80 3.22 3.23 3.40 3.37 3.15 3.67 3.70 3.61 3.150.90 4.61 4.77 4.81 5.05 4.69 5.39 5.66 5.36 4.450.95 5.99 6.08 6.18 6.94 6.43 7.17 7.99 10.82 7.03

10.3 Bootstrap LikelihoodBasic idea

Suppose for simplicity tha t our da ta y i , . . . , y n form a hom ogeneous random sample for which statistic T takes value t. I f the data were governed by a param etric model under which T had the density then a partiallikelihood for 9 based on T would be f T(t',0) regarded as a function o f 9. In the absence o f a param etric model, we may estim ate the density o f T a t t, for different values o f 9, by means o f a nonparam etric double bootstrap.

To be specific, suppose that we generate a first-level bootstrap sample y [ , . . . , y ‘ from with corresponding estim ator value t*. This bootstrapsample is now considered as a population whose param eter value is t*; the empirical distribution o f y j , . . . ,y * is the nonparam etric analogue o f a param etric model with 9 = t*. We then generate M second-level bootstrap samples by sampling from our first-level sample, and calculate the corresponding values o f T, namely r * \ . . . , f'J. Kernel density estim ation based on these second-level values provides an approxim ate density for T ” , and by analogy with parametric partial likelihood we take this density at f** = t to be the value o f a nonparam etric partial likelihood at 9 = f*. If the density estimate uses kernel w( ) with bandw idth h, then this leads to the bootstrap likelihood value at9 = t’ given by

1 M / <■*• _ Au n = f M ' i n = m Y . - { Jn r ) - <10-1»

m= 1 v 7

On repeating this procedure for R different first-level bootstrap samples, we obtain R approxim ate likelihood values L( t ’ ), r = 1 , . . . , R , from which a sm ooth likelihood curve L B(9) can be produced by nonparam etric smoothing.

Computational improvements

There are various ways to reduce the large am ount o f com putation needed to obtain a sm ooth curve. One, which was used earlier in Section 3.9.2, is to generate second-level samples from sm oothed versions o f the first-level samples. As before, probability distributions on the values y i .......yn are denoted


by vectors p = (p i , . . . ,p „ ), and param eter values are expressed as t(p); recall that p = and t = t(p). The rth first-level bootstrap sample givesstatistic value t'n and the da ta value yj occurs with frequency /*• = np’rj, say. In the bootstrap likelihood calculation this bootstrap sample is considered as a population with probability distribution p* = (p*1;...,p * n) on the da ta values, and t ' = f(p ') is considered as the 0-value for this population.

In order to obtain populations which vary sm oothly with 6 , we apply kernel sm oothing to the p*, as in Section 3.9.2. Thus for target param eter value 6° we define the vector p*(0°) o f probabilities

p ' ( 0°) e -1-') p'rj, j = l , . . . , n , (10.12)r= 1 ' '

1IIwhere typically w(-) is the standard norm al density and e = t>L ; as usual vl is the nonparam etric delta m ethod variance estim ate for t. The distribution p*(0°) will have param eter value not 0° but 9 = t (p '(0 0)). W ith the understanding tha t 9 is defined in this way, we shall for simplicity write p'(9) ra ther than p*(0°). For a fixed collection o f R first-level samples and bandw idth e > 0, the probability vectors p"(9) change gradually as 9 varies over its range o f interest.

Second-level bootstrap sampling now uses vectors p'(0) as sampling distributions on the data values, in place o f the p* s. The second-level sample values f** are then used in (10.11) to obtain Lg(0). Repeating this calculation for, say, 100 values o f 6 in the range t + 4v1/ 2, followed by sm ooth interpolation, should give a good result.

Experience suggests that the value e = v 1/ 2 is safe to use in (10.12) if the t* are roughly equally spaced, which can be arranged by weighted first-level sampling, as outlined in Problem 10.6.

A way to reduce further the am ount o f calculation is to use recycling, as described in Section 9.4.4. R ather than generate second-level samples from each p"(9) o f interest, one set o f M samples can be generated using distribution p on the data values, and the associated values f” , . . . , calculated. Then, following the general re-weighting m ethod (9.24), the likelihood values are calculated as

.> 0 ,3 ,m=\ v / j = 1 v J '

where is the frequency o f the j th case in the with second-level bootstrap sample. One simple choice for p is the E D F p. In special cases it will be possible to replace the second level o f sampling by use o f the saddlepoint approxim ation m ethod o f Section 9.5. This would give an accurate and sm ooth approxim ation to the density o f T ’“ for sampling from each p ' ( 8 ).

Example 10.3 (Air-conditioning data) We apply the ideas outlined above to

10.4 ■ Likelihood Based on Confidence Sets 509

Figure 10.4 Bootstrap likelihood for mean of air-conditioning data. Left panel: bootstrap likelihood values obtained by saddlepoint approximation for 200 random samples, with smooth curve fitted to values obtained by smoothing frequencies from 1000 bootstrap samples. Right panel: gamma profile log likelihood (solid) and bootstrap log likelihood (dots).

~DOOJZ

o>o

theta theta

the data from Example 10.1. The solid points in the left panel o f Figure 10.4 are bootstrap likelihood values for the mean 9 for 200 resamples, obtained by saddlepoint approxim ation. This replaces the kernel density estimate (10.11) and avoids the second level o f resampling, but does no t remove the variation in estim ated likelihood values for different bootstrap samples with similar values o f t r*. A locally quadratic nonparam etric sm oother (on the log likelihood scale) could be used to produce a sm ooth likelihood curve from the values o f L(t"), bu t another approach is better, as we now describe.

The solid line in the left panel o f Figure 10.4 interpolates values obtainedby applying the saddlepoint approxim ation using probabilities (10.12) at a fewvalues o f 9°. Here the values o f t! are generated at random , and we have taken

112e = 0.5vl ; the results depend little on the value o f e.The log bootstrap likelihood is very close to log empirical likelihood, with

95% confidence interval (43.8,92.1). ■

Bootstrap likelihood is based purely on resam pling and smoothing, which is a potential advantage over empirical likelihood. However, in its simplest form it is more computer-intensive. This precludes bootstrapping to estimate quantiles o f bootstrap likelihood ratio statistics, which would involve three levels o f nested resampling.

10.4 Likelihood Based on Confidence SetsIn certain circumstances it is possible to view confidence intervals as being approxim ately posterior probability sets, in the Bayesian sense. This encourages the idea o f defining a confidence distribution for 9 from the set o f confidence


limits, and then taking the P D F of this distribution as a likelihood function. T hat is, if we define the confidence distribution function C by C( 6xj = a, then the associated likelihood would be the “density” dC(6 )/dd. Leaving the philosophical argum ents aside, we look briefly at where this idea leads in the context o f nonparam etric bootstrap methods.

10.4.1 Likelihood from pivotsSuppose that Z(9 ) = z ( 6 , F ) is a pivot, with C D F K ( z ) not depending on the true distribution F, and tha t z (0 ) is a m onotone function o f 6 . Then the confidence distribution based on confidence limits derived from z leads to the likelihood

L H 6 ) = \ m \ k { z ( 6 ) } , (10.14)

where k ( z ) = d K ( z ) / d z . Since k will be unknow n in practice, it m ust be estimated.

In fact this definition o f likelihood has a hidden defect. If the identification o f confidence distribution with posterior distribution is accurate, as it is to a good approxim ation in m any cases, then the effect o f some prior distribution has been ignored in (10.14). But this effect can be removed by a simple device. Consider an im aginary experim ent in which a random sample o f size 2n is obtained, with outcom e exactly two copies o f the da ta y that we have. Then the likelihood would be the square o f the likelihood L z (6 I y) we are trying to calculate. The ratio o f the corresponding posterior densities would be simply L z (6 | y). This argum ent suggests that we apply the confidence density (10.14) twice, first with da ta y to give Lj(0), say, and second with da ta (y, y) to give L f2n(0). The ratio L l n(6 ) / L l ( 6 ) will then be a likelihood with the unknow n prior effect removed. In an explicit notation, this definition can be written

t (Q\ — ^2n(®) _ l^2n(0)l&2n {z2n(d)} (10 15)L z(p) = , (10.15)L n \Zn{6 )\kn \2n(0)}

where the subscripts indicate sample size. N ote tha t F and t are the same for both sample sizes, but quantities such as variance estimates will depend upon sample size. N ote also that the implied prior is estim ated by L l 2( 6 ) / L f2n(6).

Example 10.4 (Exponential mean) If da ta y i , . . . , y n are sampled from an exponential distribution with m ean 6 , then a suitable choice for z ( 6 , F ) is y / 6 . The gam m a distribution for Y can be used to check that the original definition (10.14) gives L i (6) = 9 ~ n~ l exp(—n y / 6 ) , whereas the true likelihood is 9~n exp (—n y / 6 ) . The true result is obtained exactly using (10.15). The implied prior is n(6 ) oc 0-1 , for 6 > 0. ■

2(0) equals dz{Q)/d6.

In practice the distribution o f Z m ust be estimated, in general by bootstrap

10.4 ■ Likelihood Based on Confidence Sets 511

sampling, so the densities k n and k2„ in (10.15) m ust be estimated. To be specific, consider the particular case o f the studentized quantity z(9) = (t—d ) / v 1L/2. A part from a constant multiplier, the definition (10.15) gives

L f (0) = k 2n j k n , (10.16)

where v„^ = v i and v2«,l = \ vl , and we have used the fact tha t t is the estimate for both sample sizes. The densities k„ and k 2n are approxim ated using bootstrap sample values as follows. First R nonparam etric samples o f size n are

a ^ i j'ydraw n from F and corresponding values o f z* = (t* — t ) /vn[ calculated. Then

R samples o f size 2n are draw n from F and values o f Z2» = (*2» ~ 0 /(® io .)1/2 calculated. N ext kernel estimates for k„ and k 2n, with bandw idths hn and h2n respectively, are obtained and substituted in (10.16). For example,

(10.17)

In practice these values can be com puted via spline sm oothing from a dense set o f values o f the kernel density estimates k„{z).

There are difficulties with this m ethod. First, just as with bootstrap likelihood, it is necessary to use a large num ber o f sim ulations R. A second difficulty is that o f ascertaining w hether or not the chosen Z is a pivot, or else what prior transform ation o f T could be used to make Z pivotal; see Section 5.2.2. This is especially true if we extend (10.16) to vector 9, which is theoretically possible. N ote tha t if the studentized bootstrap is applied to a transform ation o f t rather than t itself, then the factor \z(9)\ in (10.14) can be ignored when applying (10.16).

10.4.2 Implied likelihoodIn principle any bootstrap confidence limit m ethod can be turned into a likelihood m ethod via the confidence distribution, but it makes sense to restrict attention to the more accurate m ethods such as the studentized bootstrap used above. Section 5.4 discusses the underlying theory and introduces one other m ethod, the A B C m ethod, which is particularly easy to use as basis for a likelihood because no sim ulation is required.

First, a confidence density is obtained via the quadratic approxim ation (5.42), with a, b and c as defined for the nonparam etric A B C m ethod in (5.49). Then, using the argum ent tha t led to (10.15), it is possible to show tha t the induced likelihood function is

L Ab c (0) = exp{-5U 2(0)}, (10.18)


50 100 150 200 250 300

theta

40 60 80 100 120

theta

whereu m 2r(fl) 22(0)

W l + 2 ar(d) + { l + 4 a r ( d ) } 1/2’ 1 + {1 - 4cz(0)}V2’1 /Iwith z(9) = (t — d)/vj[ as before. This is called the implied likelihood. Based

on the discussion in Section 5.4, one would expect results similar to those from applying (10.16).

A further modification is to multiply Labc(8 ) by exp{(cv1/ 2— b)6 /vi.}, with b the bias estim ate defined in (5.49). The effect o f this modification is to make the likelihood even more com patible with the Bayesian interpretation, somewhat akin to the adjusted profile likelihood (10.2).

Example 10.5 (Air-conditioning data) Figure 10.5 shows confidence likelihoods for the two sets o f air-conditioning data in Table 5.6, samples o f size12 and 24 respectively. The implied likelihoods L ABc(9 ) are similar to the empirical likelihoods for these data. The pivotal likelihood L z (6 ), calculated from R = 9999 samples with bandw idths equal to 1.0 in (10.17), is clearly quite unstable for the smaller sample size. This also occurred with bootstrap likelihood for these data and seems to be due to the discreteness o f the simulations with so small a sample. ■

10.5 Bayesian BootstrapsAll the inferences we have described thus far have been frequentist: we have summarized uncertainty in terms o f confidence regions for the unknown param eter 6 o f interest, based on repeated sampling from a distribution F. A

Figure 10.5 Gamma profile likelihood (solid), implied likelihood L a b c (dashes) and pivot-based likelihood (dots) for air-conditioning dataset of size 12 (left panel) and size 24 (right panel). The pivot-based likelihood uses R = 9999 simulations and bandwidths 1.0.

10.5 ■ Bayesian Bootstraps 513

quite different approach is possible if prior inform ation is available regarding F. Suppose that the only possible values o f Y are known to be u i , . . . , u N, and that these arise with unknown probabilities p \ , . . . , p N, so that

Pr(y = Uj | p i , . . . , p N ) = pj, = I-

If our data consist o f the random sample y \ , . . . , y„ , and f j counts how many y, equal Uj, the probability o f the observed data given the values of the Pj is proportional to flyLi P^‘ ■ If the prior inform ation regarding the p; is summarized in the prior density n(Pi, . . . , p N), the jo in t posterior density o f the Pj given the data is proportional to

N

n i p u - . ^ p ^ n / / ,7=1

and this induces a posterior density for 8 . Its calculation is particularly straightforward when 7i is the Dirichlet density, in which case the prior and posterior densities are respectively proportional to

f t #7=1 7=1

the posterior density is Dirichlet also. Bayesian bootstrap samples and thecorresponding values o f 8 are generated from the jo in t posterior density forthe pj, as follows.

Algorithm 10.1 (Bayesian bootstrap)For r = 1 ,... ,/? ,

1 Let G \ , . . . , G n be independent gam m a variables with shape param etersaj + f j + 1, and unit scale param eters, and for j = l , . . . , N set P j = Gj/{G\ H------- 1- G^).

2 Let 8} = t(Fj), where Fj = ( P / , . . . , / ^ ) .

Estim ate the posterior density for 8 by kernel sm oothing o f d \ , . . . , dfR. •

In practice with continuous data we have f j = l. The simplest version o f the sim ulation puts aj = —1, corresponding to an im proper prior distribution with support on y \ , . . . , y„; the G; are then exponential. Some properties of this procedure are outlined in Problem 10.10.

Example 10.6 (City population data) In the city population data o f Exam ple 2.8, for which n = 10, the param eter 8 = t(F) and the rth simulated posterior value dj are


o‘CQ_

Figure 10.6 Bayesian bootstrap applied to city population data, with n = 10. The left panel shows posterior densities for ratio 6 estimated from 999 Bayesian bootstrap simulations, with a = —1, 2, 5, 10; the densities are more peaked as a increases. The right panel shows the corresponding prior densities for 0.

theta theta

The left panel o f Figure 10.6 shows kernel density estimates o f the posterior density o f 9 based on R = 999 sim ulations with all the aj equal to a = —1, 2, 5, and 10. The increasingly strong prior inform ation results in posterior densities that are m ore and more sharply peaked.

The right panel shows the implied priors on 6 , obtained using the data doubling device from Section 10.4. The priors seem highly informative, even when a = —1. ■

The prim ary use o f the Bayesian bootstrap is likely to be for im putation when da ta are missing, ra ther than in inference for 9 per se. There are theoretical advantages to such weighted bootstraps, in which the probabilities P* vary smoothly, but as yet they have been little used in applications.

10.6 Bibliographic NotesLikelihood inference is the core o f param etric statistics. M any elementary textbooks contain some discussion o f large-sample likelihood asymptotics, while adjusted likelihoods and higher-order approxim ations are described by Barndorff-Nielsen and Cox (1994).

Empirical likelihood was defined for single samples by Owen (1988) and extended to wider classes o f models in a series o f papers (Owen, 1990, 1991). Qin and Lawless (1994) m ake theoretical connections to estim ating equations, while Hall and La Scala (1990) discuss some practical issues in using em pirical likelihoods. M ore general models to which empirical likelihood has been applied include density estim ation (Hall and Owen, 1993; Chen 1996), length- biased da ta (Qin, 1993), truncated da ta (Li, 1995), and time series (M onti,


1997). A pplications to directional da ta are discussed by Fisher et al. (1996). Owen (1992a) reports sim ulations that com pare the behaviour o f the empirical likelihood ratio statistic with bootstrap m ethods for samples o f size up to 20, with overall conclusions in line with those o f Section 5.7: the studentized bootstrap perform s best, in particular giving m ore accurate confidence in tervals for the m ean than the empirical likelihood ratio statistic, for a variety of underlying populations.

Related theoretical developments are due to DiCiccio, Hall and R om ano(1991), DiCiccio and R om ano (1989), and Chen and Hall (1993). From a theoretical viewpoint it is notew orthy tha t the empirical likelihood ratio statistic can be Bartlett-adjusted, though Corcoran, D avison and Spady (1996) question the practical relevance o f this. Hall (1990) makes theoretical com parisons between empirical likelihood and likelihood based on studentized pivots.

Empirical likelihood has roots in certain problems in survival analysis, notably using the product-lim it estim ator to set confidence intervals for a survival probability. Related m ethods are discussed by M urphy (1995). See also M ykland (1995), who introduces the idea o f dual likelihood, which treats the Lagrange m ultiplier in (10.7) as a param eter. Except in large samples, it seems likely tha t our caveats about asym ptotic results apply here also.

Empirical exponential families have been discussed in Section 10.10 o f Efron (1982) and DiCiccio and R om ano (1990), am ong others; see also Corcoran, D avison and Spady (1996), who make com parisons with empirical likelihood statistics. Jing and W ood (1996) show tha t empirical exponential family likelihood is not Bartlett adjustable. A univariate version o f the statistic Q e e f in Section 10.2.2 is discussed by Lloyd (1994) in the context o f M -estimation.

B ootstrap likelihood was introduced by Davison, Hinkley and W orton(1992), who discuss its relationship to empirical likelihood, while a later paper (Davison, H inkley and W orton, 1995) describes com putational improvements.

Early work on the use o f confidence distributions to define nonparam etric likelihoods was done by H all (1987), Boos and M onahan (1986), and Ogbon- m wan and W ynn (1986). The use o f confidence distributions in Section 10.4 rests in part on the similarity o f confidence distributions to Bayesian posterior distributions. For related theory see Welch and Peers (1963), Stein (1985) and Berger and Bernardo (1992). Efron (1993) discusses the likelihood derived from A B C confidence limits, shows a strong connection with profile likelihood and related likelihoods, and gives several applications; see also C hapter 24 o f Efron and Tibshirani (1993).

The Bayesian bootstrap was introduced by Rubin (1981), and subsequently used by Rubin and Schenker (1986) and Rubin (1987) for multiple im putation in missing da ta problems. Banks (1988) has described some variants o f the Bayesian bootstrap, while N ew ton and Raftery (1994) describe a variant which

they name the weighted likelihood bootstrap. A comprehensive theoretical discussion o f weighted bootstraps is given in Barbe and Bertail (1995).

10.7 Problems1 Consider empirical likelihood for a parameter 0 = t(F) defined by an estimating

equation f u(t;y)dF(y) = 0, based on a random sample y\,...,y„.(a) Use Lagrange multipliers to maximize Y l°g Pj subject to the conditions Y P j =1 and Y2 Pju(t;yj) = 0, and hence show that the log empirical likelihood is given by (10.7) with d = 1. Verify that the empirical likelihood is maximized at the sample EDF, when 6 = t(F).

(b) Suppose that u(f,y) = y — t and n = 2, with y\ < y2. Show that rj9 can be written as (9 — y) / {(6 — yi)(y2 — 0)}, and sketch it as a function o f 6.(Section 10.2.1)

2 Suppose that x \ , . . . , x n and are independent random samples from distributions with means /i and n + 3. Obtain the empirical likelihood ratio statistic for 5.(Section 10.2.1)

3 (a) In (10.5), suppose that 6 = y + n~1/2ere, where a 1 = var(y; ) and e has an asymptotic standard normal distribution. Show that rjg = —n~l/2s / a 2, and deduce that near y, SEl (0) = —§ (y ~ 0)2/ o 2.(b) Now suppose that a single observation from F has log density f (0) = log f(y;6) and corresponding Fisher information i(6) = E{—?($)}. Use the fact that the MLE6 satisfies the equation t(6) = 0 to show that near 6 the parametric log likelihood is roughly <?(0) = — *i(6)(6 — 0)2(c) By considering the double exponential density | exp(—\y — 6 1), — oo < y < oo, and an exponential family density with mean 9, a(y) exp{yb(0) — c(8)}, show that it may or may not be true that Sel(0) = S(9)-(Section 10.2.1; DiCiccio, Hall and Romano, 1989)

4 Let 6 be a scalar parameter defined by an estimating equation f u(6;y)dF(y) = 0.Suppose that we wish to make likelihood inference for 6 based on a random sample using the empirical exponential family

gisuiOiyj)nj(6) = Pr(Y = y j ) = j = l , . . . , n ,

where is determined by

n5 > ; (0)u(0;y,) = 0. (10.19)j=i

(a) Let Z i, . . . ,Z „ be independent Poisson variables with means exp(£uj), where Uj = u(0;yj); we treat 6 as fixed. Write down the likelihood equation for and show that when the observed values o f the Z j all equal zero, it is equivalent to (10.19). Hence outline how software that fits generalized linear models may be used to find(b) Show that the formulation in terms o f Poisson variables suggests that the empirical exponential family likelihood ratio statistic is the Poisson deviance W tEF(0Q),

while the multinomial form gives WEEF(0O), where

W W (flo ) = 2 ^ { l - e x p ( ^ ; )},

W ’EEF(e0) = 2 [«log { n - ' X y ^ } - & £ > , ] .

(c) Plot the log likelihood functions corresponding to W Ee f and W EEF for the data in Example 10.1; take Uj = y, — 6. Perform a small simulation study to compare the behaviour o f WEEF and W'EEF when the underlying data are samples o f size 24 from the exponential distribution.(Section 10.2.2)

5 Suppose that a = (s in 0 ,co s0 )r is the mean direction o f a distribution on the unit circle, and consider setting a nonparametric confidence set for a based on a random sample o f angles 9i , . . . , 9„\ set yj = (s in 0. ,c o s0j)T.(a) Show that a is determined by the equation Y v f b = 0> where b = (cos 6, — sin 6)T Hence explain how to construct confidence sets based on statistics from empirical likelihood and from empirical exponential families.(b) Extend the argument to data taking values on the unit sphere, with mean direction a = (cos 9 cos <f>, cos 9 sin <j>, sin 9)T.(c) See Practical 10.2.(Section 10.2.2; Fisher et ai , 1996)

6 Suppose that t has empirical influence values lj, and set

Pj(9°) = f 1 , (10.20)E,=i ^

where f = v l/2(8° — t ) and v = n~2 lj.(a) Show that t(Ft ) = 90, where Fj denotes the C DF corresponding to (10.20). Hence describe how to space out the values t" in the first-level resampling for a bootstrap likelihood.(b) Rather than use the tilted probabilities (10.12) to construct a bootstrap likelihood by simulation, suppose that we use those in (10.20). For a linear statistic, show that the cumulant-generating function o f T ” in sampling from (10.20) is At + n{K(^ + n~iA) — K(^)} , where K (^ ) = log(£] e(lJ). Deduce that the saddlepoint approximation to f r - \T - ( t I 0°) is proportional to exp{—nX (f)}, where 6° = K '(0 - Hence show that for the sample average, the log likelihood at 9° = YI y je tyi / 53 eiyj is n { i t - lo g (5 3 e^ )} .(c) Extend (b) to the situation where t is defined as the solution to a monotonic estimating equation.(Section 10.3; Davison, Hinkley and Worton, 1992)

7 Consider the choice o f h for the raw bootstrap likelihood values (10.11), when w( ) is the standard normal density. As is often roughly true, suppose that T* ~ N(t,v) , and that conditional on T ‘ — t ’, T ” ~ N(t ' ,v) .(a) Show that the mean and variance o f the product o f vl/1 with (10.11) are /j and M ~ l {12 — I t ) , where

h = ( 2 « r * v - v }.

where y = hv~l/2 and 3 = v~l/2(t' — t). Hence verify some o f the values in the following table:

y

Oo II O 8 = 1

<NII

0 1 2 0 1 2 0 1 2

Density x lO -2 39.9 39.9 39.9 24.2 24.2 24.2 5.4 5.4 5.4Bias x lO -2 -0.8 -2 .9 -5 .7 0 -0.1 -0.5 0.3 1.2 2.5

M x variance xlO -2 40.4 13.4 5.6 28.3 11.2 5.7 7.5 3.8 2.6

(b) If y is small, show that the variance o f (10.11) is roughly proportional to the square o f its mean, and deduce that the variance is approximately constant on the log scale.(c) Extend the calculations in (a) to (10.13).(Section 10.3; Davison, Hinkley and Worton, 1992)

8 Let y represent data from a parametric model f ( y ;6 ) , and suppose that 6 is estimated by t(y). Assuming that simulation error may be ignored, under what circumstances would the bootstrap likelihood generated by parametric simulation from / equal the parametric likelihood? Illustrate your answer with the N ( 9 , 1) distribution, taking t to be (i) the sample average, (ii) the sample median.(Section 10.3)

9 Suppose that we wish to construct an implied likelihood for a correlation coefficient6 based on its sample value T by treating Z = \ log{(l + T)/( 1 — T)} as normal with mean g(0) = | log{(l + 6)/(1 — 0)} and variance n ~ l . Show that the implied likelihood and implied prior are proportional to

e x p [ - § { g ( t ) - g ( 0 ) } 2] , ( 1 - 0 ) - 2, |0| < 1.

Is the prior here proper?(Section 10.4)

10 The Dirichlet density with parameters ( f i , . . . , f „ ) is

* > » • 5 > - > . < - > *

Show that the Pj have joint moments

Tjf n \ __ _____l n n \ _ ^j(^jks ~ Zk)“ 7 ’ cow(pi ' Pk) ~ S2(t + l) ’

where S]k = 1 if j = k and zero otherwise, and s = ------- b c„.(a) Let y i , . . . , yn be a random sample, and consider bootstrapping its average. Show that under the Bayesian bootstrap with aj = a,

e ' , p; , - do-21)

Hence show that the posterior mean and variance o f = YI yjPj are y an(l (2n + a n + 1 )- I m2, where m2 = n_1 J2(yj ~ y)2-(b) Now consider the average F t o f bootstrap samples generated as follows. We generate a distribution F 1 = ( P / , . . . , Pj ) on y t, . . . , y„ under the Bayesian bootstrap,

and then make Y/,. . . , Yj by independent multinomial sampling from F f. Show that

E>( = ,a,(f(2n + an + 1) n

Are the properties o f this as n—► oo and a—»oo what you would expect? How does this compare with samples generated by the usual nonparametric bootstrap? (Section 10.5)

10.8 Practicals1 We compare the empirical likelihoods and 95% confidence intervals for the mean

o f the data in Table 3.1, (a) pooling the eight series:

attach(gravity)grav.EL <- EL.profile(g,tmin=70,tmax=85,n.t=51) plot(grav.EL[,1],exp(grav.EL[,2]),type="l",xlab="mu",

ylab="empirical likelihood") lik.CI(grav.EL ,lim=-0.5*qchisq(0.95,1))

and (b) treating the series as arising from separate distributions with the same mean and plotting eight individual likelihoods:

gravs.EL <- EL.profile(g[series==l],n.t=21) plot(gravs.EL[,1],exp(gravs.EL[,2]),type="n",xlab="mu",

ylab="empirical likelihood",xlim=range(g)) lines(gravs .EL[, 1] ,exp(gravs.EL[,2] ) ,lty=2) for (s in 2:8){ gravs.EL <- EL.profile(g[series==s],n.t=21)

lines(gravs.EL[,1],exp(gravs.EL[,2]) ,lty=2) }

Now we combine the individual likelihoods into a single likelihood by multiplying them together; we renormalize so that the product has maximum one.

lims <- matrix(NA,8,2)for (s in 1:8) { x <- g[series==s]; lims[s,] <- range(x) } mu.min <- max(lims[,1]); mu.max <- min(lims[,2]) gravs.EL <- EL.profile(g[series==l],

tmin=mu. min,tmax=mu.max,n .t=21) gravs.EL .L <- gravs.EL [,2] gravs.EL.mu <- gravs.EL[,l] for (s in 2:8)gravs.EL.L <- gravs.EL.L + EL.profile(g[series==s],

tmin=mu.min,tmax=mu.max,n.t=21)[,2] gravs.EL.L <- gravs.EL.L - max(gravs.EL.L) lines(gravs.EL.mu,exp(gravs.EL.L),lwd=2)lik.CI(cbind(gravs.EL.mu,gravs.EL.L),lim=-0.5*qchisq(0.95,1))

Compare the intervals with those in Example 3.2. D oes the result for (b) suggest a limitation o f multinomial likelihoods in general?Compare the empirical likelihoods with the profile likelihood (10.1) and the adjusted profile likelihood (10.2), obtained when the series are treated as independent normal samples with different variances but the same mean.(Section 10.2.1)

2 Dataframe i s l a y contains 18 measurements (in degrees east o f north) o f palaeo-current azimuths from the Jura Quartzite on the Scottish island o f Islay. We aimto use multinomial-based likelihoods to set 95% confidence intervals for the mean direction a(0) = (s in 0 ,co s0 )r o f the distribution underlying the data; the vector b(6) = (co s0, — s in 6)T is orthogonal to a. Let yj = (smQj , cos6j )T denote the vectors corresponding to the observed angles Then the mean direction 6is the angle subtended at the origin by Y , yj/\\ Y X/II-For the original estimate, plots o f the data, log likelihoods and confidence intervals:

a t t a c h ( is la y )th <- ifelse(theta>180,theta-360,theta)a.t <- function(th) c(sin(th*pi/180), cos(th*pi/180))b.t <- fimction(th) c(cos(th*pi/180), -sin(th*pi/180))y <- t(apply(matrix(theta, 18,1), 1, a.t))thetahat <- function(y){ m <- apply(y,2,sum)

m <- m/sqrt(m[l] ~2+m[2] “2)180*atan(m[l]/m[2] )/pi }

thetahat(y)u.t <- function(y, th) crossprod(b.t(th), t(y)) islay.EL <- EL.profile(y, tmin=-100, tmax=120, n.t=40, u=u.t) plot(islay.EL[,1],islay.EL[,2],type="l",xlab="theta",

ylab="log empirical likelihood",ylim=c(-25,0)) points(th,rep(-25,18)); abline(h=-3.84/2,lty=2) lik.CI(islay.EL,lim=-0.5*qchisq(0.95,1))islay.EEF <- EEF.profile(y, tmin=-100, tmax=120, n.t=40, u=u.t) lines(islay.EEF[,1],islay.EEF[,2],lty=3) lik.CI(islay.EEF,lim=-0.5*qchisq(0.95,1))

Discuss the shapes o f the log likelihoods.To obtain 0.95 quantiles o f the bootstrap distributions o f WEL and W E e f '■

islay.fun <- function(y, i, angle){ u <- as.vector(u.t(y[i,] , angle))

z <- rep(0,length(u))EEF.fit <- glm(z~u-l.poisson)W.EEF <- 2*sum(l-fitted(EEF.fit))EL.loglik <- function(lambda) - sum(log(l + lambda * u))EL.score <- function(lambda) - sum(u/(l + lambda * u)) assignC'u" ,u,frame=l)EL.out <- nlmin(EL.loglik,0.001)W.EL <- -2*EL.loglik(EL.out$x)c(thetahat(y[i,]), W.EL, W.EEF, EL.out$converged) }

islay.boot <- boot(y,islay.fun,R=999,angle=thetahat(y)) islay.boot$R <- sum(islay.boot$t[ , 4 ] ) islay.boot$t <- islay,boot$t[islay.boot$t[,4]==1,] apply(islay.boot$t[,2:3],2,quantile,0.95)

How do the bootstrap-calibrated confidence intervals compare with those based on the xi distribution, and with the basic bootstrap intervals using the 6' ? (Sections 10.2.1, 10.2.2; Hand et al., 1994, p. 198)

3 We compare posterior densities for the mean o f the air-conditioning data using (a) the Bayesian bootstrap with aj = — 1:

8 ■ Practicals 521

airl <- data.frame(hours=aircondit$hours,G=l) air.bayes.gen <- function(d, a){ out <- d

out$G <- rgamma(nrow(d),shape=a+2)

out }air.bayes.fun <- function(d) sum(d$hours*d$G)/sum(d$G) air.bayesian <- boot(airl, air .bayes . fun, R=999, sim="parametric",

ran.gen=air.bayes.gen,mle=-1) plot(density(air.bayesian$t,n=100,width=25),type="l",

xlab="theta",ylab="density",ylim=c(0,0.02))

and (b) an exponential model with mean 6 for the data, with prior according to which 6 ~ l has a gamma distribution with index k and scale :

kappa <- 0; lambda <- 0kappa.post <- kappa + length(airl$hours)lambda.post <- lambda + sum(airl$hours)theta <- 30:300lines(theta,

lambda.post/theta“2*dgamma(lambda.post/theta,kappa.post), lty-2)

Repeat this with different values o f a in the Bayesian bootstrap and k , X in the parametric case, and discuss your results.(Section 10.5)

11

Computer Implementation

11.1 IntroductionThe key requirem ents for com puter im plem entation o f resampling m ethods are a flexible program m ing langauge with a suite o f reliable quasi-random num ber generators, a wide range o f built-in statistical procedures to bootstrap, and a reasonably fast processor. In this chapter we outline how to use one im plem entation, using the current (M ay 1997) commercial version S p lu s 3.3 o f the statistical language S, although the m ethods could be realized in a num ber o f o ther statistical com puting environments.

The rem ainder o f this section outlines the installation o f the library, and gives a quick sum m ary o f features o f S p lu s essential to our purpose. Each subsequent section describes aspects o f the library needed for the m aterial in the corresponding chapter: Section 11.2 corresponds to C hapter 2, Section 11.3 to C hapter 3, and so forth. These sections take the form o f a tutorial on the use o f the library functions. The outline given here is not intended to replace the help files distributed with the library, which can be viewed by typing h e lp ( b o o t , l ib r a r y = " b o o t" ) within S p lus. A t various points below, you will need to consult these files for m ore details on functions.

The m ain functions in the library are sum m arized in Table 11.1.The best way to learn to use software is to use it, and from Section 11.1.2

onwards, we assume tha t you, dear reader, know the basics o f S, including how to write simple functions, tha t you are seated com fortably at your favourite com puter with S p lu s launched and a graphics window open, and tha t you are working through this chapter. We do no t show the S p lu s prom pt >, nor the continuation prom pt +.

522


Table 11.1 Functions in the Splus bootstrap library.

Function Purpose

a b c . c i N onparam etric A BC confidence intervalsb o o t Param etric and nonparam etric boo tstrap sim ulationb o o t . a r r a y A rray o f indices o r frequencies from b o o tstrap sim ulationb o o t . c i B ootstrap confidence intervalsc e n s b o o t B ootstrap for censored and survival d a tac o n t r o l C ontro l m ethods for estim ation o f quantiles, bias, variance, etc.c v . glm C ross-validation prediction erro r estim ate for generalized linear m odelem pinf C alculate em pirical influence valuese n v e lo p e C alculate sim ulation envelopee x p . t i l t E xponential tilting to calculate probability d istributionsg lm .d ia g G eneralized linear m odel diagnosticsg lm .d i a g .p l o t P lo t generalized linear m odel diagnosticsim p.m om ents Im portance resam pling m om ent estim atesim p . p ro b Im portance resam pling tail p robability estim atesim p .q u a n t i l e Im portance resam pling quantile estim atesim p .w e ig h ts C alculate im portance resam pling weightsj a c k , c i f te r .b o o t Jackknife-after-boo tstrap p lo tl i n e a r .a p p r o x C alculate linear approx im ation to a sta tistics a d d le S addlepoint approxim ations a d d l e . d i s t n S addlepoint approx im ation for a d istribu tionsim p le x Simplex m ethod o f linear program m ingsm o o th .f Frequency sm oothingt i l t . b o o t A utom atic im portance re-w eighting boo tstrap sim ulationt s b o o t B ootstrap for tim e series d a ta

11.1.1 Installation U NI X

The bootstrap library can be obtained from the hom e page for this book,

h t t p : //dmawww. e p f 1 . c h /d a v is o n . mosaic/BMA/

in the form o f a compressed s h a r file b o o t l i b . sh . Z. This file should be uncompressed and moved to an appropriate directory. The file can then be unpacked by

sh bootlib.sh rm bootlib.sh

You should then follow the instructions in the README file to complete the installation o f the library.

It is best to set up an S p lu s library b o o t containing the library files; you may need to ask your system m anager to do this. Once this is done, and once inside S p lu s in your usual working directory, the functions and data are accessed by typing

library(boot,f irst=T)


524 11 ■ Computer Implementation

This will avoid cluttering your working directory with library files, and reduce the chance that you accidentally overwrite them.

Windows

The disk at the back o f this book contains the library functions and docum entation for use with S p lu s f o r Windows. For instructions on the installation, see the file README. TXT on the disk. The contents o f the disk can also be retrieved in the form o f a z ip file from the hom e page for the book given above.

11.1.2 Some key Sp lus ideas Quasi -random numbers

To put 20 quasi-random N ( 0,1) da ta into y and to see its contents, type

y <- rnorm (20)

y

Here <- is the assignment symbol. To see the contents o f any S object, simply type its name, as above. This is often done below, and we do not show the output.

In general quasi-random num bers from a distribution are generated by the functions rexp , rgamma, r c h is q , r t , . . . , with argum ents to give param eters where needed. For example,

y <- rgamma(n=10,shape=2)

generates 10 gam m a observations with shape param eter 2, and

y <- rgamma(n=10,shape=c(l:10))

generates a vector o f ten gam m a variables with shape param eters 1 ,2 , . . . , 10.The function sam ple is used to sample from a set with or w ithout replace

ment. For example, to get a random perm utation o f the num bers 1 ,... ,1 0 , a random sample with replacem ent from them, a random perm utation o f 11, 22, 33, 44, 55, a sample o f size 10 from them, and a sample o f size 10 taken with unequal probabilities:

sample(10)

sample(10,replace=T)

set <- c(ll,22,33,44,55)

sample(set)

sample(set,size=10,replace=T)

sample(set,size=10,replace=T,prob=c(0.1,0.1,0.1,0.1,0.6))

11.2 ■ Basic Bootstraps 525

SubscriptsThe city population da ta with n = 10 are

city

citySu

city$x

where the second two com m ands show the individual variables o f c i ty . T his S p lu s object is a datafram e — an array o f da ta in which rows correspond to cases, and the nam ed colum ns to variables. Elements o f an object are accessed by subscripts, so

city$x[l]

city$x[c(l:4)]

city$x[c(l,5,10)]

city[c(l,5,10),2]

city$x[-l]

city[c(l:3),]

give various subsets o f the elements o f c i ty . To make a nonparam etric bootstrap sample o f the rows o f c i ty , you could type:

i <- sample(10,replace=T)

city[i,I

The row labels result from the algorithm used to give unique labels to rows, and can be ignored for our purposes.

11.2 Basic Bootstraps

11.2.1 Nonparametric bootstrapThe m ain bootstrap function, b o o t, works on a vector, a matrix, or a datafram e. A simple use o f b o o t to bootstrap the ratio t = x / u for the city population data of Example 2.8 is

city.fun <- function(data, i)

{ d <- data[i,]

mean(d$x)/mean(d$u) }

city.boot <- boot(data=city, statistic=city.fun, R=50)

The function c i t y . f u n takes as input the datafram e d a ta and the vector o f indices i . Its first com m and sets up the bootstrapped datafram e, and its second makes and returns the bootstrapped ratio. The last com m and instructs the function b o o t to bootstrap the da ta in c i t y R = 50 times, apply the statistic c i t y . f u n to each bootstrap dataset and put the results in c i ty .b o o t .

Boot s t rap objects

The result o f a call to b o o t is a bootstrap object. This is implemented as a list o f quantities which is given the class "b o o t" and for which various m ethods are defined. For example, typing

city.boot

prints the original statistic, its estim ated bias and its standard error, while

plot(city.boot)

gives suitable sum m ary plots.To see the names o f the elements o f the bootstrap object c i t y .b o o t , type

names(city.boot)

You see various names, o f which c i ty .b o o t$ tO , c i t y .b o o t$ t , c ity .b o o t$ R , c i t y ,b o o t$ se e d contain the original value o f the statistic, the bootstrap values, the value o f R, and the value o f the S p lu s random num ber generation seed when b o o t was invoked. To see their contents, type their names.

TimingTo repeat the simulation, checking how long it takes, type

u n ix . t im e ( c i ty .b o o t <- b o o t ( c i t y , c i ty .fu n ,R = 5 0 ) )

on a U N IX system or

dos.time(city.boot <- boot(city,city.fun,R=50))

on a DOS system. The first num ber returned is the time the sim ulation took, and is useful for estim ating how long a larger sim ulation would take.

A lthough code is generally clearer when datafram es are used, the com putation can be speeded up by avoiding them, as here:

mat <- as.matrix(city)

mat.fun <- function(data, i)

{ d <- data[i,1

mean(d[,2])/mean(d[,1]) >

unix.time(mat.boot <- boot(mat,mat.fun,R=50))

Com pare this with the time taken using the datafram e c i ty .

Frequency array

To obtain the R x n array o f bootstrap frequencies for c i t y .b o o t and to display its first 20 lines, type

f <- boot.array(city.boot)

f [1 :2 0 ,]

The rows o f f are the vectors o f frequencies for individual bootstrap samples. The array is useful for m any post hoc calculations, and is invoked by postprocessing functions such as ja c k , a f t e r .b o o t and imp. w eigh t, which are discussed below. It is calculated from c i t y . b o o t$ seed . The array o f indices for the bootstrap samples can be obtained by b o o t . a r r a y ( c i t y . b o o t , index= T ).

Types o f statisticFor a nonparam etric bootstrap, the function s t a t i s t i c can be o f one of three types. We have already seen examples o f the first, index type, where the argum ents are the datafram e d a ta and the vector o f indices, i ; this is specified by s ty p e = " i" (the default).

For the second, weighted type, the argum ents are d a ta and a vector o f weights w. For example,

city.w <- function(data, w=rep(l,nrow(data))/nrow(data))

{ w <- w/sum(w)

sum(w*data$x)/sum(w*data$u)}

city.boot <- boot(city, city.w, R=20, stype="w")

writes

£ = E w'jxj / E wj u’ Y , w) uj / Y , wY

where w* is the weight put on the j t h case o f the datafram e in the bootstrap sample; the first line o f c i ty .w ensures that vv* = 1. Setting w in the initial line o f the function gives the default value for w, which is a vector of n-1 s; this enables the original value o f t to be obtained by c i t y .w ( c i t y ) . A more com plicated example is given by the library correlation function c o rr . N ot all statistics can be w ritten in this form, but when they can, numerical differentiation can be used to obtain empirical influence values and A B C confidence intervals.

For the third, frequency type, the argum ents are d a ta and a vector o f frequencies f . For example,

city.f <- function(data, f) mean(f*data$x)/mean(f*data$u)

city.boot <- boot(city, city.f, R=20, stype="f")

uses

“ * n_ 1 E /y * « /

where /* is the frequency with which the ;'th row o f the datafram e occurs in the bootstrap sample. N ot all statistics can be w ritten in this form. It differs from the preceding type in that whereas weights can in principle take any positive

values, frequencies m ust be integers. O f course in this example it would be easiest to use the function c i t y . f u n given earlier.

Funct i on s t a t i s t i c

The contents o f s t a t i s t i c can be more-or-less arbitrarily complicated, p rovided that its ou tput is a scalar or fixed-length vector. For example,

a i r . f u n <- f u n c t io n ( d a ta , i ){ d <- d a t a [ i , ]

c (m ean (d ), v a r ( d ) /n r o w ( d a ta ) ) } a i r . b o o t <- b o o t ( d a ta = a i r c o n d i t , s t a t i s t i c = a i r . f u n , R=200)

perform s a nonparam etric bootstrap for the average o f the air-conditioning data, and returns the bootstrapped averages and their estim ated variances. We give more complex examples below. Beware of m emory and storage problems if you make the ou tpu t too long.

By default the first element o f s t a t i s t i c (and so the first column of b o o t . o u t$ t) is treated as the m ain statistic for certain calculations, such as calculation o f empirical influence values, the jackknife-after-bootstrap plot, and confidence interval calculations, which are described below. This is changed by use o f the in d ex argum ent, usually a single num ber giving the colum n of s t a t i s t i c to which the calculation is to be applied.

Further argum ents can be passed to s t a t i s t i c using the . . . argum ent to bo o t. For example,

c i t y . s u b s e t <- f u n c t io n ( d a ta , i , n=10){ d <- d a t a [ i [1 : n ] ,]

m e a n (d [ ,2 ] ) /m e a n (d [ ,1 ]) } c i t y .b o o t < - b o o t ( d a ta = c i ty , s t a t i s t i c = c i t y . s u b s e t , R=200, n=5)

gives resampled ratios for bootstrap samples of size 5. N ote that the frequency array for c i t y .b o o t would not be useful in this case. The indices can be obtained by

b o o t . a r r a y ( c i t y .b o o t , in d ic e s = T ) [ ,1 :5 ]

11.2.2 Parametric bootstrapFor a param etric bootstrap, the first argum ent to s t a t i s t i c rem ains a vector, matrix, or datafram e, bu t s t a t i s t i c need take no second argum ent. Instead three further argum ents to b o o t m ust be supplied. The first, r a n . gen, tells b o o t how to sim ulate bootstrap data, and is a function that takes two arguments, the original data, and an object containing any other param eters, mle. The outpu t o f r a n .g e n should have the same form and attributes as the original dataset. The second new argum ent to b o o t is a value for mle itself. The third

new argum ent to b o o t, s im = "p a ra m e tric " , tells b o o t to perform a param etric sim ulation: by default the sim ulation is nonparam etric and s im = "o rd in a ry " . O ther possible values for sim are described below.

For example, for param etric sim ulation from the exponential model fitted to the air-conditioning da ta in Table 1.2, we set

a i r c o n d i t . f u n <- f u n c t io n ( d a ta ) m ean (d a ta$ h o u rs ) a i r c o n d i t . sim <- f u n c t io n ( d a ta , m le){ d <- d a ta

d$hours <- re x p (n = n ro w (d a ta ) , ra te = m le ) d >

a i r c o n d i t .m le <- l /m e a n (a irc o n d i t$ h o u rs )a i r c o n d i t .p a r a <- b o o t ( d a ta = a i r c o n d i t , s t a t i s t i c = a i r c o n d i t . f u n ,

R=20, s im = " p a ra m e tr ic " , r a n .g e n = a i r c o n d i t . s im , m le = a irc o n d it .m le )

A ir-conditioning da ta for a different aircraft are given in a i r c o n d i t7 . O btain their sample average, and perform a param etric bootstrap o f the average using the fitted exponential model. Give the bias and variance estim ates for the average. D o the bootstrapped averages look norm al for this sample size?

A m ore com plicated example is param etric sim ulation based on a log bivariate norm al distribution fitted to the city population data:

l . c i t y < - l o g ( c i t y )c i ty .m le < - c ( a p p ly ( 1 . c i t y , 2 ,m e a n ) , s q r t ( a p p l y ( l . c i t y , 2 , v a r ) ) ,

c o r r ( 1 . c i t y ) ) c i t y . s im < - f u n c t io n ( d a ta , m le){ n <- n ro w (d a ta )

d <- m a t r ix ( rn o rm (2 * n ) ,n ,2)d [ ,2 ] < - m le[2 ] + m le [4 ] * (m le [ 5 ]* d [ ,2 ] + s q r t ( 1 - m le [ 5 ] ~ 2 )* d [ ,1 ] )d [ , 1] < - m le [ l ] + m le [ 3 ] * d [ , l ]d a ta $ x <- e x p ( d [ ,2 ] )d a ta $ u < - e x p ( d [ , l ] )d a ta }

c i t y . f < - f u n c t io n ( d a ta ) m e a n ( d a ta [ ,2 ] ) /m e a n ( d a ta [ ,1 ]) c i t y .p a r a < - b o o t ( c i t y , c i t y . f , R=200, s im = " p a ra m e tr ic " ,

r a n .g e n = c i ty . sim , m le = c ity .m le )

W ith this definition o f c i t y . f , a nonparam etric bootstrap can be perform ed by

c i t y .b o o t <- b o o t ( d a ta = c i ty ,s t a t i s t i c = f u n c t i o n ( d a t a , i ) c i t y . f ( d a t a [ i , ] ) , R=200)

530 11 • Computer Implementation

This is useful when com paring param etric and nonparam etric bootstraps for the same problem. Com pare them for the c i t y data.

11.2.3 Empirical influence valuesFor a statistic b o o t .f u n in weighted form, function em pinf returns the empirical influence values lj, obtained by num erical differentiation. For the ratio function c i ty .w given above, for example, these and the exact values (Problem 2.9) are

L.diff <- empinf(data=city, statistic=city.w, stype="w")

cbind(L.diff,(city$x-city.w(city)*city$u)/mean(city$u))

Empirical influence values can also be obtained from the ou tpu t o f b o o t by regression o f the values o f f* on the frequency array. For example,

city.boot <- boot(city, city.fun, R=999)

L.reg <- empinf(city.boot)

L.reg

uses regression with the 999 samples in c i t y .b o o t to estimate the lj. Jackknife values can be obtained by

J <- empinf(data=city,statistic=city.fun,stype="i",type="jack")

The argument type controls how the influence values are to be calculated, but

this also depends on the quantities input to empinf: for details see the help

file.

Variance approximations

v a r . l i n e a r uses empirical influence values to calculate the nonparam etric delta m ethod variance approxim ation for a statistic:

var.linear(L.diff)

var.linear(L.reg)

Linear approximation

l i n e a r .ap p ro x uses ou tput from a nonparam etric bootstrap sim ulation to calculate the linear approxim ations to the bootstrapped quantities. The em pirical influence values can be supplied, bu t if not, they are estim ated by a call to em pinf. For the city population ratio,

city.tL.reg <- linear.approx(city.boot)

city.tL.diff <- linear.approx(city.boot, L=L.diff)

split.screen(c(1,2))

screen(l); plot(city.tL.reg,city.boot$t); abline(0,l,lty=2)

screen(2); plot(city.tL.diff,city,boot$t); abline(0,1,lty=2)

11.3 ■ Further Ideas 531

calculates the linear approxim ation for the two sets o f empirical influence values and plots the actual t ' against them.

11.3 Further Ideas

11.3.1 Stratified samplingStratified sam pling is perform ed by including argum ent s t r a t a in the call to boo t. Suppose tha t we wish to bootstrap the difference in the trimmed averages for the last two groups of gravity data (Example 3.2):

gravity

grav <- gravity[as.numeric(gravity$series)>=7,]

grav

grav.fun <- function(data, i, trim=0.125)

{ d <- data[i,]

m <- tapply(d$g, d$series, mean, trim=trim)

m[7] -m [8] >

grav.boot <- boot(grav, grav.fun, R=200, strata=grav$series)

Check tha t the expected properties o f b o o t . a r r a y (g ra v . b o o t) hold.Em pirical influence values, linear approxim ations, and nonparam etric delta

m ethod variance approxim ations are calculated by

grav.L <- empinf(grav.boot)

grav.tL <- linear.approx(grav.boot)

var.linear(grav.L , strata=grav$series)

g r a v .b o o t$ s t r a t a contains the stra ta used in the resampling, which are taken into account autom atically if g r a v . b o o t is used, but otherwise m ust be supplied, as in the final line o f the code above.

11.3.2 SmoothingThe neatest way to perform sm ooth bootstrapping is to use s im = "p a ra m e tric " . For example, to estim ate the variance o f the median o f the data in y, using sm oothing param eter h = 0.5:

y <- rnorm(99)

h <- 0.5

y.gen <- function(data, mle)

{ n <- length.(data)

i <- sample(n, n, replace=T)

data[i] + mle*rnorm(n) }

y.boot <- boot(y, median, R=200, sim="parametric",

ran.gen=y.gen, mle=h)

var(y.boot$t)

This guarantees tha t y .b o o t$ tO contains the original median. For shrunk smoothing, see Practical 4.5.

11.3.3 Censored data

cen sb o o t is used to bootstrap censored data. Suppose tha t we wish to assess the variability o f the m edian survival time and the probability o f survival beyond 20 weeks for the first group o f A M L da ta (Example 3.9).

amll <- ami[aml$group==l,]

amll.fun <- function(data)

{ surv <- survfit(Surv(data$time,data$cens))

pi <- min(surv$surv[surv$time<20])

ml <- min(surv$time[surv$surv<0.5])

c(pl, ml) >

amll.ord <- censboot(data=amll, statistic=amll.fun, R=50)

ami1.ord

This involves ordinary bootstrap resampling, and hence could be perform ed with b o o t, although a m ll .f u n would then have to be rew ritten to have another argum ent. For conditional simulation, two additional argum ents m ust be supplied containing the estim ated survivor functions for the times to failure and the censoring d istribu tion :

amll.fail <- survfit(Surv(time,cens),data=amll)

amll.cens <- survfit(Surv(time-0.01*cens,l-cens),data=amll)

amll.con <- censboot(data=amll, statistic=amll.fun, R=50,

F.surv=amll.fail, G.surv=amll.cens, sim="cond")

11.3.4 Bootstrap diagnostics Jackknife-after-bootstrap

Function j a c k . a f t e r . b o o t produces a jackknife-after-bootstrap plot o f the first colum n o f b o o t . o u t$ t based on a nonparam etric simulation. For example, for the c i t y da ta ratio:

city.fun <- function(data, i)

{ d <- data[i,]

rat <- mean(d$x)/mean(d$u)

L <- (d$x-rat*d$u)/mean(d$u)

c(rat, sum(L~2)/nrow(d)~2, L) }

11.3 ■ Further Ideas 533

c i ty .b o o t <- b o o t ( c i t y , c i t y . f u n , R=999) c i t y .L <- c i t y .b o o t $ t 0 [ 3 : 12]s p l i t . s c r e e n ( c ( l , 2 ) ) ; s c r e e n ( l ) ; s p l i t . s c r e e n ( c ( 2 , 1 ) ) ; s c re e n (4 ) a t t a c h ( c i t y )p l o t ( u , x , ty p e = " n " , x lim = c ( 0 ,3 0 0 ) ,y lim = c (0 ,3 0 0 ))t e x t ( u ,x , r o u n d ( c i t y .L ,2 ) )sc re e n (3 )p lo t ( u ,x , ty p e = " n " ,x l im = c ( 0 ,3 0 0 ) ,y lim = c (0 ,3 0 0 )) t e x t ( u , x , c ( l :1 0 ) ) ; a b l i n e ( 0 , c i t y .b o o t $ t 0 [ 1 ] ,l ty = 2 ) s c re e n (2 )j a c k .a f t e r .b o o t ( b o o t .o u t = c i t y .b o o t , u seJ= F , s t in f = F , L = c ity .L ) c l o s e . s c r e e n (a l l= T )

The two left panels show the data with case num bers and empirical influence values as plotting symbols. The jackknife-after-bootstrap plot on the right shows the effect o f deleting cases in turn: values o f t* are more variable when case 4 is deleted and less variable when cases 9 and 10 are deleted. We see from the empirical influence values that the distribution of t' shifts downwards when cases with positive empirical influence values are deleted, and conversely.

This plot is also produced by setting true the ja c k argum ent to p l o t when applied to a bootstrap object, as in p lo t ( c i ty .b o o t , j a c k = T ) .

O ther argum ents for j a c k , a f t e r .b o o t control w hether the influence values are standardized (by default they are, s t in f= T ), w hether the empirical influence values are used (by default jackknife values are used, based on the simulation, so the default values are useJ=T and L=NULL).

M ost post-processing functions allow the user to specify either an index for the com ponent o f interest, o r a vector o f length b o o t.o u t$ R to be treated as the m ain statistic. Thus a jackknife-after-bootstrap plot using the second com ponent o f c i t y .b o o t $ t — the estim ated variances for t* — would be obtained by either of

j a c k . a f t e r . b o o t ( c i t y .b o o t , u se J= F , s t i n f = F , index=2)j a c k . a f t e r . b o o t ( c i t y . b o o t , u s e J = F , s t i n f = F , t = c i t y .b o o t $ t [ ,2 ] )

Frequency smoothing

sm ooth . f sm ooths the frequencies o f a nonparam etric bootstrap object to give a “ typical” distribution with expected value roughly at 9. In order to find the sm oothed frequencies for 9 = 1.4 for the city ratio, and to obtain the corresponding value o f t, we set

c i t y . f r e q < - s m o o th . f ( th e ta = l .4 , b o o t .o u t= c i ty .b o o t ) c i t y .w ( c i t y , c i t y . f r e q )

The sm oothing bandw idth is controlled by the w id th argum ent to sm o o th .f and is w id th x u 1/2, where v is the estim ated variance o f t — w id th = 0 .5 by default.

11.4 Tests

11.4.1 Parametric testsSimple param etric tests can be conducted using param etric simulation. For example, to perform the conditional sim ulation for the data in f i r (Example 4.2):

fir.mle <- c(sum(fir$count), nrow(fir))

fir.gen <- function(data, mle)

{ d <- data

y <- sample(x=mle[2],size=mle[1],replace=T)

d$count <- tabulate(y,mle[2])

d >

fir.fun <- function(data)

(nrow(data)-1)*var(data$count)/mean(data$count)

fir.boot <- boot(fir, fir.fun, R=999, sim="parametric",

ran.gen=fir.gen, mle=fir.mle)

qqplot(qchisq(c(l:fir.boot$R)/(fir.boot$R+l),df=49),fir.boot$t)

abline(0,1,lty=2); abline(h=fir.boot$t0)

The last two lines here display the results (almost) as in the right panel o f Figure 4.1.

11.4.2 Permutation testsA pproxim ate perm utation tests are perform ed by setting s im = "p erm u ta tio n " when invoking b o o t. For example, suppose tha t we wish to perform a perm utation test for zero correlation between the two columns o f datafram e d u c k s :

perm.fun <- function(data, i) cor(data[,1],data[i,2])

ducks.perm <- boot(ducks, perm.fun, R=499, sim="permutation")

(sum(ducks.perm$t>ducks.perm$t0)+l)/(ducks.perm$R+l)

qqnorm(ducks.perm$t,ylim=c(-1,1))

abline(h=ducks.perm$t0,lty=2)

If s t r a t a is included in the call to b o o t, perm utation is perform ed independently within each stratum .

11.4 ■ Tests 535

For a bootstrap test o f the hypothesis o f zero correlation in the ducks data, we make a new datafram e and function :

duck <- c(ducks[,1].ducks[,2])

n <- nrow(ducks)

duck.fun <- function(data, i, n)

{ x <- data[i]

cor(x[l:n],x[(n+l):(2*n)]) >

.Random.seed <- ducks.perm$seed

ducks.boot <- boot(duck, duck.fun, R=499,

strata=rep(c(l,2),c(n,n)), n=n)

(sum(ducks.boot$t>ducks.boot$tO)+l)/(ducks.boot$R+l)

This uses the same seed as for the perm utation test, for a m ore precise com parison. Is the significance level similar to that for the perm utation test?

W hy cannot b o o t be directly applied to ducks to perform a bootstrap test?

Exponential tilting

The test o f equality o f means for two sets o f da ta in Example 4.16 involves exponential tilting. The null distribution puts probabilities given by (4.25) on the two sets o f data, and the tilt param eter k solves the equation

Y j zij exp(^zi7') __ „Eyexp(Azy)

where z\j = yij, z2j = —yij, and 6 = 0. The fitted null distribution is obtained using e x p . t i l t , as follows:

z <- grav$g

z[grav$series==8] <- -z[grav$series==8]

z.tilt <- exp.tilt(L=z, theta=0, strata=grav$series)

z.tilt

where z . t i l t contains the fitted probabilities (which sum to one for each stratum ) and the values o f k and 6. O ther argum ents can be input to e x p . t i l t : see its help file.

The significance probability is then obtained by using the w e ig h ts argum ent to boo t. This argum ent is a vector containing the probabilities with which to select the rows o f d a ta , when bootstrap sampling is to be perform ed with unequal probabilities. In this case the unequal probabilities are given by the tilted distribution, under which the expected value o f the test statistic is zero. The code needed to perform the sim ulation and get the estim ated significance level is:

11.4.3 Bootstrap tests

g r a v . t e s t < - f u n c t io n ( d a ta , i ){ d <- d a t a [ i , ]

d i f f ( t a p p ly ( d $ g ,d $ s e r i e s ,m e a i i ) ) [7] } g ra v .b o o t <- b o o t(d a ta = g ra v , s t a t i s t i c = g r a v . t e s t , R=999,

w e ig h ts= z . t i l t $ p , s t r a t a = g r a v $ s e r i e s )(su m (g ra v .b o o t$ t> g ra v .b o o t$ tO )+ 1 ) / ( g r a v .boo t$R + l)

11.5 Confidence IntervalsThe m ain function for setting bootstrap confidence intervals is b o o t . c i , which takes as input a bootstrap object. For example, to get a 95% confidence interval for the ratio in the c i t y data, using the c i t y .b o o t object created in Section 11.3.4:

b o o t . c i ( b o o t . o u t= c i ty .b o o t)

By default the confidence level is 0.95, bu t other values can be obtained using the conf argum ent. Here invoking b o o t .c i shows the norm al, basic, studentized bootstrap, percentile, and BCa intervals. Subsets o f these intervals are obtained using the ty p e argum ent. For example, if c i t y .b o o t $ t only contained the ratio and not its estim ated variance, it would be impossible to obtain the studentized bootstrap interval, and an appropriate use o f b o o t .c i would be

b o o t .c i ( b o o t .o u t= c i ty .b o o t , ty p e = c ( " n o r m " , " p e r c " , " b a s i c " , " b c a " ) , c o n f = c ( 0 .8 ,0 .9 ) )

By default b o o t . c i assumes that the first and second columns o f b o o t . o u t$ t contain the statistic itself and its estim ated variance; otherwise the index argum ent can be used, as outlined in the help file.

To calculate intervals for the param eter h{6), and then back-transform them to the original scale, we use the h, h inv , and h d o t arguments. For example, to calculate intervals for the city ratio, using h(-) = log(-), we set

b o o t . c i ( c i t y . b o o t , h = lo g , h inv = ex p , h d o t= fu n c tio n (u ) 1 /u )

where h in v and h d o t are the inverse and first derivative o f h(-). N ote how transform ation improves the basic bootstrap interval.

N onparam etric ABC intervals are calculated using a b c . c i . For example

a b c . c i ( d a t a = c i t y , s t a t i s t i c = c i t y . w)

calculates the 95% ABC interval for the city ratio ; s t a t i s t i c m ust be in weighted form for this. As usual, stra ta are incorporated using the s t r a t a argument.

11.6 • Linear Regression 537

11.6 Linear Regression11.6.1 Basic approachesResampling for linear regression models is perform ed using boo t. It is simplest when bootstrapping cases. For example, to com pare the biases and variances for param eter estimates from bootstrapping least squares and Li estim ates for the mammals data:

fit.model <- function(data)

{ fit <- glm(log(brain)~log(body),data=data)

11 <- llfit(log(data$body),log(data$brain))

c(coef(fit), coef(ll)) }

mammals.fun <- function(data, i) fit.model(data[i,])

mammals.boot <- boot(mammals, mammals.fun, R=99)

mammals.boot

For model-based resampling it is simplest to set up an augm ented datafram e containing the residuals and fitted values. A lthough the model is a straightforward linear model, we fit it using glm rather than lm so that we can calculate residuals using the library function g lm .d ia g , which calculates various types o f residuals, approxim ate Cook statistics, and measures o f leverage for a glm object. (The diagnostics are exact for a linear model.) A related function is g lm .d ia g .p lo ts , which produces standard diagnostic plots for a generalized linear model fit:

mam.lm <- glm(log(brain)"log(body),data=mammals)

maim, diag <- glm. diag (mam. lm)

glm.diag.plots(mam.lm)

res <- (mam.diag$res-mean(mam.diag$res))*mam.diag$sd

mam <- data.frame(mammals,res=res,fit=fitted(mam.lm))

mam.fun <- function(data, i)

{ d <- data

d$brain <- exp(d$fit+d$res [i])

fit.model(d) }

mam.boot <- boot(mam, mam.fun, R=99)

mam.boot

Empirical influence values and the nonparam etric delta m ethod standard error for the slope o f the linear model could be obtained by putting the slope estimate in weighted form:

mam.w <- function(data, w)

coef(glm(log(data$brain)"log(data$body), weights=w))[2]

mam.L <- empinf(data=mammals, statistic=mam.w)

sqrt(var.linear(mam.L))

For more com plicated regressions, for example with unequal response variances, more inform ation m ust be added to the new datafram e.

Wild bootstrapThe wild bootstrap can be implemented using s im = "p a ra m e tr ic " , as follows:

mam.mle <- c(nrow(mam), (5+sqrt(5))/10)

mam.wild <- function(data, mle)

{ d <- data

i <- 2*rbinom(mle[1], size=l, prob=l-mle[2])-l

d$brain <- exp(d$fit+d$res*(l-i*sqrt(5))/2)

d >

mam.boot.wild <- boot(mam, fit.model, R=20, sim="parametric",

ran.gen=mam.wild, mle=mam.mle)

11.6.2 PredictionNow consider prediction o f the log brain weight o f new m am m als with body weights equal to those for the chimpanzee and baboon. For this we introduce yet another argum ent to b o o t — m, which gives the num ber o f e*m to be simulated with each bootstrap sample (see A lgorithm 6.4). In this case we w ant to predict a t m = 2 “new” mammals, with covariates contained in d .p re d . The s t a t i s t i c function supplied to b o o t m ust now take at least one m ore argum ent, namely the additional indices for constructing the bootstrap versions o f the two “new” mammals. We implement this as follows:

d.pred <- mam[c(46,47),]

pred <- function(data, d.pred)

predict(glm(log(brain)"log(body),data=data), d.pred)

maun.pred <- function(data, i, i.pred, d.pred)

{ d <- data

d$brain <- exp(d$fit+d$res[i])

pred(d, d.pred) - (d.pred$fit + d$res[i.pred]) }

mam.boot.pred <- boot(mam, mam.pred, R=199, m=2, d .pred=d.pred)

orig <- matrix(pred(mam, d.pred),mam.boot.pred$R,2,byrow=T)

exp(apply(orig+mam.boot.pred$t,2,quantile,c (0.025,0.5,0.975)) )

giving the 0.025, 0.5, and 0.975 prediction limits for the brain sizes o f the “new” mammals. The actual brain sizes lie close to or above the upper limits o f these intervals: prim ates tend to have larger brains than other mammals.

11.6.3 Aggregate prediction error and variable selectionPractical 6.5 shows how to obtain the various estimates o f aggregate prediction error based on a given model.

11.6 ■ Linear Regression 539

For consistent bootstrap variable selection, a subset o f size n — m is used to fit each o f the possible models. Consider Example 6.13, where a fake set of da ta is m ade by

xl <- runif(50); x2 <- runif(50); x3 <- runif(50)

x4 <- runif(50); x5 <- runif(50); y <- rnorm(50)+2*xl+2*x2

fake <- data.frame(y,xl,x2,x3,x4,x5)

As in tha t example, we consider the six possible models with no covariates, with just x i, with x i , x 2, and so forth, finishing with x i , . . . , x 5. The function s u b s e t .b o o t fits these to a subset o f n - s i z e observations, and calculates the prediction m ean squared error for all the data. It is then applied using b o o t :

subset.boot <- function(data, i, size=0)

{ n <- nrow(data)

i.t <- i [1:(n-size)]

data.t <- data[i.t, ]

resO <- data$y - mean(data.t$y)

lm.d <- lm(y ~ xl, data=data.t)

resl <- data$y - predict.lm(lm.d, data)

lm.d <- update(lm.d, .~.+x2)

res2 <- data$y - predict.lm(lm.d, data)





lm.d <- update(lm.d, .”.+x5)


meansq <- function(y) mean(y~2)

apply(cbind(res0,resl,res2,res3,res4,res5),2,meansq)/n }

fake.boot.40 <- boot(fake, subset.boot, R=100, size=40)

delta.hat.40 <- apply(fake.boot.40$t,2,mean)

plot(c(0:5).delta.hat.40,xlab="Number of covariates",

ylab="Delta hat (M)",type="l",ylim=c(0,0.1))

For results with a different value o f s iz e , but re-using fa k e . b o o t . 40$seed in order to reduce sim ulation variability:

.Random.seed <- fake.boot.40$seed

fake.boot.30 <- boot(fake, subset.boot, R=100, size=30)

delta.hat.30 <- apply(fake.boot.30$t,2,mean)

lines(c(0:5),delta.hat.30,lty=2)

Try this with various values o f s iz e .

Modify the code above to do variable selection using cross-validation, and com pare it with the bootstrap results.

11.7 Further Topics in Regression

11.7.1 Nonlinear and generalized linear modelsN onlinear and generalized linear models are bootstrapped using the ideas in the preceding section. For example, to apply case resampling to the calcium d a ta o f Example 7.7:

calcium.fun <- function(data, i)

{ d <- data[i,]

d.nls <- nls(cal~betaO*(l-exp(-time*betal)),data=d,

start=list(beta0=5,betal=0.2))

c(coefficients(d.nls),sum(d.nls$residuals"2)/(nrow(d)-2)) >

cal.boot <- boot(calcium,calcium.fun,R=19,strata=calcium$time)

Likewise, to apply model-based sim ulation to the leukaem ia da ta o f Exam ple 7.1, resampling standardized deviance residuals according to (7.14),

leuk.glm <- glm(time~loglO(wbc)+ag-l,Gamma(log),data=leuk)

leuk.diag <- glm.diag(leuk.glm)

muhat <- fitted(leuk.glm)

rL <- log(leuk$time/muhat)/sqrt(l-leuk.diag$h)

eps <- 10*(-4)

u <- -log(seq(from=eps,to=l-eps,by=eps))

d <- sign(u-l)*sqrt(2*(u-l-log(u)))/leuk.diag$sd

r.dev <- smooth.spline(d, u)

z <- predict(r.dev, leuk.diag$rd)$y

leuk.mle <- data.frame(muhat,rL,z)

fit.model <- function(data)

{ data.glm <- glm(time"loglO(wbc)+ag-l,Gamma(log),data=data)

c(coefficients(data.glm).deviance(data.glm)) }

leuk.gen <- function(data,mle)

{ i <- sample(nrow(data),replace=T)

data$time <- mle$muhat*mle$z[i]

data }

leuk.boot <- boot(leuk, fit.model, R=19, sim="parametric",

ran.gen=leuk.gen, mle=leuk.mle)

The other procedures for model-based resampling o f generalized linear models are applied similarly. Try to modify this code to resample the linear predictor residuals according to (7.13) (they are already calculated above).

11.7 ■ Further Topics in Regression 541

11.7.2 Survival dataFurther argum ents to cen sb o o t are needed to bootstrap survival data. For illustration, we consider the m elanom a da ta o f Example 7.6, and fit a model in which survival depends on log tum our thickness. The initial fits are given by

mel.cox <- coxph(Surv(time,status==l)~log(thickness)

+strata(ulcer),data=melanoma)

mel.surv <- survfit(mel.cox)

mel.cens <- survfit(Surv(time-0.01*(status!=1),status!=l)~l,

data=melanoma)

The bootstrap function m e l. fun given below need only take one argum ent, a datafram e containing the data themselves. N ote how the function uses a sm oothing spline to interpolate fitted values for the full range o f thickness; this avoids difficulties due to the variability o f the covariate when resampling cases. The output of m el. fun is the vector o f fitted linear predictors predicted by the spline.

mel.fun <- function(d)

{ attach(d)

cox <- coxph(Surv(time,status==l)~log(thickness)+strata(ulcer))

eta <- unique(cox$linear.predictors)

u <- unique(thickness)

sp <- smooth.spline(u,eta,df=20)

th <- seq(from=0.25,to=10,by=0.25)

eta <- predict(sp,th)$y

detach("d")

eta >

The next three com m ands give the syntax for case resampling, for model-based resampling and for conditional resampling. For either o f these last two schemes, the baseline survivor functions for the survival times and censoring times, and the fitted proportional hazards (Cox) model for the survival distribution m ust be supplied via the F . su rv , G . su rv , and cox arguments.

attach(melanoma)

mel.boot <- censboot(melanoma, mel.fun, R=99, strata=ulcer)

mel.boot.mod <- censboot(melanoma, mel.fun, R=99,

F.surv=mel.surv, G.surv=mel.cens, strata=ulcer,

cox=mel.cox, sim="model")

mel.boot.con <- censboot(melanoma, mel.fun, R=99,

F .surv=mel.surv, G.surv=mel.cens, strata=ulcer,

cox=mel.cox, sim="cond")

The bootstrap results are best displayed graphically. Here is the code for the analogue o f the left panels o f Figure 7.9:

t h < - s e q (fro m = 0 .2 5 ,to = 1 0 ,b y = 0 .2 5 )s p l i t . s c r e e n ( c ( 2 ,1 ))s c r e e n ( l )p lo t( th ,m e l.b o o t$ tO ,ty p e = " n " ,x la b = " T u m o u r th ic k n e s s (mm)",

x l im = c (0 ,1 0 ) ,y l im = c ( -2 ,2 ) ,y la b = " L in e a r p r e d ic to r " ) l i n e s ( t h , m e l . b o o t$ tO , lwd=3) r u g ( j i t t e r ( t h i c k n e s s ) )f o r ( i in 1 :1 9 ) l i n e s ( t h ,m e l .b o o t $ t [ i , ] , l w d = 0 .5 ) sc re e n (2 )p lo t( th ,m e l.b o o t$ tO ,ty p e = " n " ,x la b = " T u m o u r th ic k n e s s (mm)",

x l im = c (0 ,1 0 ) ,y l im = c ( -2 ,2 ) ,y la b = " L in e a r p r e d ic to r " ) l i n e s ( t h , m e l .b o o t$ t0 , lwd=3) m e l.en v < - e n v e lo p e ( m e l .b o o t$ t , le v e l= 0 .95) l i n e s ( t h . m e l . e n v $ p o in t[ 1 , ] , l t y = l ) l i n e s ( t h ,m e l. e n v $ p o in t[ 2 , ] , l t y = l ) m e l.en v <- e n v e lo p e (m e l .b o o t .m o d $ t ,le v e l= 0 .95) l i n e s ( t h ,m e l . e n v $ p o in t[ 1 , ] , l ty = 2 ) l in e s ( th ,m e l .e n v $ p o in t [ 2 , ] , l ty = 2 ) m e l.en v <- e n v e lo p e ( m e l .b o o t .c o n $ t , le v e l= 0 .95) l i n e s ( t h . m e l . e n v $ p o in t[ 1 , ] , l ty = 3 ) l i n e s ( t h , m e l . e n v $ p o in t[ 2 , ] , l ty = 3 ) d e ta c h ( "melanom a")

N ote how tight the confidence envelope is relative to tha t for the more highly param etrized model used in the example. Try again with larger values o f R, if you have the patience.

11.7.3 Nonparametric regressionN onparam etric regression is bootstrapped in the same way as o ther regressions. Consider for example bootstrapping the sm oothing spline fit to the m otorcycle data o f Example 7.10. The da ta w ithout repeats are in m otor, with com ponents a c c e l , tim e s , s t r a t a , and v, the last two o f which give the strata for resampling and an estim ated variance within each stratum . The three fits are obtained by

a t ta c h (m o to r )m o to r.sm o o th < - s m o o th .s p l in e ( t im e s ,a c c e l ,w = l /v ) m o to r .s m a ll <- s m o o th .s p l in e ( t im e s ,a c c e l ,w = l /v ,

sp a r= m o to r . sm o o th $ sp a r/2 ) m o to r .b ig <- s m o o th .s p l in e ( t im e s ,a c c e l ,w = l /v ,

sp a r= m o to r . sm ooth$spar*2)

Com m ands to set up and perform the resampling are as follows:

res <- (motor$accel-motor.small$y)/sqrt(1-motor.small$lev)

motor.mle <- data.frame(bigfit=motor.big$y,res=res)

xpoints <- c(10,20,25,30,35,45)

motor.fun <- function(data, x)

{ y.smooth <- smooth.spline(data$times,data$accel,w=l/data$v)

predict(y.smooth,x)$y }

motor.gen <- function(data, mle)

{ d <- data

i <- c(l:nrow(data))

11 <- sample(i[data$strata==l],replace=T)

12 <- sample(i[data$strata==2],replace=T)

13 <- sample(i[data$strata==3],replace=T)

d$accel <- mlelbigfit + mle$res[c(il,i2,i3)]

d >

motor.boot <- boot(motor, motor.fun, R=999, sim="parametric",

ran.gen=motor.gen, mle=motor.mle, x=xpoints)

Finally, the 90% basic bootstrap confidence limits are obtained by

mu.big <- predict(motor.big,xpoints)$y

mu <- predict(motor.smooth,xpoints)$y

ylims <- apply(motor.boot$t,2,quantile,c(0.05,0.95))

ytop <- mu - (ylims[1,]-mu.big)

ybot <- mu - (ylims[2,]-mu.big)

W hat is the effect o f using a smaller sm oothing param eter when calculating the residuals?

Try altering this code to apply the wild bootstrap, and see w hat effect it has on the results.

11.8 Time SeriesM odel-based resampling for time series is analogous to regression. We consider the sunspot da ta o f Example 8.3, to which we fit the autoregressive model that minimizes A IC:

sun <- 2*(sqrt(sunspot+l)-l)

ts.plot(sun)

sun.air <- ar(sun)

sun.ar$order

The best model is AR(9). How well determ ined is this, and w hat is the variance o f the series average? We bootstrap to see, using

sun.fun <- function(tsb)

{ ax.fit <- ar(tsb, order,max=25)

c(ar.fit$order, mean(tsb), tsb) >

which calculates the order o f the fitted autoregressive model, the series average, and saves the series itself.

O ur function for bootstrapping time series is ts b o o t . Here are results for fixed-block bootstraps with block length / = 20:

sun.l <- tsboot(sun, sun.fun, R=99, 1=20, sim="fixed")

tsplot(sun.l$t[l,3:291],main="Block simulation, 1=20")

table(sun.l$t[,1])

var(sun.l$t [,2])

qqnorm(sun.l$t [,2])

The statistic for t s b o o t takes only one argum ent, the time series. The first plot here shows the results for a single replicate using block sim ulation: note the occasional big jum ps in the resam pled series. N ote also the large variation in the orders o f the fitted autoregressive models.

To obtain similar results for the stationary bootstrap with mean block length I = 20 :

sun.2 <- tsboot(sun, sun.fun, R=99, 1=20, sim="geom")

Are the results similar to having blocks o f fixed length?For model-based resampling we need to store results from the original model,

and to make residuals from tha t fit:

sun.model <- list(order=c(sun.ar$order,0,0),ar=sun.ar$ar)

sun.res <- sun.ar$resid[!is.na(sun.ar$resid)]

sun.res <- sun.res - mean(sun.res)

sim.sim <- function(res,n.sim, ran.args)

{ rgl <- function(n, res) sample(res, n, replace=T)

ts.orig <- ran.args$ts

ts.mod <- ran.args$model

mean(ts.orig)+rts(arima.sim(model=ts.mod, n=n.sim,

rand.gen=rgl, res=as.vector(res))) }

sun.3 <- tsboot(sun.res, sun.fun, R=99, sim="model", n.sim=114,

ran.gen=sun.s im,ran.args=list(t s=sun, model=sun.model))

Check the orders o f the fitted models for this scheme. Are they similar to those obtained using the block schemes above?

For “post-blackening” we need to define yet another function:

sun.black <- function(res, n.sim, ran.args)

{ ts.orig <- ran.args$ts

11.9 ■ Improved Simulation 545

ts .m o d <- ra n .a rg s$ m o d e lm e a n ( ts . o r ig )+ r ts ( a r im a .s im (m o d e l= ts .m o d ,n = n .s im ,in n o v = re s ) ) }

s u n . lb <- t s b o o t ( s u n . r e s , s u n .fu n , R=99, 1=20, s im = " f ix e d " , ra n .g e n = s u n .b la c k , r a n .a x g s = l i s t ( t s = s u n , m o d e l= su n .m o d e l), n .s im = le n g th ( s u n ) )

Com pare these results with those above, and try it with sim="geom".

11.9 Improved Simulation

11.9.1 Balanced resampling

The balanced bootstrap is invoked via the sim argum ent to boot:

c i t y . b a l < - b o o t ( c i t y , c i t y . f u n , R=20, s im = "b a lan ced ")

If s t r a t a is supplied, balancing takes place separately within each stratum .

11.9.2 Control methodsc o n t r o l applies the control m ethods, including post-sim ulation balance, to the ou tput from an existing bootstrap simulation. For example,

c o n t r o l ( c i t y .b o o t , b ia s .a d j= T )

produces the adjusted bias estimate, while

c i t y .c o n <- c o n t r o l ( c i t y .b o o t )

gives a list consisting of the regression estimates o f the empirical influence values, linear approxim ations to the bootstrap statistics, the control estimates o f bias, variance, and the third cum ulant o f the f \ control estimates o f selected quantiles o f the distribution o f t*, and a spline object that summarizes the approxim ate quantiles used to obtain the control quantile estimates. Saddlepoint approxim ation is used to obtain these approxim ate quantiles. Typing

c i t y . con$L c i t y . co n $ b ias c i t y . con$var c i t y . c o n $ q u a n ti le s

gives some o f the above-mentioned quantities. Argum ents to c o n t r o l allow the user to specify the empirical influence values, the spline object, and other quantities to be used by control, if they are already available; see the help file for details.

We have already met a use o f nonparam etric sim ulation with unequal probabilities in Section 11.4, using the w e ig h ts argum ent to b o o t. The simplest form for w e ig h ts , used there, is a vector containing the probabilities with which to select the rows o f d a ta , when bootstrap sampling is to be perform ed with unequal probabilities. I f we wish to perform im portance resampling using several distributions, we can set them up and then perform the sampling as follows:

c i t y . t o p <- e x p . t i l t ( L = c i t y .L , th e ta = 2 , tO = c i ty .w ( c i ty ) ) c i t y .b o t < - e x p . t i l t ( L = c i t y .L , th e ta = 1 .2 , tO = c i ty .w ( c i ty ) ) c i t y . t i l t < - b o o t ( c i t y , c i t y . f u n , R = c (1 0 0 ,9 9 ),

w e ig h ts = r b in d ( c i ty . to p $ p , c i t y .b o t$ p ) )

which perform s 100 sim ulations from the probabilities in c i ty . to p $ p and 99 from the probabilities in c i ty .b o t$ p . In the first two lines e x p . t i l t is used to solve the equation

. , £ , 0 e x p ( ^ )to I V—\ / 1 7 \ ’J2 j exp(A/j)

corresponding to exponential tilting o f the linear approxim ation to t to be centred at 6 = 2 and 1.2. In the call to b o o t, R is a vector, and w e ig h ts a m atrix with le n g th (R ) rows and n ro w (d a ta ) columns, corresponding to the le n g th (R) distributions from which resampling takes place.

The im portance sampling weights, moments, and selected quantiles o f the resamples in c i t y . t i l t $ [ , l ] are calculated by

im p. w e i g h t s ( c i t y . t i l t ) im p .m o m e n ts ( c i ty . t i l t ) i m p . q u a n t i l e ( c i t y . t i l t )

Each o f these returns raw, ratio and regression estim ates o f the corresponding quantities. Some other uses o f im portant resampling are exemplified by

i m p . p r o b ( c i t y . t i l t , t0 = 1 .2 , def=F)z < - ( c i t y . t i l t $ t [ , 1 ] - c i t y . t i l t $ t 0 [ 1 ] ) / s q r t ( c i t y . t i l t $ t [ ,2 ] ) im p. q u a n t i l e ( b o o t . o u t = c i t y . t i l t , t= z )

The call to im p .p ro b calculates the im portance sampling estimate o f the probability tha t t* < 1.2, w ithout using defensive mixture distributions (by default d e f =T, i.e. defensive mixture distributions are used to obtain the weights and estimates). The last two lines show how im portance sampling is used to estimate quantiles o f the studentized bootstrap statistic.

For m ore details and further argum ents to the functions, see their help files.

11.9.3 Importance resampling

11.9 ■ Improved Simulation 547

Function tilt.boot

The description above relies on exponential tilting to obtain the resampling probabilities, and requires knowing where to tilt to. If this is difficult, t i l t . b o o t can be used to avoid this, by perform ing an initial bootstrap with equal resam pling probabilities, then using frequency sm oothing to estimate appropriate tilted probabilities. For example,

c i t y . t i l t < - t i l t . b o o t ( c i t y , c i t y . f u n , R=c(5 0 0 ,2 5 0 ,2 4 9 ))

perform s 500 ordinary bootstraps, uses the results to estimate probability distributions tilted to the 0.025 and 0.975 points o f the simulations, and then perform s 250 bootstraps tilted to the 0.025 quantile, and 249 tilted to the 0.975 quantile, before assigning the result to a bootstrap object. M ore complex uses o f t i l t , b o o t are possible; see its help file.

Importance re-weighting

These functions allow for im portance re-weighting as well as im portance sampling. For example, suppose tha t we require to re-weight the simulated values so that they appear to have been simulated from a distribution with expected ratio close to 1.4. We then use the q= option to the im portance sampling functions as follows:

q < - s m o o th . f ( t h e t a = l .4 , b o o t . o u t = c i t y . t i l t )c i t y .w ( c i t y , q)im p .m o m e n ts ( c i ty . t i l t , q=q)i m p . q u a n t i l e ( c i t y . t i l t , q=q)

where the first line calculates the sm oothed distribution, the second obtains the corresponding ratio, and the third and fourth obtain the m om ent and quantile estimates corresponding to sim ulation from the distribution q.

11.9.4 Saddlepoint methodsThe function used for single saddlepoint approxim ation is s a d d le . Its simplest use is to obtain the P D F and C D F approxim ations for a linear statistic, such as the linear approxim ation t + n-1 Y f j h to a general bootstrap statistic t*. The same results are obtained by using the approxim ation n~l Y f j h to t* — t, and this is w hat s a d d le does. To obtain the approxim ations at t ' = 2 for the c i t y data, we set

s a d d le ( A = c i ty .L /n r o w ( c i ty ) , u = 2 - c i ty .w ( c i ty ) )

which returns the P D F and C D F approxim ations, and the value o f £.The function s a d d l e .d i s t n returns the saddlepoint estimate o f an entire

distribution, using the terms n~*lj in the random sum and an initial idea of the centre and scale for the distribution o f T ' — t:

city.tO <- c(0, sqrt(var.linear(city.L)))

city.sad <- saddle.distn(A=city.L/nrow(city), tO=city.tO)

city.sad

The Lugannani-R ice form ula can be applied by setting LR=T in the calls to s a d d le and s a d d l e .d i s t ; by default LR=F.

For more sophisticated applications, the argum ents A and u to s a d d le . d i s t n can be replaced by functions. For example, the bootstrapped ratio can be defined through the estim ating equation

£ / ; ( * ; - tu ,)= ° , (11.1)j

where the /* have a jo in t m ultinom ial distribution with equal probabilities and denom inator n = 10, the num ber o f rows o f c i ty , as outlined in Example 9.16. Accordingly we set

city.tO <- c(city.w(city), sqrt(var.linear(city.L)))

Afn <- function(t, data) data$x-t*data$u

ufn <- function(t, data) 0

saddle(A=Afn(2, city), u=0)

city.sad <- saddle.distn(A=Afn, u=ufn, t0=city.t0, data=city)

The penultim ate line here gives the exact version o f the call to s a d d le that started this section, while the last line calculates the saddlepoint approxim ation to the exact distribution o f T*. For s a d d l e .d i s t n the quantiles o f the distribution o f T * are estim ated by obtaining the C D F approxim ation at a num ber o f values o f t, and then interpolating the C D F using a spline smoother. The range o f values o f t used is determ ined by the contents o f tO, whose first value contains the original value o f the statistic, and whose second value contains a measure o f the spread o f the distribution o f T*, such as its standard error.

A nother use o f s a d d le and s a d d l e .d i s t n is to give them directly the adjusted cum ulant generating function K(£) — t£, and the second derivative K" {<*). For example, the c i t y da ta above can be tackled as follows:

K.adj <- function(xi)

{ L <- city$x-city.t*city$u

nrow(city)*log(sum(exp(xi*L))/nrow(city))-city.t*xi }

K2 <- function(xi)

{ L <- city$x-city.t*city$u

p <- exp(L*xi)

nrow(city)*(sum(L‘2*p)/sum(p) - (sum(L*p)/sum(p))~2) >

city.t <- 2

saddle(K.adj=K.adj, K2=K2)

11.10 ■ Semiparametric Likelihoods 549

This is m ost useful when K(-) is not o f the standard form tha t follows from a m ultinom ial distribution.

Conditional approximations

C onditional saddlepoint approxim ation is applied by giving Af n and u fn more columns, and setting the w d is t and ty p e argum ents to s a d d le appropriately. For example, suppose that we want to find the distribution o f T " , defined as the root o f (11.1), but resampling 25 rather than 49 cases o f b ig c i ty . Then we set

bigcity.L <- (bigcity$x-city.w(bigcity)*bigcity$u)/

mean(bigcity$u)

bigcity.tO <- c(city.w(bigcity), sqrt(var.linear(bigcity.L)))

Afn <- function(t, data) cbind(data$x-t*data$u, 1)

ufn <- function(t, data) c(0,25)

saddle(A=Afn(l.4, bigcity), u=ufn(1.4, bigcity), wdist="p",

type="cond")

city.sad <- saddle.distn(A=Afn, u=ufn, wdist="p", type="cond",

data=bigcity, tO=bigcity.tO)

Here the w d is t argum ent gives the distribution o f the random variables W j ,

which is Poisson in this case, and the ty p e argum ent specifies tha t a conditional approxim ation is required. For resampling w ithout replacement, see the help file. A further argum ent mu allows these variables to have differing means, in which case the conditional saddlepoint will correspond to sampling from m ultinom ial or hypergeom etric distributions with unequal probabilities.

11.10 Semiparametric LikelihoodsBasic functions only are provided for sem iparam etric likelihood inference.

To calculate and plot the log profile likelihood for the mean o f a gamma model for the larger air conditioning data (Example 10.1):

gam.L <- function(y, tmin=min(y)+0.1, tmax=max(y)-0.1, n.t)

{ gam.loglik <- function(l.nu, mu, y)

{ nu <- exp(l.nu)

-sum(log(dgamma(mi*y/mu, nu)*nu/mu)) }

out <- matrix(NA, n.t+1, 3)

for (it in 0:n.t)

{ t <- tmin + (tmax-tmin)*it/n.t

fit <- nlminb(0, gam.loglik, mu=t, y=y)

out[l+it,] <- c(t, exp(fit$parameters), -fit$objective) }

out }

air.gam <- gam.L(aircondit7$hours, 40, 120, 100)

air. gam [, 3] <- air. gam [, 3] - max (air .gam [, 3])

plot(air.gam[,1],air.gam[,3],type="l",xlab="theta",

ylab="Log likelihood",xlim=c(40,120))

abline(h=-0.5*qchisq(0.95,1),lty=2)

Empirical and empirical exponential family likelihoods are obtained using the functions E L .p r o f i le and EEF.p r o f i l e . They are included in the library for dem onstration purposes only, and are not intended for serious use, nor are they currently supported as part o f the library. These functions give log likelihoods for the mean o f their first argum ent, calculated at n . t values o f 6

from tm in and tmax. The output o f E L .p ro f i l e is a n . t x 3 m atrix whose first colum n is the values o f 6 , whose next colum n is the log profile likelihood, and whose final colum n is the values o f the Lagrange multiplier. The output o f EEF. p ro f i l e is a n . t x 4 m atrix whose first colum n is the values o f 6 , whose next two colum ns are versions o f the log profile likelihood (see Example 10.4), and whose final colum n is the values o f the Lagrange multiplier. For exam ple:

air.EL <- EL.profile(aircondit7$hours,tmin=40,tmax=120,n.t=100)

lines(air.EL[,1],air.EL[,2],lty=2)

air.EEF <- EEF.profile(aircondit7$hours,tmin=40,tmax=120,

n.t=100)

lines(air.EEF[,1],air.EEF[,3],lty=3)

N ote how close the two sem iparam etric log likelihoods are, com pared to the param etric one. The practicals at the end o f C hapter 10 give m ore examples o f their use (and abuse).

M ore general (and m ore robust!) code to calculate empirical likelihoods is provided by Professor A. B. Owen a t Stanford University; see W orld Wide Web reference http://playfair.stanford.edU/reports/owen/el.S.

http://playfair.stanford.edU/reports/owen/el.S

A P P E N D I X A

Cumulant Calculations

In this book several chapters and some o f the problems involve m om ent calculations, which are often simplified by using cumulants.

The cum ulant-generating function o f a random variable Y is

K(t ) = l o g E ( e tY) = J 2 - / ks,s=l S-

where k s is the sth cum ulant, while the m om ent-generating function o f Y is

00 1M(t) = E ( e' r ) = £ - f y s,

s=0 5 ’

where fi's = E (Y S) is the sth moment. A simple example is a N(/i, a 2) random variable, for which K(t ) = t/j.+ t 2a 2; note the appealing fact that its cum ulants o f order higher than two are zero. By equating powers o f t in the expansions o f K(t) and log M(t) we find that k\ = and that

K2 =

*3 = n'3 -3(i2»'i+2(n'l)3,K4 = H 4 ~ 4 /4 /4 - 3(/4)2 + 12/4(/4 )2 - 6(/4 )4.

with inverse formulae

A*2 = K2 + (K i )2,

/i'3 = K3 + 3K2K\ + (Ki)3, (A .l)

/*4 = k 4 + 4 ( c 3 )c i + 3 ( k 2 ) 2 + 6 k 2 ( k i ) 2 + ( k i ) 4 .

The cum ulants k j, k2, kt, and K4 are the mean, variance, skewness and kurtosis o f Y.

For vector Y it is better to drop the power no tation used above and to

551

552 A ■ Cumulant Calculations

adopt index notation and the sum m ation convention. In this notation Y has com ponents Y ' , . . . , Y" and we write Y ' Y ' and Y 'Y 'Y ' for the square and cube o f Y' . The jo in t cum ulant-generating function K(t) o f Y l , . . . , Y n is the logarithm o f their jo in t m om ent-generating function,

logE = hk1 + jjtitjKlJ + j ]titj tkKt'j'k + j i titJtktiK‘JJ<J + ■■■,

where sum m ation is implied over repeated indices, so that, for example,

t,/c‘ = t\Kl + ----- h t„Kn, titjK''J = fitiK 1,1 + t \ t 2K1'2 + ----- h tntnKn'n.

Thus the n-dimensional norm al distribution with means k ‘ and covariance m atrix fc‘J has cum ulant-generating function Uk1 + jt j t jKl’i . We sometimes write K>J = cum (Y ‘, Y j ), K'j* = cum (Y ', Y>, Y k) and so forth for the coefficients of titj, titjtk in K(t). The cum ulant arrays k ‘j , etc. are invariant to index perm utation, so for example (c1,2,3 = k2,3,1.

A key feature that simplifies calculations with cum ulants as opposed to m om ents is tha t cum ulants involving two or more independent random variables are zero : for independent variables, k,j = k ' ^ = • • = 0 unless all the indices are equal.

The above notation extends to generalized cum ulants such as

cum (Y ‘Y ; Yfc) = E ( Y iY i Y k) = Kijk,

cum (Y ‘, Y * Y k) = KlJk, c u m ( Y iY J, Y k, Y l) = KijJ('1,

which can be obtained from the jo in t cum ulant-generating functions o f Y ' Y j Y k, o f Y ' and Y JY k and o f Y ‘YJ, Y k, and Y 1. N ote that ordinary m om ents can be regarded as generalized cumulants.

Generalized cum ulants can be expressed in terms o f ordinary cum ulants by means o f com plem entary set partitions, the m ost useful o f which are given in Table A .I. For example, we use its second colum n to see that = jc‘j + k‘k\ or

E (Y ‘Y J) = cum ( Y iY j ) = cum (Y ', Y J) + cum (Y ‘)cum (Y '),

m ore familiarly w ritten cov(Y ‘, Y j ) + E (Y ')E (Y ; ). The boldface 12 represents k 12, while the 12 [1] and 1|2 [1] immediately below it represent k 1,2 and k 1k2. W ith this understanding we use the third colum n to see tha t Ky’fc =

[3] + KiKJV t, where [3] is shorthand for kij’k* + KlJtkj + k’^k ' ; this is the m ultivariate version o f (A .l). Likewise k ' ^ = i + k*^k;’[2], where the term K ,,kK ^ [2 ] on the right is understood in the context o f the left-hand side to equal + k ^ k ' : each index in the first block o f the partition i j \ k appears once with the index in the second block. The expression 123|4 [2] [2] in the fourth colum n o f the table represents the partitions 123|4, 124|3, 134|2, 234| 1.

To illustrate these ideas, we calculate cov{Y,(n — I)-1 ^ (Y , — Y)2}, where

A ■ Cumulant Calculations 553

Y = n~] Y l Yj is the average o f the independent and identically distributed random variables Y \ , . . . , Y n. Note first tha t the covariance does not depend on the mean o f the Y„ so we can take k‘ = 0. We then express Y and (n — I )-1 — Y ) 1 in index notation as a, Y' and frj/Y'Y-', where a, = l / nand bij = (<5I;- — l/n ) /(n — 1), with

S i = l 1, i = j ’,} \ 0, otherwise,

the K ronecker delta symbol. The covariance is

cum (a iY ' ,b jkY JY k) = afij^K1'^ = atbjkKl'J'k = naib uK1,1,1,

the second equality following on use o f Table A .l because k' = 0 and the third equality following because the observations are independent and identically distributed. In power no tation k 1,1,1 is k3, the third cum ulant o f Yi; so cov{Y,(n - l ) - 1 £ (Y , - Y)2} = K-i/n. Similarly

var{(n - I)-1 ^ ( Y , ~ Y ) 2} = c u m ^ Y 'Y 'A , Y* Y () = bt jh, ,

which Table A .l shows to be equal to bijbiii(ic‘'j’k’1 + k ' ^k^1 + kx’1k^ ) . This reduces to

n b n b u K 1’1’1’1 + 2nb\ ibuKl’lK1'1 + 2 n(n — l ) b n b i 2K1’1 k 1'1 ,

which in tu rn is K 4 / n + 2(k2)2/(n — 1) in power notation. To perform this calculation using m om ents and power notation will convince the reader o f the elegance and relative simplicity o f cum ulants and index notation.

M cCullagh (1987) makes a cogent more-extended case for these methods. His book includes more-extensive tables o f com plem entary set partitions.

554 A ■ Cumulant Calculations

1 2 3 4

1 12 123 12341 [1] 12 [1] 123 [1] 1234 [1]

1|2 [1] 12|3 [3] 123|4 [4]1|2|3 [1] 12|34 [3]

1|2 12|3|4 [6]12 [1] 12|3 1|2|3|4 [1]

123 [1]13|2 [2] 123|4

1234 [1]1|2|3| 124|3 [3]123 [1] 12|34 [3]

14|2|3 [3]

12|341234 [1] 123|4 [2] [2]

134|2 13|24 [2] 13|2|4 [4]

12|3|4 1234 [1] 134|2 [2] 13|24 [2]

1|2|3|4 1234 [1]

Table A.1Complementary set partitions

Bibliography

A belson, R. R and Tukey, J. W. (1963) Efficient utilization o f non-num erical inform ation in quan tita tive analysis: general theory an d the case o f simple order. Annals of Mathematical Statistics 34, 1347-1369.

A kaike, H. (1973) In form ation theory and an extension o f the m axim um likelihood principle. In Second International Symposium on Information Theory, edsB. N . Petrov and F. Czaki, pp. 267-281. B udapest: A kadem iai K iado. R eprin ted in Breakthroughs in Statistics, volum e 1, eds S. K otz and N . L. Johnson, pp. 610-624. N ew Y ork: Springer.

A kritas, M . G. (1986) B ootstrapping the K ap lan -M eie r estim ator. Journal of the American Statistical Association 81, 1032-1038.

A ltm an, D. G. and A ndersen, P. K. (1989) B ootstrap investigation o f the stability o f a Cox regression model. Statistics in Medicine 8, 771-783.

A ndersen, P. K ., Borgan, 0 ., Gill, R. D. and Keiding, N. (1993) Statistical Models Based on Counting Processes. N ew Y ork: Springer.

Andrew s, D. F. and H erzberg, A. M. (1985) Data: A Collection of Problems from Many Fields for the Student and Research Worker. N ew Y ork: Springer.

A ppleyard, S. T., W itkowski, J. A., Ripley, B. D., Shotton, D. M . an d Dubowicz, V. (1985) A novel procedure for p a tte rn analysis o f features present on freeze fractured p lasm a m em branes. Journal of Cell Science 74, 105-117.

A threya, K . B. (1987) B ootstrap o f the m ean in the infinite variance case. Annals of Statistics 15, 724-731.

A tkinson, A. C. (1985) Plots, Transformations, and Regression. O xford: C larendon Press.

Bai, C. and O lshen, R. A. (1988) D iscussion o f

“T heoretical com parison o f boo tstrap confidence intervals”, by P. Hall. Annals of Statistics 16, 953-956.

Bailer, A. J. an d Oris, J. T. (1994) Assessing toxicity o f p o llu tan ts in aquatic systems. In Case Studies in Biometry, eds N. Lange, L. Ryan, L. Billard, D . R. Brillinger, L. C onquest and J. G reenhouse, pp. 25-40. N ew Y ork: Wiley.

Banks, D . L. (1988) H istospline sm oothing the Bayesian boo tstrap . Biometrika 75, 673-684.

Barbe, P. an d Bertail, P. (1995) The Weighted Bootstrap. Volume 98 o f Lecture Notes in Statistics. N ew Y ork: Springer.

B arnard, G . A. (1963) D iscussion o f “T he spectral analysis o f po in t processes”, by M . S. B artlett. Journal of the Royal Statistical Society series B 25, 294.

Bam dorff-N ielsen, O. E. and Cox, D. R. (1989) Asymptotic Techniques for Use in Statistics. L ondon: C hapm an & Hall.

B am dorff-N ielsen, O. E. and Cox, D. R. (1994) Inference and Asymptotics. L ondon: C hapm an & Hall.

Beran, J. (1994) Statistics for Long-Memory Processes. L ondon: C hapm an & Hall.

Beran, R. J. (1986) S im ulated pow er functions. Annals of Statistics 14, 151-173.

Beran, R. J. (1987) Prepivoting to reduce level erro r o f confidence sets. Biometrika 74, 457-468.

Beran, R. J. (1988) Prepivoting test statistics: a boo tstrap view o f asym ptotic refinements. Journal of the American Statistical Association 83, 687-697.

Beran, R. J. (1992) D esigning boo tstrap prediction regions. In Bootstrapping and Related Techniques: Proceedings, Trier, FRG, 1990, eds K .-H . Jockel, G . R o the and

555

556 Bibliography

W. Sendler, volum e 376 o f Lecture Notes in Economics and Mathematical Systems, pp. 23-30. New Y ork: Springer.

Beran, R. J. (1997) D iagnosing b o o tstrap success. Annals o f the Institute o f Statistical Mathematics 49, to appear.

Berger, J. O. and B ernardo, J. M . (1992) O n the developm ent o f reference prio rs (w ith D iscussion). In Bayesian Statistics 4, eds J. M . B ernardo, J. O. Berger,A. P. D aw id and A. F. M. Sm ith, pp. 35-60. O xford: C larendon Press.

Besag, J. E. an d Clifford, P. (1989) G eneralized M onte C arlo significance tests. Biometrika 76, 633-642.

Besag, J. E. and Clifford, P. (1991) Sequential M onte C arlo p-values. Biometrika 78, 301-304.

Besag, J. E. and Diggle, P. J. (1977) Simple M onte C arlo tests for spatia l pa ttern . Applied Statistics 26, 327-333.

Bickel, P. J. an d Freedm an, D. A. (1981) Some asym ptotic theory for the boo tstrap . Annals o f Statistics 9, 1196-1217.

Bickel, P. J. and Freedm an, D . A. (1983) B ootstrapping regression m odels w ith m any param eters. In A Festschrift fo r Erich L. Lehmann, eds P. J. Bickel, K . A. D oksum and J. L. Hodges, pp. 28-48. Pacific Grove, C alifornia: W adsw orth & B rooks/C ole.

Bickel, P. J. and Freedm an, D. A. (1984) A sym ptotic norm ality and the boo tstrap in stratified sam pling. Annals o f Statistics 12, 470-482.

Bickel, P. J., G otze, F. and van Zwet, W. R. (1997) Resam pling fewer th an n observations: gains, losses, and rem edies fo r losses. Statistica Sinica 7, 1-32.

Bickel, P. J., K lassen, C. A. J., R itov, Y. and W ellner, J. A. (1993) Efficient and Adaptive Estimation fo r Semiparametric Models. Baltim ore: Johns H opkins U niversity Press.

Bickel, P. J. an d Y ahav, J. A. (1988) R ichardson ex trapo la tion and the boo tstrap . Journal o f the American Statistical Association 83, 387-393.

Bissell, A. F. (1972) A negative b inom ial m odel with varying elem ent sizes. Biometrika 59, 435—441.

Bissell, A. F. (1990) H ow reliable is your capability index? Applied Statistics 39, 331-340.

Bithell, J. F. and Stone, R. A. (1989) O n statistical m ethods for analysing the geographical d istribu tion o f cancer cases near nuclear installations. Journal o f Epidemiology and Community Health 43, 79-85.

Bloomfield, P. (1976) Fourier Analysis o f Time Series: An Introduction. N ew Y ork: Wiley.

Boos, D. D . and M onahan, J. F. (1986) B ootstrap m ethods using prio r inform ation. Biometrika 73, 77-83.

Booth, J. G . (1996) B ootstrap m ethods for generalized linear m ixed m odels w ith applications to small area estim ation. In Statistical Modelling, eds G. U . H. Seeber,B. J. Francis, R. H atzinger and G. Steckel-Berger, volum e 104 o f Lecture Notes in Statistics, pp. 43-51.N ew Y ork: Springer.

Booth, J. G . and Butler, R. W. (1990) R andom iza tion d istribu tions and saddlepoin t approxim ations in generalized linear models. Biometrika 77, 787-796.

Booth, J. G., Butler, R. W. and H all, P. (1994) B ootstrap m ethods for finite populations. Journal o f the American Statistical Association 89, 1282-1289.

Booth, J. G . and H all, P. (1994) M onte C arlo approxim ation and the iterated boo tstrap . Biometrika 81, 331-340.

Booth, J. G ., H all, P. and W ood, A. T. A. (1992) B ootstrap estim ation o f conditional distributions. Annals o f Statistics 20, 1594-1610.

Booth, J. G ., H all, P. and W ood, A. T. A. (1993) Balanced im portance resam pling for the boo tstrap . Annals o f Statistics 21, 286-298.

Bose, A. (1988) E dgew orth correction by b o o tstrap in autoregressions. Annals o f Statistics 16, 1709-1722.

Box, G. E. P. and Cox, D . R. (1964) A n analysis o f transform ations (w ith Discussion). Journal o f the Royal Statistical Society series B 26, 211-246.

Bratley, P., Fox, B. L. and Schrage, L. E. (1987) A Guide to Simulation. Second edition. N ew Y ork: Springer.

Braun, W. J. and K ulperger, R. J. (1995) A F ourier m ethod fo r boo tstrapp ing tim e series. P reprint, D epartm en t o f M athem atics and Statistics, U niversity o f W innipeg.

B raun, W. J. and K ulperger, R. J. (1997) Properties o f a F ourier b o o tstrap m ethod fo r tim e series. Communications in Statistics — Theory and M ethods 26, to appear.

Breim an, L., F riedm an, J. H ., Olshen, R. A. and Stone,C. J. (1984) Classification and Regression Trees. Pacific Grove, C alifornia: W adsw orth & Brooks/C ole.

Breslow, N. (1985) C o h o rt analysis in epidem iology. In A Celebration o f Statistics, eds A. C. A tk inson and S. E. F ienberg, pp. 109-143. N ew Y ork: Springer.

Bretagnolle, J. (1983) Lois lim ites du boo tstrap de certaines fonctionelles. Annales de I'Institut Henri Poincare, Section B 19, 281-296.

Brillinger, D. R. (1981) Time Series: Data Analysis and Theory. Expanded edition. San Francisco: H olden-D ay.

Brillinger, D. R. (1988) A n elem entary trend analysis o f R io N egro levels a t M anaus, 1903-1985. Brazilian Journal o f Probability and Statistics 2, 63-79.

Bibliography 557

Brillinger, D. R. (1989) C onsistent detection o f a m onoton ic trend superposed on a sta tionary tim e series. Biometrika 76, 23-30.

Brockwell, P. J. and Davis, R. A. (1991) Time Series: Theory and Methods. Second edition. N ew Y ork: Springer.

Brockwell, P. J. and Davis, R. A. (1996) Introduction to Time Series and Forecasting. N ew Y ork: Springer.

Brown, B. W. (1980) Prediction analysis for b inary data . In Biostatistics Casebook, eds R. G. M iller, B. Efron, B. W. Brown an d L. E. Moses, pp. 3-18. N ew Y ork: Wiley.

Buckland, S. T. and G arthw aite , P. H . (1990) A lgorithm AS 259: estim ating confidence intervals by the R obb in s-M o n ro search process. Applied Statistics 39, 413-424.

Biihlm ann, P. and Kiinsch, H. R. (1995) The blockwise boo tstrap for general param eters o f a sta tionary time series. Scandinavian Journal of Statistics 22, 35-54.

Bunke, O. and D roge, B. (1984) B ootstrap and cross-validation estim ates o f the prediction erro r for linear regression models. Annals of Statistics 12, 1400-1424.

Burm an, P. (1989) A com parative study o f o rd inary cross-validation, D-fold cross-validation and the repeated learning-testing m ethods. Biometrika 76, 503-514.

Burr, D . (1994) A com parison o f certain boo tstrap confidence intervals in the Cox m odel. Journal of the American Statistical Association 89, 1290-1302.

Burr, D . and Doss, H. (1993) Confidence bands for the m edian survival time as a function o f covariates in the Cox m odel. Journal of the American Statistical Association 88, 1330-1340.

Canty, A. J., D avison, A. C. and Hinkley, D. V. (1996) Reliable confidence intervals. D iscussion o f “ B ootstrap confidence intervals”, by T. J. D iCiccio and B. Efron. Statistical Science 11, 214-219.

C arlstein, E. (1986) T he use o f subseries values for estim ating the variance o f a general sta tistic from a s ta tionary sequence. Annals of Statistics 14, 1171-1179.

C arpen ter, J. R. (1996) Simulated confidence regions for parameters in epidemiological models. Ph.D. thesis, D epartm en t o f S tatistics, U niversity o f Oxford.

C ham bers, J. M. an d H astie, T. J. (eds) (1992) Statistical Models in S. Pacific Grove, C alifornia: W adsw orth & B rooks/C ole.

C hao, M.-T. and Lo, S.-H. (1994) M axim um likelihood sum m ary and the b o o tstrap m ethod in structured finite populations. Statistica Sinica 4, 389-406.

C hapm an, P. and H inkley, D. V. (1986) T he double

b ootstrap , pivots and confidence limits. Technical R eport 34, C enter for S tatistical Sciences, U niversity o f Texas a t A ustin.

Chen, C., D avis, R. A., Brockwell, P. J. and Bai, Z. D. (1993) O rder determ ination for autoregressive processes using resam pling m ethods. Statistica Sinica 3, 481-500.

Chen, C.-H. an d G eorge, S. L. (1985) The boo tstrap and identification o f prognostic factors via Cox’s p roportional hazards regression m odel. Statistics in Medicine 4, 39-46.

Chen, S. X. (1996) E m pirical likelihood confidence intervals for nonparam etric density estim ation. Biometrika 83, 329-341.

Chen, S. X. and H all, P. (1993) Sm oothed em pirical likelihood confidence intervals for quantiles. Annals of Statistics 21, 1166-1181.

Chen, Z. and D o, K.-A. (1994) The b o o tstrap m ethods w ith saddlepoint approxim ations and im portance resam pling. Statistica Sinica 4, 407-421.

C obb, G. W. (1978) The problem o f the N ile: conditional solution to a changepoint problem . Biometrika 65, 243-252.

C ochran, W. G . (1977) Sampling Techniques. T hird edition. N ew Y ork: Wiley.

Collings, B. J. and H am ilton , M. A. (1988) E stim ating the pow er o f the tw o-sam ple W ilcoxon test for location shift. Biometrics 44, 847-860.

C ook, R. D., H aw kins, D. M. and W eisberg, S. (1992) C om parison o f m odel m isspecification diagnostics using residuals from least m ean o f squares and least m edian o f squares fits. Journal of the American Statistical Association 87, 4 1 9 ^ 2 4 .

C ook, R. D., Tsai, C.-L. and Wei, B. C. (1986) Bias in non linear regression. Biometrika 73, 615-623.

C ook, R. D. and W eisberg, S. (1982) Residuals and Influence in Regression. L ondon: C hapm an & Hall.

C ook, R. D. and W eisberg, S. (1994) T ransform ing a response variable for linearity. Biometrika 81, 731-737.

C orcoran , S. A., D avison, A. C. and Spady, R. H . (1996) R eliable inference from em pirical likelihoods. Preprint, D epartm en t o f Statistics, U niversity o f Oxford.

Cowling, A., Hall, P. and Phillips, M. J. (1996) Bootstrap confidence regions for the intensity o f a Poisson process. Journal of the American Statistical Association 91, 1516-1524.

Cox, D. R. and H inkley, D. V. (1974) Theoretical Statistics. London: C hapm an & Hall.

Cox, D. R. and Isham , V. (1980) Point Processes. London: C hapm an & Hall.

558 Bibliography

Cox, D. R. an d Lewis, P. A. W. (1966) The Statistical Analysis of Series of Events. L ondon: C hapm an & Hall.

Cox, D. R. and Oakes, D . (1984) Analysis of Survival Data. L ondon: C hapm an & Hall.

Cox, D. R. and Snell, E. J. (1981) Applied Statistics: Principles and Examples. L ondon: C hapm an & Hall.

Cressie, N. A. C. (1982) Playing safe w ith m isweighted means. Journal of the American Statistical Association 77, 754-759.

Cressie, N . A. C. (1991) Statistics for Spatial Data. N ew Y ork: Wiley.

D ahlhaus, R. and Janas, D. (1996) A frequency dom ain boo tstrap for ra tio statistics in tim e series analysis. Annals of Statistics 24, to appear.

Daley, D. J. and Vere-Jones, D. (1988) An Introduction to the Theory of Point Processes. N ew Y ork: Springer.

D aniels, H . E. (1954) Saddlepoin t approxim ations in statistics. Annals of Mathematical Statistics 25, 631-650.

Daniels, H. E. (1955) D iscussion o f “P erm uta tion theory in the derivation o f robust criteria and the study o f departu res from assum ption”, by G . E. P. Box and S. L. A ndersen. Journal of the Royal Statistical Society series B 17, 27-28.

Daniels, H . E. (1958) D iscussion o f “T he regression analysis o f b inary sequences”, by D. R. Cox. Journal of the Royal Statistical Society series B 20, 236-238.

Daniels, H. E. and Y oung, G. A. (1991) Saddlepoint approx im ation for the studentized m ean, w ith an application to the boo tstrap . Biometrika 78, 169-179.

D avison, A. C. (1988) D iscussion o f the Royal S tatistical Society m eeting on the boo tstrap . Journal of the Royal Statistical Society series B 50, 356-357.

D avison, A. C. and H all, P. (1992) O n the bias and variability o f b o o tstrap and cross-validation estim ates o f erro r rate in d iscrim ination problem s. Biometrika 79, 279-284.

D avison, A. C. an d H all, P. (1993) O n Studentizing and blocking m ethods for im plem enting the boo tstrap with dependent da ta . Australian Journal of Statistics 35, 215-224.

D avison, A. C. an d Hinkley, D. V. (1988) Saddlepoint approxim ations in resam pling m ethods. Biometrika 75, 417-431.

D avison, A. C., Hinkley, D. V. and Schechtm an, E. (1986) Efficient b o o tstrap sim ulation. Biometrika 73, 555-566.

D avison, A. C., Hinkley, D. V. and W orton, B. J. (1992) B ootstrap likelihoods. Biometrika 79, 113-130.

D avison, A. C., H inkley, D. V. and W orton, B. J. (1995) A ccurate and efficient construction o f boo tstrap

D avison, A. C. an d Snell, E. J. (1991) R esiduals and diagnostics. In Statistical Theory and Modelling: In Honour of Sir David Cox, FRS, eds D. V. Hinkley,N. Reid and E. J. Snell, pp. 83-106. L ondon: C hapm an & Hall.

D e Angelis, D . and G ilks, W. R. (1994) Estim ating acquired im m une deficiency syndrom e incidence accounting for reporting delay. Journal of the Royal Statistical Society series A 157, 31-40.

D e Angelis, D., H all, P. and Young, G . A. (1993)A nalytical and b o o tstrap approxim ations to estim ator d istributions in L\ regression. Journal of the American Statistical Association 88, 1310-1316.

D e Angelis, D . and Y oung, G. A. (1992) Sm oothing the boo tstrap . International Statistical Review 60, 45-56.

D em pster, A. P., L aird, N. M . and Rubin, D . B. (1977) M axim um likelihood from incom plete d a ta via the EM algorithm (with Discussion). Journal of the Royal Statistical Society series B 39, 1-38.

D iaconis, P. and Holm es, S. (1994) G ray codes for random ization procedures. Statistics and Computing 4, 287-302.

DiCiccio, T. J. an d Efron, B. (1992) M ore accurate confidence intervals in exponential families. Biometrika79, 231-245.

DiCiccio, T. J. and Efron, B. (1996) B ootstrap confidence intervals (w ith Discussion). Statistical Science 11, 189-228.

DiCiccio, T. J., H all, P. and R om ano, J. P. (1989) C om parison o f param etric and em pirical likelihood functions. Biometrika 76, 465-476.

DiCiccio, T. J., H all, P. and R om ano, J. P. (1991) Em pirical likelihood is B artlett-correctable. Annals of Statistics 19, 1053-1061.

DiCiccio, T. J., M artin , M. A. and Young, G. A. (1992a) A nalytic approxim ations for iterated boo tstrap confidence intervals. Statistics and Computing 2,161-171.

DiCiccio, T. J., M artin , M. A. an d Young, G. A. (1992b) F ast and accurate approxim ate double boo tstrap confidence intervals. Biometrika 79, 285-295.

DiCiccio, T. J., M artin , M. A. and Y oung, G. A. (1994) A nalytical approxim ations to boo tstrap d istribu tion functions using saddlepoint m ethods. Statistica Sinica 4, 281-295.

DiCiccio, T. J. and R om ano, J. P. (1988) A review o f boo tstrap confidence intervals (w ith Discussion). Journal of the Royal Statistical Society series B 50, 338-370. C orrection , volum e 51, p. 470.

likelihoods. Statistics and Computing 5, 257-264.

Bibliography 559

DiCiccio, T. J. and R om ano, J. P. (1989) O n adjustm ents based on the signed roo t o f the em pirical likelihood ra tio statistic. Biometrika 76, 447-456.

D iCiccio, T. J. and R om ano, J. P. (1990) N onparam etric confidence lim its by resam pling m ethods and least favorable families. International Statistical Review 58,59-76.

Diggle, P. J. (1983) Statistical Analysis of Spatial Point Patterns. L ondon: A cadem ic Press.

Diggle, P. J. (1990) Time Series: A Biostatistical Introduction. O xford: C larendon Press.

Diggle, P. J. (1993) Point process m odelling in environm ental epidem iology. In Statistics for the Environment, eds V. B arnett and K . F. T urkm an, pp. 89-110. Chichester: Wiley.

Diggle, P. J., Lange, N . an d Benes, F. M . (1991) Analysis o f variance for replicated spatia l po in t pa tterns in clinical neuroanatom y. Journal of the American Statistical Association 86, 618-625.

Diggle, P. J. and Row lingson, B. S. (1994) A conditional approach to po in t process m odelling o f elevated risk. Journal of the Royal Statistical Society series A 157, 433-440.

D o, K.-A. and Hall, P. (1991) O n im portance resam pling for the boo tstrap . Biometrika 78, 161-167.

D o, K.-A. and H all, P. (1992a) D istribu tion estim ation using concom itants o f o rder statistics, with application to M onte C arlo sim ulation for the boo tstrap . Journal of the Royal Statistical Society series B 54, 595-607.

D o , K .-A . an d H all, P. (1992b) Q uasi-random resam pling for the boo tstrap . Statistics and Computing 1, 13-22.

D obson, A. J. (1990) An Introduction to Generalized Linear Models. London: C hapm an & Hall.

D onegani, M . (1991) A n adaptive and pow erful random ization test. Biometrika 78, 930-933.

Doss, H. an d Gill, R. D. (1992) A n elem entary approach to w eak convergence for quantile processes, with applications to censored survival data. Journal of the American Statistical Association 87, 869-877.

D raper, N . R. and Sm ith, H . (1981) Applied Regression Analysis. Second edition. N ew Y ork: Wiley.

D ucharm e, G . R., Jhun, M., R om ano, J. P. and Truong,K . N . (1985) B ootstrap confidence cones for directional da ta . Biometrika 72, 637-645.

Easton, G. S. and R onchetti, E. M. (1986) G eneral saddlepoint approxim ations with applications to L statistics. Journal of the American Statistical Association81, 420-430.

Efron, B. (1979) B ootstrap m ethods: an o th er look a t the

Efron, B. (1981a) N onparam etric s tan d ard erro rs and confidence intervals (with D iscussion). Canadian Journal of Statistics 9, 139-172.

Efron, B. (1981b) C ensored d a ta and the boo tstrap .Journal of the American Statistical Association 76, 312-319.

Efron, B. (1982) The Jackknife, the Bootstrap, and Other Resampling Plans. N um ber 38 in C B M S-N SF Regional C onference Series in A pplied M athem atics.Ph iladelphia: SIA M .

Efron, B. (1983) E stim ating the erro r rate o f a prediction rule: im provem ent on cross-validation. Journal of the American Statistical Association 78, 316-331.

Efron, B. (1986) How biased is the ap p aren t erro r ra te o f a prediction rule? Journal of the American Statistical Association 81, 461-470.

Efron, B. (1987) B etter boo tstrap confidence intervals (with D iscussion). Journal of the American Statistical Association 82, 171-200.

Efron, B. (1988) C om puter-intensive m ethods in statistical regression. SIAM Review 30, 421-449.

Efron, B. (1990) M ore efficient boo tstrap com putations. Journal of the American Statistical Association 55, 79-89.

Efron, B. (1992) Jackknife-after-boo tstrap standard errors and influence functions (w ith D iscussion). Journal of the Royal Statistical Society series B 54, 83-127.

Efron, B. (1993) Bayes and likelihood calculations from confidence intervals. Biometrika 80, 3-26.

Efron, B. (1994) M issing da ta , im putation , and the b o o tstrap (with D iscussion). Journal of the American Statistical Association 89, 463-479.

Efron, B., H alloran , M . E. and Holm es, S. (1996) B ootstrap confidence levels fo r phylogenetic trees. Proceedings of the National Academy of Sciences, USA 93, 13429-13434.

Efron, B. and Stein, C. M. (1981) The jackknife estim ate o f variance. Annals of Statistics 9, 586-596.

E fron, B. and T ibshirani, R. J. (1986) B ootstrap m ethods for s tan d ard errors, confidence intervals, an d o ther m easures o f statistical accuracy (with Discussion). Statistical Science 1, 54-96.

Efron, B. an d T ibshirani, R. J. (1993) An Introduction to the Bootstrap. N ew Y ork: C hapm an & Hall.

E fron, B. an d T ibshirani, R. J. (1997) Im provem ents on cross-validation: the .632+ boo tstrap m ethod. Journal of the American Statistical Association 92, 548-560.

Fang, K . T. and W ang, Y. (1994) Number-Theoretic Methods in Statistics. L ondon: C hapm an & Hall.

jackknife. Annals o f Statistics 7, 1-26.

560 Bibliography

Faraway, J. J. (1992) On the cost of data analysis. Journal of Computational and Graphical Statistics 1, 213-229.

Feigl, P. and Zelen, M. (1965) Estimation of exponential survival probabilities with concomitant information. Biometrics 21, 826-838.

Feller, W. (1968) An Introduction to Probability Theory and its Applications. Third edition, volume I. New York: Wiley.

Fernholtz, L. T. (1983) von Mises Calculus for Statistical Functionals. Volume 19 of Lecture Notes in Statistics. New York: Springer.

Ferretti, N. and Romo, J. (1996) Unit root bootstrap tests for /1R( 1) models. Biometrika 83, 849-860.

Field, C. and Ronchetti, E. M. (1990) Small Sample Asymptotics. Volume 13 of Lecture Notes — Monograph Series. Hayward, California: Institute of Mathematical Statistics.

Firth, D. (1991) Generalized linear models. In Statistical Theory and Modelling: In Honour of Sir David Cox,

FRS, eds D. V. Hinkley, N. Reid and E. J. Snell, pp. 55-82. London: Chapman & Hall.

Firth, D. (1993) Bias reduction of maximum likelihood estimates. Biometrika 80, 27-38.

Firth, D., Glosup, J. and Hinkley, D. V. (1991) Model checking with nonparametric curves. Biometrika 78,

245-252.

Fisher, N. I., Hall, P., Jing, B.-Y. and Wood, A. T. A. (1996) Improved pivotal methods for constructing confidence regions with directional data. Journal of the American Statistical Association 91, 1062-1070.

Fisher, N. I., Lewis, T. and Embleton, B. J. J. (1987) Statistical Analysis of Spherical Data. Cambridge:

Cambridge University Press.

Fisher, R. A. (1935) The Design of Experiments.Edinburgh: Oliver and Boyd.

Fisher, R. A. (1947) The analysis of covariance method for the relation between a part and the whole. Biometrics 3, 65-68.

Fleming, T. R. and Harrington, D. P. (1991) Counting Processes and Survival Analysis. New York: Wiley.

Forster, J. J., McDonald, J. W. and Smith, P. W. F. (1996) Monte Carlo exact conditional tests for log-linear and logistic models. Journal of the Royal Statistical Society series B 58, 445^53.

Franke, J. and Hardle, W. (1992) On bootstrapping kernel spectral estimates. Annals of Statistics 20, 121-145.

Freedman, D. A. (1981) Bootstrapping regression models. Annals of Statistics 9, 1218-1228.

Freedman, D. A. (1984) On bootstrapping two-stage

least-squares estimates in stationary linear models. Annals of Statistics 12, 827-842.

Freedman, D. A. and Peters, S. C. (1984a) Bootstrapping a regression equation: some empirical results. Journal of the American Statistical Association 79, 97-106.

Freedman, D. A. and Peters, S. C. (1984b) Bootstrapping an econometric model: some empirical results. Journal of Business & Economic Statistics 2, 150-158.

Freeman, D. H. (1987) Applied Categorical Data Analysis. New York: Marcel Dekker.

Frets, G. P. (1921) Heredity of head form in man. Genetica 3, 193-384.

Garcia-Soidan, P. H. and Hall, P. (1997) On sample reuse methods for spatial data. Biometrics 53, 273-281.

Garthwaite, P. H. and Buckland, S. T. (1992) Generating Monte Carlo confidence intervals by the Robbins-Monro process. Applied Statistics 41, 159-171.

Gatto, R. (1994) Saddlepoint methods and nonparametric approximations for econometric models. Ph.D. thesis, Faculty of Economic and Social Sciences, University of Geneva.

Gatto, R. and Ronchetti, E. M. (1996) General saddlepoint approximations of marginal densities and tail probabilities. Journal of the American Statistical Association 91, 666-673.

Geisser, S. (1975) The predictive sample reuse method with applications. Journal of the American Statistical Association 70, 320-328.

Geisser, S. (1993) Predictive Inference: An Introduction. London: Chapman & Hall.

Geyer, C. J. (1991) Constrained maximum likelihood exemplified by isotonic convex logistic regression.Journal of the American Statistical Association 86, 717-724.

Geyer, C. J. (1995) Likelihood ratio tests and inequality constraints. Technical Report 610, School of Statistics, University of Minnesota.

Gigli, A. (1994) Contributions to importance sampling and resampling. Ph.D. thesis, Department of Mathematics, Imperial College, London.

Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (eds) (1996) Markov Chain Monte Carlo in Practice. London: Chapman & Hall.

Gleason, J. R. (1988) Algorithms for balanced bootstrap simulations. American Statistician 42, 263-266.

Gong, G. (1983) Cross-validation, the jackknife, and the bootstrap: excess error estimation in forward logistic regression. Journal of the American Statistical Association 78, 108-113.

Bibliography 561

Gotze, F. and Kiinsch, H. R. (1996) Second order correctness of the blockwise bootstrap for stationary observations. Annals of Statistics 24, 1914-1933.

Graham, R. L., Hinkley, D. V., John, P. W. M. and Shi, S. (1990) Balanced design of bootstrap simulations. Journal of the Royal Statistical Society series B 52, 185-202.

Gray, H. L. and Schucany, W. R. (1972) The Generalized Jackknife Statistic. New York: Marcel Dekker.

Green, P. J. and Silverman, B. W. (1994) Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. London: Chapman & Hall.

Gross, S. (1980) Median estimation in sample surveys. In Proceedings of the Section on Survey Research Methods,

pp. 181-184. Alexandria, Virginia: American Statistical Association.

Haldane, J. B. S. (1940) The mean and variance of x2, when used as a test of homogeneity, when expectations are small. Biometrika 31, 346-355.

Hall, P. (1985) Resampling a coverage pattern. Stochastic Processes and their Applications 20, 231-246.

Hall, P. (1986) On the bootstrap and confidence intervals. Annals of Statistics 14, 1431-1452.

Hall, P. (1987) On the bootstrap and likelihood-based confidence regions. Biometrika 74, 481 193.

Hall, P. (1988a) Theoretical comparison of bootstrap confidence intervals (with Discussion). Annals of Statistics 16, 927-985.

Hall, P. (1988b) On confidence intervals for spatial parameters estimated from nonreplicated data.Biometrics 44, 271-277.

Hall, P. (1989a) Antithetic resampling for the bootstrap. Biometrika 76, 713-724.

Hall, P. (1989b) Unusual properties of bootstrap confidence intervals in regression problems. Probability Theory and Related Fields 81, 247-273.

Hall, P. (1990) Pseudo-likelihood theory for empirical likelihood. Annals of Statistics 18, 121-140.

Hall, P. (1992a) The Bootstrap and Edgeworth Expansion. New York: Springer.

Hall, P. (1992b) On bootstrap confidence intervals in nonparametric regression. Annals of Statistics 20, 695-711.

Hall, P. (1995) On the biases of error estimators in prediction problems. Statistics and Probability Letters 24, 257-262.

Hall, P., DiCiccio, T. J. and Romano, J. P. (1989) On smoothing and the bootstrap. Annals of Statistics 17,692-704.

Hall, P. and Horowitz, J. L. (1993) Corrections and blocking rules for the block bootstrap with dependent data. Technical Report SRI 1-93, Centre for Mathematics and its Applications, Australian National University.

Hall, P., Horowitz, J. L. and Jing, B.-Y. (1995) On blocking rules for the bootstrap with dependent data. Biometrika82, 561-574.

Hall, P. and Jing, B.-Y. (1996) On sample reuse methods for dependent data. Journal of the Royal Statistical Society series B 58, 727-737.

Hall, P. and Keenan, D. M. (1989) Bootstrap methods for constructing confidence regions for hands. Communications in Statistics — Stochastic Models 5, 555-562.

Hall, P. and La Scala, B. (1990) Methodology and algorithms of empirical likelihood. International Statistical Review 58, 109-28.

Hall, P. and Martin, M. A. (1988) On bootstrap resampling and iteration. Biometrika 75, 661-671.

Hall, P. and Owen, A. B. (1993) Empirical likelihood confidence bands in density estimation. Journal of Computational and Graphical Statistics 2, 273-289.

Hall, P. and Titterington, D. M. (1989) The effect of simulation order on level accuracy and power of Monte Carlo tests. Journal of the Royal Statistical Society series B 51, 459-467.

Hall, P. and Wilson, S. R. (1991) Two guidelines for bootstrap hypothesis testing. Biometrics 47, 757-762.

Hamilton, M. A. and Collings, B. J. (1991) Determining the appropriate sample size for nonparametric tests for location shift. Technometrics 33, 327-337.

Hammersley, J. M. and Handscomb, D. C. (1964) Monte Carlo Methods. London: Methuen.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986) Robust Statistics: The Approach Based on Influence Functions. New York: Wiley.

Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and Ostrowski, E. (eds) (1994) A Handbook of Small Data Sets. London: Chapman & Hall.

Hardle, W. (1989) Resampling for inference from curves. In Bulletin of the 47th Session of the International Statistical Institute, Paris, August 1989, volume 3, pp. 53-63.

Hardle, W. (1990) Applied Nonparametric Regression. Cambridge: Cambridge University Press.

Hardle, W. and Bowman, A. W. (1988) Bootstrapping in nonparametric regression: local adaptive smoothing and confidence bands. Journal of the American Statistical Association 83, 102-110.

562 Bibliography

Hardle, W. and Marron, J. S. (1991) Bootstrapsimultaneous error bars for nonparametric regression. Annals of Statistics 19, 778-796.

Hartigan, J. A. (1969) Using subsample values as typical

values. Journal of the American Statistical Association

64, 1303-1317.

Hartigan, J. A. (1971) Error analysis by replaced samples. Journal of the Royal Statistical Society series B 33, 98-110.

Hartigan, J. A. (1975) Necessary and sufficient conditions for asymptotic joint normality of a statistic and its subsample values. Annals of Statistics 3, 573-580.

Hartigan, J. A. (1990) Perturbed periodogram estimates of variance. International Statistical Review 58, 1-7.

Hastie, T. J. and Loader, C. (1993) Local regression: automatic kernel carpentry (with Discussion). Statistical

Science 8, 120-143.

Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models. London: Chapman & Hall.

Hayes, K. G., Perl, M. L. and Efron, B. (1989) Application of the bootstrap statistical method to the tau-decay-mode problem. Physical Review Series D 39, 274-279.

Heller, G. and Venkatraman, E. S. (1996) Resampling procedures to compare two survival distributions in the presence of right-censored data. Biometrics 52, 1204-1213.

Hesterberg, T. C. (1988) Advances in importance sampling. Ph.D. thesis, Department of Statistics, Stanford University, California.

Hesterberg, T. C. (1995a) Tail-specific linear approximations for efficient bootstrap simulations. Journal of Computational and Graphical Statistics 4, 113-133.

Hesterberg, T. C. (1995b) Weighted average importance sampling and defensive mixture distributions. Technometrics 37, 185-194.

Hinkley, D. V. (1977) Jackknifing in unbalanced situations. Technometrics 19, 285-292.

Hinkley, D. V. and Schechtman, E. (1987) Conditional bootstrap methods in the mean-shift model. Biometrika 74, 85-93.

Hinkley, D. V. and Shi, S. (1989) Importance sampling and the nested bootstrap. Biometrika 76, 435-446.

Hinkley, D. V. and Wang, S. (1991) Efficiency of robust standard errors for regression coefficients. Communications in Statistics — Theory and Methods 20, 1- 11.

Hinkley, D. V. and Wei, B. C. (1984) Improvements of

Hirose, H. (1993) Estimation of threshold stress in accelerated life-testing. IEEE Transactions on Reliability

42, 650-657.

Hjort, N. L. (1985) Bootstrapping Cox’s regression model. Technical Report NSF-241, Department of Statistics, Stanford University.

Hjort, N. L. (1992) On inference in parametric survival data models. International Statistical Review 60,355-387.

Horvath, L. and Yandell, B. S. (1987) Convergence rates for the bootstrapped product-limit process. Annals of Statistics 15, 1155-1173.

Hosmer, D. W. and Lemeshow, S. (1989) Applied Logistic Regression. New York: Wiley.

Hu, F. and Zidek, J. V. (1995) A bootstrap based on the estimating equations of the linear model. Biometrika 82, 263-275.

Huet, S., Jolivet, E. and Messean, A. (1990) Some simulations results about confidence intervals and bootstrap methods in nonlinear regression. Statistics 3, 369-432.

Hyde, J. (1980) Survival analysis with incomplete observations. In Biostatistics Casebook, eds R. G. Miller,B. Efron, B. W. Brown and L. E. Moses, pp. 31-46. New York: Wiley.

Janas, D. (1993) Bootstrap Procedures for Time Series. Aachen: Verlag Shaker.

Jennison, C. (1992) Bootstrap tests and confidence intervals for a hazard ratio when the number of observed failures is small, with applications to group sequential survival studies. In Computer Science and Statistics: Proceedings of the 22nd Symposium on the Interface, eds C. Page and R. LePage, pp. 89-97. New York: Springer.

Jensen, J. L. (1992) The modified signed likelihood statistic and saddlepoint approximations. Biometrika 79,693-703.

Jensen, J. L. (1995) Saddlepoint Approximations. Oxford: Clarendon Press.

Jeong, J. and Maddala, G. S. (1993) A perspective on application of bootstrap methods in econometrics. In Handbook of Statistics, vol. II: Econometrics, eds G. S. Maddala, C. R. Rao and H. D. Vinod, pp. 573-610. Amsterdam: North-Holland.

Jing, B.-Y. and Robinson, J. (1994) Saddlepoint approximations for marginal and conditional probabilities of transformed variables. Annals of Statistics 22, 1115-1132.

jackknife confidence limit methods. Biometrika 71,331-339.

Bibliography 563

Jing, B.-Y. and Wood, A. T. A. (1996) Exponential empirical likelihood is not Bartlett correctable. Annals of Statistics 24, 365-369.

Jockel, K.-H. (1986) Finite sample properties and asymptotic efficiency of Monte Carlo tests. Annals of Statistics 14, 336-347.

Johns, M. V. (1988) Importance sampling for bootstrap confidence intervals. Journal of the American Statistical Association 83, 709-714.

Journel, A. G. (1994) Resampling from stochastic simulations (with Discussion). Environmental and Ecological Statistics 1, 63-91.

Kabaila, R (1993a) Some properties of profile bootstrap confidence intervals. Australian Journal of Statistics 35, 205-214.

Kabaila, P. (1993b) On bootstrap predictive inference for autoregressive processes. Journal of Time Series Analysis 14, 473—484.

Kalbfleisch, J. D. and Prentice, R. L. (1980) The Statistical Analysis of Failure Time Data. New York: Wiley.

Kaplan, E. L. and Meier, P. (1958) Nonparametric estimation from incomplete observations. Journal of the American Statistical Association 53, 457-481.

Karr, A. F. (1991) Point Processes and their Statistical Inference. Second edition. New York: Marcel Dekker.

Katz, R. (1995) Spatial analysis of pore images. Ph.D. thesis, Department of Statistics, University of Oxford.

Kendall, D. G. and Kendall, W. S. (1980) Alignments in two-dimensional random sets of points. Advances in Applied Probability 12, 380-424.

Kim, J.-H. (1990) Conditional bootstrap methods for censored data. Ph.D. thesis, Department of Statistics, Florida State University.

Kiinsch, H. R. (1989) The jackknife and bootstrap for general stationary observations. Annals of Statistics 17, 1217-1241.

Lahiri, S. N. (1991) Second-order optimality of stationary bootstrap. Statistics and Probability letters 11, 335-341.

Lahiri, S. N. (1995) On the asymptotic behaviour of the moving block bootstrap for normalized sums of heavy-tail random variables. Annals of Statistics 23, 1331-1349.

Laird, N. M. (1978) Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association 73, 805-811.

Laird, N. M. and Louis, T. A. (1987) Empirical Bayes confidence intervals based on bootstrap samples (with Discussion). Journal of the American Statistical Association 82, 739-757.

Lawson, A. B. (1993) On the analysis of mortality events associated with a prespecified fixed point. Journal of the Royal Statistical Society series A 156, 363-377.

Lee, S. M. S. and Young, G. A. (1995) Asymptotic iterated bootstrap confidence intervals. Annals of Statistics 23, 1301-1330.

Leger, C., Politis, D. N. and Romano, J. P. (1992) Bootstrap technology and applications. Technometrics 34, 378-398.

Leger, C. and Romano, J. P. (1990a) Bootstrap choice of tuning parameters. Annals of the Institute of Statistical Mathematics 42, 709-735.

Leger, C. and Romano, J. P. (1990b) Bootstrap adaptive estimation: the trimmed mean example. Canadian Journal of Statistics 18, 297-314.

Lehmann, E. L. (1986) Testing Statistical Hypotheses. Second edition. New York: Wiley.

Li, G. (1995) Nonparametric likelihood ratio estimation of probabilities for truncated data. Journal of the American Statistical Association 90, 997-1003.

Li, H. and Maddala, G. S. (1996) Bootstrapping time series models (with Discussion). Econometric Reviews 15, 115-195.

Li, K.-C. (1987) Asymptotic optimality for C p, C l,

cross-validation and generalized cross-validation: discrete index set. Annals of Statistics 15, 958-975.

Liu, R. Y. and Singh, K. (1992a) Moving blocks jackknife and bootstrap capture weak dependence. In Exploring the Limits of Bootstrap, eds R. LePage and L. Billard, pp. 225-248. New York: Wiley.

Liu, R. Y. and Singh, K. (1992b) Efficiency and robustness in resampling. Annals of Statistics 20, 370-384.

Lloyd, C. J. (1994) Approximate pivots from M-estimators. Statistica Sinica 4, 701-714.

Lo, S.-H. and Singh, K. (1986) The product-limit estimator and the bootstrap: some asymptotic representations. Probability Theory and Related Fields 71, 455-465.

Loh, W.-Y. (1987) Calibrating confidence coefficients. Journal of the American Statistical Association 82, 155-162.

Mallows, C. L. (1973) Some comments on C p. Technometrics 15, 661-675.

Mammen, E. (1989) Asymptotics with increasing dimension for robust regression with applications to the bootstrap. Annals of Statistics 17, 382-400.

Mammen, E. (1992) When Does Bootstrap Work? Asymptotic Results and Simulations. Volume 77 of Lecture Notes in Statistics. New York: Springer.

564 Bibliography

Mammen, E. (1993) Bootstrap and wild bootstrap for high dimensional linear models. Annals of Statistics 21, 255-285.

Manly, B. F. J. (1991) Randomization and Monte Carlo Methods in Biology. London: Chapman & Hall.

Marriott, F. H. C. (1979) Barnard’s Monte Carlo tests: how many simulations? Applied Statistics 28, 75-77.

McCarthy, P. J. (1969) Pseudo-replication: half samples. Review of the International Statistical Institute 37, 239-264.

McCarthy, P. J. and Snowden, C. B. (1985) The Bootstrap and Finite Population Sampling. Vital and Public Health Statistics (Ser. 2, No. 95), Public Health Service Publication. Washington, DC: United States Government Printing Office, 85-1369.

McCullagh, P. (1987) Tensor Methods in Statistics.London: Chapman & Hall.

McCullagh, P. and Nelder, J. A. (1989) Generalized Linear Models. Second edition. London: Chapman & Hall.

McKay, M. D., Beckman, R. J. and Conover, W. J. (1979) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21, 239-245.

McKean, J. W., Sheather, S. J. and Hettsmansperger, T. P. (1993) The use and interpretation of residuals based on robust estimation. Journal of the American Statistical Association 88, 1254-1263.

McLachlan, G. J. (1992) Discriminant Analysis and Statistical Pattern Recognition. New York: Wiley.

Milan, L. and Whittaker, J. (1995) Application of the parametric bootstrap to models that incorporate a singular value decomposition. Applied Statistics 44, 31-49.

Miller, R. G. (1974) The jackknife — a review. Biometrika 61, 1-15.

Miller, R. G. (1981) Survival Analysis. New York: Wiley.

Monti, A. C. (1997) Empirical likelihood confidence regions in time series models. Biometrika 84, 395-405.

Morgenthaler, S. and Tukey, J. W. (eds) (1991) Configural Polysampling: A Route to Practical Robustness. New York: Wiley.

Moulton, L. H. and Zeger, S. L. (1989) Analyzing repeated measures on generalized linear models via the bootstrap. Biometrics 45, 381-394.

Moulton, L. H. and Zeger, S. L. (1991) Bootstrapping generalized linear models. Computational Statistics and Data Analysis 11, 53-63.

Muirhead, C. R. and Darby, S. C. (1989) Royal Statistical Society meeting on cancer near nuclear installations.

Journal of the Royal Statistical Society series A 152, 305-384.

Murphy, S. A. (1995) Likelihood-based confidence intervals in survival analysis. Journal of the American Statistical Association 90, 1399-1405.

Mykland, P. A. (1995) Dual likelihood. Annals of Statistics23, 396-421.

Nelder, J. A. and Pregibon, D. (1987) An extended quasi-likelihood function. Biometrika 74, 221-232.

Newton, M. A. and Geyer, C. J. (1994) Bootstrap recycling: a Monte Carlo alternative to the nested bootstrap. Journal of the American Statistical Association 89, 905-912.

Newton, M. A. and Raftery, A. E. (1994) Approximate Bayesian inference with the weighted likelihood bootstrap (with Discussion). Journal of the Royal Statistical Society series B 56, 3—48.

Niederreiter, H. (1992) Random Number Generation and Quasi-Monte Carlo Methods. Number 63 in CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia: SIAM.

Nordgaard, A. (1990) On the resampling of stochastic processes using a bootstrap approach. Ph.D. thesis, Department of Mathematics, Linkoping University, Sweden.

Noreen, E. W. (1989) Computer Intensive Methods for Testing Hypotheses: An Introduction. New York: Wiley.

Ogbonmwan, S.-M. (1985) Accelerated resampling codes with application to likelihood. Ph.D. thesis, Department of Mathematics, Imperial College, London.

Ogbonmwan, S.-M. and Wynn, H. P. (1986) Accelerated resampling codes with low discrepancy. Preprint, Department of Statistics and Actuarial Science, The City University.

Olshen, R. A., Biden, E. N., Wyatt, M. P. and Sutherland,D. H. (1989) Gait analysis and the bootstrap. Annals of Statistics 17, 1419-1440.

Owen, A. B. (1988) Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237-249.

Owen, A. B. (1990) Empirical likelihood ratio confidence regions. Annals of Statistics 18, 90-120.

Owen, A. B. (1991) Empirical likelihood for linear models. Annals of Statistics 19, 1725-1747.

Owen, A. B. (1992a) Empirical likelihood and small samples. In Computer Science and Statistics: Proceedings of the 22nd Symposium on the Interface, eds C. Page and R. LePage, pp. 79-88. New York: Springer.

Owen, A. B. (1992b) A central limit theorem for Latin hypercube sampling. Journal of the Royal Statistical

Bibliography 565

Parzen, M. I., Wei, L. J. and Ying, Z. (1994) A resampling method based on pivotal estimating functions.

Biometrika 81, 341-350.

Paulsen, O. and Heggelund, P. (1994) The quantal size at retinogeniculate synapses determined from spontaneous and evoked EPSCs in guinea-pig thalamic slices. Journal of Physiology 480, 505-511.

Percival, D. B. and Walden, A. T. (1993) Spectral Analysis for Physical Applications: Multitaper and Conventional

Univariate Techniques. Cambridge: Cambridge University Press.

Pitman, E. J. G. (1937a) Significance tests which may be applied to samples from any populations. Journal of the Royal Statistical Society. Supplement 4, 119-130.

Pitman, E. J. G. (1937b) Significance tests which may be applied to samples from any populations: II. The correlation coefficient test. Journal of the Royal Statistical Society, Supplement 4, 225-232.

Pitman, E. J. G. (1937c) Significance tests which may be applied to samples from any populations: III. The analysis of variance test. Biometrika 29, 322-335.

Plackett, R. L. and Burman, J. P. (1946) The design of optimum multifactorial experiments. Biometrika 33, 305-325.

Politis, D. N. and Romano, J. P. (1993) Nonparametric resampling for homogeneous strong mixing random fields. Journal of Multivariate Analysis 47, 301-328.

Politis, D. N. and Romano, J. P. (1994a) The stationary bootstrap. Journal of the American Statistical Association 89, 1303-1313.

Politis, D. N. and Romano, J. P. (1994b) Large sample confidence regions based on subsamples under minimal assumptions. Annals of Statistics 22, 2031-2050.

Possolo, A. (1986) Subsampling a random field. Technical Report 78, Department of Statistics, University of Washington, Seattle.

Presnell, B. and Booth, J. G. (1994) Resampling methods for sample surveys. Technical Report 470, Department of Statistics, University of Florida, Gainesville.

Priestley, M. B. (1981) Spectral Analysis and Time Series. London: Academic Press.

Proschan, F. (1963) Theoretical explanation of observed decreasing failure rate. Technometrics 5, 375-383.

Qin, J. (1993) Empirical likelihood in biased sample problems. Annals of Statistics 21, 1182-1196.

Qin, J. and Lawless, J. (1994) Empirical likelihood and general estimating equations. Annals of Statistics 22, 300-325.

Society series B 54, 541-551. Quenouille, M. H. (1949) Approximate tests of correlation in time-series. Journal of the Royal Statistical Society series B 11, 68-84.

Rao, J. N. K. and Wu, C. F. J. (1988) Resampling inference with complex survey data. Journal of the American Statistical Association 83, 231-241.

Rawlings, J. O. (1988) Applied Regression Analysis: A Research Tool. Pacific Grove, California: Wadsworth & Brooks/Cole.

Reid, N. (1981) Estimating the median survival time. Biometrika 68, 601-608.

Reid, N. (1988) Saddlepoint methods and statistical inference (with Discussion). Statistical Science 3, 213-238.

Reynolds, P. S. (1994) Time-series analyses of beaver body temperatures. In Case Studies in Biometry, eds N. Lange, L. Ryan, L. Billard, D. R. Brillinger, L. Conquest and J. Greenhouse, pp. 211-228. New York: Wiley.

Ripley, B. D. (1977) Modelling spatial patterns (with Discussion). Journal of the Royal Statistical Society series B 39, 172-212.

Ripley, B. D. (1981) Spatial Statistics. New York: Wiley.

Ripley, B. D. (1987) Stochastic Simulation. New York: Wiley.

Ripley, B. D. (1988) Statistical Inference for Spatial Processes. Cambridge: Cambridge University Press.

Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press.

Robinson, J. (1982) Saddlepoint approximations for permutation tests and confidence intervals. Journal of the Royal Statistical Society series B 44, 91-101.

Romano, J. P. (1988) Bootstrapping the mode. Annals of the Institute of Statistical Mathematics 40, 565-586.

Romano, J. P. (1989) Bootstrap and randomization tests of some nonparametric hypotheses. Annals of Statistics 17, 141-159.

Romano, J. P. (1990) On the behaviour of randomization tests without a group invariance assumption. Journal of the American Statistical Association 85, 686-692.

Rousseeuw, P. J. and Leroy, A. M. (1987) Robust Regression and Outlier Detection. New York: Wiley.

Royall, R. M. (1986) Model robust confidence intervals using maximum likelihood estimators. International Statistical Review 54, 221-226.

Rubin, D. B. (1981) The Bayesian bootstrap. Annals of Statistics 9, 130-134.

Rubin, D. B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: Wiley.

566 Bibliography

Rubin, D. B. and Schenker, N. (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association 81, 366-374.

Ruppert, D. and Carroll, R. J. (1980) Trimmed least squares estimation in the linear model. Journal of the American Statistical Association 75, 828-838.

Samawi, H. M. (1994) Power estimation for two-sample tests using importance and antithetic resampling. Ph.D. thesis, Department of Statistics and Actuarial Science, University of Iowa, Ames.

Sauerbrei, W. and Schumacher, M. (1992) A bootstrap resampling procedure for model building: application to the Cox regression model. Statistics in Medicine 11, 2093-2109.

Schenker, N. (1985) Qualms about bootstrap confidence intervals. Journal of the American Statistical Association 80, 360-361.

Seber, G. A. F. (1977) Linear Regression Analysis. New

York: Wiley.

Shao, J. (1988) On resampling methods for variance and bias estimation in linear models. Annals of Statistics 16, 986-1008.

Shao, J. (1993) Linear model selection by cross-validation. Journal of the American Statistical Association 88,

486-494.

Shao, J. (1996) Bootstrap model selection. Journal of the American Statistical Association 91, 655-665.

Shao, J. and Tu, D. (1995) The Jackknife and Bootstrap. New York: Springer.

Shao, J. and Wu, C. F. J. (1989) A general theory for jackknife variance estimation. Annals of Statistics 17,

1176-1197.

Shorack, G. (1982) Bootstrapping robust regression. Communications in Statistics — Theory and Methods 11, 961-972.

Silverman, B. W. (1981) Using kernel density estimates to investigate multimodality. Journal of the Royal

Statistical Society series B 43, 97-99.

Silverman, B. W. (1985) Some aspects of the splinesmoothing approach to non-parametric regression curve fitting (with Discussion). Journal of the Royal Statistical Society series B 47, 1-52.

Silverman, B. W. and Young, G. A. (1987) The bootstrap: to smooth or not to smooth? Biometrika 74, 469-479.

Simonoff, J. S. and Tsai, C.-L. (1994) Use of modified profile likelihood for improved tests of constancy of variance in regression. Applied Statistics 43, 357-370.

Singh, K. (1981) On the asymptotic accuracy of Efron’s

Sitter, R. R. (1992) A resampling procedure for complex survey data. Journal of the American Statistical Association 87, 755-765.

Smith, P. W. F., Forster, J. J. and McDonald, J. W. (1996) Monte Carlo exact tests for square contingency tables. Journal of the Royal Statistical Society series A 159, 309-321.

Spady, R. H. (1991) Saddlepoint approximations for regression models. Biometrika 78, 879-889.

St. Laurent, R. T. and Cook, R. D. (1993) Leverage, local influence, and curvature in nonlinear regression. Biometrika 80, 99-106.

Stangenhaus, G. (1987) Bootstrap and inference procedures for L\ regression. In Statistical Data Analysis Based on the L\-Norm and Related Methods, ed.Y. Dodge, pp. 323-332. Amsterdam: North-Holland.

Stein, C. M. (1985) On the coverage probability of confidence sets based on a prior distribution. Volume 16

of Banach Centre Publications. Warsaw: P W N — Polish

Scientific Publishers.

Stein, M. (1987) Large sample properties of simulations using Latin hypercube sampling. Technometrics 29, 143-151.

Sternberg, H. O’R. (1987) Aggravation of floods in the Amazon River as a consequence of deforestation? Geografiska Annaler 69A, 201-219.

Sternberg, H. O ’R. (1995) Water and wetlands of Brazilian Amazonia: an uncertain future. In The Fragile Tropics of Latin America : Sustainable Management of Changing

Environments, eds T. Nishizawa and J. I. Uitto, pp. 113-179. Tokyo: United Nations University Press.

Stine, R. A. (1985) Bootstrap prediction intervals for regression. Journal of the American Statistical Association 80, 1026-1031.

Stoffer, D. S. and Wall, K. D. (1991) Bootstrapping state-space models: Gaussian maximum likelihood estimation and the Kalman filter. Journal of the American Statistical Association 86, 1024-1033.

Stone, M. (1974) Cross-validatory choice and assessment of statistical predictions (with Discussion). Journal of the Royal Statistical Society series B 36, 111-147.

Stone, M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion.Journal of the Royal Statistical Society series B 39, 44-47.

Swanepoel, J. W. H. and van Wyk, J. W. J. (1986) The bootstrap applied to power spectral density function estimation. Biometrika 73, 135-141.

bootstrap. Annals o f Statistics 9, 1187-1195.

Bibliography 567

Tanner, M. A. (1996) Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions

and Likelihood Functions. Third edition. New York: Springer.

Tanner, M. A. and Wong, W. H. (1987) The calculation of posterior densities by data augmentation (with Discussion). Journal of the American Statistical Association 82, 528-550.

Theiler, J., Galdrikian, B., Longtin, A., Eubank, S. and Farmer, J. D. (1992) Using surrogate data to detect nonlinearity in time series. In Nonlinear Modeling and Forecasting, eds M. Casdagli and S. Eubank, number XII in Santa Fe Institute Studies in the Sciences of Complexity, pp. 163-188. New York: Addison-Wesley.

Therneau, T. (1983) Variance reduction techniques for the bootstrap. Ph.D. thesis, Department of Statistics, Stanford University, California.

Tibshirani, R. J. (1988) Variance stabilization and the bootstrap. Biometrika 75, 433-444.

Tong, H. (1990) Non-linear Time Series: A Dynamical System Approach. Oxford: Clarendon Press.

Tsay, R. S. (1992) Model checking via parametric bootstraps in time series. Applied Statistics 41, 1-15.

Tukey, J. W. (1958) Bias and confidence in not quite large samples (Abstract). Annals of Mathematical Statistics 29, 614.

Venables, W. N. and Ripley, B. D. (1994) Modern Applied Statistics with S-Plus. New York: Springer.

Ventura, V. (1997) Likelihood inference by Monte Carlo methods and efficient nested bootstrapping. D.Phil. thesis, Department of Statistics, University of Oxford.

Ventura, V., Davison, A. C. and Boniface, S. J. (1997) Statistical inference for the effect of magnetic brain stimulation on a motoneurone. Applied Statistics 46, to appear.

Wahrendorf, J., Becher, H. and Brown, C. C. (1987) Bootstrap comparison of non-nested generalized linear models: applications in survival analysis and epidemiology. Applied Statistics 36, 72-81.

Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. London: Chapman & Hall.

Wang, S. (1990) Saddlepoint approximations in resampling analysis. Annals of the Institute of Statistical Mathematics 42, 115-131.

Wang, S. (1992) General saddlepoint approximations in the bootstrap. Statistics and Probability Letters 13, 61-66.

Wang, S. (1993a) Saddlepoint expansions in finite population problems. Biometrika 80, 583-590.

Wang, S. (1993b) Saddlepoint methods for bootstrap confidence bands in nonparametric regression.Australian Journal of Statistics 35, 93-101.

Wang, S. (1995) Optimizing the smoothed bootstrap.

Annals of the Institute of Statistical Mathematics 47, 65-80.

Weisberg, S. (1985) Applied Linear Regression. Second edition. New York: Wiley.

Welch, B. L. and Peers, H. W. (1963) On formulae for confidence points based on integrals of weighted likelihoods. Journal of the Royal Statistical Society series B 25, 318-329.

Welch, W. J. (1990) Construction of permutation tests. Journal of the American Statistical Association 85,693-698.

Welch, W. J. and Fahey, T. J. (1994) Correcting for covariates in permutation tests. Technical Report STAT-94-12, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario.

Westfall, P. H. and Young, S. S. (1993) Resampling-Based Multiple Testing: Examples and Methods for p-value

Adjustment. New York: Wiley.

Woods, H., Steinour, H. H. and Starke, H. R. (1932) Effect of composition of Portland cement on heat evolved during hardening. Industrial Engineering and Chemistry24, 1207-1214.

Wu, C. J. F. (1986) Jackknife, bootstrap and other resampling methods in regression analysis (with Discussion). Annals of Statistics 14, 1261-1350.

Wu, C. J. F. (1990) On the asymptotic properties of the jackknife histogram. Annals of Statistics 18, 1438-1452.

Wu, C. J. F. (1991) Balanced repeated replications based on mixed orthogonal arrays. Biometrika 78, 181-188.

Young, G. A. (1986) Conditioned data-based simulations: Some examples from geometrical statistics. International Statistical Review 54, 1-13.

Young, G. A. (1990) Alternative smoothed bootstraps. Journal of the Royal Statistical Society series B 52, 477-484.

Young, G. A. and Daniels, H. E. (1990) Bootstrap bias. Biometrika 77, 179-185.

Name Index

Abelson, R. R 403

Akaike, H. 316

Akritas, M. G. 124

Altman, D. G. 375

Amis, G. 253

Andersen, P. K. 124, 128, 353, 375

Andrews, D. F. 360

Appleyard, S. T. 417

Athreya, K. B. 60

Atkinson, A. C. 183, 315, 325

Bai, C. 315

Bai, Z. D. 427

Bailer, A. J. 384

Banks, D. L. 515

Barbe, P. 60, 516

Barnard, G. A. 183

Bamdorff-Nielsen, O. E. 183, 246,486, 514

Becher, H. 379

Beckman, R. J. 486

Benes, F. M. 428

Beran, J. 426

Beran, R. J. 125, 183, 184, 187, 246,250, 315

Berger, J. O. 515

Bernardo, J. M. 515

Bertail, P. 60, 516

Besag, J. E. 183, 184, 185

Bickel, P. J. 60, 123, 125, 129, 315,487, 494

Biden, E. N. 316

Bissell, A. F. 253, 383, 497

Bithell, J. F. 428

Bloomfield, P. 426

Boniface, S. J. 418, 428

Boos, D. D. 515

Booth, J. G. 125, 129, 246, 247, 251,374, 486, 487, 488, 491, 493

Borgan, 0. 124, 128, 353

Bose, A. S. 427

Bowman, A. W. 375

Box, G. E. P. 323

Bratley, P. 486

Braun, W. J. 427, 430

Breiman, L. 316

Breslow, N. 378

Bretagnolle, J. 60

Brillinger, D. R. x, 388, 426, 427

Brockwell, P. J. 426, 427

Brown, B. W. 382

Brown, C. C. 379

Buckland, S. T. 246

Buhlmann, P. 427

Bunke, O. 316

Burman, J. P. 60

Burman, P. 316, 321

Burr, D. 124, 133, 374

Bums, E., 300

Butler, R. W. 125, 129, 487, 493

Canty, A. J. x, 135, 246

Carlstein, E. 427

Carpenter, J. R. 246, 250

Carroll, R. J. 310, 325

Chambers, J. M. 374, 375

Chao, M.-T. 125

Chapman, P. 60, 125

Chen, C. 427

Chen, C.-H. 375

Chen, S. X. 169, 514, 515

Chen, Z. 487

Claridge, G. 157

Clifford, P. 183, 184, 185

Cobb, G. W. 241

Cochran, W. G. 7, 125

Collings, B. J. 184

Conover, W. J. 486

Cook, R. D. 125, 315, 316, 375

Corcoran, S. A. 515

Cowling, A. 428, 432, 436

Cox, D. R. 124, 128, 183, 246, 287,323, 324, 428, 486, 514

Cressie, N. A. C. 72, 428

Dahlhaus, R. 427, 431

Daley, D. J. 428

Daly, F. 68, 182, 436, 520

Daniels, H. E. 59, 486, 492

Darby, S. C. 428

Davis, R. A. 426, 427

Davison, A. C. 66, 135, 246, 316, 374, 427, 428, 486, 487, 492, 493, 515,517, 518

De Angelis, D. 2, 60, 124, 316, 343

Demetrio, C. G. B. 338

Dempster, A. P. 124

Diaconis, P. 60, 486

DiCiccio, T. J. 68, 124, 246, 252, 253,

487, 493, 515, 516

Diggle, P. J. 183, 392, 423, 426, 428

Do, K.-A. 486, 487

Dobson, A. J. 374

Donegani, M. 184, 187

Doss, H. 124, 374

Draper, N. R. 315

Droge, B. 316

Dubowicz, V. 417

Ducharme, G. R. 126

Easton, G. S. 487

Efron, B. ix, 59, 60, 61, 66, 68, 123,124, 125, 128, 130, 132, 133, 134, 183, 186, 246, 249, 252, 253, 308,315, 316, 375, 427, 486, 488, 515

Embleton, B. J. J. 236, 506

Eubank, S. 427, 430, 435

Fahey, T. J. 185

568

Name Index 569

Fang, K.-T. 486

Faraway, J. J. 125, 375

Farmer, J. D. 427, 430, 435

Feigl, P. 328

Feller, W. 320

Fernholtz, L. T. 60

Ferretti, N. 427

Field, C. 486

Firth, D. 374, 377, 383

Fisher, N. I. 236, 506, 515, 517

Fisher, R. A. 183, 186, 322

Fleming, T. R. 124

Forster, J. J. 183, 184

Fox, B. L. 486

Franke, J. 427

Freedman, D. A. 60, 125, 129, 315,427

Freeman, D. H. 378

Frets, G. P. 115

Friedman, J. H. 316

Galdrikian, B. 427, 430, 435

Garcia-Soidan, P. H. 428

Garthwaite, P. H. 246

Gatto, R. x, 487

Geisser, S. 247, 316

George, S. L. 375

Geyer, C. J. 178, 183, 372, 486

Gigli, A. 486

Gilks, W. R. 2, 183, 343

Gill, R. D. 124, 128, 353

Gleason, J. R. 486, 488

Glosup, J. 383

Gong, G. 375

Gotze, F. 60, 427

Graham, R. L. 486, 489

Gray, H. L. 59

Green, P. J. 375

Gross, S. 125

Haldane, J. B. S. 487

Hall, P. ix, x, 59, 60, 62, 124, 125, 129, 183, 246, 247, 248, 251, 315, 316,321, 375, 378, 379, 427, 428, 429, 432, 436, 486, 487, 488, 491, 493,514, 515, 516, 517

Halloran, M. E. 246

Hamilton, M. A. 184

Hammersley, J. M. 486

Hampel, F. R. 60

Hand, D. J. 68, 182, 436, 520

Handscomb, D. C. 486

Hardle, W. 316, 375, 427

Harrington, D. P. 124

Hartigan, J. A. 59, 60, 427, 430

Hastie, T. J. 374, 375

Hawkins, D. M. 316

Hayes, K. G. 123

Heggelund, P. 189

Heller, G. 374

Herzberg, A. M. 360

Hesterberg, T. C. 60, 66, 486, 490, 491

Hettsmansperger, T. P. 316

Hinkley, D. V. 60, 63, 66, 125, 135, 183, 246, 247, 250, 318, 383, 486,487, 489, 490, 492, 493, 515, 517, 518

Hirose, H. 347, 381

Hjort, N. L. 124, 374

Holmes, S. 60, 246, 486

Horowitz, J. L. 427, 429

Horvath, L. 374

Hosmer, D. W. 361

Hu, F. 318

Huet, S. 375

Hyde, J. 131

Isham, V. 428

Janas, D. 427, 431

Jennison, C. 183, 184, 246

Jensen, J. L. 486

Jeong, J. 315

Jhun, M. 126

Jing, B.-Y. 427, 429, 487, 515, 517

Jockel, K.-H. 183

John, P. W. M. 486, 489

Johns, M. V. 486, 490

Jolivet, E. 375

Jones, M. C. x, 128

Joumel, A. G. 428

Kabaila, P. 246, 250, 427

Kalbfleisch, J. D. 124

Kaplan, E. L. 124

Karr, A. F. 428

Katz, R. 282

Keenan, D. M. 428

Keiding, N. 124, 128, 353

Kendall, D. G. 124

Kendall, W. S. 124

Kim, J.-H. 124

Klaassen, C. A. J. 123

Kulperger, R. J. 427, 430

KUnsch, H. R. 427

Lahiri, S. N. 427

Laird, N. M. 124, 125

Lange, N. 428

La Scala, B. 514

Lawless, J. 514

Lawson, A. B. 428

Lee, S. M. S. 246

Leger, C. 125

Lehmann, E. L. 183

Lemeshow, S. 361

Leroy, A. M. 315

Lewis, P. A. W. 428

Lewis, T. 236, 506

Li, G. 514

Li, H. 427

Li, K.-C. 316

Liu, R. Y. 315, 427

Lloyd, C. J. 515

Lo, S.-H. 125, 374

Loader, C. 375

Loh, W.-Y. 183, 246

Longtin, A. 427, 430, 435

Louis, T. A. 125

Lunn, A. D. 68, 182, 436, 520

Maddala, G. S. 315, 427

Mallows, C. L. 316

Mammen, E. 60, 315, 316

Manly, B. F. J. 183

Marriott, F. H. C. 183

Marron, J. S. 375

Martin, M. A. 125, 183, 246, 251, 487, 493

McCarthy, P. J. 59, 60, 125

McConway, K. J. 68, 182, 436, 520

McCullagh, P. 66, 374, 553

McDonald, J. W. 183, 184

570 Name Index

McKay, M. D. 486

McKean, J. W. 316

McLachlan, G. J. 375

Meier, P. 124

Messean, A. 375

Milan, L. 125

Miller, R. G. 59, 84

Monahan, J. F. 515

Monti, A. C. 514

Morgenthaler, S. 486

Moulton, L. H. 374, 376, 377

Muirhead, C. R. 428

Murphy, S. A. 515

Mykland, P. A. 515

Nelder, J. A. 374

Newton, M. A. 178, 183, 486, 515

Niederreiter, H. 486

Nordgaard, A. 427

Noreen, E. W. 184

Oakes, D. 124, 128

Ogbonmwan, S.-M. 486, 515

Olshen, R. A. 315, 316

Oris, J. T. 384

Ostrowski, E. 68, 182, 436, 520

Owen, A. B. 486, 514, 515, 550

Parzen, M. I. 250

Paulsen, O. 189

Peers, H. W. 515

Percival, D. B. 426

Perl, M. L. 123

Peters, S. C. 315, 427

Phillips, M. J. 428, 432, 436

Pitman, E. J. G. 183

Plackett, R. L. 60

Politis, D. N. 60, 125, 427, 429

Possolo, A. 428

Pregibon, D. 374

Prentice, R. L. 124

Presnell, B. 125, 129

Priestley, M. B. 426

Proschan, F. 4, 218

Qin, J. 514

Quenouille, M. H. 59

Raftery, A. E. 515

Rao, J. N. K 125, 130

Rawlings, J. O. 356

Reid, N. 124, 486

Reynolds, P. S. 435

Richardson, S. 183

Ripley, B. D. x, 183, 282, 315, 316, 361, 374, 375, 417, 428, 486

Ritov, Y. 123

Robinson, J. 486, 487

Romano, J. P. 60, 124, 125, 126, 183,246, 427, 429, 515, 516

Romo, J. 427

Ronchetti, E. M. 60, 486, 487

Rousseeuw, P. J. 60, 315

Rowlingson, B. S. 428

Royall, R. M. 63

Rubin, D. B. 124, 125, 515

Ruppert, D. 310, 325

St. Laurent, R. T 375

Samawi, H. M. 184

Sauerbrei, W. 375

Schechtman, E. 66, 247, 486, 487

Schenker, N. 246, 515

Schrage, L. E. 486

Schucany, W. R. 59

Schumacher, M. 375

Seber, G. A. F. 315

Shao, J. 60, 125, 246, 315, 316, 375,376

Sheather, S. J. 316

Shi, S. 183, 246, 250, 486, 489, 490

Shorack, G. 316

Shotton, D. M. 417

Silverman, B. W. 124, 128, 189, 363, 375

Simonoff, J. S. 269

Singh, K. 246, 315, 374, 427

Sitter, R. R. 125, 129

Smith, H. 315

Smith, P. W. F. 183, 184

Snell, E. J. 287, 324, 374

Snowden, C. B. 125

Spady, R. H. 487, 515

Spiegelhalter, D. J. 183

Stahel, W. A. 60

Stangenhaus, G. 316

Starke, H. R. 277

Stein, C. M. 60, 515

Stein, M. 486

Steinour, H. H. 277

Sternberg, H. O ’R. x, 388, 389, 427

Stine, R. A. 315

Staffer, D. S. 427

Stone, C. J. 316

Stone, M. 316

Stone, R. A. 428

Sutherland, D. H. 316

Swanepoel, J. W. H. 427

Tanner, M. A. 124, 125

Theiler, J. 427, 430, 435

Themeau, T. 486

Tibshirani, R. J. ix, 60, 125, 246, 316,375, 427, 515

Titterington, D. M. 183

Tong, H. 394, 426

Truong, K. N. 126

Tsai, C.-L. 269, 375

Tsay, R. S. 427

Tu, D. 60, 125, 246, 376

Tukey, J. W. 59, 403, 486

van Wyk, J. W. J. 427

van Zwet, W. R. 60

Venables, W. N. 282, 315, 361, 374,375

Venkatraman, E. S. 374

Ventura, V. x, 428, 486, 492

Vere-Jones, D. 428

Wahrendorf, J. 379

Walden, A. T. 426

Wall, K. D. 427

Wand, M. P. 128

Wang, S. 124, 318, 486, 487

Wang, Y. 486

Wei, B. C. 63, 375, 487

Wei, L. J. 250

Weisberg, S. 125, 257, 315, 316

Welch, B. L. 515

Welch, W. J. 183, 185

Wellner, J. A. 123

Westfall, P. H. 184

Whittaker, J. 125

Wilson, S. R. 378, 379

Name Index 571

Witkowski, J. A. 417

Wong, W. H. 124

Wood, A. T. A. 247, 251, 486, 488,491, 515, 517

Woods, H. 277

Worton, B. J. 486, 487, 493, 515, 517,518

Wu, C. J. F. 60, 125, 130, 315, 316

Wyatt, M. P. 316

Wynn, H. P. 515

Yahav, J. A. 487, 494

Yandell, B. S. 374

Ying, Z. 250

Young, G. A. x, 59, 60, 124, 128, 246, 316, 428, 486, 487, 493

Young, S. S. 184

Zeger, S. L. 374, 376, 377

Zelen, M. 328

Zidek, J. V. 318

Example index

accelerated life test, 346, 379

adaptive test, 187, 188

AIDS data, 1, 342, 369

air-conditioning data, 4, 15, 17, 19, 25, 27, 30, 33, 36, 197, 199, 203, 205, 207, 209, 216, 217, 233, 501, 508, 513, 520

Amis data, 253

A M L data, 83, 86, 146, 160, 187

antithetic bootstrap, 493

association, 421, 422

autoregression, 388, 391, 393, 398,432, 434

average, 13, 15, 17, 19, 22, 25, 27, 30, 33, 36, 47, 51, 88, 92, 94, 98, 128, 501, 508, 513, 516

axial data, 234, 505

balanced bootstrap, 440, 441, 442,445, 487, 488, 489, 494

Bayesian bootstrap, 513, 518, 520

beaver data, 434

bias estimation, 106, 440, 464, 466,488, 492, 495

binomial data, 338, 359, 361

bivariate missing data, 90, 128

bootstrap likelihood, 508, 517, 518

bootstrap recycling, 464, 466, 492, 496

block bootstraps, 398, 401, 403, 432

brambles data, 422

Breslow data 378

calcium uptake data, 355, 441, 442

capability index, 248, 253, 497

carbon monoxide data, 67

cats data, 321

caveolae data, 416, 425

CD4 data, 68, 134, 190, 251, 252, 254

cement data, 277

changepoint estimation, 241

Channing House data, 131

circular data, 126, 517, 520

city population data, 6, 13, 22, 30, 49, 52, 53, 54, 66, 95, 108, 110, 113, 118, 201, 238, 439, 440, 447, 464, 473, 490, 492, 513

Claridge data, 157, 158, 496

cloth data, 382

coal-mining data, 435

comparison of means, 159, 162, 163,

166, 171, 172, 176, 181, 186, 454, 457, 519

comparison of variable selection methods, 306

convex regression, 371

correlation coefficient, 48, 61, 63, 68, 80, 90, 108, 115, 157, 158, 187,247, 251, 254, 475, 493, 496, 518

correlogram, 388

Darwin data, 186, 188, 471, 481, 498

difference of means, 71, 75

dogs data, 187

Downs’ syndrome data, 371

double bootstrap, 176, 177, 224, 226, 254, 464, 466, 469

ducks data, 134

eigenvalue, 64, 134, 252, 277, 445, 447

empirical likelihood, 501, 516, 519,520

empirical exponential family likelihood, 505, 516, 520

equal marginal distributions, 78

exponential mean, 15, 17, 19, 30, 61, 176, 224, 250, 510

exponential model, 188, 328, 334, 367

factorial experiment, 320, 322

fir seedlings data, 142

Frets’ heads data, 115, 447

gamma model, 5, 25, 36, 148, 207, 233, 247, 376

generalized additive model, 367, 369, 371, 382, 383

generalized linear model, 328, 334,338, 342, 367, 376, 378, 381, 383

gravity data, 72, 121, 131, 454, 457, 494, 519

handedness data, 157, 496

hazard ratio, 221

head size data, 115

heart disease data, 378

hypergeometric distribution, 487

importance sampling, 454, 457, 461, 464, 466, 489, 490, 491, 495

imputation, 88, 90

independence, 177

influence function, 48, 53

intensity estimate, 418

Islay data, 520

isotonic regression, 371

jackknife, 51, 64, 65, 317

jackknife-after-bootstrap, 115, 130, 134, 313, 325

K-function, 416, 422

kernel density estimate, 226, 413, 469

kernel intensity estimate, 418, 431

laterite data, 234, 505

leukaemia data, 328, 334, 367

likelihood ratio statistic, 62, 148, 247, 346, 501

linear approximation, 118, 468, 490

logistic regression, 141, 146, 338, 359, 361, 371, 376

572

Example index 573

log-linear model, 342, 369

lognormal model, 66, 148

low birth weights data, 361

lynx data, 432

maize data, 181,

mammals data, 257, 262, 265, 324

M C M C , 146, 184, 185

matched pairs, 186, 187, 188, 492

mean, see average

mean polar axis, 234, 505

median, see sample median

median survival time, 86

melanoma data, 352

misclassification error, 359, 361, 381

missing data, 88, 90, 128

mixed continuous-discrete distributions, 78

model selection, 304, 306, 393, 432

motorcycle impact data, 363, 365

multinomial distribution, 66, 487

multiple regression, 276, 277, 281, 286, 287, 298, 300, 304, 306, 309, 313

neurophysiological point process data, 418

neurotransmission data, 189

Nile data, 241

nitrofen data, 383

nodal involvement data, 381

nonlinear regression, 355, 441, 442

nonlinear time series, 393, 401

nonparametric regression, 365

normal plot, 150, 152, 154

normal prediction limit, 244

normal variance, 208

nuclear power data, 286, 298, 304, 323

one-way model, 276, 319, 320

overdispersion, 142, 338, 342

paired comparison, 471, 481, 498

partial correlation, 115

Paulsen data, 189

periodogram, 388

periodogram resampling, 413, 430

PET film data, 346, 379

phase scrambling, 410, 430, 435

point process data, 416, 418, 421

poisons data, 322

Poisson process, 416, 418, 422, 425, 431, 435

Poisson regression, 342, 369, 378, 382, 383

prediction, 244, 286, 287, 323, 324, 342

prediction error, 298, 300, 320, 321, 359, 361, 369, 381, 393, 401

product-limit estimator, 86, 128

proportional hazards, 146, 160, 221, 352

quantile, 48, 253, 352

quartzite data, 520

ratio, 6, 13, 22, 30, 49, 52, 53, 54, 66, 95, 98, 108, 110, 113, 118, 126,127, 165, 201, 217, 238, 249, 439, 447, 464, 473, 490, 513

regression, see convex regression, generalized additive model, generalized linear model, logistic regression, log-linear model, multiple regression, nonlinear regression, nonparametric regression, robust regression, straight-line regression

regression prediction, 286, 287, 298, 300, 320, 321, 323, 324, 342, 359, 361, 369, 381

reliability data, 346, 379

remission data, 378

returns data, 269, 272, 449, 461

Richardson extrapolation, 494

Rio Negro data, 388, 398, 403, 410

robust M-estimate, 318, 471, 483

robust regression, 308, 309, 313, 318, 324

robust variance, 265, 318, 376

rock data, 281, 287

saddlepoint approximation, 468, 469, 471, 473, 475, 477, 481, 483, 492, 493, 497

salinity data, 309, 313, 324

sample maximum, 39, 56, 247

sample median, 41, 61, 65, 80

sample variance, 61, 62, 64, 104, 481

separate families test, 148

several samples, 72, 126, 131, 133, 519

simulated data, 306

smoothed bootstrap, 80, 127, 168, 169,418, 431

spatial association, 421, 422

spatial clustering, 416

spatial epidemiology, 421

spectral density estimation, 413

spherical data, 126, 234, 505

spline model, 365

stationary bootstrap, 398, 403, 428,429

straight-line regression, 257, 262, 265, 269, 272, 308, 317, 321, 322, 449, 461

stratified ratio, 98

Strauss process, 416, 425

studentized statistic, 477, 481, 483

sugar cane data, 338

sunspot data, 393, 401, 435

survival probability, 86, 131

survival proportion data, 308, 322

survival time data, 328, 334, 346, 352, 367

survivor functions, 83

symmetric distribution, 78, 251, 471, 483

tau particle data, 133, 495

test of correlation, 157

test for overdispersion, 142, 184

test for regression coefficient, 269,281, 313

test of interaction, 322

tile resampling, 425, 432

times on delivery suite data, 300

traffic data, 253

transformation, 33, 108, 118, 169, 226,322, 355, 418

trend test in time series, 403, 410

trimmed average, 64, 121, 130, 133

tuna data, 169, 228, 469

two-sample problem, see comparison of means

two-way model, 177, 184, 338

574 Example index

unimodality, 168, 169, 189 variance estimation, 208, 446,464, weird bootstrap, 128

unit root test, 391 488, 495 Wilcoxon test, 181

unne data, 359 Weibull model, 346, 379 wild bootstrap, 272, 319

variable selection, 304, 306 weighted average, 72, 126, 131 wool prices data, 391

Subject index

abc.ci,536

A B C method, see confidence interval

Abelson-Tukey coefficients, 403

accelerated life test example, 346, 379

adaptive estimation, 120-123, 125, 133

adaptive test, 173-174, 184, 187, 188

aggregate prediction error, 290-301,316, 320-321, 358-362

AIDS data example, 1, 342, 369

air conditioning data example, 4, 15, 17, 19, 25, 27, 30, 33, 36, 149, 188, 197, 199, 203, 205, 207, 209, 216, 217, 233, 501, 508-512

Akaike’s information criterion, 316, 394, 432

algorithms

JC-fold adjusted cross-validation,295

balanced bootstrap, 439, 488

balanced importance resampling, 460, 491

Bayesian bootstrap, 513

case resampling in regression, 264

comparison of generalized linear and generalized additive models, 367

conditional bootstrap for censored data, 84

conditional resampling for censored survival data, 351

double bootstrap for bias adjustment, 104

inhomogeneous Poisson process,431

model-based resampling in linear regression, 262

phase scrambling, 408

prediction in generalized linear models, 341

prediction in linear regression, 285

resampling errors with unequal variances, 271

resampling for censored survival data, 351

stationary bootstrap, 428

superpopulation bootstrap, 94

all-subsamples method, 57

A M L data example, 83, 86, 146, 160,187, 221

analysis of deviance, 330-331,367-369

ancillary statistics, 43, 238, 241

antithetic bootstrap, 493

apparent error, 292

assessment set, 292

autocorrelation, 386, 431

autoregressive process, 386, 388, 389,392, 395, 398, 399, 400, 401, 410, 414, 432, 433, 434

simulation, 390-391

autoregressive-moving average process, 386, 408

average, 4, 8, 13, 15, 17, 19, 22, 25, 27, 30, 33, 36, 47, 51, 88, 90, 92, 94, 98, 129, 130, 197, 199, 203, 205, 207, 209, 216, 251, 501, 508, 512, 513, 516, 518, 520

comparison of several, 163

comparison of two, 159, 162, 166,171, 172, 186, 454, 457, 519

finite population, 94, 98, 129

Bayesian bootstrap, 512-514, 515,518, 520

B C a method, see confidence interval

beaver data example, 434

bias correction, 103-107

bias estimator, 16-18

adjusted, 104, 106-107, 130, 442, 464, 466, 492

balanced resampling, 440

bias of, 103

post-simulation balance, 488

sensitivity analysis for, 114, 117

binary data, 78, 359-362, 376, 377, 378

binomial data, 338

binomial process, 416

bivariate distribution, 78, 90-92, 128

block resampling, 396-408, 427, 428,432

boot, 525

..528, 538, 548

‘ ‘balanced’ ’, 545

m, 538

mle, 528, 538, 540, 543

‘ ‘parametric’ ’, 534

ran.gen, 528, 538, 540, 543

sim, 529, 534

statistic, 527, 528

strata, 531

stype, 527

weights, 527, 536, 546

boot. array, 526

boot. ci, 536

bootstrap

adjustment, 103-107, 125, 130, 175-180, 223-230

antithetic, 487, 493

asymptotic accuracy, 39-41,211-214

balanced, 438-446, 486, 494-499

algorithm, 439, 488

bias estimate, 438-440, 488

conditions for success, 445

efficiency, 445, 460, 461, 495

experimental design, 441, 486,489

first-order, 439, 486, 487-488

575

576 Subject index

higher-order, 441, 486, 489

theory, 443^45, 487

balanced importance resampling,460-463, 486, 496

Bayesian, 512-514, 515, 518, 520

block, 396-408, 427, 428, 433

calibration, 246

case resampling, 84

consistency, 37-39

conditional, 84, 124, 132, 351, 374, 474

discreteness 27, 61

double 103-113, 122, 125, 130, 175-180, 223-230, 254, 373, 463-466, 469, 486, 497, 507-509

theory for, 105-107, 125

generalized, 56

hierarchical, 100-102, 125, 130, 288

imputation, 89-92, 124-125

jittered, 124

mirror-match, 93, 125, 129

model-based, 349, 433, 434

nested, see double

nonparametric, 22

parametric, 15-21, 261, 333, 334, 339, 344, 347, 373, 378, 379, 416, 528, 534

population, 94, 125, 129

post-blackened, 397, 399, 433

post-simulation balance, 441-445, 486, 488, 495

quantile, 18-21, 36, 69, 441, 442, 448-450, 453-456, 457 463, 468, 490

recycling, 463-466, 486, 492, 496

robustness, 264

shrunk smoothed, 79, 81, 127

simulation size, 17-21, 34-37, 69, 155-156, 178-180, 183, 185, 202, 226, 246, 248

smoothed, 79-81, 124, 127, 168,169, 310, 418, 431, 531

spectral, 412—415, 427, 430

stationary, 398-408, 427, 428^29,433

stratified, 89, 90, 306, 340, 344, 365, 371, 457, 494, 531

superpopulation, 94, 125, 129

symmetrized, 78, 122, 169, 471, 485

tile, 424-426, 428, 432

tilted, 166-167, 452^56, 459, 462, 546-547

weighted, 60, 514, 516

weird, 86-87, 124, 128, 132

wild, 272-273, 316, 319, 538

bootstrap diagnostics, 113-120, 125

bias function, 108, 110, 464-465

jackknife-after-bootstrap, 113-118, 532

linearity, 118-120

variance function, 107-111, 464-465

bootstrap frequencies, 22-23, 66, 76,110-111, 438-445, 464, 526, 527

bootstrap likelihood, 507-509, 515, 517, 518

bootstrap recycling, 463-466, 487,492, 497, 508

bootstrap test, see significance test

Box-Cox transformation, 118

brambles data example, 422

Breslow estimator, 350

calcium uptake data example, 355, 441, 442

capability index example, 248, 253,497

carbon monoxide data, 67

cats data example, 321

caveolae data example, 416, 425

CD4 data example, 68, 134, 190, 251, 252, 254

cement data example, 277

censboot, 532, 541

censored data, 82-87, 124, 128, 131,160, 346-353, 514, 532, 541

changepoint model, 241

Channing House data example, 131

choice of estimator, 120-123, 125, 134

choice of predictor, 301-305

choice of test statistic, 173, 180, 184, 187

circular data, 126, 517, 520

city population data example, 6, 13, 22, 30, 49, 52, 53, 54, 66, 95, 108, 110, 113, 118,201, 238, 249,440, 447, 464, 473, 490, 492, 513

Claridge data example, 157, 158, 496

cloth data example, 382

coal-mining data example, 435

collinearity, 276-278

complementary set partitions, 552, 554

complete enumeration, 27, 60, 438, 440, 486

conditional inference, 43, 138, 145, 238-243, 247, 251

confidence band, 375, 417, 420, 435

confidence interval

ABC, 214-220, 231, 246, 511, 536

B C a, 203-213, 246, 249, 336-337, 383, 536

basic bootstrap, 28-29, 194-195,199, 213-214, 337, 365, 374, 383, 435

coefficient, 191

comparison of methods, 211-214, 230-233, 246, 336-338

conditional, 238-243, 247, 251

double bootstrap, 223-230, 250,254, 374, 469

normal approximation, 14, 194, 198,337, 374, 383, 435

percentile method, 202-203, 213-214, 336-337, 352, 383

profile likelihood, 196, 346

studentized bootstrap, 29-31, 95, 125, 194-196, 199, 212,227-228, 231, 233, 246, 248,250, 336-337, 391, 449, 454, 483-485

test inversion, 220-223, 246

confidence limits, 193

confidence region, 192, 231-237, 504-506

consistency, 13

contingency table, 177, 183, 184, 342

control, 545

control methods, 446-450, 486

bias estimation, 446-448, 496

efficiency, 447, 448, 450, 462

importance resampling weight, 456

Subject index 577

linear approximation, 446, 486, 495

quantile estimation, 446-450, 461-463, 486, 495-496

saddlepoint approximation, 449

variance estimation, 446-448, 495

Cornish-Fisher expansion, 40, 211,449

correlation estimate, 48, 61, 63, 68, 69,80, 90-92, 108, 115-116, 134, 138, 157, 158, 247, 251, 254, 266, 475, 493

correlogram, 386, 389

partial, 386, 389

coverage process, 428

cross-validation, 153, 292-295,296-301, 303, 305-307, 316, 320,321, 324, 360-361, 365, 377, 381

K-fold, 294-295, 316, 320, 324, 360-361, 381

cumulant-generating function, 66,466, 467, 472, 479, 551-553

approximate, 476-478, 482, 492

paired comparison test, 492

cumulants, 551-553

approximate, 476

generalized, 552

cumulative hazard function, 82, 83,86, 350

Darwin data example, 186, 188, 471, 481, 498

defensive mixture distribution, 457-459, 462, 464, 486, 496

delivery suite data example, 300

delta method, 45^6, 195, 227, 233, 419, 432, see also nonparametric delta method

density estimate, see kernel density estimate

deviance, 330-331, 332, 335, 367-369, 370, 373, 378, 382

deviance residuals, see regression residuals

diagnostics, see bootstrap diagnostics

difference of means, see average, comparison of two

directional data, 126, 234, 505, 515, 517, 520

dirty data, 44

discreteness effects, 26-27, 61

dispersion parameter, 327, 328, 331,339, see also overdispersion

distribution

F, 331, 368

t, 81, 331, 484

Bernoulli, 376, 378, 381, 474, 475

beta, 187, 248, 377

beta-binomial, 338, 377

binomial, 86, 128, 327, 333, 338, 377

bivariate normal, 63, 80, 91, 108,128

Cauchy, 42, 81

chi-squared, 139, 142, 163, 233, 234,237, 303, 330, 335, 368, 373, 378, 382, 484, 500, 501, 503,504, 505, 506

defensive mixture, see defensive mixture distribution

Dirichlet, 513, 518

double exponential, 516

empirical, see empirical distribution function, empirical exponential family

exponential, 4, 81, 82, 130, 132, 176, 188, 197, 203, 205, 224, 249, 328, 334, 336, 430, 491, 503,521

exponential family, 504-507, 516

gamma, 5, 131, 149, 207, 216, 230, 233, 247, 328, 332, 334, 376,503, 512, 513, 521

geometric, 398, 428

hypergeometric, 444, 487

least-favourable, 206, 209

lognormal, 66, 149, 336

multinomial, 66, 111, 129, 443, 446, 452, 468, 473, 491, 492, 493,501, 502, 517, 519

multivariate normal, 445, 552

negative binomial, 337, 344, 345,371

normal, 10, 150, 152, 154, 208, 244, 327, 485, 488, 489, 518, 551

Poisson, 327, 332, 333, 337, 342,344, 345, 370, 378, 382, 383,416, 419, 431, 473, 474, 493,516

posterior, see posterior distribution

prior, see prior distribution

slash, 485

tilted, see exponential tilting

Weibull, 346, 379

dogs data example, 187

Downs’ syndrome data example, 371

ducks data example, 134

Edgeworth expansion, 39—41, 60, 408, 476-477

EEF.profile, 550

eigenvalue example, 64, 134, 252, 278,445, 447

eigenvector, 505

EL. prof ile, 550

empinf, 530

empirical Bayes, 125

empirical distribution function, 11-12,60-61, 128, 501, 508

as model, 108

marginal, 267

missing data, 89-91

residuals, 77, 181, 261

several, 71, 75

smoothed, 79-81, 127, 169, 227, 228

symmetrized, 78, 122, 165, 169, 228, 251

tilted, 166-167, 183, 209-210, 452-456, 459, 504

empirical exponential familylikelihood, 504-506, 515, 516, 520

empirical influence values, 46-47, 49, 51-53, 54, 63, 64, 65, 75, 209, 210, 452, 461, 462, 476, 517

generalized linear models, 376

linear regression, 260, 275, 317

numerical approximation of, 47, 51-53, 76

several samples, 75, 127, 210

see also influence values

empirical likelihood, 500-504, 509,512, 514-515, 516, 517, 519, 520

empirical likelihood ratio statistic,501, 503, 506, 515

envelope test, see graphical test

578 Subject index

equal marginal distributions example, 78

error rate, 137, 153, 174, 175

estimating function, 50, 63, 105, 250,318, 329, 470-471, 478, 483, 504,505, 514, 516

excess error, 292, 296

exchangeability, 143, 145

expansion

Cornish-Fisher, 40, 211, 449

cubic, 475-478

Edgeworth, 39-41, 60, 411,476-478, 487

linear, 47, 51, 69, 75, 76, 118, 443,446, 468

notation, 39

quadratic, 50, 66, 76, 443

Taylor series, 45, 46

experimental design

relation to resampling, 58, 439, 486

exponential mean example, 15, 17, 19, 30, 61, 176, 250, 510

exponential quantile plot, 5, 188

exponential tilting, 166-167, 183, 209-210, 452-454, 456-458,461-463, 492, 495, 504, 517, 535, 546, 547

exp.tilt, 535

factorial experiment, 320, 322

finite population sampling, 92-100, 125, 128, 129, 130, 474

fir seedlings data, 142

Fisher information, 193, 206, 349, 516

Fourier frequencies, 387

Fourier transform, 387

empirical, 388, 408, 430

fast, 388

inverse, 387

frequency array, 23, 52, 443

frequency smoothing, 110, 456, 462,

463, 464-465, 496, 508

Frets’ heads data example, 115, 447

gamma model, 5, 25, 62, 131, 149,207, 216, 233, 247, 376

generalized additive model, 366-371,375, 382, 383

generalized likelihood ratio, 139

generalized linear model, 327-346, 368, 369, 374, 376-377, 378, 381-384, 516

comparison of resampling schemes for, 336-338

graphical test, 150-154, 183, 188, 416, 422

gravity data example, 72, 121, 131, 150, 152, 154, 162, 163, 166, 171,172, 454, 457, 494, 519

Greenwood’s formula, 83, 128

half-sample methods, 57-59, 125

handedness data example, 157, 158, 496

hat matrix, 258, 275, 278, 318, 330

hazard function, 82, 146-147, 221-222, 350

heads data example, see Frets’ heads data example

heart disease data example, 378

heteroscedasticity, 259-260, 264, 269, 270-271, 307, 318, 319, 323, 341, 363, 365

hierarchical data, 100-102, 125, 130, 251-253, 287-289, 374

Huber M-estimate, see robust M-estimate example

hypergeometric distribution, 487

hypothesis test, see significance test

implied likelihood, 511-512, 515, 518

imp.moments, 546

importance resampling, 450-466, 486, 491, 497

balanced, 460-463

algorithm, 491

efficiency, 461, 462

efficiency, 452, 458, 461, 462, 486

improved estimators, 456-460

iterated bootstrap confidence intervals, 486

quantile estimation, 453-456, 457, 495

ratio estimator, 456, 459, 464, 486,490

raw estimator, 459, 464, 486

regression, 486

regression estimator, 457, 459, 464, 486, 491

tail probability estimate, 452, 455

time series, 486

weights, 451, 455, 456-457, 458, 464

importance sampling, 450-452, 489

efficiency, 452, 456, 459, 460, 462, 489

identity, 116, 451, 463

misapplication, 453

quantile estimate, 489, 490

ratio estimator, 490

raw estimator, 451

regression estimator, 491

tail probability estimate, 453

weight, 451

imp.prob, 546

imp.quantile, 546

imputation, 88, 90

imp.weights, 546

incomplete data, 43-44, 88-92

index notation, 551-553

infinitesimal jackknife, seenonparametric delta method

influence functions, 46-50, 60, 63-64

chain rule, 48

correlation, 48

covariance, 316, 319

eigenvalue, 64

estimating equation, 50, 63

least squares estimates, 260, 317

M-estimation, 318

mean, 47, 316

moments, 48, 63

multiple samples, 74-76, 126

quantile, 48

ratio of means, 49, 65, 126

regression, 260, 317, 319

studentized statistic, 63

trimmed mean, 64

two-sample t statistic, 454

variance, 64

weighted mean, 126

information distance, 165-166

Subject index 579

integration

number-theoretic methods, 486

interaction example, 322

interpolation of quantiles, 195

Islay data example, 520

isotonic regression example, 371

iterative weighted least squares, 329

jackknife, 50-51, 59, 64, 65, 76, 115

delete-m, 56, 60, 493

for least squares estimates, 317

for sample surveys, 125

infinitesimal, see nonparametric delta method

multi-sample, 76

jack.after.boot, 532

jackknife-after-bootstrap, 113-118,125, 133, 134, 135, 308, 313, 322, 325, 369

parametric model, 116-118, 130

Jacobian, 470, 479

K-function, 416, 424

Kaplan-Meier estimate, see product-limit estimate

kernel density estimate, 19-20, 79,124, 127, 168-170, 189, 226-230,251, 413, 469,507,511,514

kernel intensity estimate, 419-421,431, 435

kernel smoothing, 110, 363, 364, 375

kriging, 428

Kronecker delta symbol, 412, 443, 553

Lagrange multiplier, 165-166, 502,504, 515, 516

Laplace’s method, 479, 481

laterite data example, 234, 505

Latin hypercube sampling, 486

Latin square design, 489

least squares estimates, 258, 275, 392

penalized, 364

weighted, 271, 278, 329

length-biased data, 514

leukaemia data example, 328, 334, 367

leverage, 258, 271, 275, 278, 330, 370,377

likelihood, 137

adjusted, 500, 512, 515

based on confidence sets, 509-512

bootstrap, 507-509

combination of, 500, 519

definition, 499

dual, 515

empirical, 500-506

implied, 511-512

multinomial-based, 165-166, 186, 500-509

parametric, 347, 499-500

partial, 350, 507

pivot-based, 510-511, 512, 515

profile, 62, 206, 248, 347, 501, 515,519

quasi, 332, 344

likelihood ratio statistic, 62, 137, 138, 139, 148, 196, 234, 247, 330, 347, 368, 373, 380, 499-501

signed, 196

linear.approx, 530

linear approximation, seenonparametric delta method

linear predictor, 327, 366

residuals, 331

linearity diagnostics, 118-120, 125

link function, 327, 332, 367

location-scale model, 77, 126

logistic regression example, 141, 146,338, 359, 361, 371, 376, 378, 381, 474

logit, 338, 372

loglinear model, 177, 184, 342, 369

lognormal model, 66, 149

log rank test, 160

long-range dependence, 408, 410, 426

low birth weights data example, 361

lowess, 363, 369

lunch

nonexistence of free, 437

lynx data example, 432

M-estimate, 311-313, 316, 318, 471, 483, 515

maize data example, 181

mammals data example, 257, 262, 265, 324

Markov chain, 144, 429

Markov chain Monte Carlo, 143-147, 183, 184, 385, 428

matched-pair data example, 186, 187,188, 471, 481, 492, 498

maximum likelihood estimate

bias-corrected, 377

generalized linear model, 329

nonparametric, 165-166, 186, 209, 501, 516

mean, see average

mean polar axis example, 234, 505

median, see sample median

median absolute deviation, 311

median survival time example, 86, 124

melanoma data example, 352

misclassification error, 358-362, 375, 378, 381

misclassification rate, 359

missing data, 88-92, 125, 128

mixed continuous-discrete distribution example, 78

mode estimator, 124

model selection, 301-307, 316, 375,393, 427, 432

model-based resampling, 389-396,427, 433

modified sample size, 93

moment-generating function, 551, 552

Monte Carlo test, 140-147, 151-154, 183, 184

motoneurone firing, 418

motorcycle impact data example, 363, 365

moving average process, 386

multiple imputation, 89-91, 125, 128

multiplicative bias, 62

multiplicative model, 77, 126, 328, 335

negative binomial model, 337, 344

Nelson-Aalen estimator, 83, 86, 128

nested bootstrap, see double bootstrap

nested data, see hierarchical data

580 Subject index

neurophysiological data example, 418,

428

Nile data example, 241

nitrofen data example, 383

nodal involvement data example, 381

nonlinear regression, 353-358

nonlinear time series, 393-396, 401, 410, 426

nonparametric delta method, 46-50, 75

balanced bootstrap, 443-444

cubic approximation, 475-478

linear approximation, 47, 51, 52, 60, 69, 76, 118, 126, 127, 205, 261, 443, 454, 468, 487, 488, 490,492

control variate, 446

importance resampling, 452

tilted, 490

quadratic approximation, 50, 79, 212, 215, 443, 487, 490

variance approximation, 47, 50, 63, 64, 75, 76, 108, 120, 199, 260, 261, 265, 275, 312, 318, 319,376, 477, 478, 483

nonparametric maximum likelihood, 165-166, 186, 209, 501

nonparametric regression, 362-373,375, 382, 383

normal prediction limit, 244

normal quantile plot test, 150

notation, 9-10

nuclear power data example, 286, 298, 304, 323

null distribution, 137

null hypothesis, 136

one-way model example, 208, 276,319, 320

outliers, 27, 307-308, 363

overdispersion, 327, 332, 338-339, 343-344, 370, 382

test for, 142

paired comparison, see matched-pair data

parameter transformation, see transformation of statistic

partial autocorrelation, 386

partial correlation example, 115

periodogram, 387-389, 408, 430

resampling, 412-415, 427, 430

permutation test, 141, 146, 156-160, 173, 183, 185-186, 266, 279, 422, 486, 492

for regression slope, 266, 378

saddlepoint approximation, 475, 487

PET film data example, 346, 379

phase scrambling, 408-412, 427, 430, 435

pivot, 29, 31, 33, 510-511, see also studentized statistic

point process, 415-426, 427-428

poisons data example, 322

Poisson process, 416-422, 425, 428, 431-432, 435

Poisson regression example, 342, 369, 378, 382, 383

posterior distribution, 499, 510, 513,515, 520

power notation, 551-553

prediction error, 244, 375, 378

K-fold cross-validation estimate, 293-295, 298-301, 316, 320, 324, 358-362, 381

0.632 estimator, 298, 316, 324, 358-362, 381

adjusted cross-validation estimate, 295, 298-301, 316, 324, 358-362

aggregate, 290-301, 320, 321, 324, 358-362

apparent, 292, 298-301, 320, 324, 381

bootstrap estimate, 295-301, 316, 324, 358-362, 381

comparison of estimators, 300-301

cross-validation estimate, 292-293, 298-301, 320, 324, 358-362, 381

generalized linear model, 340-346

leave-one-out bootstrap estimate, 297, 321

time series, 393-396, 401, 427

prediction limits, 243-245, 251, 284-289, 340-346, 369-371

prediction rule, 290, 358, 359

prior distribution, 499, 510, 513

product factorial moments, 487

product-limit estimator, 82-83, 87,124, 128, 350, 351, 352, 515

profile likelihood, 62, 206, 248, 347,

501, 515, 519

proportional hazards model, 146, 160, 221, 350-353, 374

P-value, 137, 138, 141, 148, 158, 161, 437

adjusted, 175-180, 183, 187

importance sampling, 452, 454, 459

quadratic approximation, see nonparametric delta method

quantile estimator, 18-21, 48, 80, 86, 124, 253, 352

quartzite data example, 520

quasi-likelihood, 332, 344

random effects model, see hierarchical data

random walk model, 391

randomization test, 183, 492, 498

randomized block design, 489

ratio

in finite population sampling, 98

stratified sampling for, 98

ratio estimate


ratio example, 6, 13, 22, 30, 49, 52, 53, 54, 62, 66, 98, 108, 110, 113, 118,126, 127, 165, 178, 186, 201, 217,238, 249, 439, 447, 464, 473, 490, 513

recycling, see bootstrap recycling

regression

L u 124, 311, 312, 316, 325

case deletion, 317, 377

case resampling, 264-266, 269, 275, 277, 279, 312, 333, 355, 364

convex, 372

design, 260, 261, 263, 264, 276, 277, 305

generalized additive, 366-371, 375, 382, 383

Subject index 581

generalized linear, 327-346, 374,376, 377, 378, 381, 382, 383

isotonic, 371

least trimmed squares, 308, 311,313, 325

linear, 256-325, 434

local, 363, 367, 375

logistic, 141, 146, 338, 371, 376, 378,381, 474

loglinear, 342, 369, 383

M-estimation, 311-313, 316, 318

many covariates with, 275-277

model-based resampling, 261-264, 267, 270-272, 275, 276, 279, 280, 312, 333-335, 346-351, 364-365

multiple, 273-307

no intercept, 263, 317

nonconstant variance, 270-273

nonlinear, 353-358, 375, 441, 442

nonparametric, 362-373, 375, 427

Poisson, 337, 342, 378, 382, 383,473, 504, 516

prediction, 284-301, 315, 316, 323,324, 340-346, 369

repeated design points in, 263

resampling moments, 262

residuals, see residuals

resistant, 308

robust, 307-314, 315, 316, 318, 325

significance tests, 266-270, 279-284,322, 325, 367, 371, 382, 383

straight-line, 257-273, 308, 317, 322,391, 449, 461, 489

survival data, 346-353

weighted least squares, 271-272, 278-279, 329

regression estimate


remission data example, 378

repeated measures, see hierarchical data

resampling, see bootstrap

residuals

deviance, 332, 333, 334, 345, 376

in multiple imputations, 89-91

inhomogeneous, 338-340, 344

linear predictor, 331, 333, 376

modified, 77, 259, 270, 272, 275, 279, 312, 318, 331, 355, 365

nonlinear regression, 355, 375

nonstandard, 349

raw, 258, 275, 278, 317, 319

Pearson, 331, 333, 334, 342, 370,376, 382

standardized, 259, 331, 332, 333,376

time series, 390, 392

returns data example, 269, 272, 449, 461

Richardson extrapolation, 487, 494

Rio Negro data example, 388, 398, 403, 410, 427

robustness, 3, 14, 264, 318

robust M-estimate example, 471, 483

robust regression example, 308, 309, 313, 318, 325

rock data example, 281, 287

rough statistics, 41-43

saddle, 547

saddle. distn, 547

saddlepoint approximation, 466-485, 486, 487, 492, 493, 498, 508, 509, 517, 547

accuracy, 467, 477, 487

conditional, 472-475, 487, 493

density function, 467, 470

distribution function, 467, 468, 470, 486-487

double, 473-475

equation, 467, 473, 479

estimating function, 470-472

integration approach, 478-485

linear statistic for, 468-469, 517

Lugannani-Rice formula, 467

marginal, 473, 475-485, 487, 493

permutation distribution, 475, 486, 487

quantile estimate, 449, 468, 480, 483

randomization distribution, 492,498

salinity data example, 309, 311, 324

sample average, see average

sample maximum example, 39, 56, 247

sample median, 41, 61, 65, 80, 121, 181, 518

sample variance, 61, 62, 64, 104, 208,432, 488

sampling

stratified, see stratified sampling

without replacement, 92

sampling fraction, 92-93

sandwich variance estimate, 63, 275, 318, 376

second-order accuracy, 39-41,211-214, 246

semiparametric model, 77-78, 123

sensitivity analysis, 113

separate families example, 148

sequential spatial inhibition process, 425

several samples, 71-76, 123, 126, 127,130, 131, 133, 163, 208, 210-211, 217-220, 253

shrinkage estimate, 102, 130

significance probability, see P-value

significance test

adaptive, 173-174, 184, 187, 188

conditional, 138, 173-174

confidence interval, 220-223

critical region, 137

double bootstrap, 175-180, 183,186, 187

error rate, 137, 175-176

generalized linear regression, 330-331, 367-369, 378, 382

graphical, 150-154, 188, 416, 422,428

linear regression, 266-270, 279-284,317, 322, 392

Monte Carlo, 140-147, 151-154

multiple, 174-175, 184

nonparametric bootstrap, 161-175, 267-270

nonparametric regression, 367, 371,382, 383

parametric bootstrap, 148-149

permutation, 141, 146, 156-160,173, 183, 185, 188, 266, 317,378, 475, 486

582 Subject index

pivot, 138-139, 268-269, 280, 284,392, 454, 486

power, 155-156, 180-184

P-value, 137, 138, 141, 148, 158,161, 175-176

randomization, 183, 185, 186, 492,498

separate families, 148, 378

sequential, 182

spatial data, 416, 421, 422, 428

studentized, see pivot

time series, 392, 396, 403, 410

simulated data example, 306

simulation error, 34-37, 62

simulation outlier, 73

simulation size, 17-21, 34-37, 69, 155-156, 178-180, 183, 185, 202,226, 246, 248

size of test, 137

smooth.f, 533

smooth estimates of F, 79-81

spatial association example, 421, 428

spatial clustering example, 416

spatial data, 124, 416, 421^126, 428

spatial epidemiology, 421, 428

species abundance example, 169, 228

spectral density estimation example, 413

spectral resampling, see periodogram resampling

spectrum, 387, 408

spherical data example, 126, 234, 505

spline smoother, 352, 364, 365, 367, 368,371,468,

standardized residuals, see residuals, standardized

stationarity, 385-387, 391, 398, 416

statistical error, 31-34

statistical function, 12-14, 46, 60, 75

Stirling’s approximation, 61, 155

straight-line regression, see regression, straight-line

stratified resampling, 71, 89, 90, 306,340, 344, 365, 371, 457, 494

stratified sampling, 97-100

Strauss process, 417, 425

studentized statistic, 29, 53, 119, 139, 171-173, 223, 249, 268, 280-281, 284, 286, 313, 315, 324, 325, 326, 330, 477, 481, 483, 513

subsampling, 55-59

balanced, 125

in model selection, 303-304

spatial, 424—426

sugar cane date example, 338

summation convention, 552

sunspot data example, 393, 401, 435

survival data

nonparametric, 82-87, 124, 128,131, 132, 350-353, 374-375

parametric, 346-350,'379

survival probability, 86, 132, 160, 352, 515

survival proportion data example,308, 322

survivor function, 82, 160, 350, 351, 352, 455

symmetric distribution example, 78, 169, 228, 251, 470, 471, 485

tau particle data example, 133, 495

test, see significance test

tile resampling, 424—426, 427, 428, 432

tilt.boot, 547

tilted distribution, see exponential . tilting

time series, 385-415, 426-427, 428-431, 514

econometric, 427

nonlinear, 396, 410, 426

toroidal shifts, 423

traffic data, 253

training set, 292

transformation of statistic

empirical, 112, 113, 118, 125, 201

for confidence interval, 195, 200,233

linearizing, 118-120

variance stabilizing, 32, 63, 108,109, 111-113, 125, 195, 201,227, 246, 252, 394, 419, 432

trend test in time series example, 403, 410

trimmed average example, 64, 121,

130, 133, 189

tsboot, 544

tuna data example, 169, 228, 469

two-way model example, 338, 342, 369

two-way table, 177, 184

unimodality test, 168, 169, 189

unit root test, 391, 427

urine data example, 359

variable selection, 301-307, 316, 375

variance approximations, see

nonparametric delta method

variance estimate, see sample variance

variance function, 327-330, 332, 336, 337, 338, 339, 344, 367

estimation of, 107-113, 465

variance stabilization, 32, 63, 108, 109,111-113, 125, 195, 201, 227, 246,419, 432

variation of properties of T, 107-113

var.linear, 530

weighted average example, 72, 126,131

weighted least squares, 270-272, 278-279, 329-330, 377

white noise, 386

Wilcoxon test, 181

wool prices data example, 391

bootstrap methods and their application

Documents

reproduction o f

department o f statistics

department o f mathematics

university o f california

sm ooth estim ates o

choice o f estim ator

n onparam etric sim

bootstrap statistics