robust statistical procedure paper

8/12/2019 Robust Statistical Procedure PAPER

1/80


2/80

CBMS NSF REGIONAL CONFERENCESERIESINAPPLIEDMATHEMATICS

Aseriesoflecturesontopicsof currentresearchinterestinappliedmathematicsunder thedirectionoftheConference Board of the MathematicalSciences,supported by theNationalScience Foundation andpublished bySIAM.G A R R E T T B I R K H O F F TheNumerical Solution of Elliptic EquationsD.V. L I N D L E Y Bayesian Statistics A ReviewR. S. V A R G A Functional Analysis an d Approximation Theoryin Numerical AnalysisR. R. B A H A D U R Some Limit Th eorems in StatisticsP A T R I C K B ILUNGSLEY, Weak Convergence of Measures: Applications in Probability3.L . LIONS,Some Aspects of th e Optimal Controlof Distributed Parameter SystemsR O G E R PENROSE, Techniquesof Differential Topology in RelativityH E R M A N CHE RNOFF,Seque ntial Analysis and Optimal DesignJ. DURBIN,Distribution Theoryfor Te sts Based on the Sample D istribution FunctionSO L I . R U B I N O W Mathematical Problems in the Biological SciencesP . D . L A X Hyperbolic Systems of Conservation Laws and the Mathematical Th eoryof Shock

WavesI. J. S C H O E N B E R G Cardinal Spline InterpolationI V A N S I N G E R The TheoryofB est Approximation and Functional AnalysisW E R N E R C . R H E IN B O L D T Methods of Solving Systems of Nonlinear EquationsH A N S F . W E IN B E R G E R Variational Methodsfor Eigenvalue ApproximationR. T Y R R E L L R O C K A F E L L A R Conjugate D uality and OptimizationSIR J A M E S L I G H T H IL L Mathematical BiofluiddynamicsG E R A R D SALTON,Theoryof IndexingC A T H L E E N S . MORAWETZ ,Notes on Time Decay and Scatteringfor Some Hyperbolic ProblemsF . H O P P E N S T E A D T Mathematical Theories of Populations: De mograph ics Gene ticsand EpidemicsR I C H A R D ASKEY ,Orthogonal Po lynomials and Special FunctionsL . E . P A Y N E Improperly Posed Problems in PartialDifferential EquationsS.ROSEN, ectureson the Measurement an d Evaluation of th e Performance of Computing SystemsH E R B E R T B . K E L L E R Nume rical Solution of Two Point Boundary Value ProblemsJ. P . LASALLE ,The S tability of Dynamical Systems -Z . ARTSTEIN,Appendix A: Limiting Equationsan dStab ility of Nonautonomous OrdinaryDifferential EquationsD . G O T T L I E B A ND S. A . ORSZAG,Numerical A nalysis of Spectral Methods: Theoryand ApplicationsP E T E R J. H U B E R Robust Statistical ProceduresH E R B E R T S O L O M O N Geometric ProbabilityF R E D S. R O B E R T S Graph Theory and Its Applications to Problemsof SocietyJ U R I S H A R T M A N I S Feasible Computationsand Provable Complexity PropertiesZ O H A R M A N N A Lectures on the Logic of Computer ProgrammingE L L I S L . JO H N S O N Integer Programming: Facets Subadditivity an d Dualityfo r Groupa nd Semi-

Group ProblemsS H M U E L W INOGRAD,Arithmetic Complexity of ComputationsJ . F . C . K I N G M A N Mathematics of Genetic DiversityM O R T O N E . GURTIN, Topics in Finite ElasticityT H O M A S G . K U R T Z Approximation of Population Processes

continued on insidebackcover


3/80

RobustStatistical rocedures


4/80

his page intentionally left blank


5/80

PeterJ.HuberDerUniversitat BayreuthBayreuth Germany

obust tatisticalroceduresSecondEdition

SOCIETY FORINDUSTRI L ND PPLIED M THEM TICSPHIL DELPHI


6/80

Copyright 1977, 1996 by Society for Industrial and Applied Mathematics.1 09 8 7 6 5 4 3 2 1

All rights reserved. Printedin theU nited States ofAm erica. Nopartofthis bookmaybe reproduced, stored, ortransmitted in anymanner without thewritten permission ofthe pub lisher. For inform ation, write to the Society for Industrial and AppliedMathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688.

Library of Congress Cataloging in P ublication DataHuber, Peter J.

Rob ust statistical procedures /Peter J. Huber.-- 2nd ed.p. cm. ~ CBM S-NSF regional conference series inappliedmathematics ; 68)Includes bibliographical references.ISBN 0-89871-379-X pbk.)1.Robust statistics. 2. Distribution Probability theory)

I. Ti tle. II. SeriesQA276.H78 1996519.5--dc20 96-36142

Chapter VIII was adapted with permission from a talk with the same title, published in tudent 1 1995), pp.75-86 Presses Acadmiques Neuchatel.


7/80

ContentsPreface to the Second Edition viiPreface to the First Edition ix

hapter IBACKGROUND1. W hy robust procedures? 1 hapter IIQUALITATIVE AND QUANTITATIVE ROBUSTNESS2. Qualitative robustness 5

3 Quantitative robustness breakdown 84. Infinitesim al robustness influence function 9 hapter IIIM- L- AND R- ESTIMATES5. M-estimates 13

6 L-estimates 167. R-estimates 198. Asym ptotic properties ofM -estimates 209. Asymptotically efficient M- L- R-estimates 2210. Scaling question 25 hapter VASYMPTOTIC MINIMAX THEORY11. Minimax asymptotic bias 2912. Minim ax asymptotic variance 30 hapterVMULTIPARAMETERPROBLEMS13. Generalities 3514. Regression 35

15. Robust covariances: the affmely invariant case 4116. Robust covariances: the coordinate dependentcase 45 hapter VIFINITE SAMPLE MINIMAX THEORY17. Robust tests an d capacities 4918. Finite sample minimax estimation 51 hapter VIIADAPTIVEESTIMATE S19. Adaptive estimates 53

V


8/80

hapter VIIIROBUSTNESS: WHERE ARE WE N O W ?20 The first ten years 5521 Influence functions an d psuedovalues 5622 Breakdow n and outlier detection 5623 Studentizing 5624 Shrinking neighborhood s 5725 Design 5826 Regression 5827 M ultivariate problems 6128 Some persistent misunderstandings 6129 Future directions 62References 6

vi


9/80

Preface to the Second EditionW hen SIAM contacted me about the preparation of a second edition of thisbooklet, it became clear almost immed iately that any attempt to rewrite the booklet

would invite the danger of spoiling its purpose. The booklet had served m e w ell as askeleton anddraft outlinefor my subsequent bookRo bust Statistics Wiley, 1981). Ihadbeen surprised that the precursor continued to sell in the presence of a morecomplete andmore learned follow-uptext. Infact, evennow itstill isselling aboutequallyaswellas thelatter. Thereason isclear inretrospect: theslim SIAM bookletprovides abrief, well-organized, and easy-to-follow introductionandoverview. Ihaveused it to teach courses for which the Wiley book was too advanced.Ithas been almost 20 years since its initial publication, however, and upda ting thetext and adding new references w as necessary. To achieve this without spoiling theflowof the original exposition, the text has been reprinted without changes and anadditional chapter, Robustness:W hereAre WeNow? ,hasbeen added. It isbasedon a talk I gave at a m eeting in Neuchatel, and Igratefully acknowledge the PressesAcadem iques Neuchatel for granting m e permission to use that material. I also thankmy m any colleagues, in particular Peter J. B ickel, for their stimulating com ments.

P E T E R J. H U E RBayreuth June 1996

vii


10/80



11/80

Preface to the First EditionAt the NSF/CBM S Regional ConferenceatIowa City 19-23 Ju ly 1976,1 gaveten consecutive talkson robust statistical procedures. These lecture notes followvery closely the a ctual presentation; their preparation w as greatly facilitated bythe excellent notes tak en by J. S.Street d urin g the conference.There w asneither

time nor space to polish the text and fill out gaps; I hope to be able to do thatelsewhere.All participants of the conference will remember Bob Hogg s impeccableorganization and the congenial atmosphere. Many thanks also g o to Carla Blumwho did overtime typing the m anuscript .PETER J. HUBERZurich May 1977

IX


12/80



13/80

CHAPTER IBackground

1 Why robust procedures? The word robust is loaded with m any sometimes inconsistentconnotations. We shall use it in a relatively narrowsense: for our purposes, robustness signifies insensitivity ag ainst small devia-tionsfrom theassumptions.Primarily,we shall be concerned w ith distributio nal robustness: the shape of thetrue underlying distribution deviates slightly from theassumedmodel(usuallytheGaussian law). This is both the most important case (some of the classicalstatistical procedures show a dramatic lack of distributional robustness), and thebest understood one. Much less isknown what happens if the other standardassumptions of statistics are not quite satisfied (e.g. independence, identicaldistributions, randomness, accuracy of the prior in Bayesian analysis, etc.) andabout the appropriate safe-guardsintheseother cases.The traditional approach to theoretical statisticsw as and is to optimizeat anidealized model and th en to rely on a c on tinu ity principle: wha t is optimal at themodel should be almost optimal nearby. Unfortunately, this reliance on con-tinuityis un fou nd ed : the classical optimized procedures tend to be discon tinuo usin th estatisticallymeaningful topologies.An eye-opening exampleh asbeen givenbyTukey(1960):Example. Assume that you have a large, randomly mixed batch of goodobservations which are normal N ( u ,cr2) and bad ones which are normalN(u, 9o2), i.e. allobservations hav ethesame mean,but theerrorsofsome areincreased by a fa ctor 3. Each single observation x i, is a good one w ith probab ility1 e, a bad onew ith probabilitye,whereeis asmall number.Two time-honoredmeasures ofscatter are the mean absolute deviation

andthe m ean square deviation

Therehadbeenadispute between E ddingtonandFisher,arou nd 1920, abo ut th erelative merits of dn and sn. Fisher then pointed out that for exactly normalobservations, sn is 12% more efficient than dn, and this seemed to settle thematter .


14/80

2 CHAPTER IOf course, the two statistics measure different characteristics of the errord istribution. Forinstance,if theerrorsareexactly n orma l,snconvergesto o whilednconverges to V ^ T T C T = 0.800-. So we should m ake precise how to compare their

performan ce on thebasisof theasym ptotic relativeefficiency (ARE)o fdnrelativetosn (seeTable1):

TABLE1ARE(e)

0.001.002.005.01.02.05.10.15.25.5

1.0

.876.9481.0161.1981.4391.7522.0351.9031.6891.3711.0170.876

Theresultisdisquietin g: justtwo badobservationsin1000 suffice tooffset th e12% ad vantageof them ean squareerror,and the AREreachesamaxim um value>2 atabout =0.05.Typical gooddata samples in the physical sciences ap pear to be wellm odeledbyanerror law of the form

where J > is the standard normal cumulative,with e in the range between 0.01an d0.1.(This does n ot n ecessarily imp ly tha t these samples con tain betw een 1%and 10% gross errors, although this is often truethe above m ay just be aconvenient description of a slightly longer-tai led than normal distr ibution.)Inotherwords, the n aturally occurring deviationsfrom the idealized mod el are largeenough to render meaninglessthe tradition al asymptotic optimality theo ry.Toavoid misunderstandings,Ishould also emph asize w hatisnotimplied he re.First,the above results do n ot imp ly that w e advocate the use of the m eanabsolutedeviation (there are still better estimates). Second, one might argue that theexampleisun realistic insofaras the bad observationswillstickout asoutliers,


15/80

B A C K G R O U N D 3so any conscientious statisticianwilldosomething about the m before calculat ingthe mean square error.This is besides the p oin t: w e are concerned herew i th th ebehaviorof the unmodified classical estimates.

The examplehas to dowith longtailedness: le ngthe ningthe tails explodes th evariability ofsn dn ismu ch lessaffected). Shorteningthetails (movingafractionalmass from the tails to the central region) on the other hand produces onlynegligible effects on the distributionsof the estimators. (Though, it may impairabsolute efficiency by decreasing the asymptoticCramer-Rao bound; but thelatter is so unstable under small changesof the distribution, that it isdifficult totakethiseffect seriously.)Thus, formost practical purposes, distribu tional lyrobust and ou tlier resis-tant are interchangeable. A ny reasonable, formal or informal , procedure forrejecting outliers wil l prevent the worst.How ever, only the best am ong these rejection rules can more or less com petewith other good robust estimators. Frank Hampel (1974a), (1976)hasanalyzedth eperformanceofsome rejection rules (follow edby the mean ,as anestimateoflocation). Rules based on the stud entized range, for instance, are disastrously bad ;the maximum studentized residual and Dixon's rule are only mediocre; the topgro up consistsof arulebasedonsample kurtosis,onebasedo nShapiro-Wilk,anda rule which simply removes all observations x - for which | /-medj j t . l /med, {|jt j-med,jt , |} exceeds a predetermined constant c (e.g. c =5.2,corresponding to3.5standa rd deviations). The m ain reason w hy some of the rulesperformed poorly istha t they w on't recognize o utliersif tw o ormoreofthem arebundled together on the same side of the sample.Altogether, 5-10% w rong valuesin adata setseemto be therule rather tha nthe exception (Ham pel(1973a)).The w orst batch I have enco untered so far (some50 ancient astronomical observations) contained about 30% grosserrors.I am inclinedtoagree w ith Da nielandWood(1971,p. 84) w hopref er technicalexpertise to any statistical criterion for straight outlier rejection. But even thethus cleaned data willnot exactly correspond to the idealized m odel, and rob ustproceduresshould beused to process them fur ther .O ne fur ther remark on terminology: al though robust methods are oftenclassified together wi th nonparametr ic and distribution-free ones, theyrath er belong togethe r w ith classical pa ram etric statistics. Justasthere,o ne has anidealized pa ram etric model, but in addition one w ould l ike to ma ke sure that themethods w ork w ellnot onlyat the model itself, but also in a neighborhood of it.Note tha t the sample mean isth enon param etric est imate of the pop ulat ion mean ,but it is not robust. Distributionfree tests stabilizeth e level,but not necessarilythe power. The performance of estimates derived from rank tests tends to berobust,bu tsinceit is afunctionof thepower,not o f thelevelo fthesetests,thisis afor tunate accident, not intrinsically connected w ith distribution-freeness.For historical notes on the subject of robustness, see the survey articles byHuber (1972),Hampel (1973a),and inparticular Stigler(1973).


16/80



17/80

CHAPTER IIQualitative and Quantitative Robustness

2 Qualitativerobustness W e have already mentioned the stabilityor con-tinuityprinciplefundamental to distributional) robustness: asm all changein theunderlyingdistribution should cause onlyasm all changein theperformanceof astatisticalprocedure. Thesenotionsare due toHampel(1968),(1971).Thus, ifXi, - ,Xn are i.i.d. random variablesw ith common distribution F ,and Tn =Tn Xi, ,Xn) is anestimate basedonthem , then this requirem entcanbe interpreted: a sufficiently small change in F =3? X) should result in anarbitrarily sm all changein^ Tn).A little thought will showif w e exclude extremely pathological functionsTnthat allordinary estimates satisfy this requiremen tforeveryfixedn .But for nonrobust statisticsthem odulusofcon tinuity typically gets w orsefor increas-ingn .Inotherw ords, we should require th at the continuity isuniformw ithrespectto n .More p recisely,for asuitable metricd* see below )in thespaceofprobab ilitymeasures, w erequire th atfo r alle >0there is aS >0such thatfor alln n0,

We may require this either only at the idealized model F ,or for all F, or evenuniformly in F.Thissituation isquite analogous to the stabilityofordinarydifferential equa-

tions,w here the solution shoulddependcontinuou sly in the topology ofuniformconvergence)on the initial values.However,it issomew hat trickytow ork w i ththeaboveform alizationofrobust-ness, and we shall begin with a much simpler, but more or less equivalentnonstochastic version of the con tinuity requ irem ent.Restrict attention to statistics which can be written as a functional T of theempiricald istributionfunctionFn,or, perhaps better, of the empirical measure

where x\, ,xn) is thesampleand Sx is the unitpointmassat x.Thenasmallchange in the sample should result in a small change in the v alueTn= T /zn)of thestatistics.Notethat manyof theclassical statisticalproceduresare ofthisform,e.g. i) allmaximum likelihood estimates:


18/80

6 CHAPTER II(ii)l inear combinationsof order statistics,e.g.th e a- t r immedmean

or also (iii) estimates derivedfromranktests,e.g. theHodges-Lehmannestimate

(themedianof thepairwise me ans * ,-+ jt/)/2,with( /,/')rangingoveralln2pairs;the more conventional versions, which use the pairs with/


19/80

QUALITATIVE AND QUANTITATIVE ROBUSTNESS 7

For anyA =H,definethe closed S-neighborhood ofA as

It iseasy toshow thatA5 isclosed; in fact

Let M be the set of allprobabil i ty measures on (fl, ), le tG Ji and let 5,e>0. ThenshallbecalledaProhorov neighborhoodofG .These neighborhoods generate th eweak topology in M .The Proho rov distance is defin ed as

It isstraightforw ardtocheck thatdPris ame tric.TH E O RE M (Strassen 1965)). The following twostatements are equivalent:(i ) ForallA(=@,F{A}^G{As} + e.(ii) There exist (dependent} random variables X, Y with values in H, such that(X) =F,g(Y)=GandP{d(X, Y ^8} l-e.

In other words, if G is the idealized model and F is the t rue un der ly in gdistribution, such that dPr(F, G ) ^ ie, we can always assume that there is an ideal(but unobservable) random variable Y wi th(Y} = G , and anobservable X wi th(X) = F,such that P{d(X, Y) e} 1e.Tha tis, them odel provides bothforsmallerrors occu rring w ith large pr obability and large errors oc cu rring w ith lowprobabil i ty,in ave ry explicitand qu ant i ta t ive fashion.Therea reseveral other m etrics alsodefiningthewe ak topology.A ninteres t ingone is the so-called bounded Lipschitz metric dB L- Assume that the metric on(O , )isbounded by 1 (ifnecessary, replace the original m etricbyd(x,y)/ l+d(x, y))) . Then define

where ranges overallfunctionssatisfyingtheLipschi tz condit ion| (x) i j s y ) |d(x, y).Also forthis me tric an analogue of Strassen's theorem holds (first proved in aspecial case by K antorov i tch and R ubins te in , 1958): dB^(F, G)^e if fthere aretw o random variablesX , Y such that(X) =F,%(Y)=G , andEd(X, Y) e .Furthermore, on the real l ine also the Levy metric d generates the weaktopology;bydefinition d^(F, G) e iff

The Levy m etric is easier to han dle than the tw o other ones, butunfo r tuna te ly ,itdoes not possess an intu i t ive interp retat ion in the style of the Proho rov orbounded Lipschi tz metric.


20/80

8 CHAPTER IIIt is now fa i r l y st r a igh t fo rwa rdto show that the two defini t ions of qual i ta t iverobus tness a re essent ia l l y equiva len t :THEOREM (Hampel 1971)). Let T be defined everywhere inM and pu t Tn =

T F n}. W e say thatTn i sconsistentatF i f Tn tend stoT F) inprobability,whereFisthe true underlying distribution.(i ) IfTis weakly con tinuo us at all F, thenTni s con sistent at allF , andF-^^ Tn)is weakly continuous uniformly i n n .(ii) // Tni sconsistentan dF^> Tn) isweakly con tinuousuniformly in n at al l F,thenTis weakly continuous.3. Qu antitative robustness breakdown. Consider a sequence of estimatesgenera ted by a statist ical funct iona l , Tn =T Fn\ where Fn is the empir ica l

dis t r ibut ion .Assume tha t T F0) is the ta rget va lue oftheseestimates (the va lue ofthe funct iona l at the ideal ized model dis t r ibut ionF0).Assume tha t the tr ue und er ly ing d is tr ibu t ion F lies anywhere in some e-ne ighbo rhoodofF0,say in

O rd in a r i l y ,our es t imateswill be consistent in the sense that

a nd a s ym p to ti ca l l y no r ma l

Thus, i t will be convenient to discuss the quant i ta t ive asymptot ic robustnessproper t ies of T intermsof the ma x imum bia s

and the maximum va r iance

However , th is is , s t r ict ly spea king, inade qua te: we should l ike to establ ish th a t forsufficiently large nour es t im ate Tn behaves wellfor a l l F 0 .Adescriptionint e rms of b\ a nd i wo uld only a l low us to show tha t fo r each Fe 0*E, Tn behaveswel lfor sufficiently largen .The distinction isfundam enta l , but hasbeen largelyneglected in the li t e r a tu r e .A better approach would be as follows. Let M F , Tn) be the median ofFTn - T FQ)) and letQ, F, Tn )be anorma l ized f-quan t i l erangeof FJnTn\For any distr ibution G ,w edefineth enorm alized f-quant i lerangea s

Thevalueoftisarbi tr a ry ,but fixed, sayt =0.25 (interquar t i l e range)ort =0.025( 95 - r a nge , which is convenient in view of the trad i t iona l 95 confidence


21/80

QUALITATIVE AND QUANTITATIVE ROBUSTNESS 9intervals). For a norm al distribu tion,Q coincides w ith the standard deviation;Q2tshall also becalledpseudo-variance.

Thenw edefineth e maximum asymptotic bias and variance respectively as

The inequa litieshereare straightforward andeasytoestablish, assuming tha t b\and V i arewell defined. Sinceb and v are awkwardtohandle, w eshall wo rk withb\ andv\t but we are then obliged to check whether for the part icular T unde rconsideration b\= b andv\=v .Fortunately, thisisusually true.W edefineth e asymptotic breakdown point ofT atF0as

Roughly speaking, the breakdown point gives the m a x i m u m fraction of badoutlierstheestimator cancopewi th .Inm an y cases, itdoesnotdepend onF0 noron the particular choice of 0*e in terms of Levy distance, Prohorov distance,e-contamination etc.)Example. The breakdown point of the a - t r immed mean is e* =a. This isintuitively obvious; for a fo rm al deriv ation see 6.)4. Infinitesimal robustness influe nce function . For the following we assumethatd*is ametricin thespaceM of allprobabil i ty measures, gene ratingtheweaktopology, andwhichisalso com patible withth e affine s t ructureofM in the sensethat

where

We say tha t astatistical functional T isFrechet differentiable at F if it can beapproximated by a linear functionalL depending on F such that for allG

It is easy to see th atL is uniq uelydeterm ined:thedifferenceLI of any two suchfunctionalssatisfiesand inparticular,withF, = 1 t F + tG ,weobtain

henceLi G-F)= 0 for a l lG .Moreover, if T is weakly continuous, then L m ust be too. The only w eaklycontinuous linear functionalsarethoseof the form


22/80

10 CHAPTER IIfo rsomebounded continuous function / r .Evidently, / isdetermined only up toan additive constant, and we can standardize i f such that J / r d F= 0, thusL G-F)= \ lfdG.

If d* F, Fn) is of thestochastic order Opn~1/2) (which holdsfordL,but not ingeneralfor dPr or JBL), thenw eobtainan extreme ly simple proof of asymptoticnormali ty:

hence\n T Fn) T F)) isasymptotically norm alwith mean0 andvariance

Unfor tunate ly, w e rarely havePredict differentiability, but the assertions justmade remain valid under weaker assumptions (andmorecomp licated proo fs).A functional T iscalledGateaux differentiable1 atF ,ifthereis afunction i / rsuchtha t for allG M ,

Whenever theFrechet deriv ative exists, then alsothe Gateaux derivative does,and the twoagree. Differentiable statistical functionals werefirstconsidered byvo n Mises (1937), (1947).Evidently, i / x ) can becom putedbyinsertingG =Sx (point mass1 atx)intothepreceding formula, and inthis lastform it has a heuristically important interpre-tation, first po inted o ut by Ham pel (1968):

gives the suita bly scaled differential influence of one additiona l observation withvalue j c if the sample sizen-*oo.Therefore, Hampel has called it the influencecurve (1C).Note. There are moderately pathological cases where the influence curveexists, but not the Ga teaux derivative. F or instance, thefunctional correspondingto the Bickel-Hodges estimate (BickelandHodges (1967))

1O ften,buterro neou sly, called Volterra differentiable . See J. A. Reeds (1976).

has this property.


23/80

QUALITATIVE AND QUANTITATIVE ROBUSTNESS 11If w eapproximateth e in f luence curveasfo l lows:

replace F byFn _ i,replace tby l/n,w e obtain the so-called sen sitivity cu rve Tukey 1970)):

How ever, this does notalwaysgive af easibleap proximation to thein f luencecu rve th eproblem resides w ith the su bsti tution ofFn-\ for F).IftheFrechetderivativeofTatFQexists, the nwehavefor thegross error modelg>e={F\F= \-e)FQ+ eH,H M}'.

in w hich case

with

y* has been called gross error sensitivity by Hampel. I f we have only Gateauxdifferentiability, some care is needed. W e shall later give two examples w here i) y * < o obu t bi(e)= < X f or e >0 , ii) - x =o o b u t l im e )= 0 f o r e - 0 .


24/80



25/80

CHAPTER III

M -, L-, and R-estimates5. M estimates. Anyestimate Tndefinedby aminim um problemof the form

or by an im plicit equation

where p is an arbitrary function, t^ ;t,0)= d /d0)p x ;0), shall be called anM-estimate or m axim um likelihood type estimate; note th at p x\6) =logf x\ 0)givesth eordinaryM .L.-estim ate).We are particularly interested inlocationestimatesorIf wewritethe last equation as

with

weobtain arepresentation of Tn as aweighted mean

withweights dependingon the sample.Our fav orite choices willbe of the form

13


26/80

All three versions 5.1), 5.2), 5.3) are essentially equivalent.Note that the funct ional version of the firstform

may cause trouble. For instance th e median correspondsto p(x) = \x\, a nd

unless F has a finite firstabsolutemoment.Thisiso.k. if weexamine instead

Influence curve of anM estimate. Put

then the inf luenc e curvelC x;F0, T ) is the ordinary de rivative

In particular, for an M-estimate, i.e. for the functional T(F) definedb y

we obtain by inserting F, for F and takin g the derivative with if/ (x,0) =(d/deW(x, 0)):

or

After put t ingF \ = S x,weobtain

So let us remem ber that the influenc e curve of an M -estimate is simplypropor-tional to \ l t

14 CHAPTER III

leading toweights


27/80

M - L- AND R ESTIMATES 15

Breakdownand continuity properties of M -estimates. Take the location case,withT F) defined by

Assume that i f isnondecreasing, but not necessarily continuous. Then

isdecreasing in f,increasingin F .T F ) is not necessarily unique;wehave 71* T F ) ^ T **with

Now le tF range over alldensities with d^ F0, F ^s.

IG lThe stochastically largest member of this set is the improper distribution I itputsmass e at +00):

where J Q isdefined by

Thus

Define

then the m axim um asymptotic bias is


28/80

6 CH PTER III

Breakdown. b+ e)


29/80

M -, L-, AND R-ESTIMATES 17at5 = 0,w i th Fs= 1s)F 0+sFi,andobtainor

Example 1. For the median t= 2) w e have

The general case is now obtained by l inear superposition:

Example 2. If T F ) = ftF fy), t hen 1C has j umps of size 0t/f F ~l tt)) at thepoints x=F~ltt).IfMh as a density m,then w e may differentiate th e expression 6.3)an d obtainthe more easily remembered formula(6.4)

Example 3. The a - t rimm ed mean

has an influence curve of the form shown in Fig. 2.

F I G 2


30/80

18 CHAPTER IIIExample 4. The a-Winsorizedmean.Forg/na,

The corresponding functional

has the influence curve shown in Fig. 3.

Breakdown and continuity properties of L estimates. Assume thatM is apositive m easure withsupport contained in[a ,1 a]where 0 0 unless F Q l and the distribution function of M havecommon discontinuity points and similarly for T Fo)-b~ e).


31/80

M -, L-, AND R-ESTIMATES 19Itfol lowsthatTiscontinuousat allF0w hereit isw ell define d, i.e.atthoseF0fo rwhichF 1andMdo not have comm on disco ntinuity points . R estimates. Consider atw o-sample ran k tes t for shift: le t*i, ,xm andy i , ,yn be two independent samples with distributions F x) and G x)=F x A )respectively. M erge the twosamples intoone o fsizem+ n ,and letRf bethe rankof j c , in the combined sample. Leta,=a i) be some given scores; thenbase a test o f A = 0 against A > 0 on the test statistic

Usually ,oneassumes thatthescores a,aregenerated bysome funct ion / asfo l lows:

But therearealso other possib ilitiesfo rderivingthea, from/ and w eshall preferto w ork w ith

Assume for simplicitym= n and put

Then ,=S Fn, n\ where Fn,Gn are the sample distribution functionsofxi, ,jcn)and (yi , ,yn )respectively, provided we define the at by (7.3).One can derive estimates ofshift An and location Tn from suchtests:(i) adjus t An such that ,=0 when computed from j c i , ,*) and (yi A n, ,yn-An) .(ii) adjus t Tn such that ,=0w hen com puted from *i, ,xn) and 2Tn -Xi, ,2Tnxn). In this case, a mirror image of the jt-sample serves as astand-in for the missing second sam ple.(Note that it may not be possible to achieve an exact zero, , being adiscontinuous fu nction.)Example. The Wilcoxon test corresponds to J t)=t \and leads to the

Hodges-Lehmann estimate Tn=med{ xt +x,-)/2}.In terms o f functionals , this means that our estimate o f location derives f romT (F),definedby the implicit equation

Fromnow on weshall assume tha tJ l t)=-J t), 0


32/80

2 CHAPTER III

Influence curve. As in thepreceding sections,we find it by insertingFt fo r Finto7.5)andtakingthederivative withrespecttotatt =0.After some calculationsthisgives

where/ is thedensityofF,andUisdefinedby itsderivative

If the true underlying distribution F is symmetric, there is a considerablesimplification: U x) =J F x)\ and thus

Example 1. Hodges-Lehmann estimate:

Example 2. Normal scores estimate J t) = E > t):

Inparticular:

Breakdown. Themaximum bias b\ e) and the breakdown point e* can beworkedout as in thepreceding sectionsif/ is monotone;e*isthat valueofeforwhich

Hodges-Lehmann:

NormalScores:Notethat thenormal scoresestimateisrobust at


33/80

M -, L-, AND R-ESTIMATES 21T F):it isusually true thatJn T Fn)- T ( F J ) isasymptotically normalwithmean0 andvariance

However, proofsvia the influence function arerarely viable(o ronly under toorestrictive regularity assumptions.)W e shall now sketch a rigorous treatment of the asymptotic properties ofM-estimates.Assume that i f / ( x , t) ism easurable in x and decreasing =nonincreasing)in t,from strictly positivetostrictly negativ e values.Put

W e have -oo


34/80

22 CHAPTER III

wherectandatare chosen such that this is a proba bility density with expectation0. Den ote its variance by

and

Thenw e getfrom the first twot erm sof theEdgew orthexpansion 8.1),calculatedatx =0, that

From thisweobtaingn andGnby twonum erical integrationsand anexponentia-tion;th eintegration constant m ustbedeterm ined such thatGnhastotal m ass1. Itturnsout th at the first integratio n can be done explicitly:

This variant of the saddle point m ethod was suggested by F. Ham pel 1973b)w horealized that th e principal approximation error w as residing in thenorm alizingconstant and that itcould beavoided byexpanding g'n/g n and then determiningthis constant by numerical integration. See also a forthcomingpaper by H. E.Daniels 1976).W e now revert to the l imit law.Put

Assum e thatA ?)and E[i(/(X, t 2 ] are continuousint and finite, and thatA(t0)= 0for some tQ .Then one shows easily that

If Aisdifferentiable att0and if we caninterchangetheorderofintegration anddifferentiation, weobtain from this

The last expression corresponds to what one formally gets from th e influencefunction.9. Asymptotically efficient M- L- R-estimates. Let (Fe)6e be a parametricfamily of d istributions. W e shall firstpresent a heuristic argument that aFisherconsistent estimate of 6,i.e. afunctional T satisfying


35/80


36/80

24 CHA PTER III

Of course,onemust checkineach individual case wh etherthese estimatesareindeed efficient the rather st r ingent regu lari ty conditionsFrechetdifferentiabilitywill rarely be satisfied).Examplesofasym ptoticallyefficientestimatesfordifferent distributionsfollow:Example 1 . Norm aldis tr ibut ion/o *)= I /^ 2 7 r ) e ~ x 2 / 2 .

sample mean , nonrobu st ,sample mean,nonrobust,normalscoresestimate, robust.

The no rm al scores estim ate losesitshighefficiencyvery quickly when onlyasmallamoun tof far-out contaminat ionisadded and issoon surpassed by theHodges-Lehmann estimate. Thus, although it is both robust and fully efficient at thenormal model, I would hesitate to recommend the normal scores estimate forpractical use.)Example 2. Logistic distrib utionF0 x) = \/ l+e~x).robust ,nonrobust ,robust

Hodges-Lehmann estimate).Rem emb er that an L-estimate is robust iff support M)=[a,1 a ] for some0


37/80

M-, L-, AND R-ESTIMATES 25Then

The asymptotically efficient estimates are: the M-estimate with (x)= [x]-c, ana-trimmedmean(a=F0 c)) , asomew hat complicated R-estim ate.10 Scaling questions Scaling problems arise in two conceptually unrelatedcontexts: first because M-estimatesof location ashithertodefined are notscale

invariant;secondwhenone isestimatingtheerrorof alocation estimate. Inorderto make location estimates scale invariant, one must combine them withan equ ivariant) scale estimateSn:

F or straight location, the me dian absolute dev iation M AD ) appears to be the bestancillaryscale estimate:

where

Thesomewhatpoorefficiency ofM AD,,ismore than counter-balancedby itshighbreakdown point(e*= 2);alsoTn retains that high breakdown point.In mo re complicated situa tions, e.g. regression see below), the med ian abso-lute deviation losesman y of its advantages, and in particular, it is no longer aneasily and independentlyof Tn com putable statistic.Thesecondscaling problem is the estimation of the statisticalerrorof Tn.The asym ptotic varianceofv Tn is

The true underlyingF is unknown:we may estimate A(F, T) bysu bstitutingeither Fn or F0 supposedly F is close to the model distributionF0), or acombination of them, on the right hand sideprovided 1C depends in a s u f f i -ciently nice way on F. Otherwise we may have to resort to more complicatedsmoothed estimatesof1C ; F, T).For complicated estimateslC(x; F , T) may not rea dily be com putable, so wemightreplaceit by thesensitivity curve SCn see 4). If weintendtointegratewithrespect toFndx\ weon ly need the valueso fSC at the observations j c . Then,instead ofdoublingtheobservations atxhwem ight justaswell leaveit outwhen


38/80

26 CHAPTER III

computing the differencequotient, i.e.,wewould approximateIC(*,)by

Thisisrelatedto theso-called jackkni fe Miller(1964),(1974)).Jackknife. Consider an estimate Tnx\, *) which is essentially the same acrossdifferentsample sizes. Thenthei-thjackkn ifedpseudo valueis, bydefinition

If Tn is thesample mean, then 7^,-=*,,for example.In termsof the jackknife , ourprevious approximation 10.2) to the influencecurve is=7ti-rn.If Tn is aconsistent estimateof 6whose biashas the expansion

then

has asmaller bias:

seeQuenouille 1956).Example. If

then

and

Tukey(1958)pointedoutthat

usuallyis agood estimatorof thevarianceofTn. It canalsobeusedas anestimateof the variance of T ,but it isbettermatched to Tn.)


39/80

M -, L-, AND R-ESTIMATES 27Warning. If the inf luencefunc t ionIC *;F , T)doesnot depend smoo th ly onF,fo r instancein the caseo f the sample median ,th e j ackkn i f e is int rouble and mayyield avar iance estimate wh ichiswo rse th an useless.Example. Thea - tr imm edmeanxa.Assumefors implic i ty thatg = n - \)a isan in teger ,a nd thatx\


40/80



41/80

CHAPTER IVAsymptot icM inim ax Theory

11 Minimax asymptotic bias To fix theidea, assum e th atthetrue distributionFliesin the set

The median incursitsm axim al positive biasx0 whenall contamination lies tothe right ofx0,wherex0 is determined from

i.e., for them edianweobtain

On the other hand, P e contains th e following distribution F ,defined by itsdensity

where q is the standard norm al density. Note tha tF is symm etric around X Q , andthat

also belongs to 2 PE .Thus, wem ust have

fo r any translation invariantfunctional T. It isobvious from this that none canhavea smaller absolute bias thanx0 atF andF_s im ultaneously.For them edian,wehave rathertrivially)bi e)=b e),andthuswehave shownthatthe sample m edian m inim izes the m axim al asym ptotic bias.We did not use any particular propertyof 0*e, and the same argument carriesthrough withlittlechangesforotherdistributions thanthenormalandothertypesof neighborhoods. It appears, not surprisingly, that the sample median is theestimate of choice fo r extremely large sample sizes, where the possible biasbecomes m ore imp ortant than th estan dard deviationof the estimate whichis oftheorder 1/vn .

29


42/80

30 CHAPTER IV

12 Minimaxasymptotic variance In the following, will be some neighbor-hood o f the normal distribution < J > consistingof sym metric distributions only, e.g.

or

It is convenient to assume that P be convex and compact in a suitable topology(the vag ue topology : the w eakest that F-J t f dFiscontinuous for all continuou swith compact support). W e allow to contain substochastic measures (i.e.probability measures pu tting mass a t 00); these may be thought to formalize thepossibility of infinitely bad outliers. The problem is to estimate location 6 in thefamilyF(x-0),F e0> .The theory is describedin somedetail in Huber(1964);I only sketch the salientpoints here.First, we have to minim ize Fisher in fo rma tion over 0*.1. D efine Fisher info rma tion as

(where^ is the set of continuously d i f ferent ia te functions with compactsupport).2. T H E O R E M . Thefollowing twoassertionsareequivalent:(i) /(F)


43/80

ASYMPTOTIC MINIMAX THEORY 313. / )islower sem i-continuous beingth e supremumof a set of continuousfunctions); hence itattainsitsminimumon the compact se t 0*,say atF0.4. I )i sconvex.5. If/0> 0,thenF0isunique.6. The formal expression for the inverse of the asymptotic variance of anM-estimate oflocation,

8. LetF,=1- 0+17 Fi) .Example. For e-contamination, weobtain

with

where c=c e).The L- and R-estimates which are efficient at F0 do not necessarily yieldminimaxsolutions, since convexityfails point6 in theabove sketchof theproof).Thereare infactcounter-examples SacksandYlvisaker 1972)).How ever,in theimportant case ofsym metric e-contamination, th e conclusion remains true fo rboth L- and R-estimates Jaeckel 197la)).Variants. Note that the least inform ative distributionF0has exponential tails,i.e. the y migh tbeslimmer \)than w hatonew ould expectinpractice.So itmightbeworthwhiletoincrease th emaximum riskalittlebitbeyond th em inimax valueinorder togainabetter p erforma nceatlong tailed d istributions.

isconvex in F.7. Take A o =-/o//o.Then


44/80

32 CHAPTER IVThiscan bedoneasfollows. Consider M-estimates, andminimizeth em aximalasym ptotic variance subjectto theside condition

The solution for contaminated normal distr ibut ions is of the form (Collins(1976)) (see Fig.4):

The values ofc and bof course depend on e.The actual performance does not depend very muchon exactlyhow i rede-scends to 0, only one should make sure that itdoes not do it too steeply; inparticular \ i f / \ shouldbe small w hen \ t y \ islarge.Hampel s extremal problem (Hampel 1968)). Assume that the model is a

generalone-parameterfamilyofdensities/(*,6 andestimate 6by an M-estimatebased on some function t { / ( x , B ,i.e.

Assume tha t T is Fisherconsistent at the model, i.e.

FIG.4


45/80

ASYMPTOTIC M INIM AX THEORY 33Then

and the asymptotic variance at the model is

Ham pel s extremal problem now is to put a bound on the gross error sensitivity:

with some app ropriately chosen functionke and , subjec t to this side cond ition, tominimize the asymptotic variance A F9, i / ) at the modelThe solution is of the form

where we have used the notation

The functions ag,bQ are somewhat difficult to determine, and if k Q is too small,thereis no solution at all. I t m ight there fore b e p refe rable to start with choosing b gA reasonable choice might bewithc between 1 and 2, and where

is the Fisher inform ation. Then one determinesae (so that th e estimate is Fisherconsistent ,and finally, one finds ke.


46/80



47/80

CHAPTER VMultiparameter Problems

13 Generalities As far as M-estimates are concerned, most concepts of thepreceding chapters generalizetovector valuedparameters.Asymptotic normalitywastreatedbyHuber(1967).TheFisherinformationmatrixand theinverseof theasymptotic covariance matrix of the estimate are convex functions of the trueunderlying distribution, matrices being ordered bypositivedefiniteness.Since thisis not a lattice ordering, it is not in general possible to find a distributionminimizingFisher information.But ifthere isone, the corresponding m axim umlikelihood estimate possesses anasymptotic minim ax property: itm inimizesthemaximal asymptotic variance among all distributions fo r which it is Fisherconsistent.14 Regression Assume thatp unkn ow n parameters 0 i, ,0P)=8T are tobeestimatedfrom nobservations yi, ,yn )=yr,towhich theyarerelatedby

The i are known functions, often assumed to be linear in 0, and the t areindependent random errors withapp roxim ately identical distributions.Onewantstoestimatetheunknow n true6 by avalue6such thattheresiduals

aremade as smallaspossible .Classically, thisisinterprete d Gauss, Legendre)as

or, almost equivalently,byta kin g derivatives:

Unfortunately, this classical approach ishighly sensitive tooccasional grosserrors. As a remedy, one may replace th e square in 14.3) by a less rapidlyincreasingfunction p:

35


48/80

36 CHAPTER Vor, insteadof 14.4), tosolve

for 6with =p .There are also other possibilitiesto robustify 14.3).For instance,Jureckova1971)andJaeckel 1972) have proposed toreplace theresidualsA ,in 14.4)bytheirranks ormore generally,by afunc t ion oftheir ranks). Possibly,i tmightb egood tosafeguard against errors in the / j bym odifying also th e second factorin14.4),e.g. by replacing also d f i d O ; by itsrank in d/i /dfy, ,d/,,/30/),but theconsequences ofthese ideas have only begun to beinvestigated Hill 1977)).In any case, the empirical evidence available to date suggests that the M-estimate approach 14.5), 14.6)iseasiertohandle andmoreflexible, andevenhas slightly better statistical properties than the approaches based on R- andL-estimates. There isonlyone minor disadvantage: one must simultaneouslyestimateascale parameter S inorder tomakei tscale invariant, e.g.

where5 isdetermined simultaneouslyfrom

In th eregression case Iwould prefer thisS to ,say,th emedian absolute valueofth eresiduals, sinceit ismore easily tractable intheory convergence proofs)andsinceit fitsbetterintotheestablishedflow ofcalculationof6inlarge least squaresproblems.In order that robust regression works, th e observation y, should not have anoverriding influenceon thefitted value

To clarify theissues, taketheclassical leastsquares caseandassumethe f t to bel inear:ThenwithIfvar y,)=a2weobtainan d

where y,, is the ithdiagonal elementofP.


49/80

MULTIPARAMETER PROBLEMS 37Note that tr F) = p, so maxy^avey=p/n; in some sense, 1/y,, is theeffec t ive number of observations entering into the determination of y,. If y,, is

closeto1,y, isessentially determinedby y /alone,y,m ayhaveanundue leverageon the determination of certain parameters, and i t maywell be impossible todecide whether y,contains agross error ornot.The asymptotic theory of robust regression works if e =m ax y,, goes to 0sufficiently fastwhenp and ntend to inf in i ty ; sufficient ly fast may betaken tom e a n ep2 > 0 or (with slightly weaker results)ep-0.

Ifonlye > 0,therem ay betroubleif theobservational errors haveanasymmet-ri c distribution and p is extremely large (above 100). This ef fec t has beenexperimental ly verif ied in a specifically designed Monte Carlo study (Huber1973a)), but for allpracticalproposes, e- 0seems to be o.k. Note that e- 0impliesp/n > 0 .Asalready mentioned,wepropose toenforce scale invariancebyestimatingascale parametera simultaneouslywith0.Thiscan bedone elegantlybyminimizing anexpression of the form

(The more natural looking expression derived from the simultaneous M.L.-problem, which contains logo-, would not allow us to push through a simpleconvergence proof for thenumerical calculations.)In theabove,p is aconvexfunc t ion , p(0)=0,which should satisfy

If c < oo,Qcan beextended bycontinuity

Oneeasily checks that Q is aconvexfunc t ionof (8,cr)ifthe/iarelinear. Unlesstheminimum(6 ,6- occurson theboundary r=0, it canequivalentlybecharac-terizedby thep1equations

with


50/80

38 CHAPTER V

Ifa is to be asym ptotically unbiased for no rma l errors, we should choose

where th e expectation is taken for a s tandard normal argumen tU.Examples.(i) With p(x)=x22 we obtain th e standard least squares estimates: 6minimizes(14.3)and satisfies

we have

and we obtain th e proposal 2 -estimateso f Huber (1964), (1973).Algorithms. I know essentially three successful algorithms for minimizingQ.The simultaneous estimation o f a introduces some (inessential) complications,and if we disregard it for the m oment, we may describe the salient ideas as follows.ALGORITHM S . A pply New ton s method to < A A ; ) d / i / d 0 / = 0 , =1, ,p.Ifty is piecewise line ar, if the t are linear, and if the trial valu e0(m)is so close to thefinalvalue 6 that both induce th e same classification of the residuals A , accordingto the linearpiecesof they lie in, then this procedure reaches th e exact,finalvalue 6 in one single step. I f the iteration step f or cr (m )is arrange d prope rly, thisalso holds for the scale inv aria nt version of the algorithm .ALGORITHM H. A pply the standard iterative nonlinear least squares algorithm(even i f th e/ j are linear), but w ith m etrical Winsorization of the residuals: in eachstepwe replace y, by

ALGORITHM W . App ly th e ordinary weighted least squares algorithm, withweights

determined f rom th e current values of the residuals.R .Dutte r (1975a), (1975b),(1976)has investigated these alg orithm s and someoftheir variants.If very accurate (num erically exact) solutions are w anted, one should useAlgorithm S. On the average, it reached the exact values within about 10

(ii) With


51/80

MULTIPARAMETER PROBLEMS 39i terations in our Monte Carlo experiments. However, these iterations are rela-tively slow, since elaborate safeguardsare needed to prevent potentially catas-trophic oscillationsin thenonfinal iteration steps, wh ere6(m)maystillbe far fromthet ru th .Algorithms H and W are much simpler to program, converge as they stand,albeit slowly, but reach a statistically satisfac tory accuracy also w ithin abo ut 10now fas ter) iterative steps.The overal l performance of H and W is almost equally good; fo r linearproblems, H m ayhaveaslight adv antag e, since it snormal equations m atrixCTCcan be calculated once and for all, whereas for W it m ust be recalculated or atleast updated)ateachiteration. By theway,theactual calculation ofCTCcan becircumvented justas in the classical least squares case, cf.Lawson and Hanson(1974).In detail , Algorithm H can be defined as follows.W eneedstarting values0(0),cr0)for theparametersand atolerance v aluee >0.N ow performthe following steps.1. P u t m= 0 .2. Compute residualsA ^m )=y,-/i(8(m)).3. Compute a new valuefo r orby

4. Winsorize the residuals:

5. Compute th epartial derivatives

6. Solve

f o r f .7. Put

where 0


52/80

In the case of proposal 2", (x) =[x]-cc,these expressions simplify: / u then isthe fraction ofnon -W insorized residuals, and

Under mild assumptions, this algorithm converges (see Huber and Dutter(1974)): assume that p(x)/x is convex forx0, that0 < p " ^ E1, andtha tthe arelinear. Thenone canshow that

and tha t these inequali t iesarestrict, unless er(m )=< r( m + 1 )or0(m)=8(m+1)respec-t ively.The sequence (0(m ),< r(m ))has at least one accum ulation point (this followsfrom a standard com pactness argum ent) , and every accumulation point minim izesQ(0,c r . If the m i n i m u misun ique, thisof course implies convergence.Remark. Theselection ofstarting values(6(0),o -(0)) presents som e problem . Inthe general case with nonl inear/i's, no blanket rules can be given, except thatvaluesof o-(0)which are too sm all should be tter be avoided. So we m ight just aswelldetermine cr(0 )from9(0)as

Inthe one-dimensional location caseweknow thatthesamplemean is averypoor start if ther e arewildobservations, but th at the sample m edian is such a goodone, thatone step of iteration suffices for all practicalpurposes(6(1)and 6 areasymptotical ly equivalent) ,cf .Andrews et al. (1972).Unfor tunate ly , in the general regression case the analogue to the samplemedian (the Li-estimate) is harder to compute than the estimate 6 we areinterested in, so we shall,despiteitspoor properties,use the analogue to thesample mean ( theL2 or least squares estimate) to start the i teration. Mostcustomerswillw an t to see the least squares result any wa y

40 CHAPTER V

9. Estimate0by6(m+1),

HereK is acorrection factor (Huber (1973,p. 812ff)):

where


53/80

M U L T I P A R A M E T E R P R O B L E M S 41

15 Robust covariances: the affinely invariantcase Covariance matr ices andthe associated ellipsoids are often used to describe the overall shape ofpointsinp-dimensional Euclidean space principal com ponen t and fac tor ana lysis, discri-m ina nt ana lysis, etc.) . But because of their high outlie r sensitiv ity the y are no tpart icular ly well suited for this purpose. Let us look first at affinely invar i an trobust alternatives.

Takea fixed spherical ly symmetric probabil i ty densi ty/ inRp We applyarbi t ra rynondegenerate affine t ransformat ionsx- V \- ) andobtainafami lyof elliptical densities

let us assume tha t our d ata obeys an u nd erly ing mod el of this type ; the proble m isto estimatethe p-vectorand the px p - m a t r ixVfromn observations\i, ,xn;x,e/?p.Evidently V i s no t un iquelyidentifiable i t can be mu lt iplied by an arb i t raryorthogonal m atr ix from theleft),butVTVis. W e canalso e nforce uniquenessofVbyrequ ir ing that , e .g ., V is positivedefinitesym m etric, or lower t r iangu larwi thapositive diagonal. Usually,w eshall adopt the lat ter convention. T hem at r ix

shall be called pseudo-) covariance m atrix ; it t ransforms l ike an ordinarycovariance matr ix under affine t ransformat ions .The m ax im um l ikel ihood est imate of, V isobtained bymax imiz ing

where ave { }denotes the average taken over the sample.By taking derivat ives, weobtain the following systemof equat ions for , V:

with

an d

The reason forintroducing the funct ion v shallbeexplained later .) / is thepxpident i tymat r ix .


54/80

42 CHAPTER VNo te that 15.2) can also be wr itten as

i.e. as a weighted mean, with weights depending on the sample, and (15.3)similarlycan be rew orked intoaweighted covariance

Note tha t for the m ult ivariate normal densi ty all the weights in (15.3 ) areidentically1, so we get the ordin ary covariance in this case.)Equation 15.3),with anarbitraryv,is in acertain sense th em ost general formfor anaffinely invariantM-estimateofcovariance. Ishallbrieflysketchwhythisisso. Forget about locationfor themome n t andassume= 0. Weneedp p+1)/2equations for the unique components of V , so assume our M-estimate to bedetermined from

where is anessentially arbitraryfunction fromRp intothespaceofsymm etricpxp -matrices.

Ifall solutio nsVoi 15.5) give the same VTV)~l, then the latter autom aticallyt ransformsin theproperw ayunder l inear t ransform ationsof the x.M oreover, if 5is any orthogonal matrix, then one shows easily that

when subst i tutedfor in (15.5),determinesthe same V T V ) ~ l .Now average over the orthogonal group:

then every solut ion V of 15.5) also solves

Note that isinvarian t und er orthogonal t ransform ations

and this implies that

for some functions u, v. This last resu lt was found independently also by W.Stahel.)Influence function. Assume that th e true un derly ing distributionF is spheri-cally symmetric,so that the true valuesof the estimates are =0, V=L Theinfluence functions for estimators| V computed from equations of the form


55/80

MULTIPARAMETER PROBLEMS 4315.2), 15.3), w ,v,wneednot berelatedto /) thenarevectorandmatrix valuedof course, butotherwise theycan bedetermined asusual.One obtains

with

For the pseudo-covariance VTV) l the influence funct ion is, of course, S+ST).

The asymptotic variances and covariances of the estimates normalized bymultiplicationwithvn)coincidewiththose o ftheirinfluence functions and thuscan becalculated easily.Fo rproofs in aslightlymore restricted f r amework ) seeMaronna 1976).Least informative distributions Let

where p is thestandardp-variatenormal density. i) Location Assume thatdependsdifferentiably on areal valued parametert.Then minimizing Fisher in format ion over P E with respect to t amounts tominimizing

Forp = l, this extremal problem wassolved in 12. Forp >1,thesolutionsaremuchmore complicated theycan beexpressed intermso fBessel andNeumannfunctions);somewhat surprisingly,-log/ |x|)is nolonger convex whichleads to

fo r some constantsa, 3,y:


56/80

CHAPTER Vcomplications with consistency proofs). In fact, th e qualitative behavior of thesolutions forlargep and e is not known itseemsthat ingeneral thereis a fixedsphere around the originnot containingany contamination even ife-1).

For our present purposes however, th e exact choice of the location estimatedoes notreally matter, provided it isrobust: w r) shouldbe 1 insome neighbor-hood of the origin and then decreaseso that w r)r stays bounded. ii) Scale. Assume that Vdepends diflferentiably on a real valued param eter /.Then m inimizing Fisher inform ation over P e with respect to t amounts tominimizing

wherefor some

F I G 5

The corresponding least inf orm ative density has a singularity at the origin, ifa>0:

where


57/80

MULTIPARAMETER PROBLEMS 45Thus,it is rather unl ikely that Na ture will play its m inim ax strategy against thestatistician , and so the statistician s m in im ax strategy is too pessimistic (itsafeguards against a quite unlikely contingency). Nevertheless, theseestimates

have some good points in their favor. For instance in dimension 1 the lim iting casee > 1, - 0, gives a = b = 1 and leads to the median absolute deviation, w hich isdistinguished by its h igh breakdo w n point e* =\.In higher dimensions, these estimates can be described as fol lows: adjust atransformation matrix V (and a location p aram eter , w hich w e shall disregard)until the transformed sample (yi , , y n ) , with y, = V(x, -g), has the fol lowingproperty:if the y sampleis metrically Winsorized by moving all pointsoutside ofthe spherical shella r bradially to the n earest surface point of the shell, thenthe mod ified y-sample has uni t covariance m atrix (see Fig. 6).

F I G . 6Breakdown properties Unfor tunate ly , the s ituation does not look good in highdimensions p It appears thate l / p for all affinely invarian t M-estimates o fcovariance. This has been proved under the assumption that u ^0, but in all

likelihood is general ly true (Huber (1976)).A note on calculations Formulas(15.2 ) and (15.3 )can be used to calculate and V i teratively : calculate w eights from the curren t approximations to an d Vrespectively, then use (15.2 ) and (15.3 ) to obtain better values for , VConvergence is not really fast (compare also Maronna (1976, p. 66)), butadequate in low dimensions; a convergence proof is still outstanding. Othercomputational schemes a re un der inves t igat ion.16 Robust co variances:thecoordinate dependent case Full affine in variancedoes not alw ays m ak e sense. Time series problems m ay be a prime example: onewould not w an t to destroy the n atural chronological orderin g of the observations.Moreover, th e high d imensions involved here would lead to r idiculously lowbreakdown points .Several coordinate depen den t approaches have been proposed an d explored byGnanadesikan and Ke ttenrin g (1972) and by Devlin et al. (1975).


58/80

46 CHAPTER VThe fo l lowing is a s imple idea whichfurn i shes a un i f y i ng t r ea tmen t o f somecoordinate dependent approaches.Let X = (Xi, ,Xn), Y= Yi, ,Yn)be two indepe nden t vectors , such that

Jz? Y)isi nva r i an tunde r pe rmuta ti onso f thecompon en t s;no th ingissa id ab out th edis tr ibut iono f X .Then th e fo l lowing theorem holds.THEOREM. Thesamplecorrelation

Proof. Calcu late these expectat ion s condi t ion al ly, g iven X , and Y givenup to ar a n do m pe rm u t a ti o n .Despite this dis t r ibut ion-free resul t , r obviously is not robus tone s ing le ,sufficiently b ad ou t l y i ng pa i r (Xh Yt) can shift r to an y va luein -1,1).The fo l lowing is a remedy.Replace / ( ,y ) b y r u ,v), w here u , v are computedf romx, y accordingto certa in , very gen eral ru les .The first two of thefo l lowingfiverequ i remen t sa re essential ,the others a rem ere ly somet imes co nven ien t .1. u iscomputed f r om x, vf rom y :

2.^ H c o m m u t ew i thpe rmu ta ti ons .3.^ S aremono tone i nc reas ing .4. E5.Va>0,V/8,3ai>0,3/8i,O fthese , parts 1 and 2 ensu re that u , vstillsatisfy the assumpt ions o f the theorem,i f x, y did. Ifpar t 3 holds, then perfect rankcorrelationsarepreserved. Finally,parts4 and 5toge ther imply tha t corre la tions1a represerved. In the fo l lowingtw o examples , all fiveassumpt ion sare satisfied.Example i) . ,=a(Rt), w h e r eRf is the r a n ko f x i and a is a mo no tone scoresfunct ion .Example ii). , = A j c , - T)/S) whe re i f is mono tone and T, S are any esti-mates o f location and scale.

Somepropertieso f such modifiedcorrelations:1) L et (X , Y) =F= l-e)G +eH b e a cen t rosymmetr ic probab i l i ty i nR 2.Then the correlat ion satisfies

satisfies


59/80

MULTIPARAMETER PROBLEMS 47The bounds are sharp, with

where k=sup | t / r | . So 7 is smallest for i f / x ) =sign A ; ) , i.e. the quadrant correla-tion. This should be compared to the m inim ax bias prop erty of the sample m edian11).2) Tests for independence. Take th e following test problem. Hypothesis:

whereX*, Y*,Z,Z\ are independent symmetric random variables , (X*)= 0, n -> oo)one obtains that the minim um power ismaximized usingthe sample correlation coefficient r A x), < A y ) X where i f corresponds to theminimaxM-estimateof location(l) fordistributionsin e3) Particular choice for ty. Le t

where 3 > is the standard normal cumulative.If

then

So, for this choice of j / r , at the normal model , we have a particularly simpletransformation from the covariance of^(X), * I / Y ) to thatof X, Y. Butnote:ifthis transformation is applied to a covariance matrix, it may destroy positivedefiniteness.)


60/80



61/80

CHAPTER VIFinite Sample Minimax Theory

17 Robustt stsand capacities Doasym ptotic robustness results carry overtosmall samples? This is not at all evident: 1 contamination means somethingentirelydifferentwhenthesamp le sizeis1000(and thereareabout 10outlierspersample) than when it is 5 (and 19 out of 20 samples are outlier-free).Let usbeginwithtesting, wh erethesituationisstillfairlysimple.TheNeyman-Pearson lemmaisclearly nonro bust, sinceasingle sour observationcandeterminewhether

i f logpi(jc)/poOOhappensto be unbounded.Thesimplestpossiblerobustificationisachieved bycensoringthesummandsin17.1):

and basing the decision on whether

It tur ns out th at this leads to exact, finite sampleminimaxtests for quite generalneighborhoods ofP0 PI:

where d is either distance in total variation, e-contamination, Kolmogorovdistance, or Levy distance.Intuitively speaking, we blow the simple hypotheses P j up to compositehypotheses 0*,,and we are seeking a pairQ,e 0 ,-of closest neighbors, making thetestingproblem hard est (see Fig.7).

FIG. 749


62/80

50 CHAPTER VIIfthelikelihood ratio I T X ) =qi x)/q0 x) betweenacertain pairO;e0*, satisfiesfor all f

then , clearly,the one-sam ple Neym an-Pearson tests between Q0and O, areminimax tests between 3 P 0 and0V One easily proves that this property carriesover to any sample size. Note that it is equivalent to: forPe 0 log T T isstochastically largest whenP=Q0;hencelog T T - J C , ) , with = ?(*;)e0*o, becomesstochastically largest wh en (xf) =Q0(see, e.g.Lehmann(1959,Lem. 1, p.73)).Theex istenceofsuchaleastfavorablepairis notself-evident ,and it was infacta great surprise thatthe usual sets /allpossessed it, and thatthe likelihoodratio T T X ) even had asimple structure(17.1)(Huber(1965)).This has to do with th e following: th e usual neighborhoods 0 > can bedescribed intermsof a two-alternat ing capacityv ,tha tis(17.5) = {PeM\VA,P(A)^v(A)}where v is a set function satisfying (fl being a complete, separable metrizablespace,A, B,beingBorelsubsetsofO):

The crucial property is the last one (thedefinition of a 2-alternating setfunction),and it isessentially eq uivalentto the following: if \< = A2 < = An isany increasing sequence then there is aPe P such that for all /,P(A,)= v Ai).Thissimultaneousmaximizingover a mo notone familyof sets occurs in(17.3)andisneeded for the m inimax property to hold.Examples. Assume that ft is a finiteset,for sim plicity, and let P0 be a fixedprobability.(i)v(A) = l e)P0(A) + e for A^ f> gives e-con tam ination neighborhoods:

(ii) t?(A)=min(l,P0(Ae)+ e) forA^


63/80

FINITE SAMPLE MINIMAX THEORY 51The theory may have some interesting applications to Bayesian statistics imprecise priors and imprecise conditional distributions P x;8j), compareHuber 1973b)).18 Finite sample minimax estimation As apartic ular case of afinite sampleminimaxrobust test consider testing betwe enN n, 1)and^V + /x,1),wh en thesetwo simple hypotheses are blown up, say, by e-contamination or Prohorovdistancee.Then the m inim ax robust testwillbe based on a test statistic of the form

with

This can be used to derive exactfinitesample m inim ax estimates. For instance,assume that you are estimating 6 from a sample jci, ,xn where the *, areindepe ndent random variables, w ith

W e wan t to find an estimator T=T x\, ,xn) such that for a givena > 0 thefollowing quanti tyisminimized:

Then the solution is found by first deriving a minimax robust test betweenB = a and 6=+a,and then transfo rm ing this test into an estimatefind thatshiftof the original sample forwhich the test is least able to decide between thetw ohypotheses.There sulting estimatecan bedescribedasfollow s:letT*,T**beth e smallest and the largest solutionof

respectively, with t f f x ) = [x]-cc, c =c a,e), but not depending on n. Then putT = T*or T =T **atrandomwithequal prob ability the more familiar resolutionof the d i lemma,T= T* +T* *)/2 isnotminimax).Formally,thisis thesame kindofestimateas the minim ax asymptotic varianceM-estimateund er symm etric contam ination. But note that in the presentcasewedonotassumesymmetry onthe c ontra ry, the finite sam ple min im ax solution forsymmetricdistributionsisu n k n o wn .

Fordetails,seeH ube r 1968).Thequestion w hethe r thisfinitesample minim axtheory alsohas an exact, scale in va rian t coun terpart isstill open.


64/80



65/80

H PTER VIIAdaptive Est imates

19 Adaptiveestimates Jaeckel (1971b)proposed to estimate location withat r immed mean X whose t r immingrate a_is estimated from the sample itself,namelysuch that the estim ated va riance ofX (see(10.3))becomes least possible.H e showed that forsym m etric dis tr ibutions this est im ate isasym ptotically equi-valent to Xa* w h e re Xa* is the a - t r im m ed mean with the smallest asymptoticvar iance among a l l t r immed means with f ixed tr imm ing rates , p rovided theunderlyingdistributionFissymm etric ,aisrestricted tosome ran ge0


66/80

54 CHAPTER VIIalthough, foreach f ixedF the estimate wo uld seem to be asym ptotically efficient, Icon jecture tha t for eachfixedn the estimate is poor for sufficiently lon gtailed F.More recently,Reran (1974), Stone (1975) an d Sacks (1975) have describedfully efficient versions of R-, M- and L-estimates respectively; for some of theseestimates one has in deed

fo r every symm etric distr ibution F .Essentially all thesefully adaptive proceduresestimate first a smoothedversion of i = f I f , and then use a location estimate based on ^ e.g. an M -estimate

Possibly, these estimatesm ay be q ui te poor for asym metr ic F , because then thevariability o f m ight con tr ibute a large variance compon ent to the variance of Tpossibly of the same order of m agn i tude as the var iance already present in theestimate based on the fixed, t rue .(For sym m etric F, an d provided < is forced tobe skew symm etric , the var iance componen t in question is asym ptotically n egligi-ble.) This question should be in vestigated qu an titativ ely .Also, i t is not clear whether these estimates are robust, cf. the r emark onTakeuchi s estima te. But see now a most recen t paper by R. Bera n (1976).For a comprehen sive review of adaptive estim ates, see Hogg (1974).


67/80

CHAPTER VIII

Robustness:WhereAre WeNow?He looked into th e water and saw that it was made up of a thousandthousand thousand and one different currents each one a different colourweaving in and out of one another like a liquid tapestry of breathtakingcomplexity; and f f explained that these were the Streams of Story thateach coloured strand represented and contained a single tale.S a l m a nR us hd i e ,Haroun and the Sea of Stories 1990

20 The first tenyears Agood casecan be made that modern robustness beginsin1960 withth e papers by J. W .Tukey on s am p l i ngfrom contaminated dis t r ibut ionsa ndF. J.Anscombeon therejectionofoutl iers . Tuke y's paper drew attentionto thedram a t iceffects of seem ingly n egl igib le deviat ions f romthe model , and it made ef fec t iveuse ofasymptoticsincombination withthegrosserrormodel. Anscombeintroduced a seminalinsuran ce idea: sacrif ice som e performa nce at the model in order to insure agains t il leffects caused by deviations from it . Most of the basic ideas, concepts , and methodsof robustness were inventedi n quick succession duringthefol lowing years and were inplaceb yab ou t 1970.In 1964 Irecast Tukey's general setup intoan asymptot ic minimax f ramework andwasabletosolve it . Im portant poin ts were th einsis tenceonfinite but smal l deviat ions ,the form al recognition tha t a large error occurring w ith low probabil i ty is a sma ll de vi-ation, and theswitch fromthethen-prevalent cri terion ofrelativeefficiency toabsoluteper forman ce. At the same t ime, I in t roduced the not ion of m ax im um l ikel ihood type orM-estimators.Ham pel (1968)added the form aldefinition of qua l i ta t ive robustness (cont in ui ty un dea suitable weak topology), inf in i tes imal robustness in form of the inf luence funct ion(von Mises derivative of a s tatis ticalfun ct ional ) , and the not ion of breakdown poin t .

A nagging early worry was the possible irrelevancyofasym ptotic approaches: con-ceptually, a 1 grosserror rateinsamplesof 1000is entirely different from th esameerrorrateinsamples of 5. This worrywaslaidtorestbyHuber (1965),(1968): bothfortes ts an d for est ima tes, there are qual i ta t iv ely ident ical and even q ua nt i ta t ive ly s i m i laexact fini te sample minimax robus tness resul ts . At the same t ime, th is fini te s am p l eapproach did away with the an noy ing assum ption tha t conta m ina t ion should preservesymmetry, and i tshowedthatonecould dealwithth ethornyidentifiabil ity issuethrougha k ind of in terval ar ithm et ic .The en d of thedecadesaw the firstextension of asymptotic robustness theory beyondlocationtomore general parametr ic m odels , n am elyth ein t roduc t ionofs hr ink ing ne ig h -borhoods by C. Huber-Carol(1970),aswellas the firstat tempt atstuden t iz ing (Huber,1970 .


68/80

56 CHAPTER VIII

Thebasic m ethodology for Monte Carlo studies ofrobust estimators was establishedinPrincetonduring 1970to 1971;seeAndrewsetal.(1972). That stu dy basically settledthe problem of robu stly estim atin g a single location parameter in samples of size 10 orlarger, openingthe way form ore general multiparam eter problems.Inth is chapter I've c hosen several of the more interesting robu stness ideas and followedthes tran ds of their stories to the present. W hat has happened to the early ideas since theirinvention? Wha timportantnewideas came a bou tin the1970sand1980s? Ihave avoidedtechnicalitiesandinstead have givenareferenceto arecent surveyormonograph wherefeasible. Co m pleteness is not inten ded ; for a far from com plete survey of researchdirectionsinrobust statistics,withmore than500references,seeStah el's articleinStahelandWeisberg(1991,Part II, p. 243).

21 Influence functionsandpseudovalues Influencefunctionsh ave becomeastan-dard tool amongthebasic robustness concepts, especially afterthecom prehensive treat-ment by Hampel et al. (1986). The technique used to robus tize classical proceduresthrough the use of pseudovalues isalso becoming widely known, even though it hasreceivedonlysc ant coveragein the literature.Theprocedure is tocalculate robust fittedvaluesj byiteratively applyingtheclassical p rocedureto thepseudovalues y j,-+r*insteadof > , - Here,thepseudoresidualr = ^ r/) isobtainedbycutting downthe cur-rent residual r, = > / j with the help of a function i proportional to the desiredinfluence function (i.e.,withthe^-funct ion defining anM-estimate). Forexamples se ein partic ular Bickel (1976, p. 167), Hu ber (1979), and Kleiner, M artin , and T hom son(1979). If j s ischosen equaltoratherthanm erely proportional to theinfluencefunction,the classical formulas, when applied to the pseudoresiduals r instead of the residu-als, yield a sym ptotically correct error es timatesforANOVAandother purposes (Huber1981, p. 197).There have been some interesting extensions ofinfluence functionideastotimeseries(Kiinsch, 1984).

22 Breakdown and outlier detection For years theconcept of breakdown pointhad been aneglected stepchild in the robustness literature. The paper byDonohoan dHuber (1983) was specifically written to give it more visibility. Recently I have beguntowonder w hether it hasgivenit toomuch;thesuddenly fashionable emphasis on highbreakdownp oint procedures hasbecome counterproductive. One of the most strikingexam ples of the usefu lness of the concept can befoundin Hampel 1985): the combinedperformanceofoutlier rejection followedby thesample meanas anestimate oflocationis essentially determinedby thebreakdownof theoutlier detection procedure.

23 Studentizing Wheneverw ehaveanestimate, wes hould provide anindicationof itsstatisticalaccuracy,fo rexample,bygivinga 95% confidence interval. This is notparticularly difficult if the numberof observations isvery large, so that the estimate isasymptotically normalwithanaccu rately estim able stand ard error,or in one-parameterproblems witho ut a nu isan ce parameter, where the finite sample theory of Hub er (1968)canbe applied.


69/80

R O B U S T N E SS : W H E R E A R E W EN O W ? 57

If neither of these apply, we end up w ith a tricky problem of studen tizat ion. To myknowledge, therehas notbeen mu chprogressbeyond th eadmittedly un satisfactory initialpaper of Huber (1970). There are many open questions with regard to this crucial lyimportant problem; in fact, one is w hat questions one should ask A sketch of theprincipal issues follows.

In th e classical normal case, it fol lows from sufficiency of x, s) and an invar ianceargument that such aconfidence interval m ust take th eform

with kn depending on the sample size but not on the sample i tself . In a well-behavedrobust version, a confidence interval m ight take the analogous fo rm

where T is anasym ptotically norm al robustlocationestimateandSis thelocation invari-ant , scale equivariant, Fisher-consistent functional est imat ing th e asymptotic standarddeviation of*JnT, applied to the empirical distr ibution. In thecaseofM -estimates forexample, w emightuse

where the argument of i/r, \ / / is y = x T)/SQ, i.e., a robustly centered and scaledversionof theo bservations, saywithS0 M A D. If we are interested in 95 confidenceintervals, Kn must approach O""1(0.975) 1.96 forlarge n. B ut Kn might depend onth e s mpleconfiguration in a nontrivial, translation andscale invariant way: since w edo not have a sufficient statistic, we m ight w an t to cond it ion on theconf igura t ion of thesamplein an as-yet undetermined way.

Although the distr ibution of *JnT typically approaches the norm al, i t wi l l do somuch faster in the central region than in the tails, and the extreme tails will dependrather uncon trollably on detai ls of the un kn ow n distr ibution of the observations. Thedistribution of S suffers from similar problems, but here it is the low end that matters.The question is ,w hat confidence levels m ake sense and are reasonably stable forw h a tsample sizes? For example, given a part icular level of co nta m ina tion and a pa rt icu larest imate, isn = 1 0good enoughtoderive accu rateandrobust99%confidence intervals,or do we h ave to be conten t with 95% or90%? I ant ic ipate that such quest ions can (andwill)b esettledwi thth ehelpofsmall sample a sym ptotics, assistedperhaps byconfiguralpolysampling (see below).

24 Shrinking neighborhoods Direct theoretical at tacks on finite neighborhoodswork only if theproblem islocation orscaleinvariant.Forlarge samples, however, mos tpoin test im ation problems begin to resemble location problems, so i t is possible to derivequite general asymptotically robust tests andestimatesbylettingthoseneighborhoodsshr ink at a rate n~1/2 w i th inc reasin g samp le size. This idea w as f irst exploited by


70/80

58 C H A P T E R V I I I

C Huber-Carol (1970), followed by Rieder, Beran, Millar and others. The final wordon this approach isconta inedin Rieder 's book (1994).

The pr incipa lm ot iva t i onclearly is technic al: shrin ki ng leads to a m anageable asym p-totictheory,wi threasonably s im plel imi t ingprocedures. B utthereisalsoaphilosophicaljus t i f ica t ion. Since the standardgoo dness-of-f it tests are ju st able to detect d ev iation sof the order O n~]^), it makes sense to put the border zone between small andlargedev ia t io ns at O n~l/2). Larger dev iation s should be taken care of by diagnost ics andm o d e l i n g ; sm aller ones are difficult to detect and should be covered (in the insurancesense) byrobustness .

Thisdoesn otm e anthatourdata sam plesaresupposedto getcleaner iftheyget larger.O n thecon trary, thetheory m ay be taken to indicate what samplesizesm akesense forwhat degree of cleanliness. Thus, th e shr inkage of neighborhood presents us with ad i l e m m a ,n am ely a choice between the alternatives:

i m p r o v e th em o d e l;o ri m p r o v e th edata; ors top sam pl ing .

N ote that adap tive estim ation is not am on g the viab le alternatives. T he problem is notone ofreducing statistical variability,but one ofavoidingbias,and theancient Emperor-of -China paradox ap plies. (You can get a fantastically accurate estimate of the height ofthe em peror by av erag ing the guesses of 600 mi l l i on Chinese, most of whom never sawthe em peror.)I f used wi th c i r c u m s p e c t i o n ,shr ink ing neighborhood theory thuscan give valuablequalitat ivea ndeven q uant i ta t ive hintson thek indo fproceduresto beused in apracticalsi tuation. If the neig hborhoo d is sm all in term s O n~l/2), we should take som ethingclose to the classicalMLE;if it is large, an an alog ue to the m edian or to the sign test isind icated ; if it is even larger, statistical variability willbe dominated by biases beyondcontrol .The asym ptot ic theory ofshr inkin gneighb orhoods is, inessence,a theory of infinites-i m a l robustness and suffers from the sam e concep tual drawb ack as approaches basedonth e inf luence func t ion: infini tesimal robustness (bounded influence) does not automat-ically i m p l y robustness. The crucial point is that in any practical application we havea fixed, f ini te sample size and we need to know whether we are inside the range of nand forw hic h asym ptot ic theory yie lds a decent approximat ion. This range may be i f f i u l t to d e te r m i n e ,but theb reakdow n point of teniscom putable and may be a usefulindicator.

25. Design R obu stness casts a shadow on the theory ofop tim al designs: they losetheir theore t ica l o p t im a l i ty very quick ly unde r minor v io la t ionsof linearity (Huber,1975) or ind epe nd en ce assu m pti on s (Bic kel and Herzberg, 1979). I am not aware ofm u c h current act ivi ty in this area, but the lesson is clear: Naive designs usually aremore robust and better than opt imal designs.

26 Regression B a c k in 1975,the discussants of Bickel (1976) raised interestingcr i t ic isms. In part icular there were complaints about the mult ip l ic i ty of robustpro-


71/80

R O B U S T N E S S :W H E R E A R EW E N O W ? 59

cedures andabout thei r com puta t iona landcon ceptua l com plexi ty . Bickel fended themoff skil lfully andconv inc ing ly .

There may have been reasons for concern then, but the si tuation has become worse.M ost of the action in the1980shas occurred on the regressionfront . Hereis an incom pletelistof robust regression est imators: L I (going back at least to Laplace), M (Hu ber, 1973);G M (M allows, 1975), with varian tsbyHam pel,Krasker, andW elsch;R M(Siegel, 1982);LM S and LTS (Rousseeuw, 1984); S (Rou sseeuw and Yohai, 1984); MM (Yo hai, 1987); (Yohai andZamar , 1988); SRC (Simpson, Ruppert, and Carroll, 1992); and no end isin sight. For an up-to-date review see Davies (1993).

Bickel w ould not have an easy job today ; mu ch of the Nordic cr i t i c i sm, unsubs tan-tiated in 1975, seems to be justified now. In anyen gineer ing product , an overly rapidsequence of updates is sometimes a sign of vigorou s progress, but i t can also be a signofshoddy workmansh ip ,a nd often it isboth. In anycase,itconfusesth ecustomers andhence iscounterproduct ive .

Robustness has been defined as insensi t iv i tyto small deviat ions from an idealizedmodel . What is this model in the regression case? The classical model goes back toGauss and assumes tha t th e carrier X (the matr ix X of the independent var iables)iserror free. X may be systematic (as in designed experiments) , or haphazard (as inmost undesigned experiments) , but i ts rows only rarely can be modeled as a randomsample from a specif ied mult iva riate model distr ibution . Stat ist icians tend to forgettha t th e elements of X often are not observed quanti t ies, but are derived from somemodel (cf . the classical no nlinea r problem s of astronomy and geodesy g ivin griseto themethod of least squares in the first place). In essence, each ind ivid ua lX correspondsto a somewhat different si tuation and might have to be dealt with different ly. Thus ,multiplicity of procedures m ay lie in the na ture of robu st regression. Cu rious ly, m ost ofthe actionseemsto have been focused throug h tun ne l vision on ju st oneaspect: safeguardat any cost against problems caused by gross errors in a random carrier.

Over the years, I too have had to defend the min im ax approach to dis tr ibu tion alrobustness on m any occasions. The sal ient poin ts of my defense w ere that the leastfavorablesi tua t ionone issafeguarding aga inst ,fa rfrom being unrealist ical ly pessim ist ic,is more similar to actua lly observed error distr ib utio ns than the norm al model; th at thperformance loss at at rue norm al model is relatively small; thaton the other hand th eclassical ly optimal procedures may exp erience heavy losses if the norm al model is ju stslightlyv iolated; and that the hardest problems are not with extreme outl iers (wh ich areeasy to detect a nd el imin ate) , but with what happen s on the shoulders of the distr ibution s.Moreover, the com puta tion of robustM -estimatesis easy an d fast (see the last paragrap hof this section) . Not a sing le one of these l ines of defense can be used w ith the m odern high breakdown point regression est ima tes.

Atyp ical causef orbreakdowninregression is thepresenceofgross o utliersinX wh i l eindividual ly such outl iersaretriv ially easy tospot (with th ehelp of the diagonal of thehatm a t r ix ) ,efficient identification of collaborative leverage groups is an open, perhapsunsolvable ,diagn ostic problem. I w ou ld advise against t reat ing leverage grou ps blin dlyth rough robustness, however; they ma y hide serious design or mo deling problems, andthere are similar problems even w ith single leverage points.


72/80

60 CHAP T E R VI I I

The reasons for ano ut l ie r amongthe X ( leverage po int ) m ight includ ea m isplaced decim al po int,a un ique , priceless observation dating back toantiquity,anaccurateb utuseless obse rvation , outsideof therangeofvalidityof the model.If the valuea t this leverage point disagrees wi th the evidence extrapolated from th e

other observ ations, this may be becauseth e outlyingobservationisaffected by agrosserror(in X or in y ,the other observ ations are affected by small systematic errors (this is more oftenthecasethanonem ight think),th emodel isinaccura te ,so theextrapolation fails .

The existenceo f severa l phenom enologica lly indis t inguishablebutconceptually dif-ferent s i tua t ions w i thdifferent consequences calls for a diagnostic approach (identifi-cation of leverage points or groups), follow ed by alternative w hat if analyses. Thiscontrasts sh arply w ith a simp le location estimation, wh ere the observations are exchange-able and aminim ax approach isquite adequate (althoug hone m ay w a ntto followit upwi th an investigationof the causes of grosser errors).

A t th e root of thecurrentconfusion is tha t hard ly any body bothers tostate all of theissues clearly .N otonlym u s tthees t imatorand aprocedureforc o m p u t in git b especified,but also th e situations for w h i c h it is supposed to be appropriate or inappropriate andcriteriafor jud gin g estimators and procedures m ust be given. Therehas been a tendencytorush in toprint with rash claimsandprocedures. In particular , whatis m e a n t by theword breakdown? For many of the newer estimates there are unqualified claims thattheir breakdow n poin t approaches 0.5 in large samples. But such claims tac it ly excludedesigned situations: if the observations are equally parti t ioned among the corners ofa simplex in d-space, no estimate whatsoe

robust statistical procedure paper

Documents