[wiley series in probability and statistics] robust regression and outlier detection...

12
Robust Regression and Outlier Detection

Upload: annick-m

Post on 28-Mar-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

Robust Regression and Outlier Detection

Page 2: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

Robust Regression and Outlier Detection

PETER J. ROUSSEEUW

Dept. of Mathematics and Computing Universitaire Instelling Antwerpen Universiteitsplein 1 B-2610 Antwerp, Belgium [email protected]

ANNICK M. LEROY

Bristol-Myers-Squibb B-1170 Brussels, Belgium

JOHN WILEY & SONS

New York 0 Chichester 0 Brisbane 0 Toronto 0 Singapore

Page 3: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

A NOTE TO THE READER: Tlus book has been electronically reproduced from digital information stored at John Wdey & Sons, Inc. We are pleased that the use of this new technology will enable us to keep works of enduring scholarly value in print as long as there is a reasonable demand for them. The content of t h i s book is identical to previous printings.

Copyright @ 1987 by John Wiley & Sons, Inc.

All rights reserved. Published simultaneously in Canada.

Reproduction or translation of any part of this work beyond that permitted by Section 107 or 108 of the 1976 United States Copyright Act without the permission of the copyright owner is unlawful. Requests for permission or further information should be addressed to the Permissions Department, John Wiley & Sons, Inc.

Librmp of Congress Ccr*Joging in Pub&&n Du&:

Rousseeuw, Peter J. Robust regression and outlier detection.

(Wiley series in probability and mathematical statistics. Applied probability and statistics, ISSN 0271-6356)

Bibliography: p. Includes index. 1. Regression analysis. 2. Outliers (Statistics)

3. Least squares. I. Leroy, Annick M. 11. Title. 111. Series. QA278.2.R68 1987 519.5'36 87-8234 ISBN 0471-85233-3

Printed in the United States of America

10 9

Page 4: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

If among these errors are some which appear too large to be admissible, t h n those observations which produced these errors will be rejected, as coming from too faulty experiments, and the unknowns will be determined by means of the other observations, which will then give much smaller errors.

Legendre in 1805, in the first publication on least squares

This idea, however, from its nature, involves something vague. . . and clearly innumerable different principles can be proposed. . . . But of all these principles ours is the most simple; by the others we shall be led into the most complicated calculations.

Gauss in 1809, on the least squares criterion

The method of Least Squares is seen to be our best course when we have thrown overboard a certain portion of our data- sort of sacrifice which has often to be made by those who sail upon the stormy sea of Prob- ability.

Edgeworth in 1887

(Citations collected by Plackett 1972 and Stigler 1973, 1977.)

Page 5: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

Preface

Regression analysis is an important statistical tool that is routinely applied in most sciences. Out of many possible regression techniques, the least squares (LS) method has been generally adopted because of tradition and ease of computation. However, there is presently a widespread awareness of the dangers posed by the occurrence of outliers, which may be a result of keypunch errors, misplaced decimal points, recording or transmission errors, exceptional phenomena such as earthquakes or strikes, or mem- bers of a different population slipping into the sample. Outliers occur very frequently in real data, and they often go unnoticed because now- days much data is processed by computers, without careful inspection or screening. Not only the response variable can be outlying, but also the explanatory part, leading to so-called leverage points. Both types of outliers may totally spoil an ordinary LS analysis. Often, such influential points remain hidden to the user, because they do not always show up in the usual LS residual plots. To remedy this problem, new statistical techniques have been de-

veloped that are not so easily affected by outliers. These are the robust (or resistant) methods, the results of which remain trustworthy even if a certain amount of data is contaminated. Some people think that robust regression techniques hide the outliers, but the opposite is true because the outliers are far away from the robust fit and hence can be detected by their large residuals from it, whereas the standardized residuals from ordinary LS may not expose outliers at all. The main message of this book is that robust regression is extremely useful in identifying outliers, and many examples are given where all the outliers are detected in a single blow by simply running a robust estimator.

An alternative approach to dealing with outliers in regression analysis is to construct outlier diagnostics. These are quantities computed from

vii

Page 6: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

viii PREFACE

the data with the purpose of pinpointing influential observations, which can then be studied and corrected or deleted, followed by an LS analysis on the remaining cases. Diagnostics and robust regression have the same goal, but they proceed in the opposite order: In a diagnostic setting, one first wants to identify the outliers and then fit the good data in the classical way, whereas the robust approach first fits a regression that does justice to the majority of the data and then discovers the outliers as those points having large residuals from the robust equation. In some applica- tions, both approaches yield exactly the same result, and then the dif- ference is mostly subjective. Indeed, some people feel happy when switching to a more robust criterion, but they cannot accept the deletion of “true” observations (although many robust methods will, in effect, give the outliers zero influence), whereas others feel that it is all right to delete outliers, but they maintain that robust regression is “arbitrary” (although the combination of deleting outliers and then applying LS is itself a robust method). We are not sure whether this philosophical debate serves a useful purpose. Fortunately, some positive interaction between followers of both schools is emerging, and we hope that the gap will close. Personally we do not take an “ideological” stand, but we propose to judge each particular technique on the basis of its reliability by counting how many outliers it can deal with. For instance, we note that certain robust methods can withstand leverage points, whereas others cannot, and that some diagnostics allow us to detect multiple outliers, whereas others are easily masked.

In this book we consider methods with high breakdown point, which are able to cope with a large fraction of outliers. The “high breakdown” objective could be considered a kind of third generation in robustness theory, coming after minimax variance (Huber 1964) and the influence function (Hampel 1974). Naturally, the emphasis is on the methods we have worked on ourselves, although many other estimators are also discussed. We advocate the least median of squares method (Rousseeuw 1984) because it appeals to the intuition and is easy to use. No back- ground knowledge or choice of tuning constants are needed: You just enter the data and interpret the results. It is hoped that robust methods of this type will be incorporated into major statistical packages, which would make them easily accessible. As long as this is not yet the case, you may contact the first author (PJR) to obtain an updated version of the program PROGRESS (Program for Robust reGRESSion) used in this book. PROGRESS runs on IBM-PC and compatible machines, but the structured source code is also available to enable you to include it in your own software. (Recently, it was integrated in the ROBETH library in Lausanne and in the workstation package S-PLUS of Statistical Sciences,

Page 7: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

PREFACE ix

Inc., P.O. Box 85625, Seattle WA 98145-1625.) The computation time is substantially higher than that of ordinary LS, but this is compensated by a much more important gain of the statistician’s time, because he or she receives the outliers on a “silver platter.” And anyway, the computation time is no greater than that of other multivariate techniques that are commonly used, such as cluster analysis or multidimensional scaling.

The primary aim of our work is to make robust regression available for everyday statistical practice. The book has been written from an applied perspective, and the technical material is concentrated in a few sections (marked with *), which may be skipped without loss of understanding. No specific prerequisites are assumed. The material has been organized for use as a textbook and has been tried out as such. Chapter 1 introduces outliers and robustness in regression. Chapter 2 is confined to simple regression for didactic reasons and to make it possible to include robust- ness considerations in an introductory statistics course not going beyond the simple regression model. Chapter 3 deals with robust multiple regression, Chapter 4 covers the special case of one-dimensional location, and Chapter 5 discusses the algorithms used. Outlier diagnostics are described in Chapter 6, and Chapter 7 is about robustness in related fields such as time series analysis and the estimation of multivariate location and covariance matrices. Chapters 1-3 and 6 could be easily incorporated in a modem course on applied regression, together with any other sections one would like to cover. It is also quite feasible to use parts of the book in courses on multivariate data analysis or time series. Every chapter contains exercises, ranging from simple questions to small data sets with clues to their analysis,

PETER J. ROUSSEEUW ANNICK M. LEROY

October 1986

Page 8: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

Acknowledgments

We are grateful to David Donoho, Frank Hampel, Werner Stahel, and many other colleagues for helpful suggestions and stimulating discussions on topics covered in this book. Parts of the manuscript were read by Leon Kaufman, Doug Martin, Elvezio Ronchetti, Jos van Soest, and Bert van Zomeren. We would also like to thank Beatrice Shube and the editorial staff at John Wiley & Sons for their assistance.

We thank the authors and publishers of figures, tables, and data for permission to use the following material:

Table 1, Chapter 1; Tables 6 and 13, Chapter 2: Rousseeuw et al. (1984a), @ by North-Holland Publishing Company, Amsterdam.

Tables 1, 7, 9, and 11, Chapter 2; Tables 1, 10, 13, 16, 23, Chapter 3; Table 8, Chapter 6: @ Reprinted by permission of John Wiley & Sons, New York.

Tables 2 and 3, Chapter 2; Tables 19 and 20, Chapter 3: Rousseeuw and Yohai (1984), by permission of Springer-Verlag, New York.

Tables 4 and 10, Chapter 2: @ 1967, 1979, by permission of Academic Press, Orlando, Florida.

Table 6, Chapter 2; data of (3.1), Chapter 4: 0 Reprinted by permission of John Wiley & Sons, Chichester.

Figure 11, Chapter 2; Tables 5, 9, 22, 24, Chapter 3; data of (2.2) and Exercise 11, Chapter 4; Table 8, Chapter 6; first citation in Epigraph: By permission of the American Statistical Association, Washington, DC.

Figure 12 and Tables 5 and 12, Chapter 2: Rousseeuw et al. (1984b), 0 by D. Reidel Publishing Company, Dordrecht, Holland.

Table 2, Chapter 3: 0 1977, Addison-Wesley Publishing Company, Inc., Reading, MA.

xi

Page 9: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

xii ACKNOWLEDGMENTS

Table 6, Chapter 3: 0 1983, Wadsworth Publishing Co., Belmont, CA Table 22, Chapter 3: Office of Naval Research, Arlington, VA. Data of Exercise 10, Chapter 4; second citation in Epigraph: 0 1908,

Third citation in Epigraph: Stigler (1977), by permission of the Institute

Figure 3, Chapter 5: By permission of B. van Zomeren (1986). Table 1, Chapter 7: By permission of Prof. Arnold Zellner, Bureau of the

Data in Exercise 15, Chapter 7: Collett (1980), by permission of the

1972, by permission of the Biometrika Trustees.

of Mathematical Statistics.

Census.

Royal Statistical Society.

Page 10: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

Contents

1. Introduction

1. Outliers in Regression Analysis 2. The Breakdown Point and Robust Estimators Exercises and Problems

2. Simple Regression

1. Motivation 2. Computation of the Least Median of Squares Line 3. Interpretation of the Results 4. Examples 5. An Illustration of the Exact Fit Property 6. Simple Regression Through the Origin *7. Other Robust Techniques for Simple Regression Exercises and Problems

3. Multiple Regression

1. Introduction 2. Computation of Least Median of Squares Multiple

Regression 3. Examples

*4. Properties of the LMS, the LTS, and S-Estimators 5 . Relation with Projection Pursuit

*6. Other Approaches to Robust Multiple Regression Exercises and Problems

1

1 9

18

21

21 29 39 46 60 62 65 71

75

75

84 92

112 143 145 154

xiii

Page 11: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

XiV CONTENTS

4. The Special Case of One-Dimensional Location

1. Location as a Special Case of Regression 2. The LMS and the LTS in One Dimension 3. Use of the Program PROGRESS

*4. Asymptotic Properties *5. Breakdown Points and Averaged Sensitivity Curves Exercises and Problems

5. Algorithms

1. Structure of the Algorithm Used in PROGRESS *2. Special Algorithms for Simple Regression *3. Other High-Breakdown Estimators *4. Some Simulation Results Exercises and Problems

6. Outlier Diagnostics

1. Introduction 2. The Hat Matrix and LS Residuals 3. Single-Case Diagnostics 4. Multiple-Case Diagnostics 5. Recent Developments 6. High-Breakdown Diagnostics Exercises and Problems

7. Related Statistical Techniques

1. Robust Estimation of Multivariate Location and Covariance Matrices, Including the Detection of Leverage Points

2. Robust Time Series Analysis 3. Other Techniques Exercises and Problems

References

Table of Data Sets

Index

158

158 164 174 178 183 194

197

197 204 206 208 214

216

216 217 227 234 235 237 245

248

248 273 284 288

292

311

313

Page 12: [Wiley Series in Probability and Statistics] Robust Regression and Outlier Detection (Rousseeuw/Robust Regression & Outlier Detection) || Frontmatter

Robust Regression and Outlier Detection