ebook img

Outliers in Statistical Data PDF

376 Pages·1978·23.577 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Outliers in Statistical Data

Outliers in Statistica[ ata VIC BARNETT University of Sheffield and TOBY LEWIS University of Hull John Wiley & Sons Chichester · New York · Brisbane · Toronto Outliers in Statistica[ ata VIC BARNETT University of Sheffield and TOBY LEWIS University of Hull John Wiley & Sons Chichester · New York · Brisbane · Toronto Preface The concept of an outlier has fascinated experimentalists since the earliest attempts to interpret data. Even before the formai development of statistica! method, argument raged over whether, and on what basis, we should discard observations from a set of data on the grounds that they are 'unrepresenta tive', 'spurious', or 'mavericks' or 'rogues'. The early emphasis stressed the contamination of the data by unanticipated and unwelcome errors or mis takes affecting some of the observations. Attitudes varied from one extreme to another: from the view that we should never sully the sanctity of the data by daring to adjudge its propriety, to an ultimate pragmatism expressing 'if Copyright© 1978 by John Wiley & Sons Ltd. in doubt, throw it out'. Reprinted February 1979 The present views are more sophisticated. A wider variety of aims are Reprinted June 1980 recognized in the handling of outliers, outlier-generating models bave been Ali rights reserved. proposed, and there is now available a vast array of specific statistica! techniques for processing outliers. The work is scattered throughout the No part of this book may be reproduced by any means, nor literature of the prese n t century, shows no sign of any abatement, but has transmitted, nor translated into a machine language with out the written permission of the publisher. no t previously bee n drawn together in a comprehensive review. Our purpose in writing this book is to attempt to provide such a review, at two levels. On Library of Congress Cataloging in Publication Data: the one band we seek to survey the existing state of knowledge in the outlier field and to present the details of selected procedures for different situations. Barnett, Vie. On the other band we attempt to categorize differences in attitude, aim, and Outliers in statistica} data. Barnett!Lewis (Wiley series in probability and mathematical model in the study of outliers, and to follow the implications of such statistics) distinctions for the development of new research approaches. In offering Bibliography: p. such a comprehensive overview of the principles and methods associated lncludes index. with outliers we hope that we may help the practitioner in the analysis of 1. Outliers (Statistics) I. Lewis, Tobias, joint author. Il. Title. data and the researcher in opening up possible new avenues of enquiry. QA276.B2849 519.5 77-21024 Early work on outliers was (inevitably) characterized by lack of attention ISBN O 471 99599 1 to the modelling of the outlier-generating mechanism, by informality of technique with no hacking in terms of a study of the statistica! properties of Typeset by The Universities Press, Belfast, Northern Ireland proposed procedures, and by a leaning towards the hardline view that Printed and bound in Great Britain outliers should be either rejected or retained with full import. Even today at The Pitman Press, Bath v Preface The concept of an outlier has fascinated experimentalists since the earliest attempts to interpret data. Even before the formai development of statistica! method, argument raged over whether, and on what basis, we should discard observations from a set of data on the grounds that they are 'unrepresenta tive', 'spurious', or 'mavericks' or 'rogues'. The early emphasis stressed the contamination of the data by unanticipated and unwelcome errors or mis takes affecting some of the observations. Attitudes varied from one extreme to another: from the view that we should never sully the sanctity of the data by daring to adjudge its propriety, to an ultimate pragmatism expressing 'if Copyright© 1978 by John Wiley & Sons Ltd. in doubt, throw it out'. Reprinted February 1979 The present views are more sophisticated. A wider variety of aims are Reprinted June 1980 recognized in the handling of outliers, outlier-generating models bave been Ali rights reserved. proposed, and there is now available a vast array of specific statistica! techniques for processing outliers. The work is scattered throughout the No part of this book may be reproduced by any means, nor literature of the prese n t century, shows no sign of any abatement, but has transmitted, nor translated into a machine language with out the written permission of the publisher. no t previously bee n drawn together in a comprehensive review. Our purpose in writing this book is to attempt to provide such a review, at two levels. On Library of Congress Cataloging in Publication Data: the one band we seek to survey the existing state of knowledge in the outlier field and to present the details of selected procedures for different situations. Barnett, Vie. On the other band we attempt to categorize differences in attitude, aim, and Outliers in statistica} data. Barnett!Lewis (Wiley series in probability and mathematical model in the study of outliers, and to follow the implications of such statistics) distinctions for the development of new research approaches. In offering Bibliography: p. such a comprehensive overview of the principles and methods associated lncludes index. with outliers we hope that we may help the practitioner in the analysis of 1. Outliers (Statistics) I. Lewis, Tobias, joint author. Il. Title. data and the researcher in opening up possible new avenues of enquiry. QA276.B2849 519.5 77-21024 Early work on outliers was (inevitably) characterized by lack of attention ISBN O 471 99599 1 to the modelling of the outlier-generating mechanism, by informality of technique with no hacking in terms of a study of the statistica! properties of Typeset by The Universities Press, Belfast, Northern Ireland proposed procedures, and by a leaning towards the hardline view that Printed and bound in Great Britain outliers should be either rejected or retained with full import. Even today at The Pitman Press, Bath v vi Preface Preface vii sufficient attention is not always paid to the form of the outlier model, or to Iease of life over the las t decade or so. It is directed to more than one kind the practical purpose of investigating outliers, in the presentation of of reader: to the student (to inform him of the range of ideas and techni methods for processing outliers. Many procedures have an ad hoc, intui ques), to the experimentalist (to assist him in the judicious choice of tively justified, basis with little external reference in the sense of the relative methods for handling outliers), and to the professional statistician (as a statistica! merits of different possibilities. In reviewing such techniques we guide to the present state of knowledge and a springboard for further will attempt to set them, as far as possible, within a wider framework of research). model, statistica! principle, and practical aim, and we shall also consider the The level of treatment assumes a knowledge of elementary probability extent to which such basic considerations have begun to formally permeate theory and statistica! method such as would be acquired in an introductory outlier study over recent years. university-level course. The metbodological exposition leans on an under Such an emphasis is reflected in the structure of the book. The opening standing of tbe principles and practical implications of testing and estima two chapters are designed respectively to motivate examination of outliers tion. Where basic modelling and demonstration of statistica! propriety are and to pose basic questions about the nature of an outlier. Chapter l gives a discussed, a more mathematical appreciation of basic principles is assumed, generai survey of the field. In Chapter 2 we consider the various ways in including some familiarity with optimality properties of methods of con which we can model the presence of outliers in a set of data. We examine structing tests and estimators and some knowledge of the properties of order the different interests (from rejection of unacceptable contamination, statistics. Proofs of results are formally presented wbere appropriate, but at through the accommodation of outliers with reduced influence in robust a heuristic ratber tban bighly matbematical level. procedures applied to the whole set of data, to specific identification of Extensive tables of appropriate statistica! functions are presented in an outliers as the facets of principal interest in the data). We discuss the Appendix, to aid the practical worker in tbe use of tbe different procedures. statistica! respectability of distinct methods of study, and the special prob Many of these tables are extracted from existing publisbed tables; we are lems that arise from the dimensionality of the data set or from the purpose grateful to ali tbe authors and publisbers concerned, and bave made indi of its analysis (single-sample estimation or testing, regression, analysis of viduai acknowledgement at tbe appropriate places in our text. Otber tables data from designed experiments, examination of slippage in multisample bave been specially produced by us. The whole set of tables bas been data, and so on). presented in as compact and consistent a style as possible. Tbis has involved Chapter 3 examines at length the assessment of discordancy of outliers in a good deal of selection and re-ordering of tbe previously publisbed mater single univariate samples. It discusses basic considerations and also presents ia!; we bave aimed as far as possible to standardize the ranges of tabulated a battery of techniques for practical use with comment on the circumstances values of sample size, percentage point, etc. supporting one method rather than another. Copious references are given throughout the text to source materia! and Chapter 4, on the accommodation of outliers in single univariate samples, to further work on the various topics. Tbey are gathered togetber in the deals with inference procedures which are robust in the sense of providing section entitled 'References and Bibliography' with appropriate page refer protection against the effect of outliers. Chapter 5 is concerned with ences to places of principal relevance in the text. Additional references processing severa! univariate samples both with regard to the relative augment those wbicb have been discussed in tbe text. These will of course slippage of the distributions from which they arise and (to a lesser extent) in appear without any page reference, but will carry an indication of the main relation to the accommodation of outliers in robust analysis of the whole set area to wbicb tbey are relevant. of data. It is, of course, a privilege and pleasure to acknowledge tbe help of others. Chapters 6 and 7 extend the ideas and methods (in relation to the three We tbank Dave Collett, Nick Fieller, Agnes Herzberg, and David Kendall interests: rejection, accommodation, identification) to single multivariate for helpful comments on early drafts of some of tbe materia!. We are samples and to the analysis of data in regression, designed experiments, or particularly grateful to Kim Malafant who carried out the extensive calcula time-series situations. Chapter 8 gives fuller and more specific attention to tions of tbe new statistica! tables in Cbapter 3. Our grateful tbanks co also to the implications of adopting a Bayesian, or a non-parametric, approach to Hazel Howard wbo coped nobly with tbe typing of a difficult manuscript. the study of outliers. The concluding Chapter 9 poses a few issues for W e are solely responsible for any imperfections in the book and should be further consideration or investigation. glad to be informed of them. The book aims to bring together in a logica! framework the vast amount of work on outliers which has been scattered over the years in the various July, 1977 VIe BARNETT professional journals and texts, and which appears to have acquired a new ToBY LEwis vi Preface Preface vii sufficient attention is not always paid to the form of the outlier model, or to Iease of life over the las t decade or so. It is directed to more than one kind the practical purpose of investigating outliers, in the presentation of of reader: to the student (to inform him of the range of ideas and techni methods for processing outliers. Many procedures have an ad hoc, intui ques), to the experimentalist (to assist him in the judicious choice of tively justified, basis with little external reference in the sense of the relative methods for handling outliers), and to the professional statistician (as a statistica! merits of different possibilities. In reviewing such techniques we guide to the present state of knowledge and a springboard for further will attempt to set them, as far as possible, within a wider framework of research). model, statistica! principle, and practical aim, and we shall also consider the The level of treatment assumes a knowledge of elementary probability extent to which such basic considerations have begun to formally permeate theory and statistica! method such as would be acquired in an introductory outlier study over recent years. university-level course. The metbodological exposition leans on an under Such an emphasis is reflected in the structure of the book. The opening standing of tbe principles and practical implications of testing and estima two chapters are designed respectively to motivate examination of outliers tion. Where basic modelling and demonstration of statistica! propriety are and to pose basic questions about the nature of an outlier. Chapter l gives a discussed, a more mathematical appreciation of basic principles is assumed, generai survey of the field. In Chapter 2 we consider the various ways in including some familiarity with optimality properties of methods of con which we can model the presence of outliers in a set of data. We examine structing tests and estimators and some knowledge of the properties of order the different interests (from rejection of unacceptable contamination, statistics. Proofs of results are formally presented wbere appropriate, but at through the accommodation of outliers with reduced influence in robust a heuristic ratber tban bighly matbematical level. procedures applied to the whole set of data, to specific identification of Extensive tables of appropriate statistica! functions are presented in an outliers as the facets of principal interest in the data). We discuss the Appendix, to aid the practical worker in tbe use of tbe different procedures. statistica! respectability of distinct methods of study, and the special prob Many of these tables are extracted from existing publisbed tables; we are lems that arise from the dimensionality of the data set or from the purpose grateful to ali tbe authors and publisbers concerned, and bave made indi of its analysis (single-sample estimation or testing, regression, analysis of viduai acknowledgement at tbe appropriate places in our text. Otber tables data from designed experiments, examination of slippage in multisample bave been specially produced by us. The whole set of tables bas been data, and so on). presented in as compact and consistent a style as possible. Tbis has involved Chapter 3 examines at length the assessment of discordancy of outliers in a good deal of selection and re-ordering of tbe previously publisbed mater single univariate samples. It discusses basic considerations and also presents ia!; we bave aimed as far as possible to standardize the ranges of tabulated a battery of techniques for practical use with comment on the circumstances values of sample size, percentage point, etc. supporting one method rather than another. Copious references are given throughout the text to source materia! and Chapter 4, on the accommodation of outliers in single univariate samples, to further work on the various topics. Tbey are gathered togetber in the deals with inference procedures which are robust in the sense of providing section entitled 'References and Bibliography' with appropriate page refer protection against the effect of outliers. Chapter 5 is concerned with ences to places of principal relevance in the text. Additional references processing severa! univariate samples both with regard to the relative augment those wbicb have been discussed in tbe text. These will of course slippage of the distributions from which they arise and (to a lesser extent) in appear without any page reference, but will carry an indication of the main relation to the accommodation of outliers in robust analysis of the whole set area to wbicb tbey are relevant. of data. It is, of course, a privilege and pleasure to acknowledge tbe help of others. Chapters 6 and 7 extend the ideas and methods (in relation to the three We tbank Dave Collett, Nick Fieller, Agnes Herzberg, and David Kendall interests: rejection, accommodation, identification) to single multivariate for helpful comments on early drafts of some of tbe materia!. We are samples and to the analysis of data in regression, designed experiments, or particularly grateful to Kim Malafant who carried out the extensive calcula time-series situations. Chapter 8 gives fuller and more specific attention to tions of tbe new statistica! tables in Cbapter 3. Our grateful tbanks co also to the implications of adopting a Bayesian, or a non-parametric, approach to Hazel Howard wbo coped nobly with tbe typing of a difficult manuscript. the study of outliers. The concluding Chapter 9 poses a few issues for W e are solely responsible for any imperfections in the book and should be further consideration or investigation. glad to be informed of them. The book aims to bring together in a logica! framework the vast amount of work on outliers which has been scattered over the years in the various July, 1977 VIe BARNETT professional journals and texts, and which appears to have acquired a new ToBY LEwis Contents CHAPTER l INTRODUCTION l 1.1 Human error and ignorance 6 1.2 Outliers in relation to probability models 7 1.3 Outliers in more structured situations 10 1.4 Bayesian and non-parametric methods 15 1.5 Survey of outlier problems 16 CHAPTER 2 WHAT SHOULD ONE DO ABOUT OUTLYING OBSER VATIONS? 18 2.1 Early informai approaches 18 2.2 Various aims 22 2.3 Models for discordancy 28 2.4 Test statistics 38 2.5 Statistica/ principles underlying tests of discordancy 41 2.6 Accommodation of outliers: robust estimation and testing 46 CHAPTER 3 DISCORDANCY TESTS FOR OUTLIERS IN UNI VARIATE SAMPLES 52 3.1 Statistica/ bases for construction of tests 56 3 .1.1 Inclusive and exclusive measures, and a recursive algorithm for the null distribution of a test statistic 61 3.2 Performance criteria of tests 64 3.3 The multiple outlier problem 68 3.3.1 Block procedures for multiple outliers in univariate samples 71 3.3.2. Consecutive procedures for multiple outliers in univariate sam ples 73 3.4 Discordancy tests for practical use 7 5 3.4.1 Guide to use of the tests 75 ix Contents CHAPTER l INTRODUCTION l 1.1 Human error and ignorance 6 1.2 Outliers in relation to probability models 7 1.3 Outliers in more structured situations 10 1.4 Bayesian and non-parametric methods 15 1.5 Survey of outlier problems 16 CHAPTER 2 WHAT SHOULD ONE DO ABOUT OUTLYING OBSER VATIONS? 18 2.1 Early informai approaches 18 2.2 Various aims 22 2.3 Models for discordancy 28 2.4 Test statistics 38 2.5 Statistica/ principles underlying tests of discordancy 41 2.6 Accommodation of outliers: robust estimation and testing 46 CHAPTER 3 DISCORDANCY TESTS FOR OUTLIERS IN UNI VARIATE SAMPLES 52 3.1 Statistica/ bases for construction of tests 56 3 .1.1 Inclusive and exclusive measures, and a recursive algorithm for the null distribution of a test statistic 61 3.2 Performance criteria of tests 64 3.3 The multiple outlier problem 68 3.3.1 Block procedures for multiple outliers in univariate samples 71 3.3.2. Consecutive procedures for multiple outliers in univariate sam ples 73 3.4 Discordancy tests for practical use 7 5 3.4.1 Guide to use of the tests 75 ix x Contents Contents xi 3.4.2 Discordancy tests for gamma (including exponential) 6.2.7 Correlation methods 227 samples 76 6.2.8 A 'gap test' for multivariate outliers 229 3.4.3 Discordancy tests for normal samples 89 6.3 Accommodation of multivariate outliers 231 3.4.4 Discordancy tests for samples from other distributions 115 CHAPTER 7 OUTLIERS IN DESIGNED EXPERIMENTS, REGRES CHAPTER 4 ACCOMMODATION OF OUTLIERS IN UNIVARIATE SION ANO IN TIME-SERIES 234 SAMPLES: ROBUST ESTIMATION ANO TESTING 126 7 .l Outliers in designed experiments 238 7 .1.1 Discordancy tests based on residuals 238 4.1 Performance criteria 130 7 .1.2 Residual-based accommodation procedures 246 4.1.1 Efficiency measures for estimatoFs 130 7 .1.3 Graphical methods 247 4.1.2 The qualitative approach: influence curves 136 7 .1.4 Non-residual-based methods 249 4.1.3 Robustness of confidence intervals 141 7 .1.5 Non-parametric, and Bayesian, methods 251 4.1.4 Robustness of significance tests 142 4.2 Generai methods of accommodation 144 7.2 Outliers in regression 252 7 .2.1 Outliers in linear regression 252 4.2.1 Estimation of location 144 7 .2.2 Multiple regression 256 4.2.2 Performance characteristics of locatio11 estimators 155 7.3 Outliers with generai linear models 257 4.2.3 Estimation of scale or dispersion 158 7.3.1 Residual-based methods 257 4.2.4 Studentized location estimates, tests, and confidence inter 7 .3.2 Non-residual-based methods 264 vals 160 7.4 Outliers in time-series 266 4.3 Accommodation of outliers in univariate normal samples 163 4.4 Accommodation of outliers in exponential samples 171 CHAPTER 8 BAYESIAN ANO NON-PARAMETRIC APPRO- ACHES 269 CHAPTER 5 OUTLYING SUB-SAMPLES: SLIPPAGE TESTS 174 8.1 Bayesian methods 269 5.1. Non-parametric slippage tests 176 8.1.1 Bayesian 'tests of discordancy' 269 5.1.1 Non-parametric tests for slippage of a single population 176 8.1.2 Bayesian accommodation of outliers 277 5.1.2 Non-parametric tests for slippage of several populations: multi 8.2 Non-parametric methods 282 ple comparisons 183 5.2 The slippage model 186 CHAPTER 9 PERSPECTIVE 286 5.3 Parametric slippage tests 187 5.3.1 Norma/ samples 188 APPENDIX: STATISTICAL TABLES 289 5.3.2 Generai slippage tests 197 5.3.3 Non-norma/ samples 201 REFERENCES ANO BIBLIOGRAPHY 337 5.3.4 Group parametric slippage tests 204 5.4 Other slippage work 205 INDEX 357 CHAPTER 6 OUTLIERS IN MULTIVARIATE DATA 208 6 .l Outliers in multivariate norma l samples 209 6.2 Informai detection of multivariate outliers 219 6.2.1 Marginai outliers 220 6.2.2 Linear constraints 221 6.2.3 Graphical methods 221 6.2.4 Principal component analysis method 223 6.2.5 Use of generalized distances 224 6.2.6 Fourier-type representation 227 x Contents Contents xi 3.4.2 Discordancy tests for gamma (including exponential) 6.2.7 Correlation methods 227 samples 76 6.2.8 A 'gap test' for multivariate outliers 229 3.4.3 Discordancy tests for normal samples 89 6.3 Accommodation of multivariate outliers 231 3.4.4 Discordancy tests for samples from other distributions 115 CHAPTER 7 OUTLIERS IN DESIGNED EXPERIMENTS, REGRES CHAPTER 4 ACCOMMODATION OF OUTLIERS IN UNIVARIATE SION ANO IN TIME-SERIES 234 SAMPLES: ROBUST ESTIMATION ANO TESTING 126 7 .l Outliers in designed experiments 238 7 .1.1 Discordancy tests based on residuals 238 4.1 Performance criteria 130 7 .1.2 Residual-based accommodation procedures 246 4.1.1 Efficiency measures for estimatoFs 130 7 .1.3 Graphical methods 247 4.1.2 The qualitative approach: influence curves 136 7 .1.4 Non-residual-based methods 249 4.1.3 Robustness of confidence intervals 141 7 .1.5 Non-parametric, and Bayesian, methods 251 4.1.4 Robustness of significance tests 142 4.2 Generai methods of accommodation 144 7.2 Outliers in regression 252 7 .2.1 Outliers in linear regression 252 4.2.1 Estimation of location 144 7 .2.2 Multiple regression 256 4.2.2 Performance characteristics of locatio11 estimators 155 7.3 Outliers with generai linear models 257 4.2.3 Estimation of scale or dispersion 158 7.3.1 Residual-based methods 257 4.2.4 Studentized location estimates, tests, and confidence inter 7 .3.2 Non-residual-based methods 264 vals 160 7.4 Outliers in time-series 266 4.3 Accommodation of outliers in univariate normal samples 163 4.4 Accommodation of outliers in exponential samples 171 CHAPTER 8 BAYESIAN ANO NON-PARAMETRIC APPRO- ACHES 269 CHAPTER 5 OUTLYING SUB-SAMPLES: SLIPPAGE TESTS 174 8.1 Bayesian methods 269 5.1. Non-parametric slippage tests 176 8.1.1 Bayesian 'tests of discordancy' 269 5.1.1 Non-parametric tests for slippage of a single population 176 8.1.2 Bayesian accommodation of outliers 277 5.1.2 Non-parametric tests for slippage of several populations: multi 8.2 Non-parametric methods 282 ple comparisons 183 5.2 The slippage model 186 CHAPTER 9 PERSPECTIVE 286 5.3 Parametric slippage tests 187 5.3.1 Norma/ samples 188 APPENDIX: STATISTICAL TABLES 289 5.3.2 Generai slippage tests 197 5.3.3 Non-norma/ samples 201 REFERENCES ANO BIBLIOGRAPHY 337 5.3.4 Group parametric slippage tests 204 5.4 Other slippage work 205 INDEX 357 CHAPTER 6 OUTLIERS IN MULTIVARIATE DATA 208 6 .l Outliers in multivariate norma l samples 209 6.2 Informai detection of multivariate outliers 219 6.2.1 Marginai outliers 220 6.2.2 Linear constraints 221 6.2.3 Graphical methods 221 6.2.4 Principal component analysis method 223 6.2.5 Use of generalized distances 224 6.2.6 Fourier-type representation 227

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.