ebook img

A Handbook of Small Data Sets PDF

470 Pages·1994·9.198 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Handbook of Small Data Sets

A Handbook of Small Data Sets A Handbook of Small Data Sets Edited by D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway and E. Ostrowski IU!11 Springer-Science+Business Media, B.V. First edition 1994 © 1994 D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway and E. Ostrowski OrigiDBily pub1ished by CbBpman & Hall in 1994 Sof1lxJvm-n:priDt ofthe banlcover 1st edition 1994 The editors and pub1isher would 1ike to acknowledge the kind pennission to reproduce each data set given by the individual copyright holders. Wehave made every effort to contact the copyright holder for each data set and would be grateful if any errors were brought to the attention of the publisher for correction at a later printing. Additional material to this book can be downloaded from http://extras.springer.com ISBN 978-0-412-39920-6 ISBN 978-1-4899-7266-8 (eBook) DOI 10.1007/978-1-4899-7266-8 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as pennitted under the UK Copyright Designs and Patents Act, 1988, this publication may not be reproduced, stored, or transmitted, in any form or by any means, without the prior pennission in writing of the publishers, or in the case of reprographic reproduction only in accordance with the terms of the licences issued by the Copyright Licensing Agency in the UK, or in accordance with the terms of licences issued by the appropriate Reproduction Rights Organization outside the UK. Enquiries concerning reproduction outside the terms stated here should be sent to the publishers at the London address printed on this page. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. A catalogue record for this book is available from the British Library Printedon permanent acid-free text paper, manufactured in accordance with ANSI/NISO Z39.48-1992 and ANSI/NISO Z39.48-1984 (Permanence of Paper). CONTENTS Introduction vii How to use the disk xiii The data sets 1 Data structure index 413 Subject index 437 INTRODUCTION During our work as teachers of statistical methodology we have often been in the position of trying to find suitable data sets to illustrate techniques or phenomena or to use in examination questions. In common with many other teachers, we have often fabricated numbers to fill the role. But this is far from ideal for several reasons: • Obviously unreal data sets ('In a country called Randomania, the Grand Vizier wanted to know the average number of sheep per household') do not convey to the students the importance and relevance of the discipline of statistics. If the technique being taught is as important as the teacher claims, how is it that he/she has been unable to fmd a real example? • If data purporting to come from some real domain are invented ('Suppose a researcher wanted to find out if women scored higher than men on the WAlS-R test') there is the risk of misleading - it is in fact quite difficult to create realistic artificial data sets unless one is very familiar with the application area. One needs to be sure that the means are in the right range, that the dispersion is realistic, and so on. • Inventing data serves to reinforce the misconception that statistics is a science of calculation, instead of a science of problern solving. To avoid this risk it is necessary to present real problems along with the statistical solutions. Since artificial data sets have a number of drawbacks, real ones must be found. And this is often not easy. Many subject matter journals do not give the raw data, but only the results of statistical analyses, typically insufficient to allow reconstruction of the data. One can spend hours browsing through books and journals to locate a suitable set. Wehave spent many hours so doing, and we are certain that many other teachers of statistics share our experience. Forthis reason we decided that a source book, a volume containing a large number of small data sets suitable for teaching, would be valuable. This book is the result. In what follows, about 500 real small data sets, with brief descriptions and details of their sources, are given. Of course, a book such as this will only realize its potential if users can locate a data set to illustrate the sort of technique that they wish to use. This obviously must be achieved through an index, but it is perhaps not as easy as it might appear. Data sets can be analysed in many different ways. A viii INTRODUCTION contingency table can be used to illustrate chi-squared tests, for log-linear modelling, and for correspondence analysis, but it might also be used for less obvious purposes. lt might be used to illustrate methods of outlier detection, the dangen of collapsing tables, Simpson' s paradox, methods for estimating small probabilities, problems with structural zeros, sampling inadequacies, or doubtless a whole host of other things we have not thought of. So indexing the data sets by possible statistical technique, while in a sense ideal, seemed impracticable. An alternative was to index the data sets by their properties, so that users could find a data set which had the sort of structure they needed to demoostrate whatever it was that they were interested in. This was the strategy we fmally adopted. The book has two indexes: (i) a data structure index, (ii) a subject index. The second of these is straightforward. It simply contains keywords describing the application domain - the technical area and the problern from which the data arose. The first is more difficult. There are various theories of data which we could have used to produce a taxonomy through which to classify the data sets in this volume (for example, Coombs, C.H. (1964) A theory of data. New York: John Wiley & Sons). However, none ofthem seemed to provide the right mix of simplicity and power for our purposes. We needed an approach which could handle most of the data sets, but which was not excessively complicated and difficult to grasp. lt was not critical if particularly unorthodox data sets had to be handled by supplementary comments, provided this did not occur too frequently. The approach we adopted was to describe the data sets in terms of: (a) two numbers, the first representing the number of independent units described in the set and the second the number of measurements taken on each unit; (b) a categorization of the variables measured; (c) an optional supplementary word or phrase describing the structure in familiar terms. Of course, such descriptions arenot always unambiguous. For example, they teil us nothing about any grouping structure beyond that contained in terms such as nominal, categorical, and binary in (b) above. However, to have included such descriptions would have led to substantially greater complexity of description. Also, there is often more than one way of describing any given data set. In particular, the description of a data set will depend on the objective of the analysis. As a simple example, consider responses given as percentages of items correct in each of six tests taken by two people. lf the aim is to compare INTRODUCTION ix people, then one could describe this as two units, each with six scores. Altematively, if the aim is to compare tests, then one could describe it as six units, each with two scores. Such disadvantages have to be weighed against the merits of keeping the descriptive scheme short and this is the spirit in which our data structure index was treated. lt is not intended to Iead the user to the single data set which will do the job but to several which rnight be suitable and from which a choice can be made. And, a point to which we retum below, it is not intended to take the place of casual browsing through the data sets. The terms used in (i) above are as follows: • A grouping of subjects has been represented by a nominal, binary (if two groups), or categorical variable, though the table rnight show the groups separately rather than give an explicit grouping variable. • Nominal represents a variable with unordered response categories, and categorical represents a variable with ordered response categories. • The term numeric has been used to indicate measurement on an interval or ratio scale. • Values expressed as 'parts per rnillion' could sensibly be regarded as proportions, ratios, counts, rates, or simply numeric. We adopted whatever seemed mostsensible to us in the context ofthe example (which, of course, need not seem sensible to you, though we hope it does). • Other terms have been used occasionally, where we thought them desirable, such as maxima, if the values represent the maximum values observed in some process, count, if the values are counts, and so on. • The description of the numbers and types of the variables is sometimes followed by an overall description of the data set. These occur in square brackets. Examples are: [survival] to indicate that the data show survival times (and will often be censored). Censoring is indicated by a binary variable in the description of the data set, though again it may not appear explicitly in the data (it may appear as asterisks against appropriate values - the descriptions of each data set will make things clear). [spatial] to indicate that there is a spatial component to the data. (Such data would normally require a complex descriptive scheme to describe them adequately, which would be contrary to our aim of simplicity and could not be justified for a mere handful of data sets.) [time series]: the fact that the data are a time series will be apparent from the descriptions of the variables- having the form n m rep(r), with rep(r) signifying repeated r times. Nevertheless, we thought it helpful to flag such data sets explicitly. [latin square]. x INTRODUCTION [tnmsition matrix]. [dissimßarity matrix]. [correlation matrix]. Some examples of data descriptions are: 40 3 numeric(2), binary [survival] which means that there are 40 cases, each measured on three variables, two of which are numeric and the third binary (if there is censoring in the survival data, this is indicated by this binary variable). The fmal term indicates that it is survival data. 1 26 rep(26)count [time series] which indicates that a single object is measured 26 times, producing a single count on each occasion. 9 x 9 [correlation matrixl is a correlation matrix of size 9. 13 22 rep(ll)(numeric, rate) shows that 13 objects each produce a numeric score and a rate on each of 11 occasions. We stress again that this does not remove all ambiguity. For example, if two counts are given, with one necessarily being part ofthe other (e.g. number of children in a family and number of female children in a family) then it may be described as count(2) or count, proportion. Similarly, very large counts might arguably be treated as numeric. lt is thus possible that our way of describing a data set may not be the way you would have chosen. While it may be possible to define a formallanguage, free from ambiguity, to describe all conceivable data structures, the complexity of such a language would be out ofplace here. We hope and expect that usersoftbis volume will browse through it. In any case, it is worth noting that many of the data sets have intrinsic interest in their own right and are informative, educational, or even amusing. The data sets have been drawn from a very wide range of sources and application domains and we have also tried to provide material which can be used to illustrate a correspondingly wide range of statistical methodology. However, we are all too aware of the enormous size of the discipline of statistics. Ifyou feel that our coverage of some subdomain of statistics is too weak, then please Iet us know - we can try to rectify the inadequacy in any future edition that may be produced. Similarly, while we have made every effort to ensure the accuracy of the figures, given the number of digits reproduced it is likely that some INTRODUCTION xi inaccuracies exist. We apologisein advance should this prove tobe the case and would appreciate being informed of any inaccuracies that you spot. At least the presence of the data disk will remove the risk of further data entry errors beyond those we may have introduced! In this connection, the filename of each data set, as used on the data disk, is indicated in the data structure index. We hope that the data sets collected here will be of value to both teachers and students of statistics. And that both teachers and students will enjoy analysing them. David J. Hand Fergus Daly A. Dan Lunn Kevin J. McConway Elizabeth Ostrowski The Open University, July 1993

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.