Proceedings of a Conference Held at The University of Wisconsin Madison, Wisconsin April 2B - 30, 1969 Statistical Computation Edited by Roy C. Milton COMPUTING CENTER THE UNIVERSITY OF WISCONSIN MADISON, WISCONSIN and John A. IMelder ROTHAMSTED EXPERIMENTAL STATION HARPENDEN, HERTFORDSHIRE, ENGLAND ® Academic Preae New York • London 1969 COPYRIGHT © 1969, BY ACADEMIC PRESS, INC. ALL RIGHTS RESERVED NO PART OF THIS BOOK MAY BE REPRODUCED IN ANY FORM, BY PHOTOSTAT, MICROFILM, RETRIEVAL SYSTEM, OR ANY OTHER MEANS, WITHOUT WRITTEN PERMISSION FROM THE PUBLISHERS. ACADEMIC PRESS, INC. Ill Fifth Avenue, New York, New York 10003 United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD. Berkeley Square House, London W1X6BA LIBRARY OF CONGRESS CATALOG CARD NUMBER: 71-84248 PRINTED IN THE UNITED STATES OF AMERICA SPEAKERS Albert E. Beaton George E. P. Box John M. Chambers Peter J. Claringbold William W. Cooley Brian E. Cooper Wilfrid J. Dixon John D. Gabbe Gene H. Golub John C. Gower Herman O. Hartley William J. Hemmerle David F. Hendry Kenneth E. Iverson Joseph B. Kruskal Frank J. Massey Mervin E. Muller John A. Nelder Theodor D. Sterling David L. Wallace Graham N. Wilkinson v PREFACE The inspiration for the Conference on Statistical Computation occurred at the meeting of the International Statistical Institute in Sydney, 1967. As a result of formal presentations and informal discussions especially involving John Nelder and Graham Wilkinson, George Box returned to Madison to sug gest to Mervin Muller and Ben Rosen that the University of Wisconsin organ ize a conference that would present and evaluate the current status of some basic aspects of the organization of statistical data processing and computing, and suggest directions for future research and development. It was felt that international discussion, evaluation, and cooperation were needed to begin to cope with such problems as duplication of effort, communicability between statistical programs, definition and specification of data structures, and sta tistical processing languages. It seemed appropriate for the University of Wisconsin to be host for such a conference because, among other reasons, the sharing of facilities in the Computer Sciences-Statistics Center by the Computing Center, the Department of Statistics, and the Department of Computer Sciences was itself symbolic of the important relationships to be emphasized by the con ference. The organizing committee consisted of three representatives from Wisconsin, Roy Milton of the Computing Center (Chairman), Grace Wahba (Statistics), and John Halton (Computer Sciences), together with John Nelder (Rothamsted Experimental Station, England) and Graham Wilkinson (Division of Mathematical Statistics, C.S.I.R.O., Australia). Together with their advisors (Box, Muller, and Rosen) they established the program and invited contributors, completing this preliminary work by May 1968. At the same time a proposal was made to the National Science Foundation for support of the conference. The program included five sessions: (1) Statistical data screening with computers (2) Specifications for statistical data structures (3) Statistical systems and languages vn PREFACE (4) Teaching of statistics with computers (5) Current techniques in numerical analysis related to statistical computation. It is clear that contributors to a conference such as this cannot exhaust the subject matter, nor will the selection of speakers include all the compe tent and productive persons in the various areas of interest. We hope, how ever, that this volume contains a reasonable cross section of topics and contributors and, further, that it will serve to stimulate thought about where the subject now stands and where we go from here. There was also a panel discussion on "Collaboration in developing statistical program systems." This discussion consisted of remarks by Paul Meier (Chairman), Joseph Cameron, John Chambers, Wilfrid Dixon, Mervin Muller, John Nelder, and Graham Wilkinson, plus participation from the floor. The variety of interests and points of view that were expressed brought out the need for further discussion and planning to discover areas where collaboration may be both possible and fruitful, and the most appropriate way of organizing such collaboration. An attempt will be made to follow up the suggestions made during the discussion. Attendance at the conference was over 300 persons and included about 30 visitors from England, Scotland, Canada, and Australia. This response far surpassed our early estimates of the current interest in statistical computa tion. The interests of those attending were various: statistical theory, applied statistics (data analysis), numerical analysis, computer management, pro gramming, language design, and computer science. The papers and discus sion made it clear that all these interests were relevant to our subject. Sta tistical theory underlies the algorithms that we write to act on our data, applied statistics contributes its pattern-exposing techniques for looking at data, while numerical analysis validates the accuracy of algorithms for the inexact arithmetic of the computer. The computer manager shows us our jobs as elements in a queuing process of great complexity, itself susceptible to statistical analysis, while the programmer forces us to define exactly what we want to do with our data. Finally, the language designer offers us the hope of expressing our particular ideas more easily to the computer, and the computer scientist embeds our activities in the general one of storing and retrieving information of all kinds. We believe the conference established useful links between these disciplines, and we shall be content if, as the result of it, there emerge better programs, well modularized, using better algorithms, being more machine-independent, and better documented. vm PREFACE Our thanks go to the conference speakers who contributed to this vol ume, and to the session chairmen: Francis J. Anscombe, Ake Bjbrck, Joseph Cameron, Paul Meier, and Michael Godfrey. Grateful acknowledgment is made of the financial support by the National Science Foundation. Art work was done by Martha Fritz. Most of the papers were typed for photo-offset by Diana Webster, whose capable assistance and perseverance demand rec ognition. By publishing the volume in this form we hope to meet the often- merited complaint by reviewers and others that the published proceedings of conferences appear much too late. At the same time we are very much aware of deficiencies arising from authors' variation in mathematical notation and referencing, and from the lack of an index. We hope that readers will accept these deficiencies in the interests of rapid production. Roy C. Milton Madison, Wisconsin John A. Nelder June 1969 IX THE CHALLENGE OF STATISTICAL COMPUTATION Keynote Address George E. P. Box Department of Statistics University of Wisconsin, Madison, Wisconsin "When you can measure what you are speaking about and express it in numbers, you know something about it, but when you cannot measure it, when you cannot express it in num bers, your knowledge is of a meagre and unsatisfactory kind." This famous remark of Lord Kelvin reminds us how very important to scientific progress is the proper handling of numbers. And so it's not surprising that we should find this gathering today with statisticians and computer scien tists and various hybrids meeting together. We are here to discuss the business of efficient scientific investigation particularly as it involves, on the one hand, data gathering and generation (as exemplified by the design of experiments and sample surveys), and on the other, data analysis. But more than that, we must consider the iterative interplay of data generation and analysis on the efficient production of information and how this may be facilitated by the discern ing use of computers. I think sometimes people wonder why statistics, with its emphasis on probability and the theory of errors, is so important in science. In fact, statisticians sometimes are criticized on the grounds that they are so busy looking at the part of the observations that is due to error that they fail to pay enough attention to the other part which con tains the essential information. One answer is that the only way to know that we have an adequate model of a given system is to study the errors. It's rather like a chemist who is doing a filtration — he can discover whether his filtration is fully effective by testing the filtrate and seeing if it is pure water. 3 GEORGE E. P. BOX And that's the sort of thing we do. An adequate sta tistical model is a transformation of the data that provides random noise — random noise that is uncorrelated with any possible input variable that we can think of. To know we have a model which fully accounts for some physical phenom enon we must be sure that what is left, after the effect of the model is allowed for, is informationless; and informa tion must be discussed in terms of probability. The business of model building is an interesting itera tive process. It seems to consist of three stages, used in alternation, which may be called Model Identification, Model Fitting, and Diagnostic Checking. Model identification is an informal technique which statis ticians have been regretably loath to own up to, and to dis cuss. Here one is trying to get some idea of what model or class of models (or of hypotheses or of conjectures) is worthy to be tentatively entertained. This will obviously include such questions as what variables should be consi dered. We cannot, of course, use efficient statistical methods at this stage because we don't know yet what the model is. Model fitting or estimation is a much more popular field of study because at first sight at least it seems to be associated with the purely mathematical question "If A were true would B follow?" which is a sensible mathematical question even if A is patently false. Diagnostic checking is partly involved with what have been called tests of goodness of fit. However, merely test ing fit is not enough. It is insufficient to say to the experimenter "It doesn't fit, good afternoon." He wants to know how it doesn't fit and when he knows this he can begin to consider how he should modify the model and thus to com mence a second iterative cycle. All of these procedures can benefit enormously from the use of the computer and, in particular, from imaginative choice of the form and display of output in such a way as is likely to interact with the human mind and so allow the creative process to proceed. In some problems we may be dealing with very simple models but the sheer amount of the data may make the compu ter invaluable. In other problems the data may not be numerous, but the power of the computer is essential in 4 STATISTICAL COMPUTATION coping with the complexity of the models that the scientist and statistician may originate or be led to. Now of course human beings are supposed to be differen tiated from other animals largely because they discovered how to use tools. Also, It Is clear that there Is enormous Interaction between the things that humans dp_, and the tools they have — the one producing development of the other. In particular, the nature and direction of enquiries which humans have undertaken have often been functions of the development of suitable tools and vice versa. Major quantum effects in the sciences have followed the development of suitable tools — the theory and practice of astronomy made little progress before the development of adequate tele scopes, and giant strides have been made since the introduc tion of radio telescopes. And so it is with the computer. The existence of the new tool has created a revolution not only in the kinds of things that scientists do, but in their thinking, in their theorizing and in their demands for new tools to elaborate these thoughts. This same revolution is also influencing the kinds of things that statisticians do. However, there is less of a revolution here than there should be, perhaps because there just aren't enough doers among the statisticians. This may even apply to some computer scientists — but I get ahead of myself. We are fortunate indeed to be living in a time when exciting developments can take place in the theory of effi cient data generation and data analysis. For example, one class of problems in data generation and analysis, which the computer has made it possible to tackle and which has inter ested us here for some time, arises in the building of mech anistic models. Thus in the study of a chemical reaction we may wish to choose experimental conditions which will best discriminate between a group of possible physical models defined in terms of sets of differential equations, each set of equations being appropriate on some specific view of the nature and ordering of the component molecular combinations. Another problem of this kind occurs when we believe we have the right model and may wish to plan experiments which will estimate its parameters with greatest precision. Although the necessary numerical and statistical theory was certainly available, such problems were not considered until recently, chiefly, one supposes, because the computational aspects 5