Springer Texts in Statistics Advisors: George Casella Stephen Fienberg Ingram Olkin Springer Science+Business Media, LLC Springer Texts in Statistics Advisors: George Casella Stephen Fienberg Ingram Olkin Springer Science+Business Media, LLC Springer Texts in Statistics Alfred: Elements of Statistics for the Life and Social Sciences Berger: An Introduction to ProbabiIity and Stochastic Processes Bilodeau and Brenner: Theory ofMultivariate Statistics Biom: Probability and Statistics: Theory and Applications Brockwell and Davis: Introduction to Times Series and Forecasting, Second Edition Chow and Teicher: Probability Theory: Independence, Interchangeability, Martingales, Third Edition Christensen: Advanced Linear Modeling: Multivariate, Time Series, and Spatial Data; Nonparametrie Regression and Response Surface Maximization, Second Edition Christensen: Log-Linear Models and Logistic Regression, Second Edition Christensen: Plane Answers to Complex Questions: The Theory ofLinear Models, Third Edition Creighton: A First Course in Probability Models and Statisticallnference Davis: Statistical Methods for the Analysis ofRepeated Measurements Dean and Voss: Design and Analysis ofExperiments du Toit, Steyn, and Stumpf Graphical Exploratory Data Analysis Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and Levin: Statistics for Lawyers Flury: A First Course in Multivariate Statistics Jobson: Applied Multivariate Data Analysis, Volume I: Regression and Experimental Design Jobson: Applied Multivariate Data Analysis, Volume 11: Categorical and Multivariate Methods Kalbfleiseh: Probability and Statisticallnference, Volume I: Probability, Second Edition Kalbfleiseh: Probability and Statisticallnference, Volume 11: Statisticallnference, Second Edition Karr: Probability Keyfitz: Applied Mathematical Demography, Second Edition Kiefer: Introduction to Statisticallnference Kokoska and Nevison: Statistical Tables and Formulae Kulkarni: Modeling, Analysis, Design, and Control of Stochastic Systems Lange: Applied Probability Lehmann: Elements ofLarge-Sample Theory Lehmann: Testing Statistical Hypotheses, Second Edition Lehmann and Casella: Theory ofPoint Estimation, Second Edition Lindman: Analysis ofVariance in Experimental Design Lindsey: Applying Generalized Linear Models (continued after index) Je ffrey S. Simonoff Analyzing Categorical Data With 64 Figures , Springer Ieffrey S. Simonoff Leonard N. Stern School of Business New York University New York, NY 10012-0258 USA [email protected] Editorial Board George Casella Stephen Fienberg Ingram Olkin Department of Statistics Department of Statistics Department of Statistics University of Florida Carnegie Mellon University Stanford University Gainesville, FL 32611-8545 Pittsburgh, PA 15213-3890 Stanford, CA 94305 USA USA USA Cover illustration: The Poisson regression model (Figure 5.1). Library of Congress Cataloging-in-Publication Data Simonoff, Jeffrey S. Analyzing categorical data / Jeffrey S. Simonoff. p. cm. - (Springer texts in statistics) Includes bibliographical references and index. ISBN 978-1-4419-1837-6 ISBN 978-0-387-21727-7 (eBook) DOI 10.1007/978-0-387-21727-7 1. Multivariate analysis. 1. Title. II. Series. QA278.S524 2003 519.5'35-dc21 2003044946 Printed on acid-free paper. © 2003 Springer Science+Business Media New York Originally published by Springer-Verlag New York, loc 2003. Softcover reprint ofthe hardcover 1s t edition 2003 Ali rights reserved. This work may not be translated or copied in whole or in part without the written permission ofthe pubJisher (Springer-Verlag New York,loc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Vse in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now lmown or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. 9 8 7 6 5 432 I SPIN 10919460 u,'IEJX Typesetting: Pages created by the author in 2e using Springer' s svsing2e.sty macro. www.springer-ny.com To my parents, Pearl and Morris Simonoff Preface This book grew out of notes that I prepared for a dass in categorical data analysis that I gave at the Stern School of Business of New York University during the Fall 1998 semester. The dass drew from a very diverse pool of students, induding undergraduate statistics majors, M.B.A. students, M.S. and Ph.D. statistics students, and M.S. and Ph.D. students in other fields, induding management, economics, and public administration. My task was to come up with a way of presenting the material in a way that such a heterogeneous group could grasp. I immediately hit on the idea of using regression ideas to drive everything, since all of the students would have seen regression before, at some level. This is not a new ideaj many books have in recent years exploited the generalized linear model when discussing categorical data analysis. I had in mind something a little different, however-a heavily data-analytic approach that covered a broad range of categorical data problems, from the count data models common in econometric modeling, to the loglinear models familiar to statisticians and social scientists, to binomial and multinomial regression models originally used in biological applications, with linear regression at the core. This origin has several implications for the reader of this book. First, Chapters 2 and 3 contain a more detailed overview of least squares regres sion modeling than is typical in books of this type, since this material is continually drawn on when describing analogous techniques for categorical data. There is also a good deal of detailed material on univariate discrete random variables (binomial, Poisson, negative binomial, multinomial) in Chapter 4. My hope is that these three chapters will make it possible for the book to stand alone more effectively, and make it useful for readers viii Preface with a wide range of backgrounds. On the other hand, they make the book longer than it might have been; there is a lot to get through if areader just sits down and attempts to read straight through. The Poisson regression model, and its variants and extensions, is the en gine for much of the material here. This is not unusual for books on econo metric models for count data (for example, Long, 1997, or Cameron and Trevidi, 1998), but is not typical for categorical data analysis books written by statisticians, which tend to highlight the Poisson regression model as the basis of loglinear modeling, but do not focus very much on count data modeling problems directly (for example, Agresti, 1996, or Lloyd, 1999). On the other hand, this book also incIudes extensive discussion of loglin ear models for contingency tables, incIuding tables with special structure, which is common in statistical categorical data analysis books, but not count data modeling books. The cIose connection between these models and useful models for binomial and multinomial data makes it easy to then incIude material on logistic regression (and its variants and competitors) as weIl. The approach is cIassical; for a Bayesian approach to many of these problems, see Johnson and Albert (1999). The target audience for this book is similar to the (student) audience for my original cIass, but extended to incIude working data analysts, whether they are statisticians, social scientists, engineers, or workers in any other area. My hope is that anyone who might be faced with categorical data to analyze would benefit from reading and using the book. Some exposure to linear regression modeling would be helpful, although the material in Chapters 2-4 is designed to provide enough background for the later ma terial. The book has a strong focus on applying methods to real problems (which accounts for the active, rat her than passive, nature of its titIe). For this reason, there is more detailed discussion of examples than is typical in books of this type, incIuding more background material on the problem, more model checking and selection, and more discussion of implications from a contextual point of view. These discussions are set aside in the text with grey rules and titIes, in order to emphasize their importance. Nothing can take the place of reading the original papers (or doing the analysis yourself), but my intention is to give the reader more of a flavor of the full data-analytic experience than is typical in a textbook. I hope that the readers will find the examples interesting on their own merits, not just as examples of categorical data analysis. Many are from recent pa pers in subject-area scientific journals. A more detailed description of the organization of the book is given in Section 1.2. Many of the basic techniques for categorical data analysis are available in almost all statistical packages. A great deal (but not aIl) of categorical data modeling can be done using any statistical package that has a general ized linear model function. All of the statistical modeling and figures in the text are based on S-PLUS (Insightful Corporation, 2001), incIuding func- Preface ix tions and libraries written by myself and other people, with the following exceptions: • ANOAS (Eliason, 1986) was used to fit the Goodman RC association model. • Egret (Cytel Software Corporation, 2001a) was used for beta-bino mial regression. • LIMDEP (Greene, 2000a) was used for zero inflated count regression and truncated Poisson regression. • LogXact (Cytel Software Corporation, 1999) was used to conduct conditional analyses of logistic regression models. • SAS (SAS Institute, 2000) was used to fit some ordinal regression models. • SPSS (SPSS, Inc., 2001) was used to construct the Hosmer-Lemeshow statistic when fitting a binary logistic regression model. • StatXact (Cytel Software Corporation, 2001 b) was used for various conditional analyses on contingency tables. I have set up a Web site related to the material in this book at the address http://www . stern.nyu. edu/rvjsimonof/AnalCatData (a link also can be found at the Springer-Verlag Web site, http://www.springer-ny.com. un der "Author Websites"). The site indudes computer code, functions, and macros in S-PLUS (and the free package R, which is virtually identical to S-PLUS; see Ihaka and Gentleman, 1996), and SAS for the material in the book and the data sets used in the text and exercises (these data sets are identified by name in typewriter font in the book). Answers to selected exercises are available to instructors who adopt the book as a course text. For more information, see the book's Web site or the Springer-Verlag Web site. I would like to thank some of the people who helped me in the prepa ration of this book. The students in my dass on categorical data analysis helped me to focus my ideas on the subject in a systematic way. Many people provided me with interesting data sets; their names are given in the book where the data are introduced. David Firth, Mark Handcock, and Gary Simon read and commented on draft versions of the text. Yufeng Ding and Zheng Sun helped with software issues, and checked many of the computational results given in the text. John Kimmel was his usual patient and supportive self in guiding the book through the publication process. I would also like to thank my family for their support and encouragement during this long process. East Meadow, New York Jeffrey S. Simonoff May 2003 Contents Preface vii 1 Introduction 1 1.1 The Nature of Categorical Data . 1 1.2 Organization of This Book .. 3 2 Gaussian-Based Data Analysis 7 2.1 The Normal (Gaussian) Random Variable 7 2.1.1 The Gaussian Density Function .. 7 2.1.2 Large-Sample Inference for the Gaussian Random Variable . . . . . . . . . . . . . . 8 2.1.3 Exact Inference for the Gaussian Random Variable. 11 2.2 Linear Regression and Least Squares 12 2.2.1 The Linear Regression Model . . . . 12 2.2.2 Least Squares Estimation . . . . . . 14 2.2.3 Interpreting Regression Coefficients . 15 2.2.4 Assessing the Strength of a Regression Relationship 17 2.3 Inference for the Least Squares Regression Model . . . . .. 18 2.3.1 Hypothesis Tests and Confidence Intervals for ß .. 18 2.3.2 Interval Estimation for Predicted and Fitted Values 19 2.4 Checking Assumptions 20 2.5 An Example. . . . . . 21 2.6 Background Material. 25 2.7 Exercises ....... 26