Table Of ContentSpringer Texts in Statistics
Advisors:
George Casella Stephen Fienberg Ingram Olkin
Springer Science+Business Media, LLC
Springer Texts in Statistics
Advisors:
George Casella Stephen Fienberg Ingram Olkin
Springer Science+Business Media, LLC
Springer Texts in Statistics
Alfred: Elements of Statistics for the Life and Social Sciences
Berger: An Introduction to ProbabiIity and Stochastic Processes
Bilodeau and Brenner: Theory ofMultivariate Statistics
Biom: Probability and Statistics: Theory and Applications
Brockwell and Davis: Introduction to Times Series and Forecasting,
Second Edition
Chow and Teicher: Probability Theory: Independence, Interchangeability,
Martingales, Third Edition
Christensen: Advanced Linear Modeling: Multivariate, Time Series, and
Spatial Data; Nonparametrie Regression and Response Surface
Maximization, Second Edition
Christensen: Log-Linear Models and Logistic Regression, Second Edition
Christensen: Plane Answers to Complex Questions: The Theory ofLinear
Models, Third Edition
Creighton: A First Course in Probability Models and Statisticallnference
Davis: Statistical Methods for the Analysis ofRepeated Measurements
Dean and Voss: Design and Analysis ofExperiments
du Toit, Steyn, and Stumpf Graphical Exploratory Data Analysis
Durrett: Essentials of Stochastic Processes
Edwards: Introduction to Graphical Modelling, Second Edition
Finkelstein and Levin: Statistics for Lawyers
Flury: A First Course in Multivariate Statistics
Jobson: Applied Multivariate Data Analysis, Volume I: Regression and
Experimental Design
Jobson: Applied Multivariate Data Analysis, Volume 11: Categorical and
Multivariate Methods
Kalbfleiseh: Probability and Statisticallnference, Volume I: Probability,
Second Edition
Kalbfleiseh: Probability and Statisticallnference, Volume 11: Statisticallnference,
Second Edition
Karr: Probability
Keyfitz: Applied Mathematical Demography, Second Edition
Kiefer: Introduction to Statisticallnference
Kokoska and Nevison: Statistical Tables and Formulae
Kulkarni: Modeling, Analysis, Design, and Control of Stochastic Systems
Lange: Applied Probability
Lehmann: Elements ofLarge-Sample Theory
Lehmann: Testing Statistical Hypotheses, Second Edition
Lehmann and Casella: Theory ofPoint Estimation, Second Edition
Lindman: Analysis ofVariance in Experimental Design
Lindsey: Applying Generalized Linear Models
(continued after index)
Je ffrey S. Simonoff
Analyzing Categorical Data
With 64 Figures
, Springer
Ieffrey S. Simonoff
Leonard N. Stern School of Business
New York University
New York, NY 10012-0258
USA
jsimonof@stern.nyu.edu
Editorial Board
George Casella Stephen Fienberg Ingram Olkin
Department of Statistics Department of Statistics Department of Statistics
University of Florida Carnegie Mellon University Stanford University
Gainesville, FL 32611-8545 Pittsburgh, PA 15213-3890 Stanford, CA 94305
USA USA USA
Cover illustration: The Poisson regression model (Figure 5.1).
Library of Congress Cataloging-in-Publication Data
Simonoff, Jeffrey S.
Analyzing categorical data / Jeffrey S. Simonoff.
p. cm. - (Springer texts in statistics)
Includes bibliographical references and index.
ISBN 978-1-4419-1837-6 ISBN 978-0-387-21727-7 (eBook)
DOI 10.1007/978-0-387-21727-7
1. Multivariate analysis. 1. Title. II. Series.
QA278.S524 2003
519.5'35-dc21 2003044946
Printed on acid-free paper.
© 2003 Springer Science+Business Media New York
Originally published by Springer-Verlag New York, loc 2003.
Softcover reprint ofthe hardcover 1s t edition 2003
Ali rights reserved. This work may not be translated or copied in whole or in part without the
written permission ofthe pubJisher (Springer-Verlag New York,loc., 175 Fifth Avenue, New York,
NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Vse
in connection with any form of information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now lmown or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or not
they are subject to proprietary rights.
9 8 7 6 5 432 I SPIN 10919460
u,'IEJX
Typesetting: Pages created by the author in 2e using Springer' s svsing2e.sty macro.
www.springer-ny.com
To my parents, Pearl and Morris Simonoff
Preface
This book grew out of notes that I prepared for a dass in categorical data
analysis that I gave at the Stern School of Business of New York University
during the Fall 1998 semester. The dass drew from a very diverse pool of
students, induding undergraduate statistics majors, M.B.A. students, M.S.
and Ph.D. statistics students, and M.S. and Ph.D. students in other fields,
induding management, economics, and public administration.
My task was to come up with a way of presenting the material in a
way that such a heterogeneous group could grasp. I immediately hit on the
idea of using regression ideas to drive everything, since all of the students
would have seen regression before, at some level. This is not a new ideaj
many books have in recent years exploited the generalized linear model
when discussing categorical data analysis. I had in mind something a little
different, however-a heavily data-analytic approach that covered a broad
range of categorical data problems, from the count data models common in
econometric modeling, to the loglinear models familiar to statisticians and
social scientists, to binomial and multinomial regression models originally
used in biological applications, with linear regression at the core.
This origin has several implications for the reader of this book. First,
Chapters 2 and 3 contain a more detailed overview of least squares regres
sion modeling than is typical in books of this type, since this material is
continually drawn on when describing analogous techniques for categorical
data. There is also a good deal of detailed material on univariate discrete
random variables (binomial, Poisson, negative binomial, multinomial) in
Chapter 4. My hope is that these three chapters will make it possible for
the book to stand alone more effectively, and make it useful for readers
viii Preface
with a wide range of backgrounds. On the other hand, they make the book
longer than it might have been; there is a lot to get through if areader just
sits down and attempts to read straight through.
The Poisson regression model, and its variants and extensions, is the en
gine for much of the material here. This is not unusual for books on econo
metric models for count data (for example, Long, 1997, or Cameron and
Trevidi, 1998), but is not typical for categorical data analysis books written
by statisticians, which tend to highlight the Poisson regression model as
the basis of loglinear modeling, but do not focus very much on count data
modeling problems directly (for example, Agresti, 1996, or Lloyd, 1999).
On the other hand, this book also incIudes extensive discussion of loglin
ear models for contingency tables, incIuding tables with special structure,
which is common in statistical categorical data analysis books, but not
count data modeling books. The cIose connection between these models
and useful models for binomial and multinomial data makes it easy to then
incIude material on logistic regression (and its variants and competitors) as
weIl. The approach is cIassical; for a Bayesian approach to many of these
problems, see Johnson and Albert (1999).
The target audience for this book is similar to the (student) audience for
my original cIass, but extended to incIude working data analysts, whether
they are statisticians, social scientists, engineers, or workers in any other
area. My hope is that anyone who might be faced with categorical data
to analyze would benefit from reading and using the book. Some exposure
to linear regression modeling would be helpful, although the material in
Chapters 2-4 is designed to provide enough background for the later ma
terial. The book has a strong focus on applying methods to real problems
(which accounts for the active, rat her than passive, nature of its titIe). For
this reason, there is more detailed discussion of examples than is typical
in books of this type, incIuding more background material on the problem,
more model checking and selection, and more discussion of implications
from a contextual point of view. These discussions are set aside in the
text with grey rules and titIes, in order to emphasize their importance.
Nothing can take the place of reading the original papers (or doing the
analysis yourself), but my intention is to give the reader more of a flavor
of the full data-analytic experience than is typical in a textbook. I hope
that the readers will find the examples interesting on their own merits, not
just as examples of categorical data analysis. Many are from recent pa
pers in subject-area scientific journals. A more detailed description of the
organization of the book is given in Section 1.2.
Many of the basic techniques for categorical data analysis are available
in almost all statistical packages. A great deal (but not aIl) of categorical
data modeling can be done using any statistical package that has a general
ized linear model function. All of the statistical modeling and figures in the
text are based on S-PLUS (Insightful Corporation, 2001), incIuding func-
Preface ix
tions and libraries written by myself and other people, with the following
exceptions:
• ANOAS (Eliason, 1986) was used to fit the Goodman RC association
model.
• Egret (Cytel Software Corporation, 2001a) was used for beta-bino
mial regression.
• LIMDEP (Greene, 2000a) was used for zero inflated count regression
and truncated Poisson regression.
• LogXact (Cytel Software Corporation, 1999) was used to conduct
conditional analyses of logistic regression models.
• SAS (SAS Institute, 2000) was used to fit some ordinal regression
models.
• SPSS (SPSS, Inc., 2001) was used to construct the Hosmer-Lemeshow
statistic when fitting a binary logistic regression model.
• StatXact (Cytel Software Corporation, 2001 b) was used for various
conditional analyses on contingency tables.
I have set up a Web site related to the material in this book at the address
http://www . stern.nyu. edu/rvjsimonof/AnalCatData (a link also can be
found at the Springer-Verlag Web site, http://www.springer-ny.com. un
der "Author Websites"). The site indudes computer code, functions, and
macros in S-PLUS (and the free package R, which is virtually identical to
S-PLUS; see Ihaka and Gentleman, 1996), and SAS for the material in the
book and the data sets used in the text and exercises (these data sets are
identified by name in typewriter font in the book). Answers to selected
exercises are available to instructors who adopt the book as a course text.
For more information, see the book's Web site or the Springer-Verlag Web
site.
I would like to thank some of the people who helped me in the prepa
ration of this book. The students in my dass on categorical data analysis
helped me to focus my ideas on the subject in a systematic way. Many
people provided me with interesting data sets; their names are given in the
book where the data are introduced. David Firth, Mark Handcock, and
Gary Simon read and commented on draft versions of the text. Yufeng
Ding and Zheng Sun helped with software issues, and checked many of the
computational results given in the text. John Kimmel was his usual patient
and supportive self in guiding the book through the publication process. I
would also like to thank my family for their support and encouragement
during this long process.
East Meadow, New York Jeffrey S. Simonoff
May 2003
Contents
Preface vii
1 Introduction 1
1.1 The Nature of Categorical Data . 1
1.2 Organization of This Book .. 3
2 Gaussian-Based Data Analysis 7
2.1 The Normal (Gaussian) Random Variable 7
2.1.1 The Gaussian Density Function .. 7
2.1.2 Large-Sample Inference for the Gaussian
Random Variable . . . . . . . . . . . . . . 8
2.1.3 Exact Inference for the Gaussian Random Variable. 11
2.2 Linear Regression and Least Squares 12
2.2.1 The Linear Regression Model . . . . 12
2.2.2 Least Squares Estimation . . . . . . 14
2.2.3 Interpreting Regression Coefficients . 15
2.2.4 Assessing the Strength of a Regression Relationship 17
2.3 Inference for the Least Squares Regression Model . . . . .. 18
2.3.1 Hypothesis Tests and Confidence Intervals for ß .. 18
2.3.2 Interval Estimation for Predicted and Fitted Values 19
2.4 Checking Assumptions 20
2.5 An Example. . . . . . 21
2.6 Background Material. 25
2.7 Exercises ....... 26