Table Of ContentModels for Discrete Data
REVISED EDITION
DANIEL ZELTERMAN
1
PublishedintheUnitedStates
byOxfordUniversityPressInc.,NewYork
(cid:1)c DanielZelterman,2006
Firstpublished2006
PrintedinGreatBritain
BiddlesLtd.,King’sLynn,Norfolk
ISBN0-19-856701-4 978-0-19-856701-1
PREFACE
Discrete or count data arise in experiments where the outcome variables are the
number of individuals classified into unique, nonoverlapping categories. I hope
thisvolumewillbeusedasatexttoaccompanyonesemestermaster’sgraduate
level course. Courses similar to this one are taken in almost every statistics or
biostatistics department that offers a graduate degree. By the end of the course
the students should be comfortable with the language of logistic regression and
log-linearmodels.Theyshouldbeabletorunthemajorstatisticalpackagesand
then interpret the output.
Many of our students go on to work in the pharmaceutical industry after
graduating with a masters degree. Consequently, the choice of examples has a
decidedly health/medical bias. We expect our students to be useful to their em-
ployers the day they leave our program so there is not a lot of time to spend on
advanced theory that is not directly applicable. On the other hand, the subject
matter is constantly changing so those in the field are increasingly expected to
keep abreast and come up with creative solutions to new problems. Similarly,
a second, but smaller prospective audience is the PhD practitioner who needs
an introduction to the coordinate-free material in Chapter 5 or a review of the
sample size estimation methodology. Chapter 6 ends with some recent devel-
opments in the area of sparse tables. This subject is ripe for further research
investigation.
Let me put this monograph in perspective with other books already on the
market. It is not my aim to produce an encyclopedia of topics with hundreds of
references such as the texts by Agresti (1990) or Bishop et al. (1975). These
twooutstandingvolumescoverthebasicsandalsogiveasummaryofthe(then)
current research. I want the mathematical treatment of this book to be above
the level of the Freeman Jr. (1987) and Fienberg (1980) texts but not as high
as Plackett’s book (1981). The number and depth of topics is about the same
asLindsay(1995)andtheFienberg,Freeman,andPlacketttexts.Thebooksby
Collett(1991),CoxandSnell(1989),andHosmerandLemeshow(1989)givein-
depthdiscussionsoflogisticregressionwhicharecoveredhereinonechapter,and
subsequently less detail. Similarly, the volume by Christensen (1990) examines
only log-linear models.
Howisthisbookdifferent?Themostimportantdifferenceistheinclusionand
emphasis of topics I feel are important and not covered elsewhere in sufficient
depth in relation to their usefulness. These topics include the negative multi-
nomial distribution and the many forms of the hypergeometric distribution.
Another area often neglected is the coordinate-free models described here in
Chapter5.Thesemodelsarepartofthelargerfamilyofgeneralizedlinearmodels
whichhavebeenmadepopularbyMcCullaghandNelder(1989). Weshowhow
to implement coordinate-free models using the generalized linear model proce-
dure in SAS.
Another major difference of this book is a detailed treatment of the issues
of sample size and power. Probably the most common question asked of statis-
ticians is, ‘How large a sample size is needed?’ This question must be answered
before any epidemiological or clinical medical research can be initiated. Most
grant proposals to potential funding agencies and departments require that this
issuebeaddressed.Thisbookcoverspowerandsamplesizeestimation fromthe
view of Fisher’s exact inference (Section 2.4.2) and as an asymptotic noncentral
chi-squared calculation (Section 4.4.2). These techniques are not obscure, but
they are not covered elsewhere in depth proportional to their usefulness or im-
portance. The table in Appendix A for calculating power is very useful in this
calculation.
A third difference of this book is the approach of integrating the software
into the text. The methodology in Chapters 3, 4, and 5 interweaves the theory,
the examples and the computer programs. The programs are largely written
in SAS, a popular software package that the practicing statistician should be
familiar with. The reader should have at least a rudimentary knowledge of the
SAS computing package. This package represents important working tools to
the applied statistician. A few remaining programs are written for Splus or in
FORTRAN.
The readers who are unfamiliar with SAS but have access to this software
couldeasilymodifytheexamplesinthisbooktosuittheirownindividualneeds.
My experience is that this is how much software is learned: not by reading
computer manuals but rather by copying and modifying existing programs. We
will describe the output from these programs.
ThematerialinSection4.5.3isalsounique.Thesubjectofclosedorexplicit
form maximum likelihood estimates does not appear in other similar books on
categoricaldata.Themethodsaresomewhatabstractandcanbeomittedonfirst
reading.Nevertheless,thetopichasreceivedconsiderableresearchattentionand
the curious reader may be left wondering about this elementary, yet unresolved
issue. The problems associated with sparse data are outlined in Section 6.3.2.
This area is in need of further methodological development.
What do I ask of my readers? At a minimum, the reader should know about
such elementary statistical topics as sample means and variances, the Pearson
chi-squared,andstatisticaldistributionssuchasthebinomialandPoisson. The
computingexampleswillbeunderstoodbythereaderwithabasicknowledgeof
SAS. In order to more fully appreciate all of the theory, the reader should have
had a one-year graduate level course in mathematical statistics at the level of
Hogg and Craig (1970) and a single semester course in linear models covering
topicssuchasorthogonalcontrasts.Thisadvancedreadershouldalsobefamiliar
withmatrixmultiplication,maximumlikelihoodestimation,sufficientstatistics,
moment generating functions, and hypothesis tests. The small amount of linear
algebra needed is reviewed in Section 5.2.
Thereaderisencouragedtoattemptalloftheexercisesthatappearattheend
ofeverychapter.Theexercisesrangefromabstractmathematicaltothepractical
how-to variety. The reader who does not have the mathematical background
to work out the details should make an attempt to understand the question
and appreciate the implications. These exercises are not ‘busy work’ but often
represent a deeper insight into the subject matter. Several exercises contain
references to more advanced theory in the published literature. Other exercises
contain examples of real data similar to what the applied statistician is likely to
encounter in practice.
Acknowledgments
We are all a product of our environments. There is no exception in this work.
Mostofall,ImustthankBeatriceShube,withoutwhoseencouragementIwould
neverhaveattemptedthiswork.InmyfirstYalecareer,IamindebttoI.Richard
Savage, Barry Margolin, and Jeff Simonoff. In Albany, I thank colleagues Lloyd
Lininger, Malcolm Sherman, and many former students. From the University of
Minnesota: Chap T. Le, Anjana Sengupta, Tom A. Louis, Lance A. Waller, Jim
Boen, Ivan S.-F. Chan, Brad Carlin, Joan Patterson, and many students. Back
at Yale, again, I thank Ted Holford, Chang Yu, Hongyu Zhao, and even more
students. Elsewhere, I am grateful to Paul Mielke, Alan Agresti, Noel Cressie,
ShelbyHaberman,B.W.Brown,andthelateCliffordClogg,fondlyremembered.
FinallyIthankmyparentswhotaughtmetoreadandwrite,andmywifeLinda.
DANIEL ZELTERMAN
New Haven, CT
October 1998
Preface to Revised Edition
My intended audience has changed since I drafted the first edition of the manu-
script as a member of a mathematics department. This revised edition includes
manymoreappliedexercisesrequiringcomputersolution,usuallyinSAS.These
havebeenrequestedbymystudentsandotherswhohavetaughtusingthisbook
in a classroom setting. A computer file containing the programs and data sets
is available by contacting the author.
A new section on Poisson regression has been added in Section 2.3. Michael
Friendly’smosaicplotsareincludedasahelpfuldiagnosticfortwo-wayfrequency
tables. These appear in Section 4.1.4.
I thank Heping Zhang and Paul Cislo for their help in proofreading this
manuscript.Anyerrorsremaining,however,aretheresponsibilityoftheauthor.
July 2005
CONTENTS
1 Introduction 1
1.1 Issues 1
1.2 Some examples 1
2 Sampling distributions 6
2.1 Binomial and multinomial distributions 6
2.1.1 Binominal distribution 6
2.1.2 Multinomial distribution 8
2.2 The Poisson distribution 10
2.3 Poisson regression 12
2.4 The hypergeometric distribution 21
2.4.1 The extended hypergeometric distribution 26
2.4.2 Exact inference in a 2 2 table 29
×
2.4.3 Exact power and sample size estimation 33
2.4.4 The multivariate hypergeometric distribution 38
2.4.5 Distributions in higher dimensions 42
2.5 The negative binomial distribution 45
2.6 The negative multinomial distribution 48
Applied exercises 51
Theory exercises 60
3 Logistic regression 66
3.1 Three simple examples 66
3.2 Two complex examples 80
3.2.1 Nodal involvement 81
3.2.2 Job satisfaction 85
3.3 Diagnostics for logistic regression 94
Applied exercises 102
Theory exercises 115
4 Log-linear models 118
4.1 Models in two dimensions 119
4.1.1 Independence in a 2 2 table 119
×
4.1.2 Interaction in a 2 2 table 123
×
4.1.3 Log-linear models for I J tables 126
×
4.1.4 Mosaic plots for I J tables 127
×
4.2 Models in three dimensions 130
4.3 Models in higher dimensions 136
4.4 Goodness-of-fit tests 145
4.4.1 When the model fits 147
4.4.2 Sample size estimation 147
4.5 Maximum likelihood estimates 152
4.5.1 The Birch criteria 152
4.5.2 Model fitting algorithms 156
4.5.3 Explicit maximum likelihood estimates 160
Applied exercises 165
Theory exercises 177
5 Coordinate-free models 182
5.1 Motivating examples 183
5.2 Some linear algebra 187
5.3 The likelihood function 194
5.4 Maximum likelihood estimation 197
5.5 Examples 203
Applied exercises 214
Theory exercises 216
6 Additional topics 220
6.1 Longitudinal data 220
6.1.1 Marginal models 221
6.1.2 Transitional models 223
6.1.3 Missing values 226
6.2 Case–control studies 226
6.2.1 Stratified analyses 227
6.2.2 Pair-matched case–control designs 231
6.3 Advanced goodness of fit 233
6.3.1 Cressie and Read statistics 233
6.3.2 Sparse data problems 236
6.3.3 High-dimensional data 239
Applied exercises 243
Theory exercises 250
Appendix A: Power for chi-squared tests 254
Appendix B: A FORTRAN program for exact tests
in tables 256
Appendix C: Splus programs for the extended
hypergeometric distribution 262
References 264
Selected solutions and hints 272
Index 279
1
INTRODUCTION
In this chapter we briefly describe a set of numerical examples. Each example is
used to provoke some questions related to its possible analysis. In every case we
are trying to get the reader thinking about the issues involved. The questions
are resolved in subsequent chapters and the references to appropriate sections
cited. In many cases, the SAS program is given along with the solution.
1.1 Issues
Everycurriculuminstatisticsincludescoursesinareassuchaslinearregression,
linearmodels,ordesignofexperiments. Theseare,simplyput,in-depthstudies
ofthenormaldistribution.Allofthesecoursescontainadiscussionofcoretopics
such as least-squares estimation and the analysis of variance. In the analysis of
discrete data, however, there is no such common ground that must always be
covered. Several different courses on the analysis of discrete or categorical data
havingnothingincommonwith eachothercanbetaught. Thereisnoonetopic
that must absolutely be covered. This lack of a cohesive backbone is a result
of several different approaches to the analysis of discrete data, each of which
has its adherents and detractors. With such a wide variety of topics available,
it is tempting to include too many, for fear of omitting something important.
Generalized linear models, popularized by McCullagh and Nelder (1989) and
Dobson (1990), are an important step toward developing a single unified theory
that covers many of these topics.
To add to this apparent disarray, many of these methods have found their
way into various disciplines such as economics and political and social sciences.
Each of these disciplines has evolved its own unique vocabulary and names for
these techniques. Sometimes methods are developed simultaneously in different
fields, each claiming priority and each claiming the correct nomenclature.
1.2 Some examples
To stimulate further research and to get the reader thinking about the types of
dataandtheproblemsinvolved,thissectionpresentsanumberofexamplesthat
willappearagain,laterinthisvolume.Eachoftheseexamplesaredescribedhere
andaccompaniedbyaseriesofquestionsthatwillgetthereaderthinkingabout
thesortofproblemsthatneedtobeaddressedandtomotivatethemethodology
that will be developed.
2 INTRODUCTION
The data given in Table 1.1 were given by Innes et al. (1969). Avadex is the
trade name of diallate, a fungicide and a herbicide marketed by Monsanto, that
has been in use for many years. It is often used to treat a variety of common
crop seeds such as soybeans, beets, corn, and alfalfa, before and after planting.
Table 1.1 summarizes the results that evaluated an experiment that eval-
uated the effects of Avadex on lung cancer in mice. Sixteen male mice were
continuouslyfedsmallamountsofAvadexand79unexposed‘control’micewere
keptseparately.After85weeksallmiceweresacrificedandtheirlungsexamined
by a pathologist for the appearance of tumors.
Outofthe95miceinthestudy,4exposedmiceand5controlmiceexhibited
tumors. Table 1.1 gives a convenient way to represent the data as a 2 2 table
×
of discrete counts. Over the past century a tremendous amount of intellectual
energy has been spent on the analysis of 2 2 tables of counts such as those
×
giveninTable1.1.Thereadershouldalreadybeabletoanalyzethesedatausing
Pearson’s chi-squared statistic. The chi-squared statistic dates back to the year
1900 and is taught in most elementary statistics courses.
Are there other useful methods for analyzing these data? In Chapter 4 we
describe the likelihood ratio statistic, G2, that behaves very much like the chi-
squaredstatistic.Overtheyearsmanyotherstatisticshavebeendevelopedthat
are similar in behavior to the Pearson chi-squared statistic. Rather than in-
troduce and describe them individually, Section 6.3.1 describes a continuum of
statistics containing many of these chi-squared equivalent statistics as special
cases. This large family was first described by Cressie and Read (1984).
There are also exact methods available that do not rely on the asymptotic
chi-squareddistributionofthesestatistics.Thesearelargelycomputer-intensive
methods that consist of enumerating every possible discrete outcome. For the
fixedmarginaltotalsofTable1.1,noticethatthereareonly10possibleoutcomes.
The upper left cell could contain any count from 0 through 9 without changing
the marginal totals. This one cell determines the contents of the other three,
hence the expression ‘1 degree of freedom’. In Section 2.4 we examine this table
using exact methods.
AnothercandidateforexactmethodsisgiveninTable1.2,whichsummarizes
the experiences of 46 subjects with each of three different drugs, labeled A,
B, and C. The reactions to each drug were described as being favorable or
unfavorable.This 23 tableofdiscretecountsarisesbyconsideringeverypossible
combination of favorable and unfavorable reactions to each of the three drugs.
The generally small counts in each of the 23 = 8 cells in this table leads us
Table 1.1 Incidence of tumors in mice exposed to Avadex.
Source: Innes et al. 1969
Exposed Control Totals
Mice with tumors 4 5 9
No tumors 12 74 86
Totals 16 79 95