ebook img

Models for Discrete Data PDF

292 Pages·2006·1.527 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Models for Discrete Data

Models for Discrete Data REVISED EDITION DANIEL ZELTERMAN 1 PublishedintheUnitedStates byOxfordUniversityPressInc.,NewYork (cid:1)c DanielZelterman,2006 Firstpublished2006 PrintedinGreatBritain BiddlesLtd.,King’sLynn,Norfolk ISBN0-19-856701-4 978-0-19-856701-1 PREFACE Discrete or count data arise in experiments where the outcome variables are the number of individuals classified into unique, nonoverlapping categories. I hope thisvolumewillbeusedasatexttoaccompanyonesemestermaster’sgraduate level course. Courses similar to this one are taken in almost every statistics or biostatistics department that offers a graduate degree. By the end of the course the students should be comfortable with the language of logistic regression and log-linearmodels.Theyshouldbeabletorunthemajorstatisticalpackagesand then interpret the output. Many of our students go on to work in the pharmaceutical industry after graduating with a masters degree. Consequently, the choice of examples has a decidedly health/medical bias. We expect our students to be useful to their em- ployers the day they leave our program so there is not a lot of time to spend on advanced theory that is not directly applicable. On the other hand, the subject matter is constantly changing so those in the field are increasingly expected to keep abreast and come up with creative solutions to new problems. Similarly, a second, but smaller prospective audience is the PhD practitioner who needs an introduction to the coordinate-free material in Chapter 5 or a review of the sample size estimation methodology. Chapter 6 ends with some recent devel- opments in the area of sparse tables. This subject is ripe for further research investigation. Let me put this monograph in perspective with other books already on the market. It is not my aim to produce an encyclopedia of topics with hundreds of references such as the texts by Agresti (1990) or Bishop et al. (1975). These twooutstandingvolumescoverthebasicsandalsogiveasummaryofthe(then) current research. I want the mathematical treatment of this book to be above the level of the Freeman Jr. (1987) and Fienberg (1980) texts but not as high as Plackett’s book (1981). The number and depth of topics is about the same asLindsay(1995)andtheFienberg,Freeman,andPlacketttexts.Thebooksby Collett(1991),CoxandSnell(1989),andHosmerandLemeshow(1989)givein- depthdiscussionsoflogisticregressionwhicharecoveredhereinonechapter,and subsequently less detail. Similarly, the volume by Christensen (1990) examines only log-linear models. Howisthisbookdifferent?Themostimportantdifferenceistheinclusionand emphasis of topics I feel are important and not covered elsewhere in sufficient depth in relation to their usefulness. These topics include the negative multi- nomial distribution and the many forms of the hypergeometric distribution. Another area often neglected is the coordinate-free models described here in Chapter5.Thesemodelsarepartofthelargerfamilyofgeneralizedlinearmodels whichhavebeenmadepopularbyMcCullaghandNelder(1989). Weshowhow to implement coordinate-free models using the generalized linear model proce- dure in SAS. Another major difference of this book is a detailed treatment of the issues of sample size and power. Probably the most common question asked of statis- ticians is, ‘How large a sample size is needed?’ This question must be answered before any epidemiological or clinical medical research can be initiated. Most grant proposals to potential funding agencies and departments require that this issuebeaddressed.Thisbookcoverspowerandsamplesizeestimation fromthe view of Fisher’s exact inference (Section 2.4.2) and as an asymptotic noncentral chi-squared calculation (Section 4.4.2). These techniques are not obscure, but they are not covered elsewhere in depth proportional to their usefulness or im- portance. The table in Appendix A for calculating power is very useful in this calculation. A third difference of this book is the approach of integrating the software into the text. The methodology in Chapters 3, 4, and 5 interweaves the theory, the examples and the computer programs. The programs are largely written in SAS, a popular software package that the practicing statistician should be familiar with. The reader should have at least a rudimentary knowledge of the SAS computing package. This package represents important working tools to the applied statistician. A few remaining programs are written for Splus or in FORTRAN. The readers who are unfamiliar with SAS but have access to this software couldeasilymodifytheexamplesinthisbooktosuittheirownindividualneeds. My experience is that this is how much software is learned: not by reading computer manuals but rather by copying and modifying existing programs. We will describe the output from these programs. ThematerialinSection4.5.3isalsounique.Thesubjectofclosedorexplicit form maximum likelihood estimates does not appear in other similar books on categoricaldata.Themethodsaresomewhatabstractandcanbeomittedonfirst reading.Nevertheless,thetopichasreceivedconsiderableresearchattentionand the curious reader may be left wondering about this elementary, yet unresolved issue. The problems associated with sparse data are outlined in Section 6.3.2. This area is in need of further methodological development. What do I ask of my readers? At a minimum, the reader should know about such elementary statistical topics as sample means and variances, the Pearson chi-squared,andstatisticaldistributionssuchasthebinomialandPoisson. The computingexampleswillbeunderstoodbythereaderwithabasicknowledgeof SAS. In order to more fully appreciate all of the theory, the reader should have had a one-year graduate level course in mathematical statistics at the level of Hogg and Craig (1970) and a single semester course in linear models covering topicssuchasorthogonalcontrasts.Thisadvancedreadershouldalsobefamiliar withmatrixmultiplication,maximumlikelihoodestimation,sufficientstatistics, moment generating functions, and hypothesis tests. The small amount of linear algebra needed is reviewed in Section 5.2. Thereaderisencouragedtoattemptalloftheexercisesthatappearattheend ofeverychapter.Theexercisesrangefromabstractmathematicaltothepractical how-to variety. The reader who does not have the mathematical background to work out the details should make an attempt to understand the question and appreciate the implications. These exercises are not ‘busy work’ but often represent a deeper insight into the subject matter. Several exercises contain references to more advanced theory in the published literature. Other exercises contain examples of real data similar to what the applied statistician is likely to encounter in practice. Acknowledgments We are all a product of our environments. There is no exception in this work. Mostofall,ImustthankBeatriceShube,withoutwhoseencouragementIwould neverhaveattemptedthiswork.InmyfirstYalecareer,IamindebttoI.Richard Savage, Barry Margolin, and Jeff Simonoff. In Albany, I thank colleagues Lloyd Lininger, Malcolm Sherman, and many former students. From the University of Minnesota: Chap T. Le, Anjana Sengupta, Tom A. Louis, Lance A. Waller, Jim Boen, Ivan S.-F. Chan, Brad Carlin, Joan Patterson, and many students. Back at Yale, again, I thank Ted Holford, Chang Yu, Hongyu Zhao, and even more students. Elsewhere, I am grateful to Paul Mielke, Alan Agresti, Noel Cressie, ShelbyHaberman,B.W.Brown,andthelateCliffordClogg,fondlyremembered. FinallyIthankmyparentswhotaughtmetoreadandwrite,andmywifeLinda. DANIEL ZELTERMAN New Haven, CT October 1998 Preface to Revised Edition My intended audience has changed since I drafted the first edition of the manu- script as a member of a mathematics department. This revised edition includes manymoreappliedexercisesrequiringcomputersolution,usuallyinSAS.These havebeenrequestedbymystudentsandotherswhohavetaughtusingthisbook in a classroom setting. A computer file containing the programs and data sets is available by contacting the author. A new section on Poisson regression has been added in Section 2.3. Michael Friendly’smosaicplotsareincludedasahelpfuldiagnosticfortwo-wayfrequency tables. These appear in Section 4.1.4. I thank Heping Zhang and Paul Cislo for their help in proofreading this manuscript.Anyerrorsremaining,however,aretheresponsibilityoftheauthor. July 2005 CONTENTS 1 Introduction 1 1.1 Issues 1 1.2 Some examples 1 2 Sampling distributions 6 2.1 Binomial and multinomial distributions 6 2.1.1 Binominal distribution 6 2.1.2 Multinomial distribution 8 2.2 The Poisson distribution 10 2.3 Poisson regression 12 2.4 The hypergeometric distribution 21 2.4.1 The extended hypergeometric distribution 26 2.4.2 Exact inference in a 2 2 table 29 × 2.4.3 Exact power and sample size estimation 33 2.4.4 The multivariate hypergeometric distribution 38 2.4.5 Distributions in higher dimensions 42 2.5 The negative binomial distribution 45 2.6 The negative multinomial distribution 48 Applied exercises 51 Theory exercises 60 3 Logistic regression 66 3.1 Three simple examples 66 3.2 Two complex examples 80 3.2.1 Nodal involvement 81 3.2.2 Job satisfaction 85 3.3 Diagnostics for logistic regression 94 Applied exercises 102 Theory exercises 115 4 Log-linear models 118 4.1 Models in two dimensions 119 4.1.1 Independence in a 2 2 table 119 × 4.1.2 Interaction in a 2 2 table 123 × 4.1.3 Log-linear models for I J tables 126 × 4.1.4 Mosaic plots for I J tables 127 × 4.2 Models in three dimensions 130 4.3 Models in higher dimensions 136 4.4 Goodness-of-fit tests 145 4.4.1 When the model fits 147 4.4.2 Sample size estimation 147 4.5 Maximum likelihood estimates 152 4.5.1 The Birch criteria 152 4.5.2 Model fitting algorithms 156 4.5.3 Explicit maximum likelihood estimates 160 Applied exercises 165 Theory exercises 177 5 Coordinate-free models 182 5.1 Motivating examples 183 5.2 Some linear algebra 187 5.3 The likelihood function 194 5.4 Maximum likelihood estimation 197 5.5 Examples 203 Applied exercises 214 Theory exercises 216 6 Additional topics 220 6.1 Longitudinal data 220 6.1.1 Marginal models 221 6.1.2 Transitional models 223 6.1.3 Missing values 226 6.2 Case–control studies 226 6.2.1 Stratified analyses 227 6.2.2 Pair-matched case–control designs 231 6.3 Advanced goodness of fit 233 6.3.1 Cressie and Read statistics 233 6.3.2 Sparse data problems 236 6.3.3 High-dimensional data 239 Applied exercises 243 Theory exercises 250 Appendix A: Power for chi-squared tests 254 Appendix B: A FORTRAN program for exact tests in tables 256 Appendix C: Splus programs for the extended hypergeometric distribution 262 References 264 Selected solutions and hints 272 Index 279 1 INTRODUCTION In this chapter we briefly describe a set of numerical examples. Each example is used to provoke some questions related to its possible analysis. In every case we are trying to get the reader thinking about the issues involved. The questions are resolved in subsequent chapters and the references to appropriate sections cited. In many cases, the SAS program is given along with the solution. 1.1 Issues Everycurriculuminstatisticsincludescoursesinareassuchaslinearregression, linearmodels,ordesignofexperiments. Theseare,simplyput,in-depthstudies ofthenormaldistribution.Allofthesecoursescontainadiscussionofcoretopics such as least-squares estimation and the analysis of variance. In the analysis of discrete data, however, there is no such common ground that must always be covered. Several different courses on the analysis of discrete or categorical data havingnothingincommonwith eachothercanbetaught. Thereisnoonetopic that must absolutely be covered. This lack of a cohesive backbone is a result of several different approaches to the analysis of discrete data, each of which has its adherents and detractors. With such a wide variety of topics available, it is tempting to include too many, for fear of omitting something important. Generalized linear models, popularized by McCullagh and Nelder (1989) and Dobson (1990), are an important step toward developing a single unified theory that covers many of these topics. To add to this apparent disarray, many of these methods have found their way into various disciplines such as economics and political and social sciences. Each of these disciplines has evolved its own unique vocabulary and names for these techniques. Sometimes methods are developed simultaneously in different fields, each claiming priority and each claiming the correct nomenclature. 1.2 Some examples To stimulate further research and to get the reader thinking about the types of dataandtheproblemsinvolved,thissectionpresentsanumberofexamplesthat willappearagain,laterinthisvolume.Eachoftheseexamplesaredescribedhere andaccompaniedbyaseriesofquestionsthatwillgetthereaderthinkingabout thesortofproblemsthatneedtobeaddressedandtomotivatethemethodology that will be developed. 2 INTRODUCTION The data given in Table 1.1 were given by Innes et al. (1969). Avadex is the trade name of diallate, a fungicide and a herbicide marketed by Monsanto, that has been in use for many years. It is often used to treat a variety of common crop seeds such as soybeans, beets, corn, and alfalfa, before and after planting. Table 1.1 summarizes the results that evaluated an experiment that eval- uated the effects of Avadex on lung cancer in mice. Sixteen male mice were continuouslyfedsmallamountsofAvadexand79unexposed‘control’micewere keptseparately.After85weeksallmiceweresacrificedandtheirlungsexamined by a pathologist for the appearance of tumors. Outofthe95miceinthestudy,4exposedmiceand5controlmiceexhibited tumors. Table 1.1 gives a convenient way to represent the data as a 2 2 table × of discrete counts. Over the past century a tremendous amount of intellectual energy has been spent on the analysis of 2 2 tables of counts such as those × giveninTable1.1.Thereadershouldalreadybeabletoanalyzethesedatausing Pearson’s chi-squared statistic. The chi-squared statistic dates back to the year 1900 and is taught in most elementary statistics courses. Are there other useful methods for analyzing these data? In Chapter 4 we describe the likelihood ratio statistic, G2, that behaves very much like the chi- squaredstatistic.Overtheyearsmanyotherstatisticshavebeendevelopedthat are similar in behavior to the Pearson chi-squared statistic. Rather than in- troduce and describe them individually, Section 6.3.1 describes a continuum of statistics containing many of these chi-squared equivalent statistics as special cases. This large family was first described by Cressie and Read (1984). There are also exact methods available that do not rely on the asymptotic chi-squareddistributionofthesestatistics.Thesearelargelycomputer-intensive methods that consist of enumerating every possible discrete outcome. For the fixedmarginaltotalsofTable1.1,noticethatthereareonly10possibleoutcomes. The upper left cell could contain any count from 0 through 9 without changing the marginal totals. This one cell determines the contents of the other three, hence the expression ‘1 degree of freedom’. In Section 2.4 we examine this table using exact methods. AnothercandidateforexactmethodsisgiveninTable1.2,whichsummarizes the experiences of 46 subjects with each of three different drugs, labeled A, B, and C. The reactions to each drug were described as being favorable or unfavorable.This 23 tableofdiscretecountsarisesbyconsideringeverypossible combination of favorable and unfavorable reactions to each of the three drugs. The generally small counts in each of the 23 = 8 cells in this table leads us Table 1.1 Incidence of tumors in mice exposed to Avadex. Source: Innes et al. 1969 Exposed Control Totals Mice with tumors 4 5 9 No tumors 12 74 86 Totals 16 79 95

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.