Statistics for Biology and Health Series Editors: M. Gail K. Krickeberg J. Samet A. Tsiatis W. Wong For other titles published in this series, go to http://www.springer.com/series/2848 Per Kragh Andersen • Lene Theil Skovgaard Regression with Linear Predictors With 171 illustrations by Therese Graversen 1 C Per Kragh Andersen Lene Theil Skovgaard University of Copenhagen University of Copenhagen Dept. Biostatistics Dept. Biostatistics Øster Farimagsgade 5 Øster Farimagsgade 5 DK-1014 Copenhagen K DK-1014 Copenhagen K Denmark Denmark [email protected] [email protected] Series Editors M. Gail A. Tsiatis National Cancer Institute Department of Statistics Bethesda, MD 20892, USA North Carolina State University Raleigh, NC 27695, USA K. Krickeberg Le Chatelet W. Wong F-63270 Manglieu, France Department of Statistics Stanford University J. Samet Stanford, CA 94305-4065, USA Department of Preventive Medicine Keck School of Medicine University of Southern California 1441 Eastlake Ave. Room 4436, MC 9175 Los Angeles, CA 90089 ISSN 1431-8776 ISBN 978-1-4419-7169-2 e-ISBN 978-1-4419-7170-8 DOI 10.1007/978-1-4419-7170-8 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010931468 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connec- tion with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) V Preface This is a book about regression analysis, that is, the situation in statistics where the distribution of a response (or outcome) variable is related to ex- planatory variables (or covariates). This is an extremely common situation in theapplicationofstatisticalmethodsinmanyfields,andlinearregression,lo- gistic regression, and Cox proportional hazards regression are frequently used for quantitative, binary, and survival time outcome variables, respectively. Several books on these topics have appeared and for that reason one may well ask why we embark on writing still another book on regression. We have two main reasons for doing this: 1. First,wewanttohighlightsimilaritiesamonglinear,logistic,proportional hazards,andotherregressionmodelsthatincludealinearpredictor.These modelsareoftentreatedentirelyseparatelyintextsinspiteofthefactthat alloperationsonthemodelsdealingwiththelinearpredictorareprecisely the same, including handling of categorical and quantitative covariates, testing for linearity and studying interactions. 2. Second, we want to emphasize that, for any type of outcome variable, multiple regression models are composed of simple building blocks that areaddedtogetherinthelinearpredictor:thatis,t-tests,one-wayanalyses of variance and simple linear regressions for quantitative outcomes, 2×2, 2×(k+1)tables andsimplelogistic regressionsfor binaryoutcomes, and 2-and(k+1)-samplelogranktestsandsimpleCoxregressionsforsurvival data.Thishastwoconsequences.Allthesesimpleandwellknownmethods can be considered as special cases of the regression models. On the other hand, the effect of a single explanatory variable in a multiple regression model can be interpreted in a way similar to that obtained in the simple analysis, however, now valid only for the other explanatory variables in the model “held fixed”. Note the important point that addition of simple terms in the linear predictor will imply an assumption of no interaction; that is, the effect of an explanatory variable is the same for all values of otherexplanatoryvariablesinthemodel.Thisisanassumptionthatoften needs careful consideration as part of the analysis. In Chapter 1 the basic ideas are set up and the examples to be used throughout the book are introduced. Chapter 2 presents a review of back- ground material on probability distributions and the principles of statistical inference. In Chapter 3 the simple building blocks for categorical explana- tory variables are introduced for the three main types of outcome variables. Chapter 4 deals with one quantitative explanatory variable, first when a lin- ear effect can be assumed and, next, certain models with a nonlinear effect of the covariate (still described by a linear predictor, however) are introduced. A very common example of such a nonlinear effect is a polynomial. VI Havingpresentedthebuildingblocks,Chapter5discussesmultipleregres- sion models (i.e., models with several explanatory variables). We focus on models with two covariates and introduce the concepts of confounding and interaction (“effect modification”). In Chapter 6 we discuss model building strategies,inparticularselectionofexplanatoryvariablesforansweringaspe- cific research question and illustrate the strategies by thorough analyses of three specific examples. WhereasChapters3through6primarilydealwithexamplesoflinearmod- elsforquantitativeoutcomes,logisticmodelsforbinarydata,andCoxmodels for lifetimes, Chapter 7 presents a number of other regression models with a linear predictor. These include the logistic models for ordinal and multino- mial data, Poisson-type models for counts as well as alternative models for quantitative, binary, and lifetime data. Chapter 8 briefly mentions a number of extended models all involving a linear predictor. These include multivari- ate models with more than one response variable per individual, for example repeated measurements and other types of correlated outcomes, and models with covariate measurement errors. The multivariate models include random effect models and marginal models. The treatment of these topics is by no means meant to be exhaustive but to serve mainly as a warning that mod- els more complicated than the ones treated earlier in this book occur quite frequently and will produce erroneous conclusions if not analyzed properly. Thebookisconcludedwithfourappendicessummarizingnotation,theuseof logarithms, some recommendations, and simple programming, respectively. It is important to notice that some sections of the book are more difficult to read than others, simply because of the varying level of complexity for the different methods we wish to cover. Thebookisbasedonourpersonalexperienceasteachers,consultants,and researchersinbiostatistics formorethanthreedecadesandalldataexamples are based on this. Our intended readers are primarily researchers from scientific areas where statistics is being applied to analyze numerical data, for example, fields such as medicine, public health, dentistry, agriculture, and so on. For that rea- son we do not expect readers to have a strong background in mathematics. We limit the amount of mathematical formulas used, and we avoid the use of Greek letters in formulas altogether. We do, however, expect readers to have some familiarity with basic statistical terms but, to set a common level, Chapter 2 gives an overview of the necessary concepts. Although we have mainlywrittenthebookforappliedscientistsitisourhopethatreaderswith a more mathematical background who wish to enter the field of biostatistics willalsobenefitfromstudyingthebook.Weintendtoprovideabookthatcan both serve as a reference source and a course textbook. For the last purpose, chapters conclude with a series of exercises all dealing with data analysis. The mainstream of our text presents aspects of the methods that are important for building regression models, yet we have in some places found it natural to add brief discussions of related concepts that are less important VII for regression analysis. Such digressions are marked in the text as Digression and also include some historical and more technical remarks. An applied statistical text such as the present book, obviously includes a large number of practical examples, and computational aspects are crucial. We have chosen not to base the presentation of examples in the book on a single piece of software. Instead, the book is accompanied by Web pages documenting examples by including computer code for the computations in R and SAS. It is our intention to supplement the Web pages with code in STATA as well but we have chosen not to include SPSS code because it is our impressionthatmostusersofthisprogramusethemenuinterfaceratherthan writing program code. In addition, a brief appendix (Appendix D) includes very simple and “raw” commands for fitting the basic types of regression models in R, SAS, and STATA. It has been our ambition that the Web pages should be user-friendly with facilitiesformakinganentrybasedonbothbookchaptersandonthedifferent examples. The pages can be found at www.biostat.ku.dk/~linearpredictors We would like to thank our medical colleagues who have granted us per- mission to use their data as illustrations. We also wish to thank colleagues at theDepartmentofBiostatistics,UniversityofCopenhagen,Denmark,forcre- atingafriendlyandproductiveworkingenvironment. Inparticular,wethank ØrnulfBorgan,BendixCarstensen,SaskialeCessie,ThomasGerds,NielsKei- ding, Kajsa Kvist, Maja Olsbjerg Larsen, Henrik Ravn and Willi Sauerbrei for comments, advice, assistance or encouragement. Part of the manuscript was written during a week-long workshop in Oberwolfach, Germany, and we wish to thank Mathematisches Forschungsinstitut, Oberwolfach, for hospital- ity during that week. Last, but definitely not least, our most sincere thanks go to Therese Gra- versen. Not only did she design the accompanying Web pages and produce all the graphics in the manuscript but she also gave valuable and thoughtful comments to both contents and lay-out. Without her skillful input this book would never have become what it is now. Copenhagen,June2010 PerKraghAndersen,LeneTheilSkovgaard Contents 1 Introduction............................................... 1 1.1 Introductory examples and types of outcome................ 3 1.1.1 Introductory examples ............................. 3 1.1.2 Types of outcome ................................. 7 1.2 Covariates.............................................. 9 1.2.1 Categorical covariates.............................. 9 1.2.2 Quantitative covariates............................. 11 1.3 Link functions .......................................... 13 1.4 Building a regression model............................... 22 1.4.1 The linear predictor and the link function ............ 24 1.4.2 Regression models and their interpretation ........... 28 1.5 Further examples........................................ 30 1.6 The scope of this book and how to read it .................. 37 2 Statistical models.......................................... 43 2.1 Random variables and probability ......................... 44 2.1.1 The Bernoulli distribution .......................... 47 2.1.2 The Binomial distribution .......................... 48 2.1.3 The Poisson distribution ........................... 50 2.1.4 The Normal distribution ........................... 51 2.1.5 Other common distributions ........................ 55 2.1.6 Conditional probability ............................ 56 2.2 Descriptive statistics..................................... 57 2.2.1 Binary outcome ................................... 57 2.2.2 Quantitative outcome.............................. 59 2.2.3 Survival time outcome ............................. 63 2.3 Statistical inference...................................... 64 2.3.1 Estimation ....................................... 65 2.3.2 Model checking ................................... 73 2.3.3 Hypothesis testing................................. 79 2.3.4 The likelihood function............................. 84 X Contents 2.4 Exercises ............................................... 92 3 One categorical covariate .................................. 95 3.1 Binary covariate......................................... 96 3.1.1 Quantitative outcome: t-tests ....................... 97 3.1.2 Binary outcome: (2×2)-tables and the chi-square test ..110 3.1.3 Survival time outcome: the 2-sample logrank test......123 3.2 Categorical covariate with more than two levels .............137 3.2.1 Quantitative outcome: One-way analysis of variance ...142 3.2.2 Binary outcome: The 2×(k+1)-table ...............157 3.2.3 Survival time outcome: The (k+1)-sample logrank test 161 3.3 Exercises ...............................................166 4 One quantitative covariate.................................173 4.1 Linear effect ............................................175 4.1.1 Quantitative outcome: Simple linear regression ........176 4.1.2 Binary outcome: Simple logistic regression............194 4.1.3 Survival time outcome: Simple Cox regression.........201 4.2 Nonlinear effect .........................................210 4.2.1 Dividing the covariate range into intervals ............214 4.2.2 Polynomials ......................................221 4.2.3 Other nonlinear models with a linear predictor ........223 4.3 Exercises ...............................................228 5 Multiple regression, the linear predictor ...................231 5.1 Two covariates: Models without interaction .................234 5.1.1 Two categorical covariates..........................234 5.1.2 One categorical and one quantitative covariate ........245 5.1.3 Two quantitative covariates.........................254 5.2 Two covariates: Models with interaction....................263 5.2.1 Two categorical covariates..........................264 5.2.2 One categorical and one quantitative covariate ........273 5.2.3 Two quantitative covariates.........................278 5.2.4 Saving degrees-of-freedom ..........................285 5.3 Several covariates........................................287 5.3.1 Models without higher-order interactions .............287 5.3.2 Models with higher-order interactions ................289 5.4 Matched studies.........................................294 5.5 Exercises ...............................................300 6 Model building: From purpose to conclusion...............303 6.1 General principles for model selection ......................304 6.1.1 Identification of covariates..........................305 6.1.2 Model diagrams ...................................308 6.1.3 Initial model building ..............................311 Contents XI 6.1.4 Strategy of analysis................................314 6.1.5 Model checks and diagnostics .......................317 6.1.6 Collinearity.......................................318 6.1.7 Interactions.......................................319 6.2 Examples...............................................320 6.2.1 The vitamin D example ............................321 6.2.2 The surgery example...............................331 6.2.3 The PBC-3 trial...................................342 6.3 Sample size determination ................................354 6.4 Exercises ...............................................365 7 Alternative outcome types and link functions..............367 7.1 Multinomial outcome ....................................367 7.1.1 Ordinal outcome ..................................368 7.1.2 Nominal outcome..................................383 7.2 Count outcome..........................................387 7.3 Quantitative outcome ....................................394 7.4 Binary outcome .........................................403 7.4.1 Alternatives to the logit link ........................404 7.4.2 Case-control studies ...............................409 7.5 Survival time outcome ...................................416 7.5.1 Multiplicative hazard models .......................416 7.5.2 Additive hazard models ............................422 7.5.3 Accelerated failure time models .....................427 7.6 Exercises ...............................................429 8 Further topics .............................................431 8.1 Multivariate outcome ....................................431 8.1.1 Random effects models.............................433 8.1.2 Marginal models ..................................438 8.1.3 Longitudinal and life history data ...................440 8.2 Errors in covariates ......................................447 8.2.1 Regression dilution ................................448 8.2.2 Correction for measurement error in covariates ........452 A Appendix: Notation .......................................457 B Appendix: Use of logarithms...............................463 C Appendix: Some recommendations.........................473 D Programming in R, SAS and STATA...........................477 References.....................................................483 Index..........................................................487 1 Introduction Supposewearestudyingbloodpressureinhumansbasedonarandomsample from a specific population, say, inhabitants of some larger city. The very first step in such a study may be to get a summary of the level and variation of blood pressure, subject to criteria such as ethnicity, gender or age. The purpose of studying blood pressure may be to establish normal references to serve as future guidelines for when to start treatment for either too high or too low a blood pressure. In order to illustrate the distribution of the blood pressure measurements from this sample, we usually calculate average and standard deviation and possiblyproduceahistogramorsomeothergraphicalillustration.Thesequan- tities are examples of descriptive statistics. Asmallpartofthevariationinthebloodpressuremeasurementscanprob- ablybeascribedtomeasurementerror,however,alargerpartofthevariation is more likely due to individual fluctuations over time and topopulation vari- ation, the true differences between subjects. Part of this population variation may be due to characteristics of the subjects that are easily recognized. For instance,menfrequentlyhaveasomewhathigherbloodpressurethanwomen, and older people tend to have a higher blood pressure than younger people. The example illustrates that, in many fields of research, information may be collected on various features in a number of experimental units. Other examples may be the serum bilirubin level in male and female patients with liver cirrhosis, whether persons in different job categories working in a given company have suffered from severe headaches in some specified period, the survival time from diagnosis of cancer patients in different stages of disease, the rise in blood glucose in experimental animals after feeding with different diets, the number of claims in a year for insurance policies of different types, the yield of some crop in differently treated field plots in an agricultural experiment, or the annual cancer rates in successive years in some country. Inthelastthreeexamplestheexperimentalunitsaretheinsurancepolicy, the field plot, and the country. However, in this book we are mainly using examples from medical and public health research and we denote the experi-