ebook img

Analysis of Variance Design and Regression Linear Modeling for Unbalanced Data PDF

606 Pages·2017·4.988 MB·english
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Analysis of Variance Design and Regression Linear Modeling for Unbalanced Data

Analysis of Variance, Design, and Regression Linear Modeling for Unbalanced Data Second Edition Ronald Christensen University of New Mexico Albuquerque, USA CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2016 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20151221 International Standard Book Number-13: 978-1-4987-7405-5 (eBook - PDF) Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Preface xvii EditedPrefacetoFirstEdition xxi Computing xxv 1 Introduction 1 1.1 Probability 1 1.2 Randomvariablesandexpectations 4 1.2.1 Expectedvaluesandvariances 6 1.2.2 Chebyshev’sinequality 9 1.2.3 Covariancesandcorrelations 10 1.2.4 Rulesforexpectedvaluesandvariances 12 1.3 Continuousdistributions 13 1.4 Thebinomialdistribution 17 1.4.1 Poissonsampling 21 1.5 Themultinomialdistribution 21 1.5.1 IndependentPoissonsandmultinomials 23 1.6 Exercises 24 2 OneSample 27 2.1 Exampleandintroduction 27 2.2 Parametricinferenceaboutμ 31 2.2.1 Significancetests 34 2.2.2 Confidenceintervals 37 2.2.3 Pvalues 38 2.3 Predictionintervals 39 2.4 Modeltesting 42 2.5 Checkingnormality 43 2.6 Transformations 48 2.7 Inferenceaboutσ2 51 2.7.1 Theory 54 2.8 Exercises 55 3 GeneralStatisticalInference 57 3.1 Model-basedtesting 58 3.1.1 AnalternativeF test 64 3.2 Inferenceonsingleparameters:assumptions 64 3.3 Parametrictests 66 3.4 Confidenceintervals 70 3.5 Pvalues 72 3.6 Validityoftestsandconfidenceintervals 75 3.7 Theoryofpredictionintervals 75 3.8 Samplesizedeterminationandpower 78 3.9 Theshapeofthingstocome 80 3.10 Exercises 85 4 TwoSamples 87 4.1 Twocorrelatedsamples:Pairedcomparisons 87 4.2 Twoindependentsampleswithequalvariances 90 4.2.1 Modeltesting 95 4.3 Twoindependentsampleswithunequalvariances 96 4.4 Testingequalityofthevariances 101 4.5 Exercises 104 5 ContingencyTables 109 5.1 Onebinomialsample 109 5.1.1 Thesigntest 112 5.2 Twoindependentbinomialsamples 112 5.3 Onemultinomialsample 115 5.4 Twoindependentmultinomialsamples 117 5.5 Severalindependentmultinomialsamples 120 5.6 Lancaster–Irwinpartitioning 123 5.7 Exercises 129 6 SimpleLinearRegression 133 6.1 Anexample 133 6.1.1 Computercommands 137 6.2 Thesimplelinearregressionmodel 139 6.3 Theanalysisofvariancetable 141 6.4 Model-basedinference 141 6.5 Parametricinferentialprocedures 143 6.6 Analternativemodel 145 6.7 Correlation 146 6.8 Two-sampleproblems 147 6.9 Amultipleregression 148 6.10 Estimationformulaeforsimplelinearregression 149 6.11 Exercises 154 7 ModelChecking 157 7.1 Recognizingrandomness:Simulateddatawithzerocorrelation 157 7.2 Checkingassumptions:Residualanalysis 159 7.2.1 Anotherexample 163 7.2.2 Outliers 165 7.2.3 Effectsofhighleverage 166 7.3 Transformations 168 7.3.1 Circleoftransformations 168 7.3.2 Box–Coxtransformations 171 7.3.3 Constructedvariables 174 7.4 Exercises 177 8 LackofFitandNonparametricRegression 179 8.1 Polynomialregression 179 8.1.1 Pickingapolynomial 181 8.1.2 Exploringthechosenmodel 183 8.2 Polynomialregressionandleverages 185 8.3 Otherbasisfunctions 189 8.3.1 High-ordermodels 191 8.4 Partitioningmethods 191 8.4.1 Fittingthepartitionedmodel 192 8.4.2 Outputforcategoricalpredictors* 194 8.4.3 Utts’method 196 8.5 Splines 198 8.6 Fisher’slack-of-fittest 200 8.7 Exercises 201 9 MultipleRegression:Introduction 205 9.1 Exampleofinferentialprocedures 205 9.1.1 Computingcommands 209 9.1.2 Generalstatementofthemultipleregressionmodel 210 9.2 Regressionsurfacesandprediction 211 9.3 Comparingregressionmodels 213 9.3.1 Generaldiscussion 214 9.4 Sequentialfitting 216 9.5 Reducedmodelsandprediction 218 9.6 Partialcorrelationcoefficientsandaddedvariableplots 219 9.7 Collinearity 221 9.8 Moreonmodeltesting 223 9.9 Additiveeffectsandinteraction 227 9.10 Generalizedadditivemodels 229 9.11 Finalcomment 230 9.12 Exercises 230 10 DiagnosticsandVariableSelection 235 10.1 Diagnostics 235 10.2 Bestsubsetmodelselection 240 10.2.1 R2statistic 241 10.2.2 AdjustedR2 statistic 243 10.2.3 Mallows’sC statistic 244 p 10.2.4 Acombinedsubsetselectiontable 245 10.3 Stepwisemodelselection 246 10.3.1 Backwardselimination 246 10.3.2 Forwardselection 247 10.3.3 Stepwisemethods 248 10.4 Modelselectionandcasedeletion 248 10.5 Lassoregression 250 10.6 Exercises 252 11 MultipleRegression:MatrixFormulation 255 11.1 Randomvectors 255 11.2 Matrixformulationofregressionmodels 256 11.2.1 Simplelinearregressioninmatrixform 256 11.2.2 Thegenerallinearmodel 258 11.3 Leastsquaresestimationofregressionparameters 262 11.4 Inferentialprocedures 266 11.5 Residuals,standardizedresiduals,andleverage 269 11.6 Principalcomponentsregression 270 11.7 Exercises 274 12 One-WayANOVA 277 12.1 Example 277 12.1.1 Inferencesonasinglegroupmean 281 12.1.2 Inferenceonpairsofmeans 281 12.1.3 Inferenceonlinearfunctionsofmeans 283 12.1.4 Testingμ =μ =μ 284 1 2 3 12.2 Theory 284 12.2.1 Analysisofvariancetables 289 12.3 RegressionanalysisofANOVAdata 290 12.3.1 Testingapairofmeans 292 12.3.2 Modeltesting 293 12.3.3 Anotherchoice 296 12.4 Modelingcontrasts 297 12.4.1 Ahierarchicalapproach 298 12.4.2 Evaluatingthehierarchy 299 12.4.3 Regressionanalysis 303 12.4.4 Relationtoorthogonalcontrasts 303 12.4.5 Theory:Difficultiesingeneralunbalancedanalyses 303 12.5 Polynomialregressionandone-wayANOVA 304 12.5.1 Fisher’slack-of-fittest 310 12.5.2 MoreonR2 313 12.6 Weightedleastsquares 314 12.6.1 Theory 316 12.7 Exercises 317 13 MultipleComparisonMethods 323 13.1 “Fisher’s”leastsignificantdifferencemethod 324 13.2 Bonferroniadjustments 326 13.3 Scheffe´’smethod 328 13.4 Studentizedrangemethods 330 13.4.1 Tukey’shonestsignificantdifference 331 13.5 Summaryofmultiplecomparisonprocedures 332 13.6 Exercises 332 14 Two-WayANOVA 335 14.1 Unbalancedtwo-wayanalysisofvariance 335 14.1.1 Initialanalysis 336 14.1.2 Hierarchyofmodels 339 14.1.3 Computingissues 340 14.1.4 Discussionofmodelfitting 341 14.1.5 Diagnostics 342 14.1.6 Outlierdeletedanalysis 342 14.2 Modelingcontrasts 346 14.2.1 Nonequivalenceoftests 347 14.3 Regressionmodeling 349 14.4 Homologousfactors 351 14.4.1 Symmetricadditiveeffects 351 14.4.2 Skewsymmetricadditiveeffects 353 14.4.3 Symmetry 355 14.4.4 Hierarchyofmodels 357 14.5 Exercises 357 15 ACOVAandInteractions 361 15.1 Onecovariateexample 361 15.1.1 Additiveregressioneffects 362 15.1.2 Interactionmodels 364 15.1.3 Multiplecovariates 369 15.2 Regressionmodeling 369 15.2.1 Usingoverparameterizedmodels 370 15.3 ACOVAandtwo-wayANOVA 371 15.3.1 Additiveeffects 372 15.4 Nearreplicatelack-of-fittests 375 15.5 Exercises 377 16 MultifactorStructures 379 16.1 Unbalancedthree-factoranalysisofvariance 379 16.1.1 Computing 383 16.1.2 Regressionfitting 385 16.2 Balancedthree-factors 386 16.3 Higher-orderstructures 393 16.4 Exercises 393 17 BasicExperimentalDesigns 397 17.1 Experimentsandcausation 397 17.2 Technicaldesignconsiderations 399 17.3 Completelyrandomizeddesigns 401 17.4 Randomizedcompleteblockdesigns 401 17.4.1 Pairedcomparisons 405 17.5 Latinsquaredesigns 406 17.5.1 Latinsquaremodels 407 17.5.2 DiscussionofLatinsquares 407 17.6 Balancedincompleteblockdesigns 408 17.6.1 Specialcases 410 17.7 Youdensquares 412 17.7.1 Balancedlatticesquares 412 17.8 Analysisofcovarianceindesignedexperiments 413 17.9 Discussionofexperimentaldesign 415 17.10 Exercises 416 18 FactorialTreatments 421 18.1 Factorialtreatmentstructures 421 18.2 Analysis 422 18.3 Modelingfactorials 424 18.4 InteractioninaLatinsquare 425 18.5 Abalancedincompleteblockdesign 429 18.6 ExtensionsofLatinsquares 433 18.7 Exercises 436 19 DependentData 439 19.1 Theanalysisofsplit-plotdesigns 439 19.1.1 Modelingwithinteraction 446 19.2 Afour-factorexample 450 19.2.1 Unbalancedsubplotanalysis 452 19.2.2 Whole-plotanalysis 456 19.2.3 Fixingeffectlevels 459 19.2.4 Finalmodelsandestimates 460 19.3 Multivariateanalysisofvariance 463 19.4 Randomeffectsmodels 472 19.4.1 Subsampling 473 19.4.2 Randomeffects 474 19.5 Exercises 477 20 LogisticRegression:PredictingCounts 481 20.1 Modelsforbinomialdata 481 20.2 Simplelinearlogisticregression 484 20.2.1 Goodness-of-fittests 485 20.2.2 Assessingpredictiveability 486 20.2.3 Casediagnostics 488 20.3 Modeltesting 489 20.4 Fittinglogisticmodels 490 20.5 Binarydata 493 20.5.1 Goodness-of-fittests 494 20.5.2 Casediagnostics 496 20.5.3 Assessingpredictiveability 496 20.6 Multiplelogisticregression 497 20.7 ANOVAtypelogitmodels 505 20.8 Orderedcategories 507 20.9 Exercises 510 21 Log-LinearModels:DescribingCountData 513 21.1 Modelsfortwo-factortables 514 21.1.1 Lancaster–Irwinpartitioning 514 21.2 Modelsforthree-factortables 515 21.2.1 Testingmodels 517 21.3 Estimationandoddsratios 518 21.4 Higher-dimensionaltables 520 21.5 Orderedcategories 522 21.6 Offsets 525 21.7 Relationtologisticmodels 526 21.8 Multinomialresponses 528 21.9 Logisticdiscriminationandallocation 530 CONTENTS xv 21.10 Exercises 535 22 ExponentialandGammaRegression:Time-to-EventData 537 22.1 Exponentialregression 538 22.1.1 Computingissues 540 22.2 Gammaregression 541 22.2.1 Computingissues 543 22.3 Exercises 543 23 NonlinearRegression 545 23.1 Introductionandexamples 545 23.2 Estimation 546 23.2.1 TheGauss–Newtonalgorithm 547 23.2.2 Maximumlikelihoodestimation 551 23.3 Statisticalinference 551 23.4 Linearizablemodels 559 23.5 Exercises 560 AppendixA:MatricesandVectors 563 A.1 Matrixadditionandsubtraction 564 A.2 Scalarmultiplication 564 A.3 Matrixmultiplication 564 A.4 Specialmatrices 566 A.5 Lineardependenceandrank 567 A.6 Inversematrices 568 A.7 Alistofusefulproperties 570 A.8 Eigenvaluesandeigenvectors 570 AppendixB:Tables 573 B.1 Tablesofthet distribution 574 B.2 Tablesoftheχ2distribution 576 (cid:2) B.3 TablesoftheW statistic 580 B.4 TablesoftheStudentizedrange 581 B.5 TheGreekalphabet 585 B.6 TablesoftheF distribution 586 References 599 AuthorIndex 605 SubjectIndex 607 Preface Background BigDataarethefutureofStatistics.Theelectronicrevolutionhasincreasedexponentiallyourabil- ity to measure things. A century ago, data were hard to come by. Statisticians put a premium on extractingeverybitofinformationthatthedatacontained.Nowdataareeasytocollect;theprob- lemissortingthroughthemtofindmeaning.To alargeextent,thishappensintwoways:doinga crudeanalysisonamassiveamountofdataordoingacarefulanalysisonthemoderateamountof datathatwereisolatedfromthemassivedataasbeingmeaningful.Itisquiteliterallyimpossibleto analyzea million data pointsas carefullyas onecan analyzea hundreddata points, so “crude”is notapejorativetermbutratherafactoflife. Thefundamentaltoolsusedinanalyzingdatahavebeenaroundalongtime.Itistheemphases and the opportunities that have changed. With thousands of observations, we don’t need a per- fectstatisticalanalysistodetectalargeeffect.Butwiththousandsofobservations,wemightlook for subtle effects that we never bothered looking for before, and such an analysis must be done carefully—asmustanyanalysisin whichonlya smallpartof themassive dataare relevantto the problem at hand. The electronic revolution has also provided us with the opportunityto perform dataanalysisproceduresthatwerenotpracticalbefore,butin myexperience,the newprocedures (oftencalledmachinelearning),aresophisticatedapplicationsoffundamentaltools. This book explains some of the fundamentaltools and the ideas needed to adapt them to big data.Itisnotabookthatanalyzesbigdata.Thebookanalyzessmalldatasetscarefullybutbyusing toolsthat1)caneasilybescaledtolargedatasetsor2)applytothehaphazardwayinwhichsmall relevantdata sets are now constructed. Personally, I believe that it is not safe to apply models to largedatasetsuntilyouunderstandtheirimplicationsforsmalldata.Thereisalsoamajoremphasis ontoolsthatlookforsubtleeffects(interactions,homologouseffects)thatarehardtoidentify. Thefundamentaltoolsexaminedherearelinearstructuresformodelingdata;specifically,how toincorporatespecificideasaboutthestructureofthedataintothemodelforthedata.Mostofthe book is devoted to adapting linear structures (regression,analysis of variance, analysis of covari- ance) to examine measurement (continuous)data. But the exact same methods apply to either-or (Yes/No, binomial) data, count (Poisson, multinomial) data, and time-to-event (survival analysis, reliability)data.Thebookalsoplacesstrongemphasisonfoundationalissues,e.g.,themeaningof significancetestsandtheintervalestimatesassociatedwiththem;thedifferencebetweenprediction andcausation;andtheroleofrandomization. The platform for this presentation is the revision of a book I published in 1996, Analysis of Variance,Design,andRegression:AppliedStatisticalMethods.Withinayear,Iknewthatthebook wasnotwhatIthoughtneededtobetaughtinthe21stcentury,cf.,Christensen(2000).Thisbook, AnalysisofVariance,Design,andRegression:LinearModelingofUnbalancedData,shareswith theearlierbooklotsofthetitle,muchofthedata,andevensomeofthetext,butthebookisradically different.Theoriginalbookfocusedgreatlyonbalancedanalysisofvariance.Thisbookfocuseson modelingunbalanceddata.Assuch,itgeneralizesmuchoftheworkinthepreviousbook.Themore generalmethodspresentedhereagreewiththeearliermethodsforbalanceddata.Anotheradvantage oftakingamodelingapproachtounbalanceddataisthatbymakingtheefforttotreatunbalanced analysisofvariance,onecaneasilyhandleawiderangeofmodelsfornonnormaldata,becausethe samefundamentalmethodsapply.Tothatend,Ihaveincludednewchaptersonlogisticregression,

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.