ebook img

Practical Regression and Anova using R PDF

213 Pages·2012·0.99 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Practical Regression and Anova using R

Practical Regression and Anova using R JulianJ.Faraway July2002 1 Copyright (cid:0)c 1999,2000,2002JulianJ.Faraway Permissiontoreproduceindividualcopiesofthisbookforpersonaluseisgranted. Multiplecopiesmay be createdfor nonprofitacademicpurposes —a nominalcharge to cover theexpenseof reproductionmay bemade. Reproductionforprofitisprohibitedwithoutpermission. Preface There are many books on regression and analysis of variance. These books expect different levels of pre- paredness and place different emphases on the material. This book is not introductory. It presumes some knowledgeofbasicstatisticaltheoryandpractice. Studentsareexpectedtoknowtheessentialsofstatistical inferencelikeestimation,hypothesistestingandconfidenceintervals. Abasicknowledgeofdataanalysisis presumed. Somelinearalgebraandcalculusisalsorequired. The emphasis of this text is on the practice of regression and analysis of variance. The objective is to learn what methods are available and moreimportantly, when they shouldbe applied. Many examples are presented to clarify the use of the techniques and to demonstrate what conclusions can be made. There is relatively less emphasis on mathematical theory, partly because some prior knowledge is assumed and partly because the issues are better tackled elsewhere. Theory is important because it guides the approach we take. I take a wider view of statistical theory. It is not just the formal theorems. Qualitative statistical conceptsarejustasimportantinStatisticsbecausetheseenableustoactuallydoitratherthanjusttalkabout it. Thesequalitative principlesarehardertolearnbecausetheyaredifficulttostatepreciselybuttheyguide thesuccessfulexperiencedStatistician. Dataanalysiscannotbelearntwithoutactuallydoingit. Thismeansusingastatisticalcomputingpack- age. Thereisawidechoiceofsuchpackages. Theyaredesignedfordifferentaudiencesandhavedifferent strengths and weaknesses. I have chosen to use R (ref. Ihaka and Gentleman (1996)). Why do I use R ? Theareseveralreasons. 1. Versatility. R is a also a programming language, so I am not limited by the procedures that are preprogrammedbyapackage. ItisrelativelyeasytoprogramnewmethodsinR . 2. Interactivity. Data analysis is inherently interactive. Some older statistical packages were designed when computing was more expensive and batch processing of computations was the norm. Despite improvementsinhardware,theoldbatchprocessingparadigmlivesonintheiruse. R doesonething atatime,allowingustomakechangesonthebasisofwhatweseeduringtheanalysis. 3. R is based on S from which the commercial package S-plus is derived. R itself is open-source software and may be freely redistributed. Linux, Macintosh, Windows and other UNIX versions are maintained and can be obtained from the R-project at www.r-project.org. R is mostly compatible with S-plus meaning that S-plus could easily be used for the examples given in this book. 4. Popularity. SAS is the most common statistics package in general but R or S is most popular with researchers in Statistics. A look at common Statistical journals confirms this popularity. R is also popularforquantitative applicationsinFinance. The greatest disadvantage of R is that it is not so easy to learn. Some investment of effort is required beforeproductivitygainswillberealized. ThisbookisnotanintroductiontoR.Thereisashortintroduction 2 3 in the Appendix but readers are referred to the R-project web site at www.r-project.orgwhere you can find introductory documentation and information about books on R . I have intentionally included in thetextallthecommandsusedtoproducetheoutputseeninthisbook. Thismeansthatyoucanreproduce these analyses and experimentwith changes and variations before fully understandingR . The reader may choosetostartworkingthroughthistextbeforelearningR andpickitupasyougo. The web site for this book is at www.stat.lsa.umich.edu/˜faraway/bookwhere data de- scribedinthisbookappears. Updateswillappeartherealso. ThankstothebuildersofR withoutwhomthisbookwouldnothavebeenpossible. Contents 1 Introduction 8 1.1 Beforeyoustart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.2 DataCollection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1.3 InitialDataAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 WhentouseRegressionAnalysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Estimation 16 2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 LinearModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 MatrixRepresentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Estimatingb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Leastsquaresestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 Examplesofcalculatingbˆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7 Whyisbˆ agoodestimate? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.9 MeanandVarianceofbˆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.10 Estimatings 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.11 GoodnessofFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.12 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3 Inference 26 3.1 Hypothesisteststocomparemodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 SomeExamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 Testofallpredictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2 Testingjustonepredictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.3 Testingapairofpredictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.4 Testingasubspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 ConcernsaboutHypothesisTesting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 ConfidenceIntervalsforb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Confidenceintervalsforpredictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.9 Whatcangowrong? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.9.1 Sourceandqualityofthedata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 CONTENTS 5 3.9.2 Errorcomponent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9.3 StructuralComponent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.10 InterpretingParameterEstimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4 ErrorsinPredictors 55 5 GeneralizedLeastSquares 59 5.1 Thegeneralcase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 WeightedLeastSquares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 IterativelyReweightedLeastSquares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6 TestingforLackofFit 65 6.1 s 2 known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2 s 2 unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7 Diagnostics 72 7.1 ResidualsandLeverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.2 StudentizedResiduals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.3 Anoutliertest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.4 InfluentialObservations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5 ResidualPlots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.6 Non-ConstantVariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.7 Non-Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.8 AssessingNormality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.9 Half-normalplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.10 CorrelatedErrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8 Transformation 95 8.1 Transformingtheresponse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.2 Transformingthepredictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.2.1 BrokenStickRegression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.2.2 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.3 RegressionSplines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 8.4 ModernMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 9 ScaleChanges,PrincipalComponentsandCollinearity 106 9.1 ChangesofScale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 9.2 PrincipalComponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 9.3 PartialLeastSquares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 9.4 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 9.5 RidgeRegression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 10 VariableSelection 124 10.1 HierarchicalModels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 10.2 StepwiseProcedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 10.2.1 ForwardSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 10.2.2 StepwiseRegression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.3 Criterion-basedprocedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 CONTENTS 6 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 11 StatisticalStrategyandModelUncertainty 134 11.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 11.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 12 ChicagoInsuranceRedlining-acompleteexample 138 13 RobustandResistantRegression 150 14 MissingData 156 15 AnalysisofCovariance 160 15.1 Atwo-levelexample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 15.2 Codingqualitative predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 15.3 AThree-levelexample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 16 ANOVA 168 16.1 One-WayAnova. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.1 Themodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.2 Estimationandtesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.3 Anexample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 16.1.4 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 16.1.5 MultipleComparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 16.1.6 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 16.1.7 Scheffe´’stheoremformultiplecomparisons . . . . . . . . . . . . . . . . . . . . . . 177 16.1.8 Testingforhomogeneityofvariance . . . . . . . . . . . . . . . . . . . . . . . . . . 179 16.2 Two-WayAnova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 16.2.1 Oneobservationpercell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.2 Morethanoneobservationpercell . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.3 Interpretingtheinteractioneffect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.4 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 16.3 Blockingdesigns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.3.1 RandomizedBlockdesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.3.2 RelativeadvantageofRCBDoverCRD . . . . . . . . . . . . . . . . . . . . . . . . 190 16.4 LatinSquares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 16.5 BalancedIncompleteBlockdesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 16.6 Factorialexperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 A RecommendedBooks 204 A.1 BooksonR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 A.2 BooksonRegressionandAnova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 B R functionsanddata 205 CONTENTS 7 C QuickintroductiontoR 207 C.1 Readingthedatain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 C.2 NumericalSummaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 C.3 GraphicalSummaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 C.4 Selectingsubsetsofthedata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 C.5 LearningmoreaboutR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Chapter 1 Introduction 1.1 Before you start Statistics starts with a problem, continues with the collection of data, proceeds with the data analysis and finishes with conclusions. It is a common mistake of inexperienced Statisticians to plunge into a complex analysiswithoutpayingattentiontowhattheobjectivesareorevenwhetherthedataareappropriateforthe proposedanalysis. Lookbeforeyouleap! 1.1.1 Formulation The formulationof a problem is often more essential than its solutionwhich may be merelya matterofmathematicalorexperimentalskill. AlbertEinstein Toformulatetheproblemcorrectly, youmust 1. Understand the physical background. Statisticians often work in collaboration with others and need tounderstandsomethingaboutthesubjectarea. Regardthisasanopportunitytolearnsomethingnew ratherthanachore. 2. Understandtheobjective. Again,oftenyouwillbeworkingwithacollaboratorwhomaynotbeclear about what the objectives are. Beware of “fishing expeditions” - if you look hard enough, you’ll almostalwaysfindsomethingbutthatsomethingmayjustbeacoincidence. 3. Make sure you know what the client wants. Sometimes Statisticians perform an analysis far more complicated than the client really needed. You may find that simpledescriptive statisticsare all that areneeded. 4. Put the problem into statistical terms. This is a challenging step and where irreparable errors are sometimesmade. Oncetheproblemistranslatedintothelanguageof Statistics,thesolutionisoften routine. Difficulties with this step explain why Artificial Intelligence techniques have yet to make muchimpactinapplicationtoStatistics. Definingtheproblemishardtoprogram. That a statistical method can read in and process the data is not enough. The results may be totally meaningless. 8 1.1. BEFOREYOUSTART 9 1.1.2 DataCollection It’simportanttounderstandhowthedatawascollected. Are the data observational or experimental? Are the data a sample of convenience or were they (cid:0) obtained via a designed sample survey. How the data were collected has a crucial impact on what conclusionscanbemade. Istherenon-response? Thedatayoudon’tseemaybejustasimportantasthedatayoudosee. (cid:0) Aretheremissingvalues? Thisisacommonproblemthatistroublesomeandtimeconsumingtodeal (cid:0) with. Howarethedatacoded? Inparticular, howarethequalitative variablesrepresented. (cid:0) Whatarethe unitsof measurement? Sometimesdatais collectedor representedwithfar moredigits (cid:0) thanarenecessary. Considerroundingifthiswillhelpwiththeinterpretationorstoragecosts. Bewareofdataentryerrors. Thisproblemisalltoocommon—almostacertaintyinanyrealdataset (cid:0) ofatleastmoderatesize. Performsomedatasanitychecks. 1.1.3 InitialDataAnalysis Thisisacriticalstepthatshouldalwaysbeperformed. Itlookssimplebutitisvital. Numericalsummaries-means,sds,five-numbersummaries,correlations. (cid:0) Graphicalsummaries (cid:0) – Onevariable-Boxplots,histogramsetc. – Twovariables-scatterplots. – Manyvariables-interactive graphics. Lookforoutliers,data-entryerrorsandskewedorunusualdistributions. Arethedatadistributed asyou expect? Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming. Itoftentakesmoretimethanthedataanalysisitself. Inthiscourse,allthedatawillbereadyto analyzebutyoushouldrealizethatinpracticethisisrarelythecase. Let’s look at an example. The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The following variables were recorded: Number of times pregnant, Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolicbloodpressure(mmHg),Tricepsskinfoldthickness(mm),2-Hourseruminsulin(muU/ml), Bodymassindex (weightin kg/(heightinm2)),Diabetespedigreefunction,Age(years)andatestwhether thepatientshowssignsofdiabetes(coded0ifnegative,1ifpositive). ThedatamaybeobtainedfromUCI Repositoryofmachinelearningdatabasesathttp://www.ics.uci.edu/˜mlearn/MLRepository.html. Ofcourse,beforedoinganythingelse,oneshouldfindoutwhatthepurposeofthestudywasandmore abouthowthedatawascollected. Butlet’sskipaheadtoalookatthedata:

Description:
Some linear algebra and calculus is also required. The emphasis of .. pregnant glucose diastolic triceps insulin bmi diabetes age test. 1. 6. 148. 72.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.