Statistics and Computing SeriesEditors: J.Chambers D.Hand W.Ha¨rdle Statistics and Computing Brusco/Stahl:BranchandBoundApplications inCombinatorial DataAnalysis Chambers:SoftwareforDataAnalysis:Programming withR Dalgaard:Introductory StatisticswithR,2nded. Gentle:ElementsofComputational Statistics Gentle:NumericalLinearAlgebraforApplicationsinStatistics Gentle:RandomNumberGenerationandMonte CarloMethods,2nded. Ha¨rdle/Klinke/Turlach:XploRe:AnInteractiveStatistical Computing Environment Ho¨rmann/Leydold/Derflinger:Automatic Nonuniform Random VariateGeneration Krause/Olson:TheBasicsofS-PLUS,4thed. Lange:NumericalAnalysisforStatisticians Lemmon/Schafer:DevelopingStatisticalSoftwareinFortran95 Loader:LocalRegressionandLikelihood Marasinghe/Kennedy:SASforDataAnalysis:Intermediate StatisticalMethods O´ Ruanaidh/Fitzgerald:NumericalBayesianMethodsAppliedto SignalProcessing Pannatier:VARIOWIN:SoftwareforSpatialDataAnalysisin2D Pinheiro/Bates:Mixed-EffectsModelsinSandS-PLUS Unwin/Theus/Hofmann: GraphicsofLargeDatasets: VisualizingaMillion Venables/Ripley:ModernAppliedStatisticswithS,4thed. Venables/Ripley:SProgramming Wilkinson:TheGrammar ofGraphics,2nded. Peter Dalgaard Introductory Statistics with R Second Edition 123 PeterDalgaard DepartmentofBiostatistics UniversityofCopenhagen Denmark [email protected] ISSN:1431-8784 ISBN:978-0-387-79053-4 e-ISBN:978-0-387-79054-1 DOI:10.1007/978-0-387-79054-1 LibraryofCongressControlNumber:2008932040 (cid:2)c 2008SpringerScience+BusinessMedia,LLC Allrightsreserved.Thisworkmaynotbetranslatedorcopiedinwholeorinpartwithoutthewritten permissionofthepublisher(SpringerScience+BusinessMedia,LLC,233SpringStreet,NewYork, NY10013,USA),exceptforbriefexcerptsinconnectionwithreviewsorscholarlyanalysis.Use in connection with any form of information storage and retrieval, electronic adaptation, computer software,orbysimilarordissimilarmethodologynowknownorhereafterdevelopedisforbidden. Theuseinthispublicationoftradenames,trademarks,servicemarks,andsimilarterms,evenifthey arenotidentifiedassuch,isnottobetakenasanexpressionofopinionastowhetherornottheyare subjecttoproprietaryrights. Printedonacid-freepaper springer.com To Grete, for putting up with me for so long Preface R is a statistical computer program made available through the Internet under the General Public License (GPL). That is, it is supplied with a li- censethatallowsyoutouseitfreely,distributeit,orevensellit,aslongas thereceiverhasthesamerightsandthesourcecodeisfreelyavailable.It existsforMicrosoftWindowsXPorlater,foravarietyofUnixandLinux platforms,andforAppleMacintoshOSX. Rprovidesanenvironmentinwhichyoucanperformstatisticalanalysis and produce graphics. It is actually a complete programming language, althoughthatisonlymarginallydescribedinthisbook.Herewecontent ourselveswithlearningtheelementaryconceptsandseeinganumberof cookbookexamples. R is designed in such a way that it is always possible to do further computations on the results of a statistical procedure. Furthermore, the designforgraphicalpresentationofdataallowsbothno-nonsensemeth- ods, for example plot(x,y), and the possibility of fine-grained control oftheoutput’sappearance.ThefactthatRisbasedonaformalcomputer language gives it tremendous flexibility. Other systems present simpler interfaces in terms of menus and forms, but often the apparent user- friendlinessturnsintoahindranceinthelongerrun.Althoughelementary statistics is often presented as a collection of fixed procedures, analysis of moderately complex data requires ad hoc statistical model building, whichmakestheaddedflexibilityofRhighlydesirable. viii Preface R owes its name to typical Internet humour. You may be familiar with the programming language C (whose name is a story in itself). Inspired bythis,BeckerandChamberschoseintheearly1980stocalltheirnewly developedstatisticalprogramminglanguageS.Thislanguagewasfurther developedintothecommercialproductS-PLUS,whichbytheendofthe decadewasinwidespreaduseamongstatisticiansofallkinds.RossIhaka and Robert Gentleman from the University of Auckland, New Zealand, chosetowriteareducedversionofSforteachingpurposes,andwhatwas more natural than choosing the immediately preceding letter? Ross’ and Robert’sinitialsmayalsohaveplayedarole. In1995,MartinMaechlerpersuadedRossandRoberttoreleasethesource codeforRundertheGPL.ThiscoincidedwiththeupsurgeinOpenSource softwarespurredbytheLinuxsystem.Rsoonturnedouttofillagapfor people like me who intended to use Linux for statistical computing but hadnostatisticalpackageavailableatthetime.Amailinglistwassetup forthecommunicationofbugreportsanddiscussionsofthedevelopment ofR. InAugust1997,Iwasinvitedtojoinanextendedinternationalcoreteam whose members collaborate via the Internet and that has controlled the developmentofRsincethen.Thecoreteamwassubsequentlyexpanded several times and currently includes 19 members. On February 29, 2000, version1.0.0wasreleased.Asofthiswriting,thecurrentversionis2.6.2. This book was originally based upon a set of notes developed for the course in Basic Statistics for Health Researchers at the Faculty of Health SciencesoftheUniversityofCopenhagen.Thecoursehadaprimarytar- get of students for the Ph.D. degree in medicine. However, the material hasbeensubstantiallyrevised,andIhopethatitwillbeusefulforalarger audience, although some biostatistical bias remains, particularly in the choiceofexamples. Inlateryears,thecourseinStatisticalPracticeinEpidemiology,whichhas beenheldyearlyinTartu,Estonia,hasbeenamajorsourceofinspiration andexperienceinintroducingyoungstatisticiansandepidemiologiststo R. ThisbookisnotamanualforR.Theideaistointroduceanumberofbasic conceptsandtechniquesthatshouldallowthereadertogetstartedwith practicalstatistics. Intermsofthepracticalmethods,thebookcoversareasonablecurriculum for first-year students of theoretical statistics as well as for engineering students. These groups will eventually need to go further and study more complex models as well as general techniques involving actual programmingintheRlanguage. Preface ix Forfieldswhereelementarystatisticsistaughtmainlyasatool,thebook goes somewhat further than what is commonly taught at the under- graduate level. Multiple regression methods or analysis of multifactorial experimentsarerarelytaughtatthatlevelbutmayquicklybecomeessen- tial for practical research. I have collected the simpler methods near the beginning to make the book readable also at the elementary level. How- ever, in order to keep technical material together, Chapters 1 and 2 do includematerialthatsomereaderswillwanttoskip. The book is thus intended to be useful for several groups, but I will not pretend that it can stand alone for any of them. I have included brief theoretical sections in connection with the various methods, but more thanasteachingmaterial,theseshouldserveasremindersorperhapsas appetizersforreaderswhoarenewtotheworldofstatistics. Notesonthe2ndedition The original first chapter was expanded and broken into two chapters, and a chapter on more advanced data handling tasks was inserted after thecoverageofsimplerstatisticalmethods.Therearealsotwonewchap- tersonstatisticalmethodology,coveringPoissonregressionandnonlinear curve fitting, and a few items have been added to the section on de- scriptivestatistics.Theoriginalmethodologicalchaptershavebeenquite minimallyrevised,mainlytoensurethatthetextmatchestheactualout- put of the current version of R. The exercises have been revised, and solutionsketchesnowappearinAppendixD. Acknowledgements Obviously,thisbookwouldnothavebeenpossiblewithouttheeffortsof myfriendsandcolleaguesontheRCoreTeam,theauthorsofcontributed packages,andmanyofthecorrespondentsofthee-maildiscussionlists. I am deeply grateful for the support of my colleagues and co-teachers Lene Theil Skovgaard, Bendix Carstensen, Birthe Lykke Thomsen, Helle Rootzen, Claus Ekstrøm, Thomas Scheike, and from the Tartu course Krista Fischer, Esa Läära, Martyn Plummer, Mark Myatt, and Michael Hills, as well as the feedback from several students. In addition, sev- eralpeople,includingBillVenables,BrianRipley,andDavidJames,gave valuableadviceonearlydraftsofthebook. Finally,profoundthanksareduetothefreesoftwarecommunityatlarge. The R project would not have been possible without their effort. For the x Preface typesetting of this book, TEX, LATEX, and the consolidating efforts of the LATEX2eprojecthavebeenindispensable. PeterDalgaard Copenhagen April2008