A Handbook of Statistical Analyses Using n SECOND EDITION © 2010 by Taylor and Francis Group, LLC A Handbook of Statistical Analyses Using SECOND EDITION Brian S. Everitt and Ibrsten Hothorn CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor & Francis Croup, an informa business A CHAPMAN & HALL BOOK © 2010 by Taylor and Francis Group, LLC Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2010 by Taylor and Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number: 978-1-4200-7933-3 (Paperback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Everitt, Brian. A handbook of statistical analyses using R / Brian S. Everitt and Torsten Hothorn. -- 2nd ed. p. cm. Includes bibliographical references and index. ISBN 978-1-4200-7933-3 (pbk. : alk. paper) 1. Mathematical statistics--Data processing--Handbooks, manuals, etc. 2. R (Computer program language)--Handbooks, manuals, etc. I. Hothorn, Torsten. II. Title. QA276.45.R3E94 2010 519.50285’5133--dc22 2009018062 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com © 2010 by Taylor and Francis Group, LLC Dedication To our wives, Mary-Elizabeth and Carolin, for their constant support and encouragement © 2010 by Taylor and Francis Group, LLC Preface to Second Edition Like the first edition this book is intended as a guide to data analysis with the R system for statistical computing. New chapters on graphical displays, generalised additive models and simultaneous inference have been added to thissecondeditionandasectionongeneralisedlinearmixedmodelscompletes thechapterthatdiscussestheanalysisoflongitudinaldatawheretheresponse variable does not have a normal distribution. In addition, new examples and additional exercises have been added to several chapters. We have also taken the opportunity to correct a number of errors that were present in the first edition.Mostoftheseerrorswerekindlypointedouttousbyavarietyofpeo- ple to whom we are very grateful, especially Guido Schwarzer, Mike Cheung, Tobias Verbeke, Yihui Xie, Lothar H¨aberle, and Radoslav Harman. We learnt that many instructors use our book successfully for introductory courses in applied statistics. We have had the pleasure to give some courses based on the first edition of the book ourselves and we are happy to share slides covering many sections of particular chapters with our readers. LATEX sourcesandPDFversionsofslidescoveringseveralchaptersareavailablefrom the second author upon request. A new version of the HSAUR package, now called HSAUR2 for obvious reasons, is available from CRAN. Basically the package vignettes have been updated to cover the new and modified material as well. Otherwise, the tech- nical infrastructure remains as described in the preface to the first edition, with two small exceptions: names of R add-on packages are now printed in bold font and we refrain from showing significance stars in model summaries. Lastly we would like to thank Thomas Kneib and Achim Zeileis for com- menting on the newly added material and again the CRC Press staff, in par- ticular Rob Calver, for their support during the preparation of this second edition. Brian S. Everitt and Torsten Hothorn London and Mu¨nchen, April 2009 © 2010 by Taylor and Francis Group, LLC Preface to First Edition This book is intended as a guide to data analysis with the R system for sta- tistical computing. R is an environment incorporating an implementation of the S programming language, which is powerful and flexible and has excellent graphical facilities (R Development Core Team, 2009b). In the Handbook we aimtogiverelativelybriefandstraightforwarddescriptionsofhowtoconduct a range of statistical analyses using R. Each chapter deals with the analy- sis appropriate for one or several data sets. A brief account of the relevant statistical background is included in each chapter along with appropriate ref- erences, but our prime focus is on how to use R and how to interpret results. We hope the book will provide students and researchers in many disciplines with a self-contained means of using R to analyse their data. Risanopen-sourceprojectdevelopedbydozensofvolunteersformorethan ten years now and is available from the Internet under the General Public Li- cence. R has become the lingua franca of statistical computing. Increasingly, implementationsofnewstatisticalmethodologyfirstappearasRadd-onpack- ages.Insomecommunities,suchasinbioinformatics,Ralreadyistheprimary workhorseforstatisticalanalyses.BecausethesourcesoftheRsystemareopen andavailabletoeveryonewithoutrestrictionsandbecauseofitspowerfullan- guageandgraphicalcapabilities,Rhasstartedtobecomethemaincomputing engine for reproducible statistical research (Leisch, 2002a,b, 2003, Leisch and Rossini,2003,Gentleman,2005).Forareproduciblepieceofresearch,theorig- inal observations, all data preprocessing steps, the statistical analysis as well asthescientificreportformaunityandallneedtobeavailableforinspection, reproduction and modification by the readers. ReproducibilityisanaturalrequirementfortextbookssuchastheHandbook of Statistical Analyses Using R and therefore this book is fully reproducible usinganRversiongreaterorequalto2.2.1.Allanalysesandresults,including figures and tables, can be reproduced by the reader without having to retype a single line of R code. The data sets presented in this book are collected in a dedicated add-on package called HSAUR accompanying this book. The packagecanbeinstalledfromtheComprehensiveRArchiveNetwork(CRAN) via R> install.packages("HSAUR") and its functionality is attached by R> library("HSAUR") The relevant parts of each chapter are available as a vignette, basically a © 2010 by Taylor and Francis Group, LLC document including both the R sources and the rendered output of every analysiscontainedinthebook.Forexample,thefirstchaptercanbeinspected by R> vignette("Ch_introduction_to_R", package = "HSAUR") and the R sources are available for reproducing our analyses by R> edit(vignette("Ch_introduction_to_R", package = "HSAUR")) An overview on all chapter vignettes included in the package can be obtained from R> vignette(package = "HSAUR") We welcome comments on the R package HSAUR, and where we think these addtoorimproveouranalysisofadatasetwewillincorporatethemintothe package and, hopefully at a later stage, into a revised or second edition of the book. Plots and tables of results obtained from R are all labelled as ‘Figures’ in the text. For the graphical material, the corresponding figure also contains the‘essence’oftheRcodeusedtoproducethefigure,althoughthiscodemay differ a little from that given in the HSAUR package, since the latter may include some features, for example thicker line widths, designed to make a basic plot more suitable for publication. WewouldliketothanktheRDevelopmentCoreTeamfortheRsystem,and authors of contributed add-on packages, particularly Uwe Ligges and Vince Carey for helpful advice on scatterplot3d and gee. Kurt Hornik, Ludwig A. Hothorn, Fritz Leisch and Rafael Weißbach provided good advice with some statistical and technical problems. We are also very grateful to Achim Zeileis for reading the entire manuscript, pointing out inconsistencies or even bugs and for making many suggestions which have led to improvements. Lastly we would like to thank the CRC Press staff, in particular Rob Calver, for their support during the preparation of the book. Any errors in the book are, of course, the joint responsibility of the two authors. Brian S. Everitt and Torsten Hothorn London and Erlangen, December 2005 © 2010 by Taylor and Francis Group, LLC List of Figures 1.1 Histograms of the market value and the logarithm of the market value for the companies contained in the Forbes 2000 list. 19 1.2 Raw scatterplot of the logarithms of market value and sales. 20 1.3 Scatterplot with transparent shading of points of the loga- rithms of market value and sales. 21 1.4 Boxplots of the logarithms of the market value for four selected countries, the width of the boxes is proportional to the square roots of the number of companies. 22 2.1 Histogram(top)andboxplot(bottom)ofmalignantmelanoma mortality rates. 30 2.2 Parallel boxplots of malignant melanoma mortality rates by contiguity to an ocean. 31 2.3 Estimated densities of malignant melanoma mortality rates by contiguity to an ocean. 32 2.4 Scatterplot of malignant melanoma mortality rates by geo- graphical location. 33 2.5 Scatterplot of malignant melanoma mortality rates against latitude. 34 2.6 Bar chart of happiness. 35 2.7 Spineplot of health status and happiness. 36 2.8 Spinogram (left) and conditional density plot (right) of happiness depending on log-income 38 3.1 Boxplots of estimates of room width in feet and metres (after conversion to feet) and normal probability plots of estimates of room width made in feet and in metres. 55 3.2 Routputoftheindependentsamplest-testfortheroomwidth data. 56 3.3 R output of the independent samples Welch test for the roomwidth data. 56 3.4 R output of the Wilcoxon rank sum test for the roomwidth data. 57 3.5 Boxplot and normal probability plot for differences between the two mooring methods. 58 © 2010 by Taylor and Francis Group, LLC 3.6 R output of the paired t-test for the waves data. 59 3.7 R output of the Wilcoxon signed rank test for the waves data. 59 3.8 Enhanced scatterplot of water hardness and mortality, showing both the joint and the marginal distributions and, in addition, the location of the city by different plotting symbols. 60 3.9 R output of Pearsons’ correlation coefficient for the water data. 61 3.10 R output of the chi-squared test for the pistonrings data. 61 3.11 Association plot of the residuals for the pistonrings data. 62 3.12 R output of McNemar’s test for the rearrests data. 63 3.13 R output of an exact version of McNemar’s test for the rearrests data computed via a binomial test. 63 4.1 An approximation for the conditional distribution of the difference of mean roomwidth estimates in the feet and metres group under the null hypothesis. The vertical lines show the negative and positive absolute value of the test statistic T obtained from the original data. 71 4.2 R output of the exact permutation test applied to the roomwidth data. 72 4.3 R output of the exact conditional Wilcoxon rank sum test applied to the roomwidth data. 73 4.4 R output of Fisher’s exact test for the suicides data. 73 5.1 Plot of mean weight gain for each level of the two factors. 84 5.2 R output of the ANOVA fit for the weightgain data. 85 5.3 Interaction plot of type and source. 86 5.4 Plot of mean litter weight for each level of the two factors for the foster data. 87 5.5 Graphical presentation of multiple comparison results for the foster feeding data. 90 5.6 ScatterplotmatrixofepochmeansforEgyptianskullsdata. 92 6.1 Scatterplot of velocity and distance. 104 6.2 Scatterplotofvelocityanddistancewithestimatedregression line (left) and plot of residuals against fitted values (right). 105 6.3 Boxplots of rainfall. 107 6.4 Scatterplots of rainfall against the continuous covariates. 108 6.5 R output of the linear model fit for the clouds data. 109 6.6 Regression relationship between S-Ne criterion and rainfall with and without seeding. 111 6.7 Plot of residuals against fitted values for clouds seeding data. 113 © 2010 by Taylor and Francis Group, LLC
Description: