Table Of ContentSPRINGER BRIEFS IN STATISTICS
Thomas W. MacFarland
Introduction to
Data Analysis
and Graphical
Presentation in
Biostatistics with R
Statistics in the Large
SpringerBriefs in Statistics
Forfurthervolumes:
http://www.springer.com/series/8921
Thomas W. MacFarland
Introduction to Data Analysis
and Graphical Presentation
in Biostatistics with R
Statistics in the Large
123
ThomasW.MacFarland
OfficeforInstitutional
Effectiveness
NovaSoutheasternUniversity
FortLauderdale,FL,USA
ISSN2191-544X ISSN2191-5458(electronic)
ISBN978-3-319-02531-5 ISBN978-3-319-02532-2(eBook)
DOI10.1007/978-3-319-02532-2
SpringerChamHeidelbergNewYorkDordrechtLondon
LibraryofCongressControlNumber:2013953880
©TheAuthor(s)2014
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof
thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,
broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation
storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.
PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations
areliabletoprosecutionundertherespectiveCopyrightLaw.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication
doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant
protectivelawsandregulationsandthereforefreeforgeneraluse.
While the advice and information in this book are believed to be true and accurate at the date of
publication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityfor
anyerrorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,with
respecttothematerialcontainedherein.
Printedonacid-freepaper
SpringerispartofSpringerScience+BusinessMedia(www.springer.com)
Contents
1 Introduction:BiostatisticsandR........................................... 1
1.1 PurposeofThisText................................................... 1
1.2 DevelopmentofBiostatistics.......................................... 2
1.3 DevelopmentofR ..................................................... 3
1.4 HowRisUsedinThisText........................................... 4
2 DataExploration,DescriptiveStatistics,andMeasures
ofCentralTendency ......................................................... 5
2.1 BackgroundonThisLesson........................................... 5
2.1.1 DescriptionoftheData........................................ 5
2.1.2 NullHypothesis(Ho) ......................................... 7
2.2 DataImportofa.csvSpreadsheet-TypeDataFileintoR ........... 7
2.3 OrganizetheDataandDisplaytheCodeBook ...................... 9
2.4 ConductaVisualDataCheck......................................... 9
2.5 DescriptiveAnalysisoftheData...................................... 10
2.6 Summary............................................................... 13
2.7 Addendum:SpecializedExternalPackagesandFunctions.......... 13
2.8 PreparetoExit,Save,andLaterRetrieveThisRSession ........... 15
3 Student’st-TestforIndependentSamples................................. 17
3.1 BackgroundonThisLesson........................................... 17
3.1.1 DescriptionoftheData........................................ 17
3.1.2 NullHypothesis(Ho) ......................................... 18
3.2 DataImportofa.csvSpreadsheet-TypeDataFileintoR ........... 19
3.3 OrganizetheDataandDisplaytheCodeBook ...................... 20
3.4 ConductaVisualDataCheck......................................... 23
3.5 DescriptiveAnalysisoftheData...................................... 34
3.6 ConducttheStatisticalAnalysis ...................................... 40
3.7 Summary............................................................... 42
3.8 Addendum:t-Statisticvz-Statistic ................................... 43
3.8.1 CreatetheEnumeratedDataset............................... 44
v
vi Contents
3.8.2 Calculatethet-Statistic........................................ 44
3.8.3 Calculatethez-Statistic....................................... 45
3.9 PreparetoExit,Save,andLaterRetrieveThisRSession ........... 45
4 Student’st-TestforMatchedPairs......................................... 47
4.1 BackgroundonThisLesson........................................... 47
4.1.1 DescriptionoftheData........................................ 47
4.1.2 NullHypothesis(Ho) ......................................... 49
4.1.3 UnstackedDataandStackedData............................ 49
4.2 DataImportofa.csvSpreadsheet-TypeDataFileintoR ........... 51
4.3 OrganizetheDataandDisplaytheCodeBook ...................... 52
4.4 ConductaVisualDataCheck......................................... 54
4.5 DescriptiveAnalysisoftheData...................................... 60
4.6 ConducttheStatisticalAnalysis ...................................... 63
4.7 Summary............................................................... 65
4.8 Addendum1: Stacked Data and Student’s t-Test
forMatchedPairs...................................................... 66
4.9 Addendum2:TheImpactofNonStudent’st-Test.................. 70
4.10 PreparetoExit,Save,andLaterRetrieveThisRSession ........... 72
5 OnewayAnalysisofVariance(ANOVA)................................... 73
5.1 BackgroundonThisLesson........................................... 73
5.1.1 DescriptionoftheData........................................ 73
5.1.2 NullHypothesis(Ho) ......................................... 75
5.2 DataImportofa.csvSpreadsheet-TypeDataFileintoR ........... 75
5.3 OrganizetheDataandDisplaytheCodeBook ...................... 77
5.4 ConductaVisualDataCheck......................................... 82
5.5 DescriptiveAnalysisoftheData...................................... 87
5.6 ConducttheStatisticalAnalysis ...................................... 89
5.6.1 ExploratoryOnewayANOVA ................................ 90
5.6.2 OnewayANOVAMethod1:lm()andanova()Functions... 91
5.6.3 Oneway ANOVA Method 2: aov() and
TukeyHSD()Functions........................................ 92
5.7 Summary............................................................... 93
5.8 Addendum:OtherPackagesforDisplayofOnewayANOVA ...... 96
5.9 PreparetoExit,Save,andLaterRetrieveThisRSession ........... 97
6 TwowayAnalysisofVariance(ANOVA)................................... 99
6.1 BackgroundonThisLesson........................................... 99
6.1.1 DescriptionoftheData........................................ 99
6.1.2 NullHypothesis(Ho) ......................................... 100
6.2 DataImportofa.csvSpreadsheet-TypeDataFileintoR ........... 100
6.3 OrganizetheDataandDisplaytheCodeBook ...................... 101
6.4 ConductaVisualDataCheck......................................... 104
6.5 DescriptiveAnalysisoftheData...................................... 111
6.6 ConducttheStatisticalAnalysis ...................................... 117
Contents vii
6.7 Summary............................................................... 122
6.8 Addendum:OtherPackagesforDisplayofTwowayANOVA ...... 124
6.9 PreparetoExit,Save,andLaterRetrieveThisRSession ........... 126
7 CorrelationandLinearRegression........................................ 129
7.1 BackgroundonThisLesson........................................... 129
7.1.1 DescriptionoftheData........................................ 129
7.1.2 NullHypothesis(Ho) ......................................... 130
7.2 DataImportofa.csvSpreadsheet-TypeDataFileintoR ........... 131
7.3 OrganizetheDataandDisplaytheCodeBook ...................... 132
7.4 ConductaVisualDataCheck......................................... 135
7.5 DescriptiveAnalysisoftheData...................................... 140
7.6 ConducttheStatisticalAnalysis ...................................... 142
7.6.1 CorrelationUsingPearson’sr................................. 142
7.6.2 LinearRegression ............................................. 150
7.7 Summary............................................................... 154
7.8 Addendum:MultipleRegression ..................................... 155
7.8.1 Hand-CalculateMultipleRegression......................... 156
7.8.2 MinimalAdequateModel(MAM)forRegression .......... 158
7.8.3 StepwiseRegression .......................................... 160
7.9 PreparetoExit,Save,andLaterRetrieveThisRSession ........... 163
8 FutureActionsandNextSteps ............................................. 165
8.1 UseofThisText ....................................................... 165
8.2 FutureUseofRforBiostatistics...................................... 166
8.3 ExternalResources .................................................... 167
8.4 ContacttheAuthor..................................................... 167
Chapter 1
Introduction: Biostatistics and R
Abstract The purpose of this lesson is to provide context for the science of
biostatistics and to highlight a few of the major contributors. Emphasis is given
to the role of data analysis for the various disciplines in the biological sciences
(e.g., agriculture, biology, clinical trials, ecology, environmental health, epidemi-
ology, genetics, health sciences, nutrition, public health, etc.). The practice of
biostatistics is then linked to the use of R, a free and open source software
environment. As explained, each problem in this text is associated with a .csv
(comma-separated values) ASCII file, a Code Book detailing data organization,
qualityassurancethroughgraphicalpresentationsanddescriptivestatistics,selected
statisticalanalyses,summaryofoutcomes,andanaddendumofferingideasonhow
Rcanbeusedforadditionalinsightintobiostatistics.
Keywords Agriculture • Biology • Biostatistics • Census • Clinical trials •
Code Book • Comma-separated values ASCII file • Command Line Interface
(CLI) • ComprehensiveRArchiveNetwork(CRAN) • CRAN ContributedPack-
ages • Data analysis • Descriptive statistics • Ecology • Environmental health
• Epidemiology • Genetics • GraphicalUserInterface(GUI) • Healthsciences •
Nutrition • Opensourcesoftware • Publichealth • R • S • Scheme
1.1 PurposeofThisText
Scientists use empiricism to guide and validate decisions. Precision, orderliness,
analysis,andasoundbackgroundinstatisticsaredirectlyassociatedwithinformed
judgment,decision-making,andthesubsequentallocationofhuman,physical,and
fiscalresources–alltoimprovethehumancondition.Thepurposeofthistextisto
provideanintroductiontotheuseofRsoftwareasaplatformforproblemsrelated
to biostatistics. Data identification, data organization, graphical and descriptive
portrayalof phenomena,and statistical tests through the use of R are all inherent
tothistext.
T.W.MacFarland,IntroductiontoDataAnalysisandGraphicalPresentation 1
inBiostatisticswithR,SpringerBriefsinStatistics,DOI10.1007/978-3-319-02532-2__1,
©TheAuthor(s)2014
2 1 Introduction:BiostatisticsandR
RsupportsaGraphicalUserInterface(GUI),theRCommander.Thisresourceis
availableasanexternalRpackage,Rcmdr.Rcmdrisfairlyeasytousebuteventually
therearelimitsontheuseofRCommander.
R also supports a far more robust and useful syntax-based Command Line
Interface (CLI) approach to statistics. This text is focused on the use of R-based
syntax, working at the command line, to address data organization, statistical
analyses, and graphical presentations as they relate to biostatistics. A series of
smallconfidence-buildingactivitiesarepresentedatthebeginningofthistext,with
moredetailgraduallyintroducedasthetextisfollowedfrombeginningtoend.All
examples are for biostatistics. The many examples presented in this text can be
easilyappliedtoallareasofbiostatistics,regardlessofmajorareaofstudy.
1.2 Development of Biostatistics
Thetermstatisticsisderivedfromstatus,theLatintermforstate.Thus,thescience
and practice of statistics, as we think of it today, was first associated with data
relatingtothestate,suchascensuscountsandhealthrecords.Giventheimportance
of statistics as a part of state governance, there are more than a few accounts of
census-takingandhealthrecordsfromtheearliestdaysofrecordedhistory.
Going beyond mere record-keeping, an interest in the mathematics of chance
(e.g., probability) began to develop in the 1500s and 1600s, especially among
those who engaged in European court life. The early interest in probability may
nothavebeenaltruisticbutwasinsteadfocusedongainingadvantageincardgames
andotherformsofgambling.Theuseofprobabilitytosolveproblemsforsocietal
gain may not have been the first interest but instead attention was focused on the
question, Given that there a X cards in the deck, if I discard the Y card from my
hand,whatisthechancethatIwilldrawtheZcardfromthedeckandimprovemy
chanceofwinningthisgameofcards?
Thisearlyinterestinprobabilityandeventuallytheevolvingscienceofstatistics
as a vehicle for social improvement eventually grew into what we think of as
biostatistics. It is far beyond the purpose of this introductory text on the use of
Rinbiostatisticstogointotoomuchdetail,butataminimumitwouldbehelpful
tolookintothebiographyandcontributionsofthefollowingfoundersofwhatwe
nowconsiderbiostatistics:
• BlaisePascal(1623–1662),preparedearlywritingsonprobabilityanddeveloped
thePascaline(e.g.,mechanicalcalculator).
• JohnGraunt(1620–1674),publishedNaturalandPoliticalObservationsMade
UpontheBillsofMortality,perhapsthefirstwidely-readtextondemographics,
publichealth,andepidemiology.
• JohnSnow(1813–1858),advocatedforepidemiologyandthe1854BroadStreet
(London)CholeraOutbreak.
Description:Through real-world datasets, this book shows the reader how to work with material in biostatistics using the open source software R. These include tools that are critical to dealing with missing data, which is a pressing scientific issue for those engaged in biostatistics. Readers will be equipped t