Statistical Data Analysis Explained StatisticalDataAnalysisExplained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett, R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6 Statistical Data Analysis Explained Applied Environmental Statistics with R Clemens Reimann GeologicalSurveyofNorway Peter Filzmoser ViennaUniversityofTechnology Robert G. Garrett GeologicalSurveyofCanada Rudolf Dutter ViennaUniversityofTechnology Copyright©2008 JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester, WestSussexPO198SQ,England Telephone(+44)1243779777 Email(forordersandcustomerserviceenquiries):[email protected] Visit our Home Page on www.wileyeurope.com or www.wiley.com AllRightsReserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystemortransmittedin anyformorbyanymeans,electronic,mechanical,photocopying,recording,scanningorotherwise,exceptunder thetermsoftheCopyright,DesignsandPatentsAct1988orunderthetermsofalicenceissuedbytheCopyright LicensingAgencyLtd,90TottenhamCourtRoad,LondonW1T4LP,UK,withoutthepermissioninwritingofthe Publisher.RequeststothePublishershouldbeaddressedtothePermissionsDepartment,JohnWiley&SonsLtd,The Atrium,SouthernGate,Chichester,WestSussexPO198SQ,England,[email protected],orfaxed to(+44)1243770620. Designationsusedbycompaniestodistinguishtheirproductsareoftenclaimedastrademarks.Allbrandnamesand productnamesusedinthisbookaretradenames,servicemarks,trademarksorregisteredtrademarksoftheirrespective owners.ThePublisherisnotassociatedwithanyproductorvendormentionedinthisbook. Thispublicationisdesignedtoprovideaccurateandauthoritativeinformationinregardtothesubjectmattercovered. ItissoldontheunderstandingthatthePublisherisnotengagedinrenderingprofessionalservices.Ifprofessional adviceorotherexpertassistanceisrequired,theservicesofacompetentprofessionalshouldbesought. OtherWileyEditorialOffices JohnWiley&SonsInc.,111RiverStreet,Hoboken,NJ07030,USA Jossey-Bass,989MarketStreet,SanFrancisco,CA94103-1741,USA Wiley-VCHVerlagGmbH,Boschstr.12,D-69469Weinheim,Germany JohnWiley&SonsAustraliaLtd,33ParkRoad,Milton,Queensland4064,Australia JohnWiley&Sons(Asia)PteLtd,2ClementiLoop#02-01,JinXingDistripark,Singapore129809 JohnWily&SonsCanadaLtd,6045FreemontBlvd,Mississauga,Ontario,L5R4J3 Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmaynotbe availableinelectronicbooks. BritishLibraryCataloguinginPublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary ISBN978-0-470-98581-6 Typesetin10/12ptTimesbyThomsonDigital,Noida,India PrintedandboundinGreatBritainbyAntonyRoweLtd.,Chippenham,Wilts Thisbooksisprintedonacid-freepaper Contents Preface xiii Acknowledgements xv Abouttheauthors xvii 1 Introduction 1 1.1 TheKolaEcogeochemistryProject 5 1.1.1 ShortdescriptionoftheKolaProjectsurveyarea 6 1.1.2 Samplingandcharacteristicsofthedifferentsamplematerials 9 1.1.3 Samplepreparationandchemicalanalysis 11 2 PreparingtheDataforUseinRandDAS+R 13 2.1 RequireddataformatforimportintoRandDAS+R 14 2.2 Thedetectionlimitproblem 17 2.3 Missingvalues 20 2.4 Some"typical"problemsencounteredwheneditingalaboratorydatareport filetoaDAS+Rfile 21 2.4.1 Sampleidentification 22 2.4.2 Reportingunits 22 2.4.3 Variablenames 23 2.4.4 Resultsbelowthedetectionlimit 23 2.4.5 Handlingofmissingvalues 24 2.4.6 Filestructure 24 2.4.7 Qualitycontrolsamples 25 2.4.8 Geographicalcoordinates,furthereditingandsomeunpleasant limitationsofspreadsheetprograms 25 2.5 Appendingandlinkingdatafiles 25 2.6 Requirementsforageochemicaldatabase 27 2.7 Summary 28 vi CONTENTS 3 GraphicstoDisplaytheDataDistribution 29 3.1 Theone-dimensionalscatterplot 29 3.2 Thehistogram 31 3.3 Thedensitytrace 34 3.4 Plotsofthedistributionfunction 35 3.4.1 Plotofthecumulativedistributionfunction(CDF-plot) 35 3.4.2 Plotoftheempiricalcumulativedistributionfunction (ECDF-plot) 36 3.4.3 Thequantile-quantileplot(QQ-plot) 36 3.4.4 Thecumulativeprobabilityplot(CP-plot) 39 3.4.5 Theprobability-probabilityplot(PP-plot) 40 3.4.6 Discussionofthedistributionfunctionplots 41 3.5 Boxplots 41 3.5.1 TheTukeyboxplot 42 3.5.2 Thelog-boxplot 44 3.5.3 Thepercentile-basedboxplotandthebox-and-whiskerplot 46 3.5.4 Thenotchedboxplot 47 3.6 Combinationofhistogram,densitytrace,one-dimensionalscatterplot, boxplot,andECDF-plot 48 3.7 Combinationofhistogram,boxplotorbox-and-whiskerplot,ECDF-plot, andCP-plot 49 3.8 Summary 50 4 StatisticalDistributionMeasures 51 4.1 Centralvalue 51 4.1.1 Thearithmeticmean 51 4.1.2 Thegeometricmean 52 4.1.3 Themode 52 4.1.4 Themedian 52 4.1.5 Trimmedmeanandotherrobustmeasuresofthecentral value 53 4.1.6 Influenceoftheshapeofthedatadistribution 53 4.2 Measuresofspread 56 4.2.1 Therange 56 4.2.2 Theinterquartilerange(IQR) 56 4.2.3 Thestandarddeviation 57 4.2.4 Themedianabsolutedeviation(MAD) 57 4.2.5 Variance 58 4.2.6 Thecoefficientofvariation(CV) 58 4.2.7 Therobustcoefficientofvariation(CVR) 59 4.3 Quartiles,quantilesandpercentiles 59 4.4 Skewness 59 CONTENTS vii 4.5 Kurtosis 59 4.6 Summarytableofstatisticaldistributionmeasures 60 4.7 Summary 60 5 MappingSpatialData 63 5.1 Mapcoordinatesystems(mapprojection) 64 5.2 Mapscale 65 5.3 Choiceofthebasemapforgeochemicalmapping 66 5.4 Mappinggeochemicaldatawithproportionaldots 68 5.5 Mappinggeochemicaldatausingclasses 69 5.5.1 Choiceofsymbolsforgeochemicalmapping 70 5.5.2 Percentileclasses 71 5.5.3 Boxplotclasses 71 5.5.4 UseofECDF-andCP-plottoselectclassesformapping 74 5.6 Surfacemapsconstructedwithsmoothingtechniques 74 5.7 Surfacemapsconstructedwithkriging 76 5.7.1 Constructionofthe(semi)variogram 76 5.7.2 Qualitycriteriaforsemivariograms 79 5.7.3 Mappingbasedonthesemivariogram(kriging) 79 5.7.4 Possibleproblemswithsemivariogramestimationandkriging 80 5.8 Colourmaps 82 5.9 Somecommonmistakesingeochemicalmapping 84 5.9.1 Mapscale 84 5.9.2 Basemap 84 5.9.3 Symbolset 84 5.9.4 Scalingofsymbolsize 84 5.9.5 Classselection 86 5.10 Summary 88 6 FurtherGraphicsforExploratoryDataAnalysis 91 6.1 Scatterplots(xy-plots) 91 6.1.1 Scatterplotswithuser-definedlinesorfields 92 6.2 Linearregressionlines 93 6.3 Timetrends 95 6.4 Spatialtrends 97 6.5 Spatialdistanceplot 99 6.6 Spiderplots(normalisedmulti-elementdiagrams) 101 6.7 Scatterplotmatrix 102 6.8 Ternaryplots 103 6.9 Summary 106 7 DefiningBackgroundandThreshold,IdentificationofDataOutliersand ElementSources 107 7.1 Statisticalmethodstoidentifyextremevaluesanddataoutliers 108 viii CONTENTS 7.1.1 Classicalstatistics 108 7.1.2 Theboxplot 109 7.1.3 Robuststatistics 110 7.1.4 Percentiles 111 7.1.5 Cantherangeofbackgroundbecalculated? 112 7.2 DetectingoutliersandextremevaluesintheECDF-orCP-plot 112 7.3 Includingthespatialdistributioninthedefinitionofbackground 114 7.3.1 Usinggeochemicalmapstoidentifyareasonablethreshold 114 7.3.2 Theconcentration-areaplot 115 7.3.3 Spatialtrendanalysis 118 7.3.4 Multiplebackgroundpopulationsinonedataset 119 7.4 Methodstodistinguishgeogenicfromanthropogenicelementsources 120 7.4.1 TheTOP/BOT-ratio 120 7.4.2 Enrichmentfactors(EFs) 121 7.4.3 Mineralogicalversuschemicalmethods 128 7.5 Summary 128 8 ComparingDatainTablesandGraphics 129 8.1 Comparingdataintables 129 8.2 Graphicalcomparisonofthedatadistributionsofseveraldatasets 133 8.3 Comparingthespatialdatastructure 136 8.4 Subsetcreation–amightytoolingraphicaldataanalysis 138 8.5 Datasubsetsinscatterplots 141 8.6 Datasubsetsintimeandspatialtrenddiagrams 142 8.7 Datasubsetsinternaryplots 144 8.8 Datasubsetsinthescatterplotmatrix 146 8.9 Datasubsetsinmaps 147 8.10 Summary 148 9 ComparingDataUsingStatisticalTests 149 9.1 Testsfordistribution(Kolmogorov–SmirnovandShapiro–Wilktests) 150 9.1.1 TheKoladatasetandthenormalorlognormaldistribution 151 9.2 Theone-samplet-test(testforthecentralvalue) 154 9.3 Wilcoxonsigned-ranktest 156 9.4 Comparingtwocentralvaluesofthedistributionsofindependentdatagroups157 9.4.1 Thetwo-samplet-test 157 9.4.2 TheWilcoxonranksumtest 158 9.5 Comparingtwocentralvaluesofmatchedpairsofdata 158 9.5.1 Thepairedt-test 158 9.5.2 TheWilcoxontest 160 9.6 Comparingthevarianceoftwodatasets 160 9.6.1 TheF-test 160 9.6.2 TheAnsari–Bradleytest 160 CONTENTS ix 9.7 Comparingseveralcentralvalues 161 9.7.1 One-wayanalysisofvariance(ANOVA) 161 9.7.2 Kruskal-Wallistest 161 9.8 Comparingthevarianceofseveraldatagroups 161 9.8.1 Bartletttest 161 9.8.2 Levenetest 162 9.8.3 Flignertest 162 9.9 Comparingseveralcentralvaluesofdependentgroups 163 9.9.1 ANOVAwithblocking(two-way) 163 9.9.2 Friedmantest 163 9.10 Summary 164 10 ImprovingDataBehaviourforStatisticalAnalysis:Ranking andTransformations 167 10.1 Ranking/sorting 168 10.2 Non-lineartransformations 169 10.2.1 Squareroottransformation 169 10.2.2 Powertransformation 169 10.2.3 Log(arithmic)-transformation 169 10.2.4 Box–Coxtransformation 171 10.2.5 Logittransformation 171 10.3 Lineartransformations 172 10.3.1 Addition/subtraction 172 10.3.2 Multiplication/division 173 10.3.3 Rangetransformation 174 10.4 Preparingadatasetformultivariatedataanalysis 174 10.4.1 Centring 174 10.4.2 Scaling 174 10.5 Transformationsforclosednumbersystems 176 10.5.1 Additivelogratiotransformation 177 10.5.2 Centredlogratiotransformation 178 10.5.3 Isometriclogratiotransformation 178 10.6 Summary 179 11 Correlation 181 11.1 Pearsoncorrelation 182 11.2 Spearmanrankcorrelation 183 11.3 Kendall-taucorrelation 184 11.4 Robustcorrelationcoefficients 184 11.5 Whenisacorrelationcoefficientsignificant? 185 11.6 Workingwithmanyvariables 185 x CONTENTS 11.7 Correlationanalysisandinhomogeneousdata 187 11.8 Correlationresultsfollowingadditivelogratioorcentredlogratio transformations 189 11.9 Summary 191 12 MultivariateGraphics 193 12.1 Profiles 193 12.2 Stars 194 12.3 Segments 196 12.4 Boxes 197 12.5 Castlesandtrees 198 12.6 Parallelcoordinatesplot 198 12.7 Summary 200 13 MultivariateOutlierDetection 201 13.1 Univariateversusmultivariateoutlierdetection 201 13.2 Robustversusnon-robustoutlierdetection 204 13.3 Thechi-squareplot 205 13.4 Automatedmultivariateoutlierdetectionandvisualisation 205 13.5 Othergraphicalapproachesforidentifyingoutliersandgroups 208 13.6 Summary 210 14 PrincipalComponentAnalysis(PCA)andFactorAnalysis(FA) 211 14.1 ConditioningthedataforPCAandFA 212 14.1.1 Differentdatarangesandvariability,skewness 212 14.1.2 Normaldistribution 213 14.1.3 Dataoutliers 213 14.1.4 Closeddata 214 14.1.5 Censoreddata 215 14.1.6 Inhomogeneousdatasets 215 14.1.7 Spatialdependence 215 14.1.8 Dimensionality 216 14.2 Principalcomponentanalysis(PCA) 216 14.2.1 Thescreeplot 217 14.2.2 Thebiplot 219 14.2.3 Mappingtheprincipalcomponents 220 14.2.4 RobustversusclassicalPCA 221 14.3 Factoranalysis 222 14.3.1 Choiceoffactoranalysismethod 224 14.3.2 Choiceofrotationmethod 224 14.3.3 Numberoffactorsextracted 224 14.3.4 Selectionofelementsforfactoranalysis 225 14.3.5 Graphicalrepresentationoftheresultsoffactoranalysis 225