Table Of ContentStatistical Data Analysis Explained
StatisticalDataAnalysisExplained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett,
R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6
Statistical Data Analysis Explained
Applied Environmental Statistics with R
Clemens Reimann
GeologicalSurveyofNorway
Peter Filzmoser
ViennaUniversityofTechnology
Robert G. Garrett
GeologicalSurveyofCanada
Rudolf Dutter
ViennaUniversityofTechnology
Copyright©2008 JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,
WestSussexPO198SQ,England
Telephone(+44)1243779777
Email(forordersandcustomerserviceenquiries):cs-books@wiley.co.uk
Visit our Home Page on www.wileyeurope.com or www.wiley.com
AllRightsReserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystemortransmittedin
anyformorbyanymeans,electronic,mechanical,photocopying,recording,scanningorotherwise,exceptunder
thetermsoftheCopyright,DesignsandPatentsAct1988orunderthetermsofalicenceissuedbytheCopyright
LicensingAgencyLtd,90TottenhamCourtRoad,LondonW1T4LP,UK,withoutthepermissioninwritingofthe
Publisher.RequeststothePublishershouldbeaddressedtothePermissionsDepartment,JohnWiley&SonsLtd,The
Atrium,SouthernGate,Chichester,WestSussexPO198SQ,England,oremailedtopermreq@wiley.co.uk,orfaxed
to(+44)1243770620.
Designationsusedbycompaniestodistinguishtheirproductsareoftenclaimedastrademarks.Allbrandnamesand
productnamesusedinthisbookaretradenames,servicemarks,trademarksorregisteredtrademarksoftheirrespective
owners.ThePublisherisnotassociatedwithanyproductorvendormentionedinthisbook.
Thispublicationisdesignedtoprovideaccurateandauthoritativeinformationinregardtothesubjectmattercovered.
ItissoldontheunderstandingthatthePublisherisnotengagedinrenderingprofessionalservices.Ifprofessional
adviceorotherexpertassistanceisrequired,theservicesofacompetentprofessionalshouldbesought.
OtherWileyEditorialOffices
JohnWiley&SonsInc.,111RiverStreet,Hoboken,NJ07030,USA
Jossey-Bass,989MarketStreet,SanFrancisco,CA94103-1741,USA
Wiley-VCHVerlagGmbH,Boschstr.12,D-69469Weinheim,Germany
JohnWiley&SonsAustraliaLtd,33ParkRoad,Milton,Queensland4064,Australia
JohnWiley&Sons(Asia)PteLtd,2ClementiLoop#02-01,JinXingDistripark,Singapore129809
JohnWily&SonsCanadaLtd,6045FreemontBlvd,Mississauga,Ontario,L5R4J3
Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmaynotbe
availableinelectronicbooks.
BritishLibraryCataloguinginPublicationData
AcataloguerecordforthisbookisavailablefromtheBritishLibrary
ISBN978-0-470-98581-6
Typesetin10/12ptTimesbyThomsonDigital,Noida,India
PrintedandboundinGreatBritainbyAntonyRoweLtd.,Chippenham,Wilts
Thisbooksisprintedonacid-freepaper
Contents
Preface xiii
Acknowledgements xv
Abouttheauthors xvii
1 Introduction 1
1.1 TheKolaEcogeochemistryProject 5
1.1.1 ShortdescriptionoftheKolaProjectsurveyarea 6
1.1.2 Samplingandcharacteristicsofthedifferentsamplematerials 9
1.1.3 Samplepreparationandchemicalanalysis 11
2 PreparingtheDataforUseinRandDAS+R 13
2.1 RequireddataformatforimportintoRandDAS+R 14
2.2 Thedetectionlimitproblem 17
2.3 Missingvalues 20
2.4 Some"typical"problemsencounteredwheneditingalaboratorydatareport
filetoaDAS+Rfile 21
2.4.1 Sampleidentification 22
2.4.2 Reportingunits 22
2.4.3 Variablenames 23
2.4.4 Resultsbelowthedetectionlimit 23
2.4.5 Handlingofmissingvalues 24
2.4.6 Filestructure 24
2.4.7 Qualitycontrolsamples 25
2.4.8 Geographicalcoordinates,furthereditingandsomeunpleasant
limitationsofspreadsheetprograms 25
2.5 Appendingandlinkingdatafiles 25
2.6 Requirementsforageochemicaldatabase 27
2.7 Summary 28
vi CONTENTS
3 GraphicstoDisplaytheDataDistribution 29
3.1 Theone-dimensionalscatterplot 29
3.2 Thehistogram 31
3.3 Thedensitytrace 34
3.4 Plotsofthedistributionfunction 35
3.4.1 Plotofthecumulativedistributionfunction(CDF-plot) 35
3.4.2 Plotoftheempiricalcumulativedistributionfunction
(ECDF-plot) 36
3.4.3 Thequantile-quantileplot(QQ-plot) 36
3.4.4 Thecumulativeprobabilityplot(CP-plot) 39
3.4.5 Theprobability-probabilityplot(PP-plot) 40
3.4.6 Discussionofthedistributionfunctionplots 41
3.5 Boxplots 41
3.5.1 TheTukeyboxplot 42
3.5.2 Thelog-boxplot 44
3.5.3 Thepercentile-basedboxplotandthebox-and-whiskerplot 46
3.5.4 Thenotchedboxplot 47
3.6 Combinationofhistogram,densitytrace,one-dimensionalscatterplot,
boxplot,andECDF-plot 48
3.7 Combinationofhistogram,boxplotorbox-and-whiskerplot,ECDF-plot,
andCP-plot 49
3.8 Summary 50
4 StatisticalDistributionMeasures 51
4.1 Centralvalue 51
4.1.1 Thearithmeticmean 51
4.1.2 Thegeometricmean 52
4.1.3 Themode 52
4.1.4 Themedian 52
4.1.5 Trimmedmeanandotherrobustmeasuresofthecentral
value 53
4.1.6 Influenceoftheshapeofthedatadistribution 53
4.2 Measuresofspread 56
4.2.1 Therange 56
4.2.2 Theinterquartilerange(IQR) 56
4.2.3 Thestandarddeviation 57
4.2.4 Themedianabsolutedeviation(MAD) 57
4.2.5 Variance 58
4.2.6 Thecoefficientofvariation(CV) 58
4.2.7 Therobustcoefficientofvariation(CVR) 59
4.3 Quartiles,quantilesandpercentiles 59
4.4 Skewness 59
CONTENTS vii
4.5 Kurtosis 59
4.6 Summarytableofstatisticaldistributionmeasures 60
4.7 Summary 60
5 MappingSpatialData 63
5.1 Mapcoordinatesystems(mapprojection) 64
5.2 Mapscale 65
5.3 Choiceofthebasemapforgeochemicalmapping 66
5.4 Mappinggeochemicaldatawithproportionaldots 68
5.5 Mappinggeochemicaldatausingclasses 69
5.5.1 Choiceofsymbolsforgeochemicalmapping 70
5.5.2 Percentileclasses 71
5.5.3 Boxplotclasses 71
5.5.4 UseofECDF-andCP-plottoselectclassesformapping 74
5.6 Surfacemapsconstructedwithsmoothingtechniques 74
5.7 Surfacemapsconstructedwithkriging 76
5.7.1 Constructionofthe(semi)variogram 76
5.7.2 Qualitycriteriaforsemivariograms 79
5.7.3 Mappingbasedonthesemivariogram(kriging) 79
5.7.4 Possibleproblemswithsemivariogramestimationandkriging 80
5.8 Colourmaps 82
5.9 Somecommonmistakesingeochemicalmapping 84
5.9.1 Mapscale 84
5.9.2 Basemap 84
5.9.3 Symbolset 84
5.9.4 Scalingofsymbolsize 84
5.9.5 Classselection 86
5.10 Summary 88
6 FurtherGraphicsforExploratoryDataAnalysis 91
6.1 Scatterplots(xy-plots) 91
6.1.1 Scatterplotswithuser-definedlinesorfields 92
6.2 Linearregressionlines 93
6.3 Timetrends 95
6.4 Spatialtrends 97
6.5 Spatialdistanceplot 99
6.6 Spiderplots(normalisedmulti-elementdiagrams) 101
6.7 Scatterplotmatrix 102
6.8 Ternaryplots 103
6.9 Summary 106
7 DefiningBackgroundandThreshold,IdentificationofDataOutliersand
ElementSources 107
7.1 Statisticalmethodstoidentifyextremevaluesanddataoutliers 108
viii CONTENTS
7.1.1 Classicalstatistics 108
7.1.2 Theboxplot 109
7.1.3 Robuststatistics 110
7.1.4 Percentiles 111
7.1.5 Cantherangeofbackgroundbecalculated? 112
7.2 DetectingoutliersandextremevaluesintheECDF-orCP-plot 112
7.3 Includingthespatialdistributioninthedefinitionofbackground 114
7.3.1 Usinggeochemicalmapstoidentifyareasonablethreshold 114
7.3.2 Theconcentration-areaplot 115
7.3.3 Spatialtrendanalysis 118
7.3.4 Multiplebackgroundpopulationsinonedataset 119
7.4 Methodstodistinguishgeogenicfromanthropogenicelementsources 120
7.4.1 TheTOP/BOT-ratio 120
7.4.2 Enrichmentfactors(EFs) 121
7.4.3 Mineralogicalversuschemicalmethods 128
7.5 Summary 128
8 ComparingDatainTablesandGraphics 129
8.1 Comparingdataintables 129
8.2 Graphicalcomparisonofthedatadistributionsofseveraldatasets 133
8.3 Comparingthespatialdatastructure 136
8.4 Subsetcreation–amightytoolingraphicaldataanalysis 138
8.5 Datasubsetsinscatterplots 141
8.6 Datasubsetsintimeandspatialtrenddiagrams 142
8.7 Datasubsetsinternaryplots 144
8.8 Datasubsetsinthescatterplotmatrix 146
8.9 Datasubsetsinmaps 147
8.10 Summary 148
9 ComparingDataUsingStatisticalTests 149
9.1 Testsfordistribution(Kolmogorov–SmirnovandShapiro–Wilktests) 150
9.1.1 TheKoladatasetandthenormalorlognormaldistribution 151
9.2 Theone-samplet-test(testforthecentralvalue) 154
9.3 Wilcoxonsigned-ranktest 156
9.4 Comparingtwocentralvaluesofthedistributionsofindependentdatagroups157
9.4.1 Thetwo-samplet-test 157
9.4.2 TheWilcoxonranksumtest 158
9.5 Comparingtwocentralvaluesofmatchedpairsofdata 158
9.5.1 Thepairedt-test 158
9.5.2 TheWilcoxontest 160
9.6 Comparingthevarianceoftwodatasets 160
9.6.1 TheF-test 160
9.6.2 TheAnsari–Bradleytest 160
CONTENTS ix
9.7 Comparingseveralcentralvalues 161
9.7.1 One-wayanalysisofvariance(ANOVA) 161
9.7.2 Kruskal-Wallistest 161
9.8 Comparingthevarianceofseveraldatagroups 161
9.8.1 Bartletttest 161
9.8.2 Levenetest 162
9.8.3 Flignertest 162
9.9 Comparingseveralcentralvaluesofdependentgroups 163
9.9.1 ANOVAwithblocking(two-way) 163
9.9.2 Friedmantest 163
9.10 Summary 164
10 ImprovingDataBehaviourforStatisticalAnalysis:Ranking
andTransformations 167
10.1 Ranking/sorting 168
10.2 Non-lineartransformations 169
10.2.1 Squareroottransformation 169
10.2.2 Powertransformation 169
10.2.3 Log(arithmic)-transformation 169
10.2.4 Box–Coxtransformation 171
10.2.5 Logittransformation 171
10.3 Lineartransformations 172
10.3.1 Addition/subtraction 172
10.3.2 Multiplication/division 173
10.3.3 Rangetransformation 174
10.4 Preparingadatasetformultivariatedataanalysis 174
10.4.1 Centring 174
10.4.2 Scaling 174
10.5 Transformationsforclosednumbersystems 176
10.5.1 Additivelogratiotransformation 177
10.5.2 Centredlogratiotransformation 178
10.5.3 Isometriclogratiotransformation 178
10.6 Summary 179
11 Correlation 181
11.1 Pearsoncorrelation 182
11.2 Spearmanrankcorrelation 183
11.3 Kendall-taucorrelation 184
11.4 Robustcorrelationcoefficients 184
11.5 Whenisacorrelationcoefficientsignificant? 185
11.6 Workingwithmanyvariables 185
x CONTENTS
11.7 Correlationanalysisandinhomogeneousdata 187
11.8 Correlationresultsfollowingadditivelogratioorcentredlogratio
transformations 189
11.9 Summary 191
12 MultivariateGraphics 193
12.1 Profiles 193
12.2 Stars 194
12.3 Segments 196
12.4 Boxes 197
12.5 Castlesandtrees 198
12.6 Parallelcoordinatesplot 198
12.7 Summary 200
13 MultivariateOutlierDetection 201
13.1 Univariateversusmultivariateoutlierdetection 201
13.2 Robustversusnon-robustoutlierdetection 204
13.3 Thechi-squareplot 205
13.4 Automatedmultivariateoutlierdetectionandvisualisation 205
13.5 Othergraphicalapproachesforidentifyingoutliersandgroups 208
13.6 Summary 210
14 PrincipalComponentAnalysis(PCA)andFactorAnalysis(FA) 211
14.1 ConditioningthedataforPCAandFA 212
14.1.1 Differentdatarangesandvariability,skewness 212
14.1.2 Normaldistribution 213
14.1.3 Dataoutliers 213
14.1.4 Closeddata 214
14.1.5 Censoreddata 215
14.1.6 Inhomogeneousdatasets 215
14.1.7 Spatialdependence 215
14.1.8 Dimensionality 216
14.2 Principalcomponentanalysis(PCA) 216
14.2.1 Thescreeplot 217
14.2.2 Thebiplot 219
14.2.3 Mappingtheprincipalcomponents 220
14.2.4 RobustversusclassicalPCA 221
14.3 Factoranalysis 222
14.3.1 Choiceoffactoranalysismethod 224
14.3.2 Choiceofrotationmethod 224
14.3.3 Numberoffactorsextracted 224
14.3.4 Selectionofelementsforfactoranalysis 225
14.3.5 Graphicalrepresentationoftheresultsoffactoranalysis 225