Basic Statistical Graphics for Archaeology with R: Life Beyond Excel Mike Baxter Hilary Cool Barbican Research Associates Nottingham Basic Statistical Graphics for Archaeology with R: Life Beyond Excel Mike Baxter* and Hilary Cool** * Nottingham Trent University, School of Science and Technology (Emeritus) ** Barbican Research Associates, 16 Lady Bay Road, Nottingham NG2 5BJ, UK October 2016 Contents Preface viii 1 Introduction 1 1.1 Why statistics? 1 1.2 Why R? 2 1.3 The need for graphics, or not 2 1.4 How this book was written 4 1.5 Structure of the text 5 1.6 Additional resources – data sets and R code 6 2 R – an introduction 8 2.1 Introduction 8 2.2 R basics 10 2.2.1 R – getting started 10 2.2.2 Data import 10 2.2.3 Data structures 11 2.3 Functions and packages 13 2.3.1 Functions 13 2.3.2 Packages – general 14 2.3.3 Packages – graphics 14 2.3.4 Plot construction – ggplot2 16 2.3.5 Colour 19 2.4 Don’t panic 20 3 Descriptive statistics 22 3.1 Introduction 22 3.2 Types of data 22 3.3 Summary statistics 23 3.3.1 Definitions 23 3.3.2 Usage (and abusage) 26 3.4 Obtaining summary statistics in R 28 i 4 Histograms 31 4.1 Introduction 31 4.2 Histograms in base R 33 4.3 Histograms using ggplot2 35 4.4 Histograms from count data 36 4.5 Histograms with unequal bin-widths 37 4.6 Using histograms for comparison 39 4.7 Frequency polygons 43 5 User-defined functions 45 5.1 User-defined functions 45 5.1.1 Introduction 45 5.1.2 Writing functions – a very basic introduction 46 5.1.3 More realistic illustrations 47 5.2 Summary 50 6 Kernel density estimates 51 6.1 Introduction 51 6.2 Example – loomweights from Pompeii 53 6.3 Comparing KDEs 58 6.4 Two-dimensional (2D) KDEs 59 7 Boxplots and related graphics 64 7.1 Introduction 64 7.2 Construction and interpretation 64 7.3 Using plots for comparing groups 65 7.4 Discussion 70 8 Scatterplots 73 8.1 Introduction 73 8.2 Presentational aspects 74 8.2.1 Plotting symbols, axis labels, axis limits 74 8.2.2 Adding lines to plots 76 8.2.3 Points 79 8.2.4 Scatterplot matrices 82 9 Pie charts 85 9.1 Introduction 85 9.2 Examples 86 9.2.1 Pointless pie charts 86 9.2.2 Dishonest pie charts 89 9.2.3 Could do better 90 9.2.4 From bad to worse? 94 ii 10 Barplots 98 10.1 Introduction 98 10.2 One-way data 99 10.3 Two-way data 102 10.3.1 Preliminary considerations 102 10.3.2 Barplots for two-way data in base R 103 10.3.3 Barplots in ggplot2 106 10.4 Examples, including things to avoid 109 10.4.1 Glass ‘counters’ and their function 109 10.4.2 Chartjunk I 113 10.4.3 Chartjunk II 115 11 Ternary diagrams 119 11.1 Introduction 119 11.2 Examples 120 11.2.1 Palaeolithic assemblages 120 11.2.2 Romano-British civilian faunal assemblages 122 11.2.3 Staffordshire Hoard gold compositions 127 12 Correspondence analysis 130 12.1 Introduction 130 12.2 Correspondence analysis – basics 131 12.3 Some data 132 12.4 R basics 135 12.4.1 ‘Default’ analyses 135 12.4.2 Customising plots 136 12.4.3 Interpretation 140 12.5 Glass ‘counters’ from Pompeii 141 12.6 Anglo-Saxon male graves and seriation 144 12.7 Flavian drinking-vessels 148 References 153 Index 165 iii List of Figures 3.1 Some possible distributional ‘shapes’ for continuous data. 26 3.2 A histogram and KDE for late Romano-British hairpin lengths. 29 4.1 Romano-British hairpins. 32 4.2 A Roman pillar moulded bowl. 32 4.3 Histograms for the Romano-British hairpin lengths I. 33 4.4 Histograms for the Romano-British hairpin lengths II. 34 4.5 Histograms for the Romano-British hairpin lengths using ggplot2. 35 4.6 Different ways of representing the data from Table 4.3. 38 4.7 Using stacked histograms for comparison. 39 4.8 Using superimposed histograms for comparison. 40 4.9 Frequency polygons for Romano-British hairpin lengths. 43 6.1 Construction of a kernel density estimate. 51 6.2 The effect of bandwidth on KDE construction. 52 6.3 KDEs of the weights of loomweights I. 53 6.4 A view of Insula VI.1, Pompeii. 55 6.5 Loomweights from Insula VI.1, Pompeii. 55 6.6 KDEs of the weights of loomweights II. 56 6.7 KDEs and histograms of the weights of loomweights. 57 6.8 KDEs comparing lengths of early and late Romano-British hairpins. 58 6.9 Two-dimensional KDEs using ggplot2. 59 6.10 Two-dimensional KDEs using base R. 61 6.11 Contour plots based on 2D KDEs using base R. 62 7.1 Basic boxplots, violin plots and beanplots. 65 7.2 Boxplots and beanplots used to compare weights of loomweights. 66 7.3 Kernel density estimates of weights of phased loomweights. 67 7.4 Boxplots and violin plots used to compare weights of loomweights. 68 7.5 Boxplots and violin plots for Romano-British hairpin lengths. 69 7.6 Boxplots for skewed data. 72 8.1 Default scatterplots for loomweight heights and weights. 74 8.2 Enhanced scatterplots for loomweight heights and weights. 74 8.3 Basic code for a base R scatterplot. 75 iv 8.4 Available symbols for plotting points in R. 75 8.5 Basic code for a ggplot2 scatterplot. 76 8.6 Adding lines to plots I. 76 8.7 Available lines for plotting in R. 77 8.8 Adding lines to plots II. 78 8.9 Adding points to plots. 79 8.10 Plotting multiple groups I. 80 8.11 Plotting multiple groups II. 81 8.12 A scatterplot matrix for loomweight dimensions using base R. 82 8.13 An enhanced scatterplot matrix for loomweight dimensions. 83 8.14 A ggpairs scatterplot for loomweight dimensions. 84 9.1 Pie charts of the distribution of contexts by cauldron group. 88 9.2 Pie charts showing the distribution of amphora types in the Lower Danube region. 89 9.3 Pie charts for faunal remains from Wroxeter. 90 9.4 Soay sheep and Dexter cattle. 91 9.5 A clustered barplot for faunal remains from Wroxeter. 92 9.6 A ternary diagram for faunal remains from Wroxeter. 93 9.7 Pie charts for counter colours by phase from Pompeii, Insula VI.1. 96 10.1 Default and enhanced barplots for amphora presence on Romano- British sites. 100 10.2 Default barplots for species presence on Romano-British sites. 104 10.3 Enhanced barplots for species presence on Romano-British sites using base R. 104 10.4 ‘Default’ barplots for species presence on Romano-British sites us- ing ggplot2. 107 10.5 AmodifiedclusteredbarplotforspeciespresenceonRomano-British sites using ggplot2. 108 10.6 Glass counters from Gloucester and Pompeii. 110 10.7 A clustered barplot for glass ‘counter’ dimensions from different contexts. 111 10.8 An alternative (and poorly) ‘enhanced’ barplot for amphora pres- ence on Romano-British sites. 113 10.9 A version of Figure 10.3a showing the visual impact of grid lines. 114 10.10Monthly distribution of sacrifices to Saturn. 115 10.11Examples of unnecessary three-dimensional barplots. 116 11.1 Ternary diagrams for the assemblage data of Table 11.1. 121 11.2 ‘Default’ and modified ternary diagrams for Romano-British civil- ian faunal assemblages. 123 11.3 A ‘modified’ ternary diagram for Romano-British civilian faunal assemblages II. 125 v 11.4 A ternary diagram for Romano-British civilian faunal assemblages overlaid with a two-dimensional KDE. 126 11.5 Ternary diagrams for Romano-British civilian faunal assemblages for different site types overlaid with two-dimensional KDEs. 126 11.6 Objects from the Staffordshire Hoard. 128 11.7 A ternary diagram for Au/Ag/Cu compositions of artefacts from the Staffordshire Hoard. 129 12.1 A selection of Romano-British bow brooches of 1st to 4th century date. 134 12.2 Basic correspondence analyses for the Romano-British brooch data set. 135 12.3 AnenhancedcorrespondenceanalysesfortheRomano-Britishbrooch data set using the ca function and base R. 137 12.4 AnenhancedcorrespondenceanalysesfortheRomano-Britishbrooch data set using the ca function and ggplot2. 138 12.5 A correspondence analysis of glass counter colours by phase from Insula VI.1, Pompeii, Italy. 142 12.7 Examples of Flavian glass vessels. 148 12.8 Correspondence analysis, Flavian drinking-vessel glass assemblages. 149 vi List of Tables 3.1 Lengths of late Romano-British copper alloy hairpins. 28 4.1 Lengths of Romano-British copper alloy hairpins. 31 4.2 The Romano-British hairpin data as entered into R. 33 4.3 Chronology of Roman glass pillar-moulded bowls. 37 6.1 Loomweight dimensions. 54 7.1 Phased loomweight data with six variables. 66 9.1 Counts of Iron Age and early Roman cauldrons by context. 87 9.2 Percentages of major domesticates from Wroxeter by phase. 90 9.3 Colours of glass ‘counters’ by phase, from Insula VI.1, Pompeii. 95 10.1 Amphora as a proportion of total pottery assemblages by site. 100 10.2 Later first- to mid-second century animal bone assemblages from Romano-British sites. 102 10.3 The data of Table 10.2 recast for use with the barplot function in base R. 103 10.4 The data of Table 10.2 recast for use with ggplot2. 107 10.5 Glass ‘counter’ sizes from Insula VI.1 Pompeii and other contexts. 109 10.6 Individuals commemorated with and setting up funerary altars. 117 11.1 Counts of cores, blanks and tools from middle levels of the palae- olithic site at Ksar Akil (Lebanon). 121 11.2 Faunal remains from Romano-British civilian sites. 122 11.3 Au/Ag/Cu compositions for artefacts from the Staffordshire Hoard. 127 12.1 Regional counts of late Iron Age and Roman brooches. 133 12.2 Assemblage profiles for Flavian drinking-vessel glass. 149 vii Preface As the late, great Terry Pratchett put it, in his Discworld novel Hogfather, “Everything starts somewhere, though many physicists disagree. But people have always been dimly aware of the problem with the start of things.” The genesis of this text is multiple. But the idea of producing it, to the best of our recollection, crystallized during a conversation in a pub – as many good and bad ideas do. The first author is an academic statistician with reasonably extensive youthful experience of the practicalities of archaeological excavation and its aftermath. The second author is an archaeologist who has used statistical methodology in her work since her university days. We were working on separate projects with the finishing line in sight and vaguely wondering what to do next. We’ve published jointly since the 1990s, latterly, when it comes to statistics, including graphical presentation using the open-source software R about which more is said in Chapter 2. Some of the simpler methodology that we’ve used, with respect to graphical presentation of the kind that is the staple of archaeo- logical publications that touch this kind of thing, should be straightforward. The archaeological literature is littered with statistical graphics that should have been strangled at birth or, in some cases, not conceived at all. We resort to Excel from time-to-time and take the trouble to customise the graphs. We thought that a text that discussed how to produce the most common types of graph that appear in archaeological publications, with detailed instruc- tions on how to construct them in R, might be useful. The immediate thought at the time was that it would enable the second author to produce her own graphs in R without having to trouble the first author. This would be a solipsistic reason for inflicting a text on an unsuspecting world, thoughthishasneverstoppedanyone. Thereisamoreseriouspurpose–ofthekind that occurs to you on your second glass of wine (or beer or cup of tea or whatever). You might also conceive of our pub conversation as a ‘conception’ arising, in a less precisely datable fashion, from mutual long-standing dissatisfaction with some of of the statistical graphics to be found in the archaeological literature. With apologies for the bad language used this is best conveyed by some of the kinds of things we’ve said. ‘If I see another bl**dy three-dimensional ex- ploded pie-chart I’ll scream’. ‘The editor(s) who allowed this monstrous, three- dimensional, optically disturbing bar chart to sully their publication should be viii
Description: