Table Of ContentProbability, Statistics, and
Data
Probability, Statistics, and
Data
A Fresh Approach Using R
Darrin Speegle
Bryan Clair
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
and by CRC Press
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
© 2022 Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, LLC
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.
ISBN: 978-0-367-43667-4 (hbk)
ISBN: 978-1-032-15441-1 (pbk)
ISBN: 978-1-003-00489-9 (ebk)
DOI: 10.1201/9781003004899
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Contents
Preface ix
Software Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Data in R 1
1.1 Arithmetic and variable assignment. . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Indexing vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Reading data from files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.9 Errors and warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.10 Useful idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Vignette: Data science communities . . . . . . . . . . . . . . . . . . . . . . . . . 19
Vignette: An R Markdown primer . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Probability 27
2.1 Probability basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Conditional probability and independence . . . . . . . . . . . . . . . . . . . 41
2.4 Counting arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Vignette: Negative surveys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Discrete Random Variables 57
3.1 Probability mass functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Binomial and geometric random variables . . . . . . . . . . . . . . . . . . . 64
3.4 Functions of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5 Variance, standard deviation, and independence . . . . . . . . . . . . . . . . 74
3.6 Poisson, negative binomial, and hypergeometric . . . . . . . . . . . . . . . . 77
Vignette: Loops in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4 Continuous Random Variables 91
4.1 Probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3 Variance and standard deviation . . . . . . . . . . . . . . . . . . . . . . . . 98
4.4 Normal random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5 Uniform and exponential random variables. . . . . . . . . . . . . . . . . . . 107
v
vi Contents
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5 Simulation of Random Variables 117
5.1 Estimating probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2 Estimating discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3 Estimating continuous distributions . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.5 Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.6 Point estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Vignette: Stein’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6 Data Manipulation 159
6.1 Data frames and tibbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.2 dplyr verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.3 dplyr pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4 The power of dplyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.5 Working with character strings . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.6 Structure of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.7 The apply family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Vignette: dplyr murder mystery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Vignette: Data and gender. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7 Data Visualization with ggplot 197
7.1 ggplot fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.2 Visualizing a single variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.3 Visualizing two or more variables . . . . . . . . . . . . . . . . . . . . . . . . 218
7.4 Customizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Vignette: Choropleth maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Vignette: COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8 Inference on the Mean 253
8.1 Sampling distribution of the sample mean . . . . . . . . . . . . . . . . . . . 255
8.2 Confidence intervals for the mean . . . . . . . . . . . . . . . . . . . . . . . . 258
8.3 Hypothesis tests of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . 261
8.4 One-sided confidence intervals and hypothesis tests . . . . . . . . . . . . . . 266
8.5 Assessing robustness via simulation . . . . . . . . . . . . . . . . . . . . . . . 268
8.6 Two sample hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
8.7 Type II errors and power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
8.8 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
9 Rank Based Tests 305
9.1 One sample Wilcoxon signed rank test . . . . . . . . . . . . . . . . . . . . . 307
9.2 Two sample Wilcoxon tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
9.3 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
9.4 Effect size and consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Vignette: ROC curves and the Wilcoxon rank sum statistic . . . . . . . . . . . . 327
Contents vii
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
10 Tabular Data 335
10.1 Tables and plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
10.2 Inference on a proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
10.3 χ2 tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
10.4 χ2 goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
10.5 χ2 tests on cross tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
10.6 Exact and Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . 361
Vignette: Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
11 Simple Linear Regression 371
11.1 Least squares regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
11.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
11.3 Geometry of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
11.4 Residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
11.5 Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
11.6 Simulations for simple linear regression . . . . . . . . . . . . . . . . . . . . . 411
11.7 Cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
11.8 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Vignette: Simple logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
12 Analysis of Variance and Comparison of Multiple Groups 435
12.1 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
12.2 The ANOVA test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
12.3 Unequal variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
12.4 Pairwise t-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Vignette: Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
13 Multiple Regression 459
13.1 Two explanatory variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
13.2 Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
13.3 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
Vignette: External data formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
Image Credits 487
Index 489
Index of Data Sets and Packages 499
Preface
This book represents a fundamental rethinking of a calculus based first course in probability
and statistics. We offer a breadth first approach, where the essentials of probability and
statistics can be taught in one semester. The statistical programming language R plays a
central role throughout the text through simulations, data wrangling, visualizations, and
statistical procedures. Data sets from a variety of sources, including many from recent, open
source scientific articles, are used in examples and exercises. Demonstrations of important
facts are given through simulations, with some formal mathematical proofs as well.
This book is an excellent choice for students studying data science, statistics, engineering,
computer science, mathematics, science, business, or for any student wanting a practical
course grounded in simulations.
The book assumes a mathematical background of one semester of calculus along with some
infinite series in Chapter 3. Integrals and infinite series are used for notation and exposition
in Chapters 3 and 4, but in other chapters the use of calculus is minimal. Since an emphasis
is placed on understanding results (and robustness to departures from assumptions) via
simulation, most if not all parts of the book can be understood without calculus. Proofs of
many results are provided, and justifications via simulations for many more, but this text is
not intended to support a proof based course. Readers are encouraged to follow the proofs,
but often one wants to understand a proof only after first understanding the result and why
it is important.
Our philosophy in this book is to not shy away from messy data sets. The book contains
extensive sections and many exercises that require data cleaning and manipulation. This is
an essential part of the text.
A one-semester course using this book could reasonably cover most material in Chapters 1-8
in order and then select two or three additional chapters. Sections 2.4, 3.6, 5.6, 8.7 and 8.8
may be omitted or given light coverage. The descriptive statistics in Chapters 6 and 7 are
frequently the first part of a statistics course, but we recommend leaving them in the middle
as they provide students with a welcome change of pace during the semester. Chapter 9
(RankBasedTests)isparticularlyimportantbecauseitusessimulationtechniquesdeveloped
throughout the text to help students understand power and the effects of assumptions on
testing.
There is enough material for a more leisurely and thorough two-semester sequence that
would delve deeper into probability theory, spend more time on data wrangling, and cover
all of the inference chapters.
Most chapters in the book contain at least one vignette. These short sections are not part
of the development of the base material. We imagine these vignettes as starting points for
further study for some students, or as interesting additions to the main material. Examples
include chloropleth maps, data and gender, Stein’s paradox, and a treatment of Covid-19
data.
Base R and tidyverse tools are interspersed, depending on which is better for a particular
ix