Probability, Statistics, and Data Probability, Statistics, and Data A Fresh Approach Using R Darrin Speegle Bryan Clair First edition published 2022 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN © 2022 Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as- sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho- tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden- tification and explanation without intent to infringe. ISBN: 978-0-367-43667-4 (hbk) ISBN: 978-1-032-15441-1 (pbk) ISBN: 978-1-003-00489-9 (ebk) DOI: 10.1201/9781003004899 Publisher’s note: This book has been prepared from camera-ready copy provided by the authors. Contents Preface ix Software Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Data in R 1 1.1 Arithmetic and variable assignment. . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Indexing vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.7 Reading data from files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.8 Packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.9 Errors and warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.10 Useful idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Vignette: Data science communities . . . . . . . . . . . . . . . . . . . . . . . . . 19 Vignette: An R Markdown primer . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2 Probability 27 2.1 Probability basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Conditional probability and independence . . . . . . . . . . . . . . . . . . . 41 2.4 Counting arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Vignette: Negative surveys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3 Discrete Random Variables 57 3.1 Probability mass functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3 Binomial and geometric random variables . . . . . . . . . . . . . . . . . . . 64 3.4 Functions of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5 Variance, standard deviation, and independence . . . . . . . . . . . . . . . . 74 3.6 Poisson, negative binomial, and hypergeometric . . . . . . . . . . . . . . . . 77 Vignette: Loops in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4 Continuous Random Variables 91 4.1 Probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3 Variance and standard deviation . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4 Normal random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5 Uniform and exponential random variables. . . . . . . . . . . . . . . . . . . 107 v vi Contents 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5 Simulation of Random Variables 117 5.1 Estimating probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.2 Estimating discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3 Estimating continuous distributions . . . . . . . . . . . . . . . . . . . . . . 125 5.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.5 Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.6 Point estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Vignette: Stein’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6 Data Manipulation 159 6.1 Data frames and tibbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.2 dplyr verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.3 dplyr pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.4 The power of dplyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.5 Working with character strings . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.6 Structure of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.7 The apply family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Vignette: dplyr murder mystery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Vignette: Data and gender. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 7 Data Visualization with ggplot 197 7.1 ggplot fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 7.2 Visualizing a single variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 7.3 Visualizing two or more variables . . . . . . . . . . . . . . . . . . . . . . . . 218 7.4 Customizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Vignette: Choropleth maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Vignette: COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 8 Inference on the Mean 253 8.1 Sampling distribution of the sample mean . . . . . . . . . . . . . . . . . . . 255 8.2 Confidence intervals for the mean . . . . . . . . . . . . . . . . . . . . . . . . 258 8.3 Hypothesis tests of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . 261 8.4 One-sided confidence intervals and hypothesis tests . . . . . . . . . . . . . . 266 8.5 Assessing robustness via simulation . . . . . . . . . . . . . . . . . . . . . . . 268 8.6 Two sample hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 8.7 Type II errors and power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 8.8 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 9 Rank Based Tests 305 9.1 One sample Wilcoxon signed rank test . . . . . . . . . . . . . . . . . . . . . 307 9.2 Two sample Wilcoxon tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 9.3 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 9.4 Effect size and consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Vignette: ROC curves and the Wilcoxon rank sum statistic . . . . . . . . . . . . 327 Contents vii Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 10 Tabular Data 335 10.1 Tables and plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 10.2 Inference on a proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 10.3 χ2 tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 10.4 χ2 goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 10.5 χ2 tests on cross tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 10.6 Exact and Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . 361 Vignette: Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 11 Simple Linear Regression 371 11.1 Least squares regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 11.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 11.3 Geometry of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 11.4 Residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 11.5 Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 11.6 Simulations for simple linear regression . . . . . . . . . . . . . . . . . . . . . 411 11.7 Cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 11.8 Bias-variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 Vignette: Simple logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 12 Analysis of Variance and Comparison of Multiple Groups 435 12.1 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 12.2 The ANOVA test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 12.3 Unequal variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 12.4 Pairwise t-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Vignette: Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 13 Multiple Regression 459 13.1 Two explanatory variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 13.2 Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 13.3 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 Vignette: External data formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . 480 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 Image Credits 487 Index 489 Index of Data Sets and Packages 499 Preface This book represents a fundamental rethinking of a calculus based first course in probability and statistics. We offer a breadth first approach, where the essentials of probability and statistics can be taught in one semester. The statistical programming language R plays a central role throughout the text through simulations, data wrangling, visualizations, and statistical procedures. Data sets from a variety of sources, including many from recent, open source scientific articles, are used in examples and exercises. Demonstrations of important facts are given through simulations, with some formal mathematical proofs as well. This book is an excellent choice for students studying data science, statistics, engineering, computer science, mathematics, science, business, or for any student wanting a practical course grounded in simulations. The book assumes a mathematical background of one semester of calculus along with some infinite series in Chapter 3. Integrals and infinite series are used for notation and exposition in Chapters 3 and 4, but in other chapters the use of calculus is minimal. Since an emphasis is placed on understanding results (and robustness to departures from assumptions) via simulation, most if not all parts of the book can be understood without calculus. Proofs of many results are provided, and justifications via simulations for many more, but this text is not intended to support a proof based course. Readers are encouraged to follow the proofs, but often one wants to understand a proof only after first understanding the result and why it is important. Our philosophy in this book is to not shy away from messy data sets. The book contains extensive sections and many exercises that require data cleaning and manipulation. This is an essential part of the text. A one-semester course using this book could reasonably cover most material in Chapters 1-8 in order and then select two or three additional chapters. Sections 2.4, 3.6, 5.6, 8.7 and 8.8 may be omitted or given light coverage. The descriptive statistics in Chapters 6 and 7 are frequently the first part of a statistics course, but we recommend leaving them in the middle as they provide students with a welcome change of pace during the semester. Chapter 9 (RankBasedTests)isparticularlyimportantbecauseitusessimulationtechniquesdeveloped throughout the text to help students understand power and the effects of assumptions on testing. There is enough material for a more leisurely and thorough two-semester sequence that would delve deeper into probability theory, spend more time on data wrangling, and cover all of the inference chapters. Most chapters in the book contain at least one vignette. These short sections are not part of the development of the base material. We imagine these vignettes as starting points for further study for some students, or as interesting additions to the main material. Examples include chloropleth maps, data and gender, Stein’s paradox, and a treatment of Covid-19 data. Base R and tidyverse tools are interspersed, depending on which is better for a particular ix