Data Science with R for Psychologists and Healthcare Professionals Christian Ryan Senior Lecturer in Clinical Psychology and Chartered Clinical Psychologist University College Cork Cork, Ireland p, p, A SCIENCE PUBLISHERS BOOK A SCIENCE PUBLISHERS BOOK First edition published 2022 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN © 2022 Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Names: Ryan, Christian, 1968- author. Title: Data science with R for psychologists and healthcare professionals / Christian Ryan. Description: First edition. | Boca Raton : CRC Press, 2021. | Includes bibliographical references and index. | Summary: “Data science - the integration of computer technologies with traditional statistical knowledge - is bringing sweeping changes across industry and academia. The ability to process, visualise and model data are vital skills for students of psychology and other health sciences. This book demonstrates the application of some of these latest approaches to the world of psychological research. Providing a thorough grounding in the use of R for data science, with many carefully crafted analyses, using open datasets, this book will enable beginners and emerging researchers to learn to harness the power of modern analytic techniques and apply them to their own research projects”-- Provided by publisher. Identifiers: LCCN 2021017092 | ISBN 9780367618452 (hardcover) Subjects: LCSH: Psychology--Statistical methods. | Psychology--Research--Data processing. | R (Computer program language) Classification: LCC BF39 .R93 2021 | DDC 150.1/5195--dc23 LC record available at https://lccn.loc.gov/2021017092 ISBN: 978-0-367-61845-2 (hbk) ISBN: 978-0-367-61856-8 (pbk) ISBN: 978-1-003-10684-5 (ebk) DOI: 10.1201/9781003106845 Preface This book is intended for healthcare professionals and students of healthcare subjects at university who wish to begin using R (Team, 2017) to configure, analysing and visualise datasets. Some readers may be acquainted with SPSS and Excel, and the book draws comparisons and explains differences in working methods with these programmes. Though this is an introductory text, primarily aimed at beginners in R, it takes a contemporary approach, drawing heavily on the functionality of the tidyverse packages. As much as possible, the strategies employed to import, clean and process data are done in a way that prioritises readability for a non-technical audience and scaffolds learning of R in the context of real-world datasets. No previous experience with R is necessary, as the book begins with the rudiments of using the programme and offers many suggestions for additional structured learning. The most productive way to use the book is to follow along, chapter by chapter, completing the analysis of each dataset on your own computer. Data analysis is a form of procedural knowledge that is best gained through practical experience. I maintain a personal website at https://drchristianryan.com where I occasionally blog about R. Updates to the companion package for this book called r4psych will be posted on the website and will be available on my GitHub page at https://github.com/ Christian-Ryan. More details about installing the package are in Chapters 2 and 4. Acknowledgements I would like to thank the R community, R developers and package writers for the unerring generosity in continuing to develop and expand the enormous power and potential of R software. It is inspiring to be part of a community that values open- source software, open science and the free exchange of help and ideas. I am grateful to my colleagues at University College Cork for support and encouragement since I joined the school. Particular thanks to Shane Galvin and Brendan Palmer; your teaching kickstarted my journey to becoming an enthusiastic R user. I would like to thank CRC Press, for the opportunity to add to the extensive literature they have published on R-related topics. Finally, I want to express my deepest gratitude to my wife Sandra, daughter Emer, my parents, family and friends, for their unwavering encouragement and support. Special thanks to my son Fintan Ryan for insightful comments and feedback on an early draft of this book. Contents Preface iii Acknowledgements iv 1. Introduction 1 1.1 Conventions used in this book 1 1.2 How this book is organised 2 1.3 Why learn R? 2 1.4 FAIR and data repositories 4 1.5 Data science 5 1.6 Avoiding complexity 5 1.7 Learning through real datasets 6 1.8 R as a language 6 1.9 Where to find help 7 1.10 Internal help system 7 1.11 Websites 8 1.12 Blogs 9 1.13 Books 10 1.14 Cheatsheets 10 2. The R Environment 12 2.1 RStudio 12 2.2 Packages 14 2.3 Where to find packages? 15 2.4 How to learn about package functions and datasets? 16 2.5 Installing packages 17 2.6 Examining installed packages 17 2.7 Learn R with swirl and other tools 19 3. The Basics 21 3.1 Overview 21 3.2 Functions and arguments 21 3.3 Creating vectors and dataframes 23 3.4 Adding new variables 27 3.5 Warning: quotation marks come in many styles 28 vi Data Science with R for Psychologists and Healthcare Professionals 3.6 Simple plots 29 3.7 Selecting parts of a dataframe 32 3.8 Saving—write.csv() 32 3.9 Loading data—read.csv() 33 3.10 Data types 35 3.11 Saving objects as Rdata 38 3.12 Installing the tidyverse 39 3.13 Function conflict 40 3.14 Importing datasets 41 3.15 Functions used in this chapter 42 4. Working Practices 43 4.1 Default settings 43 4.2 Projects 44 4.3 Scripts 48 4.4 R Markdown 49 4.5 r4psych and datsets for this book 51 5. Dataset Excel 52 5.1 Downloading data from figshare 52 5.2 Loading dataset from multi-sheet Excel files 52 5.3 Renaming variables 55 5.4 The pipe %>% 56 5.5 Factors—adding labels 57 5.6 Reading new sheets from Excel file 59 5.7 Renaming multiple variables 63 5.8 Joining datasheets 64 5.9 Counting cases, calculating means, sd and proportions 65 5.10 Saving dataframes 66 5.11 Automatically renaming variables 66 5.12 Functions used in this chapter 68 6. Dataset csv 69 6.1 Loading comma-separated value (csv) files 69 6.2 Female psychosis dataset 70 6.3 Checking the data types 71 6.4 Coercion 72 6.5 Counting missing values 74 6.6 Converting multiple variables to numeric types 74 6.7 Factors 75 6.8 Save as Rdata 75 6.9 Functions used in this chapter 75 7. Dataset SPSS 77 7.1 Loading SPSS files—.SAV 77 7.2 Examining the data 78 Contents vii 7.3 The structure of "labelled" variables 79 7.4 Factors, adding levels and labels 80 7.5 Labelled attributes 81 7.6 Removing attributes from multiple variables 82 7.7 Save the file 83 7.8 Functions used in this chapter 84 8. Coding New Variables and Scale Reliability 85 8.1 Principles 85 8.2 Dataset: Branjerdporn et al. (2019) 85 8.3 Adding values with mutate() 87 8.4 Using sum() in mutate() 88 8.5 Numeric scales with reverse scoring—scoreItems() 90 8.6 Psychometric properties—scoreItems() 95 8.7 Converting text responses to numeric values 96 8.8 Factor levels to recode variables 97 8.9 mutate() without naming variables (anonymous functions) 98 8.10 mutate() with function() on real data 100 8.11 Calculate subscale and total scores 102 8.12 Creating categorical variables from continuous scales 102 8.13 Cronbach's alpha 104 8.14 Dropping items 105 8.15 Impact of item deletion on item—whole scale correlation 106 8.16 Inter-Item Correlation Matrix 107 8.17 Functions used in this chapter 108 9. Normality 109 9.1 Introduction 109 9.2 The importance of a normal distribution 109 9.3 Creating a normal distribution 110 9.4 Density plot of a normal distribution 111 9.5 qqplot 111 9.6 Skewness and Kurtosis 113 9.7 Normality tests 114 9.8 Empirical distributions—checking normality 114 9.9 Taking small samples of data 114 9.10 Histogram, qqplot, skewness and kurtosis with real data 115 9.11 Sub-samples and distributions 116 9.12 Sidebar: objects in R 119 9.13 Severe deviations from normality 122 9.14 Summary 123 9.15 Functions used in this chapter 123 1 0. Outliers 125 10.1 Reload data—Larson et al. (2015) 125 10.2 Outliers—Boxplot 125 viii Data Science with R for Psychologists and Healthcare Professionals 10.3 Outlier—numeric methods 127 10.4 Replacing outliers 129 10.5 Functions used in this chapter 132 11. Descriptive Statistics 133 11.1 Summarise by group 133 11.2 Multiple grouping variables 134 11.3 Contingency tables 135 11.4 Chi-Squared test 136 11.5 t-test—using indexing 136 11.6 t-test using formula 138 11.7 Boxplots using formula 139 11.8 Boxplot with two IVs 139 11.9 Functions used in this chapter 140 12. Graphs with ggplot2 141 12.1 Introduction to graphing 141 12.2 Structure of a ggplot() call 142 12.3 Barplot 143 12.4 Axis labels 143 12.5 Colour and fill 144 12.6 Themes 145 12.7 Combining multiple layers 147 12.8 Scatterplot 148 12.9 Saving data objects, then plotting 149 12.10 Facetting 150 12.11 Boxplot 153 12.12 Jitter plot 154 12.13 Density plot 155 12.14 Functions used in this chapter 156 1 3. Correlation—Bivariate 157 13.1 Background 157 13.2 Scatterplot—base R 157 13.3 Scatterplot—ggplot2 159 13.4 Correlation coefficient 161 13.5 APA correlation 161 13.6 Coefficient of determination 162 13.7 Correlation Matrix—association between multiple variables 162 13.8 Plotting multiple correlations—corrplot 163 13.9 Plotting multiple correlations—GGally 164 13.10 Statistical significance in a correlation matrix 165 13.11 Assumptions of correlation 167 13.12 Functions used in this chapter 170 Contents ix 1 4. Correlation—Partial 171 14.1 Spurious correlation 171 14.2 Mediation 172 14.3 Partial correlation with correlation() 177 14.4 Functions used in this chapter 179 1 5. One-Way ANOVA—Model Data 180 15.1 ANOVA overview 180 15.2 ANOVA—organising the data 180 15.3 ANOVA formula—Base R-aov() 183 15.4 Effect size—eta-squared 184 15.5 APA output 185 15.6 Plotting the data 185 15.7 Post hoc tests—Tukey HSD 186 15.8 Planned comparisons 187 15.9 Checking the assumptions 189 15.10 car::Anova() 190 15.11 Functions used in this chapter 191 1 6. One-Way ANOVA—Real Data 192 16.1 Loading data 192 16.2 Visualising group differences 192 16.3 aov() 193 16.4 Post hoc tests: TukeyHSD() 195 16.5 Functions used in this chapter 195 1 7. Factorial ANOVA 197 17.1 Introduction 197 17.2 Dataset—Reilly et al. 2016 197 17.3 Distribution of participants across factor categories 199 17.4 Unbalanced factorial ANOVA 202 17.5 Anova() 204 17.6 Planned comparisons with emmeans() 205 17.7 Functions used in this chapter 207 1 8. ANCOVA 209 18.1 Introduction 209 18.2 Assumptions 210 18.3 Covariates 211 18.4 Preliminary ANOVA 211 18.5 Checking assumptions in van der Velde et al. (2015) 212 18.6 Setting contrast types 215 18.7 ANCOVA calculation 215