Efficient R Programming A PRACTICAL GUIDE TO SMARTER PROGRAMMING Colin Gillespie & Robin Lovelace Efficient R Programming A Practical Guide to Smarter Programming Colin Gillespie and Robin Lovelace BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Efficient R Programming by Colin Gillespie and Robin Lovelace Copyright © 2017 Colin Gillespie, Robin Lovelace. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editor: Nicole Tache Indexer: WordCo Indexing Services Production Editor: Nicholas Adams Interior Designer: David Futato Copyeditor: Gillian McGarvey Cover Designer: Randy Comer Proofreader: Christina Edwards Illustrator: Rebecca Demarest December 2016: First Edition Revision History for the First Edition 2016-11-29: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491950784 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Efficient R Programming, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-95078-4 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Prerequisites 2 Who This Book Is for and How to Use It 2 What Is Efficiency? 4 What Is Efficient R Programming? 4 Why Efficiency? 6 Cross-Transferable Skills for Efficiency 7 Touch Typing 7 Consistent Style and Code Conventions 8 Benchmarking and Profiling 9 Benchmarking 9 Benchmarking Example 10 Profiling 11 Book Resources 14 R Package 14 Online Version 14 References 14 2. Efficient Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Prerequisites 18 Top Five Tips for an Efficient R Setup 18 Operating System 18 Operating System and Resource Monitoring 19 R Version 21 Installing R 21 Updating R 23 iii Installing R Packages 23 Installing R Packages with Dependencies 24 Updating R Packages 24 R Startup 25 R Startup Arguments 25 An Overview of R’s Startup Files 26 The Location of Startup Files 26 The .Rprofile File 28 Example .Rprofile File 28 The .Renviron File 32 RStudio 34 Installing and Updating RStudio 35 Window Pane Layout 35 RStudio Options 38 Autocompletion 38 Keyboard Shortcuts 40 Object Display and Output Table 40 Project Management 41 BLAS and Alternative R Interpreters 43 Testing Performance Gains from BLAS 43 Other Interpreters 44 Useful BLAS/Benchmarking Resources 45 References 45 3. Efficient Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Prerequisites 47 Top Five Tips for Efficient Programming 47 General Advice 48 Memory Allocation 49 Vectorized Code 50 Communicating with the User 53 Fatal Errors: stop() 53 Warnings: warning() 54 Informative Output: message() and cat() 55 Invisible Returns 55 Factors 56 Inherent Order 56 Fixed Set of Categories 57 The Apply Family 57 Example: Movies Dataset 59 Type Consistency 60 Caching Variables 61 iv | Table of Contents Function Closures 63 The Byte Compiler 64 Example: The Mean Function 65 Compiling Code 66 References 67 4. Efficient Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Prerequisites 69 Top Five Tips for Efficient Workflow 70 A Project Planning Typology 70 Project Planning and Management 72 Chunking Your Work 73 Making Your Workflow SMART 74 Visualizing Plans with R 75 Package Selection 76 Searching for R Packages 78 How to Select a Package 78 Publication 80 Dynamic Documents with R Markdown 81 R Packages 83 Reference 84 5. Efficient Input/Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Prerequisites 86 Top Five Tips for Efficient Data I/O 86 Versatile Data Import with rio 86 Plain-Text Formats 88 Differences Between fread() and read_csv() 90 Preprocessing Text Outside R 92 Binary File Formats 93 Native Binary Formats: Rdata or Rds? 94 The Feather File Format 94 Benchmarking Binary File Formats 94 Protocol Buffers 96 Getting Data from the Internet 96 Accessing Data Stored in Packages 97 References 98 6. Efficient Data Carpentry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Prerequisites 100 Top Five Tips for Efficient Data Carpentry 100 Efficient Data Frames with tibble 100 Table of Contents | v Tidying Data with tidyr and Regular Expressions 102 Make Wide Tables Long with gather() 103 Split Joint Variables with separate() 104 Other tidyr Functions 105 Regular Expressions 106 Efficient Data Processing with dplyr 108 Renaming Columns 110 Changing Column Classes 110 Filtering Rows 111 Chaining Operations 112 Data Aggregation 114 Nonstandard Evaluation 117 Combining Datasets 118 Working with Databases 119 Databases and dplyr 121 Data Processing with data.table 123 References 125 7. Efficient Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Prerequisites 128 Top Five Tips for Efficient Optimization 128 Code Profiling 128 Getting Started with profvis 129 Example: Monopoly Simulation 130 Efficient Base R 131 The if() Versus ifelse() Functions 131 Sorting and Ordering 132 Reversing Elements 133 Which Indices are TRUE? 133 Converting Factors to Numerics 133 Logical AND and OR 134 Row and Column Operations 134 is.na() and anyNA() 135 Matrices 135 Example: Optimizing the move_square() Function 138 Parallel Computing 139 Parallel Versions of Apply Functions 140 Example: Snakes and Ladders 140 Exit Functions with Care 141 Parallel Code under Linux and OS X 141 Rcpp 142 A Simple C++ Function 143 vi | Table of Contents The cppFunction() Command 144 C++ Data Types 145 The sourceCpp() Function 145 Vectors and Loops 146 Matrices 149 C++ with Sugar on Top 149 Rcpp Resources 150 References 151 8. Efficient Hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Prerequisites 153 Top Five Tips for Efficient Hardware 153 Background: What Is a Byte? 154 Random Access Memory 155 Hard Drives: HDD Versus SSD 158 Operating Systems: 32-Bit or 64-Bit 159 Central Processing Unit 160 Cloud Computing 162 Amazon EC2 162 9. Efficient Collaboration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Prerequisites 164 Top Five Tips for Efficient Collaboration 164 Coding Style 164 Reformatting Code with RStudio 165 Filenames 165 Loading Packages 166 Commenting 166 Object Names 167 Example Package 167 Assignment 168 Spacing 168 Indentation 168 Curly Braces 169 Version Control 169 Commits 170 Git Integration in RStudio 170 GitHub 171 Branches, Forks, Pulls, and Clones 172 Code Review 173 References 174 Table of Contents | vii 10. Efficient Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Prerequisties 175 Top Five Tips for Efficient Learning 175 Using R’s Internal Help 176 Searching R for Topics 177 Finding and Using Vignettes 178 Getting Help on Functions 179 Reading R Source Code 181 swirl 182 Online Resources 182 Stack Overflow 183 Mailing Lists and Groups 183 Asking a Question 184 Minimal Dataset 184 Minimal Example 185 Learning In Depth 185 Spread the Knowledge 187 References 187 A. Package Dependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 B. References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 viii | Table of Contents Preface Efficient R Programming is about increasing the amount of work you can do with R in a given amount of time. It’s about both computational and programmer efficiency. There are many excellent R resources about topics such as visualization (e.g., Chang 2012), data science (e.g., Grolemund and Wickham 2016), and package development (e.g., Wickham 2015). There are even more resources on how to use R in particular domains, including Bayesian statistics, machine learning, and geographic information systems. However, there are very few unified resources on how to simply make R work effectively. Hints, tips, and decades of community knowledge on the subject are scattered across hundreds of internet pages, email threads, and discussion forums, making it challenging for R users to understand how to write efficient code. In our teaching we have found that this issue applies to beginners and experienced users alike. Whether it’s a question of understanding how to use R’s vector objects to avoid for loops, knowing how to set up your .Rprofile and .Renviron files, or the abil‐ ity to harness R’s excellent C++ interface to do the heavy lifting, the concept of effi‐ ciency is key. The book aims to distill tips, warnings, and tricks of the trade into a single, cohesive whole that provides a useful resource to R programmers of all stripes for years to come. The content of the book reflects the questions that our students from a range of disci‐ plines, skill levels, and industries have asked over the years to make their R work faster. How to set up my system optimally for R programming work? How can one apply general principles from computer science (such as do not repeat yourself, aka DRY) to the specifics of an R script? How can R code be incorporated into an efficient workflow, including project inception, collaboration, and write-up? And how can one quickly learn how to use new packages and functions? The book answers these questions and more in 10 self-contained chapters. Each chapter starts with the basics and gets progressively more advanced, so there is some‐ thing for everyone in each one. While more advanced topics such as parallel pro‐ gramming and C++ may not be immediately relevant to R beginners, the book helps ix