ebook img

Tidy Modeling with R: A Framework for Modeling in the Tidyverse PDF

384 Pages·2022·21.39 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Tidy Modeling with R: A Framework for Modeling in the Tidyverse

Tidy Modeling with R A Framework for Modeling in the Tidyverse K u h n & S ilg e Max Kuhn & Julia Silge Tidy Modeling with R Get going with tidymodels, a collection of R packages for modeling and machine learning. Whether you’re just starting “The tidymodels out or have years of experience with modeling, this practical framework combines introduction shows data analysts, business analysts, and data human-centric design scientists how the tidymodels framework offers a consistent, with statistical best flexible approach for your work. practices, and I can’t RStudio engineers Max Kuhn and Julia Silge demonstrate think of a better way to ways to create models by focusing on an R dialect called the learn it than from Max tidyverse. Software that adopts tidyverse principles shares and Julia.” both a high-level design philosophy and low-level grammar — Hadley Wickham and data structures, so learning one piece of the ecosystem Chief Scientist, RStudio makes it easier to learn the next. You’ll understand why the tidymodels framework has been built to be used by a broad “This book offers a unified range of people. and systematic approach to building, analyzing, With this book, you will: and evaluating statistical • Learn the steps necessary to build a model from beginning models in R.” to end — Balasubramanian Narasimhan • Understand how to use different modeling and feature Senior Research Scientist, engineering approaches fluently Stanford University • Examine the options for avoiding common pitfalls of modeling, such as overfitting Max Kuhn is a software engineer at RStudio, where he works to improve • Learn practical methods to prepare your data for modeling R’s modeling capabilities. He’s applied • Tune models for optimal performance models in the pharmaceutical and diagnostic industries for more than • Use good statistical practices to compare, evaluate, and 18 years. choose among models K Julia Silge is a software engineer at u h RStudio working on open source n modeling tools. With a PhD in & astrophysics, she’s worked as a data S scientist in tech and the nonprofit sector. ilg e DATA SCIENCE Twitter: @oreillymedia linkedin.com/company/oreilly-media US $59.99 CAN $74.99 youtube.com/oreillymedia ISBN: 978-1-492-09648-1 Tidy Modeling with R A Framework for Modeling in the Tidyverse Max Kuhn and Julia Silge BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Tidy Modeling with R by Max Kuhn and Julia Silge Copyright © 2022 Max Kuhn and Julia Silge. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Michelle Smith Indexer: Potomac Indexing, LLC Development Editor: Rita Fernando Interior Designer: David Futato Production Editor: Beth Kelly Cover Designer: Karen Montgomery Copyeditor: Piper Editorial Consulting, LLC Illustrator: Kate Dullea Proofreader: Tom Sullivan July 2022: First Edition Revision History for the First Edition 2022-07-12: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492096481 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Tidy Modeling with R, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-09648-1 [LSI] To Amy: When you read this, know that I love you more today than every day before. —M.K. To Robert: Happy 20 years of choosing each other. —J.S. Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Part I. Introduction 1. Software for Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Fundamentals for Modeling Software 4 Types of Models 5 Descriptive Models 6 Inferential Models 8 Predictive Models 9 Connections Between Types of Models 10 Some Terminology 11 How Does Modeling Fit into the Data Analysis Process? 12 Chapter Summary 15 2. A Tidyverse Primer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Tidyverse Principles 17 Design for Humans 18 Reuse Existing Data Structures 19 Design for the Pipe and Functional Programming 20 Examples of Tidyverse Syntax 23 Chapter Summary 25 3. A Review of R Modeling Fundamentals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 An Example 27 What Does the R Formula Do? 33 Why Tidiness Is Important for Modeling 34 vi Combining Base R Models and the Tidyverse 38 The tidymodels Metapackage 39 Chapter Summary 41 Part II. Modeling Basics 4. The Ames Housing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Exploring Features of Homes in Ames 47 Chapter Summary 52 5. Spending Our Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Common Methods for Splitting Data 56 What About a Validation Set? 59 Multilevel Data 59 Other Considerations for a Data Budget 60 Chapter Summary 61 6. Fitting Models with parsnip. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Create a Model 63 Use the Model Results 69 Make Predictions 71 parsnip-Extension Packages 73 Creating Model Specifications 73 Chapter Summary 74 7. A Model Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Where Does the Model Begin and End? 75 Workflow Basics 78 Adding Raw Variables to the workflow() 80 How Does a workflow() Use the Formula? 81 Tree-Based Models 82 Special Formulas and Inline Functions 82 Creating Multiple Workflows at Once 84 Evaluating the Test Set 86 Chapter Summary 87 8. Feature Engineering with Recipes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 A Simple recipe() for the Ames Housing Data 90 Using Recipes 92 How Data Are Used by the recipe() 94 Examples of Steps 94 vii | Table of Contents Encoding Qualitative Data in a Numeric Format 95 Interaction Terms 97 Spline Functions 99 Feature Extraction 101 Row Sampling Steps 102 General Transformations 103 Natural Language Processing 103 Skipping Steps for New Data 103 Tidy a recipe() 104 Column Roles 106 Chapter Summary 107 9. Judging Model Effectiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Performance Metrics and Inference 110 Regression Metrics 112 Binary Classification Metrics 114 Multiclass Classification Metrics 117 Chapter Summary 122 Part III. Tools for Creating Effective Models 10. Resampling for Evaluating Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 The Resubstitution Approach 125 Resampling Methods 128 Cross-Validation 129 Repeated Cross-Validation 132 Leave-One-Out Cross-Validation 133 Monte Carlo Cross-Validation 133 Validation Sets 134 Bootstrapping 135 Rolling Forecasting Origin Resampling 136 Estimating Performance 138 Parallel Processing 143 Saving the Resampled Objects 144 Chapter Summary 147 11. Comparing Models with Resampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Creating Multiple Models with Workflow Sets 149 Comparing Resampled Performance Statistics 152 Simple Hypothesis Testing Methods 155 Bayesian Methods 157 Table of Contents | viii A Random Intercept Model 158 The Effect of the Amount of Resampling 163 Chapter Summary 164 12. Model Tuning and the Dangers of Overfitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Model Parameters 165 Tuning Parameters for Different Types of Models 166 What Do We Optimize? 168 The Consequences of Poor Parameter Estimates 173 Two General Strategies for Optimization 176 Tuning Parameters in tidymodels 177 Chapter Summary 183 13. Grid Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Regular and Nonregular Grids 185 Regular Grids 186 Nonregular Grids 188 Evaluating the Grid 191 Finalizing the Model 196 Tools for Creating Tuning Specifications 198 Tools for Efficient Grid Search 199 Submodel Optimization 199 Parallel Processing 200 Benchmarking Boosted Trees 204 Access to Global Variables 206 Racing Methods 208 Chapter Summary 210 14. Iterative Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 A Support Vector Machine Model 213 Bayesian Optimization 216 A Gaussian Process Model 216 Acquisition Functions 218 The tune_bayes() Function 222 Simulated Annealing 227 Simulated Annealing Search Process 227 The tune_sim_anneal() Function 230 Chapter Summary 234 15. Screening Many Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Modeling Concrete Mixture Strength 235 Creating the Workflow Set 238 ix | Table of Contents

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.