ebook img

Introduction to Machine Learning with R. Rigorous Mathematical Analysis PDF

216 Pages·2018·3.14 MB·english
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Introduction to Machine Learning with R. Rigorous Mathematical Analysis

Introduction to Machine Learning with R Rigorous Mathematical Analysis Scott V. Burger Introduction to Machine Learning with R by Scott V. Burger Copyright © 2018 Scott Burger. Printed in the United States of America March 2018: First Edition Revision History for the First Edition 2018-03-08: First Release http://oreilly.com/catalog/errata.csp?isbn=9781491976449 978-1-491-97644-9 [LSI] Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. What Is a Model?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Algorithms Versus Models: What’s the Difference? 6 A Note on Terminology 7 Modeling Limitations 8 Statistics and Computation in Modeling 10 Data Training 11 Cross-Validation 12 Why Use R? 13 The Good 13 R and Machine Learning 15 The Bad 16 Summary 17 2. Supervised and Unsupervised Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Supervised Models 20 Regression 20 Training and Testing of Data 22 Classification 24 Logistic Regression 24 Supervised Clustering Methods 26 Mixed Methods 31 Tree-Based Models 31 Random Forests 34 Neural Networks 35 Support Vector Machines 39 Unsupervised Learning 40 Unsupervised Clustering Methods 41 Summary 43 3. Sampling Statistics and Model Training in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Bias 46 Sampling in R 51 Training and Testing 54 Roles of Training and Test Sets 55 Why Make a Test Set? 55 Training and Test Sets: Regression Modeling 55 Training and Test Sets: Classification Modeling 63 Cross-Validation 67 k-Fold Cross-Validation 67 Summary 69 4. Regression in a Nutshell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Linear Regression 72 Multivariate Regression 74 Regularization 78 Polynomial Regression 81 Goodness of Fit with Data—The Perils of Overfitting 87 Root-Mean-Square Error 87 Model Simplicity and Goodness of Fit 89 Logistic Regression 91 The Motivation for Classification 92 The Decision Boundary 93 The Sigmoid Function 94 Binary Classification 98 Multiclass Classification 101 Logistic Regression with Caret 105 Summary 106 Linear Regression 106 Logistic Regression 107 5. Neural Networks in a Nutshell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Single-Layer Neural Networks 109 Building a Simple Neural Network by Using R 111 Multiple Compute Outputs 113 Hidden Compute Nodes 114 Multilayer Neural Networks 120 Neural Networks for Regression 125 Neural Networks for Classification 130 Neural Networks with caret 131 Regression 131 Classification 132 Summary 133 6. Tree-Based Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A Simple Tree Model 135 Deciding How to Split Trees 138 Tree Entropy and Information Gain 139 Pros and Cons of Decision Trees 140 Tree Overfitting 141 Pruning Trees 145 Decision Trees for Regression 151 Decision Trees for Classification 151 Conditional Inference Trees 152 Conditional Inference Tree Regression 154 Conditional Inference Tree Classification 155 Random Forests 155 Random Forest Regression 156 Random Forest Classification 157 Summary 158 7. Other Advanced Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Naive Bayes Classification 159 Bayesian Statistics in a Nutshell 159 Application of Naive Bayes 161 Principal Component Analysis 163 Linear Discriminant Analysis 169 Support Vector Machines 173 k-Nearest Neighbors 179 Regression Using kNN 181 Classification Using kNN 182 Summary 184 8. Machine Learning with the caret Package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 The Titanic Dataset 186 Data Wrangling 187 caret Unleashed 188 Imputation 188 Data Splitting 190 caret Under the Hood 191 Model Training 194 Comparing Multiple caret Models 197 Summary 199 A. Encyclopedia of Machine Learning Models in caret. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Preface In this short introduction, I tackle a few key points. Who Should Read This Book? This book is ideally suited for people who have some working knowledge of the R programming language. If you don’t have any knowledge of R, it’s an easy enough language to pick up, and the code is readable enough that you can pretty much get the gist of the code examples herein. Scope of the Book This book is an introductory text, so we don’t dive deeply into the mathematical underpinnings of every algorithm covered. Presented here are enough of the details for you to discern the difference between a neural network and, say, a random forest at a high level. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. CHAPTER 1 What Is a Model? There was a time in my undergraduate physics studies that I was excited to learn what a model was. I remember the scene pretty well. We were in a Stars and Galaxies class, getting ready to learn about atmospheric models that could be applied not only to the Earth, but to other planets in the solar system as well. I knew enough about climate models to know they were complicated, so I braced myself for an onslaught of math that would take me weeks to parse. When we finally got to the meat of the subject, I was kind of let down: I had already dealt with data models in the past and hadn’t even realized! Because models are a fundamental aspect of machine learning, perhaps it’s not sur‐ prising that this story mirrors how I learned to understand the field of machine learning. During my graduate studies, I was on the fence about going into the finan‐ cial industry. I had heard that machine learning was being used extensively in that world, and, as a lowly physics major, I felt like I would need to be more of a computa‐ tional engineer to compete. I came to a similar realization that not only was machine learning not as scary of a subject as I originally thought, but I had indeed been using it before. Since before high school, even! Models are helpful because unlike dashboards, which offer a static picture of what the data shows currently (or at a particular slice in time), models can go further and help you understand the future. For example, someone who is working on a sales team might only be familiar with reports that show a static picture. Maybe their screen is always up to date with what the daily sales are. There have been countless dashboards that I’ve seen and built that simply say “this is how many assets are in right now.” Or, “this is what our key performance indicator is for today.” A report is a static entity that doesn’t offer an intuition as to how it evolves over time. Figure 1-1 shows what a report might look like: 1 op <- par(mar = c(10, 4, 4, 2) + 0.1) #margin formatting barplot(mtcars$mpg, names.arg = row.names(mtcars), las = 2, ylab = "Fuel Efficiency in Miles per Gallon") Figure 1-1. A distribution of vehicle fuel efficiency based on the built-in mtcars dataset found in R Figure 1-1 depicts a plot of the mtcars dataset that comes prebuilt with R. The figure shows a number of cars plotted by their fuel efficiency in miles per gallon. This report isn’t very interesting. It doesn’t give us any predictive power. Seeing how the efficiency of the cars is distributed is nice, but how can we relate that to other things in the data and, moreover, make predictions from it? A model is any sort of function that has predictive power. So how do we turn this boring report into something more useful? How do we bridge the gap between reporting and machine learning? Oftentimes the correct answer to this is “more data!” That can come in the form of more observations of the same data or by collecting new types of data that we can then use for comparison. Let’s take a look at the built-in mtcars dataset that comes with R in more detail: 2 | Chapter 1: What Is a Model?

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.