Practical Machine Learning with H 2O POWERFUL, SCALABLE TECHNIQUES FOR AI AND DEEP LEARNING Darren Cook Practical Machine Learning with H2O Powerful, Scalable Techniques for Deep Learning and AI Darren Cook BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Practical Machine Learning with H2O by Darren Cook Copyright © 2017 Darren Cook. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editor: Nicole Tache Indexer: WordCo Indexing Services, Inc. Production Editor: Colleen Lobner Interior Designer: David Futato Copyeditor: Kim Cofer Cover Designer: Karen Montgomery Proofreader: Charles Roumeliotis Illustrator: Rebecca Demarest December 2016: First Edition Revision History for the First Edition 2016-12-01: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491964606 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Practical Machine Learning with H2O, the cover image of a crayfish, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-96460-6 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Installation and Quick-Start. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Preparing to Install 1 Installing R 2 Installing Python 2 Privacy 3 Installing Java 3 Install H2O with R (CRAN) 3 Install H2O with Python (pip) 5 Our First Learning 7 Training and Predictions, with Python 11 Training and Predictions, with R 13 Performance Versus Predictions 15 On Being Unlucky 16 Flow 17 Data 17 Models 20 Predictions 20 Other Things in Flow 22 Summary 23 2. Data Import, Data Export. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Memory Requirements 25 Preparing the Data 26 Getting Data into H2O 28 Load CSV Files 28 Load Other File Formats 30 iii Load Directly from R 31 Load Directly from Python 32 Data Manipulation 34 Laziness, Naming, Deleting 34 Data Summaries 36 Operations on Columns 36 Aggregating Rows 37 Indexing 39 Split Data Already in H2O 41 Rows and Columns 45 Getting Data Out of H2O 48 Exporting Data Frames 48 POJOs 49 Model Files 51 Save All Models 51 Summary 52 3. The Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Data Set: Building Energy Efficiency 54 Setup and Load 55 The Data Columns 56 Splitting the Data 58 Let’s Take a Look! 59 About the Data Set 64 Data Set: Handwritten Digits 64 Setup and Load 65 Taking a Look 67 Helping the Models 69 About the Data Set 71 Data Set: Football Scores 71 Correlations 75 Missing Data… And Yet More Columns 80 How to Train and Test? 80 Setup and Load 81 The Other Third 82 Missing Data (Again) 85 Setup and Load (Again) 86 About the Data Set 89 Summary 89 4. Common Model Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Supported Metrics 92 iv | Table of Contents Regression Metrics 92 Classification Metrics 93 Binomial Classification 94 The Essentials 96 Effort 98 Scoring and Validation 99 Early Stopping 99 Checkpoints 102 Cross-Validation (aka k-folds) 104 Data Weighting 106 Sampling, Generalizing 108 Regression 110 Output Control 112 Summary 113 5. Random Forest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Decision Trees 116 Random Forest 116 Parameters 117 Building Energy Efficiency: Default Random Forest 119 Grid Search 122 Cartesian 123 RandomDiscrete 126 High-Level Strategy 129 Building Energy Efficiency: Tuned Random Forest 129 MNIST: Default Random Forest 132 MNIST: Tuned Random Forest 134 Enhanced Data 137 Football: Default Random Forest 138 Football: Tuned Random Forest 140 Summary 143 6. Gradient Boosting Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Boosting 145 The Good, the Bad, and… the Mysterious 147 Parameters 148 Building Energy Efficiency: Default GBM 150 Building Energy Efficiency: Tuned GBM 151 MNIST: Default GBM 155 MNIST: Tuned GBM 157 Football: Default GBM 160 Football: Tuned GBM 160 Table of Contents | v Summary 163 7. Linear Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 GLM Parameters 166 Building Energy Efficiency: Default GLM 171 Building Energy Efficiency: Tuned GLM 173 MNIST: Default GLM 179 MNIST: Tuned GLM 181 Football: Default GLM 183 Football: Tuned GLM 185 Summary 186 8. Deep Learning (Neural Nets). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 What Are Neural Nets? 188 Numbers Versus Categories 190 Network Layers 191 Activation Functions 193 Parameters 195 Deep Learning Regularization 195 Deep Learning Scoring 196 Building Energy Efficiency: Default Deep Learning 199 Building Energy Efficiency: Tuned Deep Learning 201 MNIST: Default Deep Learning 207 MNIST: Tuned Deep Learning 209 Football: Default Deep Learning 214 Football: Tuned Deep Learning 215 Summary 221 Appendix: More Deep Learning Parameters 222 9. Unsupervised Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 K-Means Clustering 226 Deep Learning Auto-Encoder 229 Stacked Auto-Encoder 232 Principal Component Analysis 234 GLRM 236 Missing Data 237 GLRM 241 Lose the R! 241 Summary 245 10. Everything Else. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Staying on Top of and Poking into Things 247 vi | Table of Contents Installing the Latest Version 247 Building from Source 248 Running from the Command Line 248 Clusters 249 EC2 250 Other Cloud Providers 251 Hadoop 251 Spark / Sparkling Water 252 Naive Bayes 252 Ensembles 253 Stacking: h2o.ensemble 254 Categorical Ensembles 256 Summary 257 11. Epilogue: Didn’t They All Do Well!. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Building Energy Results 259 MNIST Results 261 Football Data 263 How Low Can You Go? 265 The More the Merrier 265 Still Desperate for More 267 Filtering for Hardness 268 Auto-Encoder 269 Convolute and Shrink 270 Ensembles 272 That Was as Low as I Go… 273 Summary 273 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Table of Contents | vii
Description: