ebook img

Machine Learning at Scale with H2O: A practical guide to building and deploying machine learning models on enterprise systems PDF

396 Pages·2022·14.352 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Machine Learning at Scale with H2O: A practical guide to building and deploying machine learning models on enterprise systems

Machine Learning at Scale with H2O A practical guide to building and deploying machine learning models on enterprise systems Gregory Keys David Whiting BIRMINGHAM—MUMBAI Machine Learning at Scale with H2O Copyright © 2022 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Publishing Product Manager: Aditi Gour Senior Editor: David Sugarman Content Development Editor: Manikandan Kurup Technical Editor: Rahul Limbachiya Copy Editor: Safis Editing Project Coordinator: Farheen Fathima Proofreader: Safis Editing Indexer: Subalakshmi Govindhan Production Designer: Alishon Mendonca Marketing Coordinator: Abeer Riyaz Dawe First published: July 2022 Production reference: 1290622 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80056-601-9 www.packt.com My deepest love and warmth to Mary, Julia and Alexa for their support and understanding while husband and dad disappeared to the basement for significant chunks of nights and weekends as the seasons progressed. - Gregory To my wife Kathy, and son Ben, who endured too many late nights and weekends of dad locked away in his study working; the book has been a family effort and its culmination a family success. - David Acknowledgments This book would not have been possible without the approval and support of our respective leaders at H2O.ai at the time of its writing, Dmitry Baev and Eyal Kaldes. In addition, we pay our great appreciation to the deep expertise of the many Makers at H2O. ai. Their day-to-day collaboration, education, and machine learning expertise are diffused throughout the pages of this book. One name needs to be called out in particular: massive thanks to Eric Gudgeon for his infinite and unrelenting technical teachings, and for defining and developing a vast landscape of H2O model deployment implementations. This book took longer to pull together than either of us expected. Working at a hyper- focused and highly energized company certainly was a contributing factor. Against this backdrop, we appreciate the world-class patience, encouragement, guidance, and professionalism of the Packt team in collaborating on this book from start to finish. And most importantly there is family, who unfairly signed up for book writing without fully knowing it. Contributors About the authors Gregory Keys is a master principal cloud architect for Data and AI at Oracle. Formerly a senior solutions architect at H2O.ai, he has over 20 years of experience designing and implementing software and data systems. He specializes in AI/ML solutions and has multiple software patents. Gregory has a PhD in evolutionary biology, which has greatly influenced him as a systems thinker. David Whiting is a data science director and head of training at H2O.ai. He has a PhD in statistics from Texas A&M University and over 25 years of professional experience in academia, consulting, and industry. He has built and led data science teams in financial services and other regulated enterprises. About the reviewers Jan Gamec is a lead software engineer at H2O.ai and one of the top contributors to a state- of-the-art AutoML platform called Driverless AI. In the past decade, he has contributed to various projects, focusing on machine learning, cryptography, and web technologies, either in the public or academic sector. Jan holds a master's degree in machine learning and computer science from CTU, Czech Republic, with the main focus of interest being genetic programming, neural networks, and reinforcement learning. Jagadeesh Rajarajan has over 10 years of experience in building scalable data science systems. He has rich domain knowledge in the following areas: search relevance (information retrieval), recommender systems, AI for customer engagement (acquisition, activation, and retention), MLOps, and interpretable machine learning systems. Eric Gudgeon has worked on many large complex systems, built nationwide networks, and helped customers deploy highly scalable low-latency solutions. He has a passion for technology and finding creative solutions to problems. Ondrej Bilek is a lead software engineer at H2O.ai and has rich experience designing and implementing machine learning platforms for Hadoop and Kubernetes. He led the development of Enterprise Steam and is currently working on the H2O AI Cloud. Table of Contents Preface Section 1 – Introduction to the H2O Machine Learning Platform for Data at Scale 1 Opportunities and Challenges ML at scale 4 The business challenge – getting your models into enterprise The ML life cycle and three production systems 10 challenge areas for ML at scale 5 The navigation challenge – navigating A simplified ML life cycle 5 the enterprise stakeholder landscape 11 The model building challenge – state-of-the-art models at scale 6 H2O.ai's answer to these challenges 12 Summary 14 2 Platform Components and Key Concepts Technical requirements 16 H2O Core – in-memory distributed model building 21 Hello World – the H2O machine learning code 16 H2O Enterprise Steam – a managed, self-provisioning portal 23 Code example 16 The H2O MOJO – a flexible, low-latency Some issues of scale 19 scoring artifact 24 The components of H2O The workflow using H2O machine learning at scale 21 components 25 H2O key concepts 26 viii Table of Contents The data scientist's experience 26 Sparkling Water allows users to code The H2O cluster 28 in H2O and Spark seamlessly 33 Enterprise Steam as an H2O gateway 30 MOJOs export as Enterprise Steam and the H2O Core DevOps-friendly artifacts 34 high-level architecture 32 Summary 35 3 Fundamental Workflow – Data to Deployable Model Technical requirements 38 Launching an H2O cluster using the Enterprise Steam API versus the Use case and data overview 39 UI (step 1) 47 The fundamental workflow 39 Launching an H2O-3 versus Sparkling Step 1 – launching the H2O cluster 39 Water cluster (step 1) 47 Step 2 – connecting to the H2O cluster 42 Implementing Enterprise Steam or Step 3 – building the model 44 not (steps 1–2) 48 Step 4 – evaluating and explaining Using a personal access token to log in the model 46 to Enterprise Steam (step 2) 48 Step 5 – exporting the model's Building the model (step 3) 48 scoring artifact 46 Evaluating and explaining the Step 6 – shutting down the cluster 46 model (step 4) 49 Exporting the model's scoring Variation points – alternatives artifact (step 5) 50 and extensions to the Shutting down the cluster (step 6) 50 fundamental workflow 47 Summary 50 Section 2 – Building State-of-the-Art Models on Large Data Volumes Using H2O 4 H2O Model Building at Scale – Capability Articulation H2O data capabilities during Exporting data out of the H2O cluster 57 model building 54 Additional data capabilities provided by Sparkling Water 58 Ingesting data from the source to the H2O cluster 54 H2O machine learning Manipulating data in the H2O cluster 56 algorithms 58 Table of Contents ix H2O unsupervised learning algorithms 59 H2O model training capabilities 62 H2O supervised learning algorithms 60 H2O model evaluation capabilities 62 Parameters and hyperparameters 60 H2O model explainability capabilities 63 H2O extensions of supervised learning 61 H2O trained model artifacts 64 Miscellaneous 61 Summary 64 H2O modeling capabilities 61 5 Advanced Model Building – Part I Technical requirements 66 H2O AutoML 91 Splitting data for validation or The AutoML leaderboard 93 cross-validation and testing 66 Feature engineering options 95 Train, validate, and test set splits 68 Target encoding 97 Train and test splits for k-fold Other feature engineering options 107 cross-validation 69 Leveraging H2O Flow to Algorithm considerations 69 enhance your IDE workflow 108 An introduction to decision trees 71 Monitoring with Flow 109 Random forests 73 Interactive investigations with Flow 115 Gradient boosting 74 Baseline model training 76 Putting it all together – algorithms, feature Model optimization with engineering, grid search, grid search 84 and AutoML 121 Step 1 – a Cartesian grid search to An enhanced AutoML procedure 121 focus on the best tree depth 86 Step 2 – a random grid search to tune Summary 122 other parameters 88 6 Advanced Model Building – Part II Technical requirements 124 Defining Spark pipeline stages 130 Modeling in Sparkling Water 124 Creating a Sparkling Water pipeline 140 Looking ahead – a production preview 141 Introducing Sparkling Water pipelines 125 Implementing a sentiment UL methods in H2O 141 analysis pipeline 126 What is anomaly detection? 141 Importing the raw Amazon data 128

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.