Reliable Machine Learning Applying SRE Principles to ML in Production With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, and Todd Underwood Reliable Machine Learning by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley and Todd Underwood Copyright © 2022 Capriole Consulting Inc., Kranti Parisa, Niall Murphy, and Todd Underwood. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: John Devins Development Editor: Angela Rufino Production Editor: Ashley Stussy Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea September 2022: First Edition Revision History for the Early Release 2021-10-12: First Release 2021-11-22: Second Release 2022-01-12: Third Release 2022-02-04: Fourth Release 2022-03-29: Fifth Release 2022-04-05: Sixth Release 2022-05-09: Seventh Release 2022-07-06: Eighth Release See http://oreilly.com/catalog/errata.csp?isbn=9781098106225 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Reliable Machine Learning, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-098-10615-7 [LSI] Prospective Table of Contents (Subject to Change) Preface Chapter 1: Introduction Chapter 2: Data Management Principles Chapter 3: Basic Introduction to Models Chapter 4: Feature and Training Data Chapter 5: Evaluating Model Validity and Quality Chapter 6: Fairness, Privacy, and Ethical Machine Learning Systems Chapter 7: Training Systems Chapter 8: Serving Chapter 9: Monitoring and Observability for Models Chapter 10: Continuous ML Chapter 11: Incident Response Chapter 12: How Product and ML Interact Chapter 13: Integrating ML Into Your Organization Chapter 14: Practical ML Org Implementation Examples Chapter 15: Case Studies: ML Ops in Practice Preface This is not a book about how machine learning works. This is a book about how to make machine learning work—for you. The way that machine learning (ML) works is fascinating. The math, algorithms, and statistical insights that surround and support ML are themselves of interest, and what they can achieve when applied to the right data can be nothing short of magical. But we do something a little different in this book. We are not algorithm-oriented—we are whole-system-oriented. In short, we talk about everything other than the algorithms. There are plenty of other works covering the algorithmic component of ML in great detail, but this one is deliberately focused on the whole lifecycle of machine learning, giving it time and attention it doesn’t really get elsewhere. This means that we talk about the messy, complicated, and occasionally frustrating work involved in shepherding data correctly and responsibly, reliable model building, ensuring a smooth (and reversible) path to production, safety in updating, and explore concerns about cost, performance, business goals, and organizational structure. We attempt to cover everything involved in having machine learning happen reliably in your organization. Why We Wrote This Book We firmly believe at least some of the hype: ML and AI techniques are currently reshaping computing and society, and at an accelerating rate. To that extent the public hype has not caught up with the private reality in some respects.1 But we are also grounded and experienced enough to understand just how laughably unreliable and problematic many real-world ML systems actually are. The technology press writes about space flight, while most organizations still have trouble staying upright on our bicycles; these are the early days still. Now is the perfect time to actively pay attention to what ML can do and how your organization might benefit from it. Having said this, though, we recognize that many organizations are worried about ‘missing out’ on machine learning, and everything it could do for (and to) their organization. The good news is, there’s no need to panic—it is possible to get started now and to be sensible and disciplined about how you work with ML, in a way which successfully balances both obligation and reward. The bad news, and the reason many organizations are worried, is that the curve of complexity is quite steep. Once you get past the simpler aspects, many of the techniques and technologies are just being invented, and it’s hard to find a solid paved path. This book should help you to navigate that complexity. We believe that, despite the immaturity of the industry, there is much to be gained by focusing on simplicity and standardization; an approach which has the beneficial side-effect of making it easier to get started. Ultimately, organizations that deeply integrate ML into their business will benefit— some substantially2—but they will of course need a degree of sophistication about how that is done. A simpler, standardized foundation will facilitate developing that capability better than ad-hoc experiments, or even worse, a system that works but no-one knows how, or why. SRE as the lens on ML A plethora of machine-learning books exist already, many of which promise to make your ML journey better in some way, often focusing on making it easier, faster, or more productive. Few, however, talk about how to make ML more reliable, an attribute often overlooked or undervalued.3 That is what we focus on, since looking at how to do ML well through that lens has specific benefits you don’t get in other ways. The reality is that current development best practices don’t map straightforwardly onto the challenges of doing ML well end-to-end. Instead, seeing these questions through a site reliability engineering (SRE) lens—holistically, sustainably, and with the customer experience in mind—is a much better framework for understanding how to meet those challenges. You can find a parallel argument in O’Reilly’s Building Secure and Reliable Systems. An unreliable system can often be parlayed into system access for an attacker—security and reliability are intimately connected. Doing one well is not easily separated from doing the other. Similarly, ML systems, with their surprising behaviors and indirect yet profound interconnections, motivate a more holistic approach towards deciding how to integrate development, deployment, production operations, and long-term care. We believe, in short, that ML systems being reliable captures the essence of what customers, business owners, and staff really want from them. Intended Audience We are writing for anyone who wants to take ML into the real world and make a difference in their organization. Accordingly, this book is for data scientists and ML engineers, for software engineers and Site Reliability Engineers, and for organizational decision makers—even non-technical ones, although parts of the book are quite technical. Data scientists and ML engineers: we’ll explore how the data, features, and model architecture you use change how your model works, and how manageable it is in the long run, all with an eye to model velocity. Software engineers building infrastructure systems for ML, or integrating ML into existing products: we address both how to integrate ML into systems, and how to write ML infrastructure. An improved understanding of how the ML lifecycle works helps with developing functionality, designing APIs, and supporting customers. Site reliability engineers: we’ll show how ML systems typically break and how best to build (and manage) to avoid those failure modes. We’ll also explore the implications of the fact that ML model quality is not something a reliability engineer can entirely ignore. Organizational leaders who want to add ML to your existing products or services: we will help you understand how best to integrate ML into your existing products and services, and what structures and organizational patterns are required. Having a sensible way of assessing risks and advantages when making ML-related decisions is important. And finally, everyone who is rightfully concerned about the ethical, legal and privacy implications of developing and deploying ML: we will lay out the issues clearly and try to point to practical steps you can take to address these concerns before they cause damage to your users or your organization. One, perhaps counter intuitive, thing to note: many of the chapters are potentially most valuable to the people whose work is not the topic of that chapter. For example, Chapter 2: Data Management Principles, can certainly be read by data scientists and ML engineers. But it’s potentially even more useful to infrastructure/production engineers and organizational leaders. For the former groups, of course fine tuning what you’re already working on is useful, but for the latter it can provide a fresh and complete introduction to a topic area that may be entirely new. How this book is organized Before we talk about the structure of the book in detail, let’s provide some broader context about how we selected the structure and topics. It might not be what you were expecting. Our Approach There are specific approaches and techniques that engineers need to employ to make ML systems work well. But each of these approaches is subject to an enormous number of decisions once put into place in a particular organization and for a particular purpose. It is not feasible for this book to cover all, or even most, of the implementation choices that readers will generally face. Similarly, we will de-emphasise concrete recommendations for specific pieces of software. We hope that this separation from the day- to-day will allow us to express ideas more clearly, but for this kind of book, remaining platform agnostic is beneficial in and of itself. Let’s Knit! Though from time to time we use other examples, our main method of illustrating what we’re talking about is a hypothetical online store—a purveyor of textile supplies called yarnit.ai. This concept is worked through in some detail throughout the book in order to demonstrate how choices about one stage, say data acquisition or normalization, filter through to consequences throughout the rest of the stack, business, and so on. It is a single, relatively simple business (buy knitting and crocheting products, put them on a web site, market and sell them to customers). This explicitly does not capture the full complexities of the sectors you see using ML in the real world, such as manufacturing, self-driving cars, real estate marketing, and medical technology, but provides enough insight (and manages the scope) such that we feel the implementation complexities we deal with here provide lessons applicable to other domains. (In other words, the limitations of our example are, we believe, worth it.)