Implementing Service Level Objectives A Practical Guide to SLIs, SLOs & Error Budgets Alex Hidalgo Praise for Implementing Service Level Objectives “SLIs and SLOs are core practices of the discipline of SRE, but they’re trickier than they look. Alex and his merry band of SRE luminaries have a metric ton of experience and are here to help.” —David N. Blank-Edelman, Curator/Editor of Seeking SRE and Cofounder of SREcon “Practical examples of software reliability are hard to come by, but this book has done it...A must-read for ensuring that your end users are happy and successful.” —Robert Ross, CEO at FireHydrant “An approachable, clear guide that enables ‘normal’ companies to achieve Google SRE quality monitoring. I can’t recommend this book enough!” —Thomas A. Limoncelli, SRE Manager, Stack Overflow, Inc. Implementing Service Level Objectives A Practical Guide to SLIs, SLOs, and Error Budgets Alex Hidalgo BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Implementing Service Level Objectives by Alex Hidalgo Copyright © 2020 Alex Hidalgo. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: John Devins Indexer: nSight, Inc. Development Editor: Corbin Collins Interior Designer: David Futato Production Editor: Deborah Baker Cover Designer: Karen Montgomery Copyeditor: Rachel Head Illustrator: O’Reilly Media, Inc. Proofreader: Piper Editorial, LLC September 2020: First Edition Revision History for the First Edition 2020-08-04: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492076810 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Implementing Service Level Objectives, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-07681-0 [GP] Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Part I. SLO Development 1. The Reliability Stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Service Truths 2 The Reliability Stack 2 Service Level Indicators 5 Service Level Objectives 6 Error Budgets 7 What Is a Service? 8 Example Services 9 Things to Keep in Mind 12 SLOs Are Just Data 12 SLOs Are a Process, Not a Project 12 Iterate Over Everything 13 The World Will Change 13 It’s All About Humans 13 Summary 13 2. How to Think About Reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Reliability Engineering 16 Past Performance and Your Users 17 Implied Agreements 18 Making Agreements 18 v A Worked Example of Reliability 19 How Reliable Should You Be? 21 100% Isn’t Necessary 22 Reliability Is Expensive 24 How to Think About Reliability 25 Summary 26 3. Developing Meaningful Service Level Indicators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 What Meaningful SLIs Provide 28 Happier Users 28 Happier Engineers 29 A Happier Business 30 Caring About Many Things 30 A Request and Response Service 32 Measuring Many Things by Measuring Only a Few 33 A Written Example 34 Something More Complex 35 Measuring Complex Service User Reliability 37 Another Written Example 39 Business Alignment and SLIs 40 Summary 40 4. Choosing Good Service Level Objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Reliability Targets 44 User Happiness 44 The Problem of Being Too Reliable 45 The Problem with the Number Nine 46 The Problem with Too Many SLOs 48 Service Dependencies and Components 49 Service Dependencies 49 Service Components 52 Reliability for Things You Don’t Own 53 Open Source or Hosted Services 54 Measuring Hardware 54 Choosing Targets 56 Past Performance 56 Basic Statistics 57 Metric Attributes 61 Percentile Thresholds 64 What to Do Without a History 65 Summary 66 vi | Table of Contents 5. How to Use Error Budgets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Error Budgets in Practice 68 To Release New Features or Not? 69 Project Focus 70 Examining Risk Factors 71 Experimentation and Chaos Engineering 72 Load and Stress Tests 73 Blackhole Exercises 74 Purposely Burning Budget 75 Error Budgets for Humans 75 Error Budget Measurement 76 Establishing Error Budgets 76 Decision Making 86 Error Budget Policies 88 Summary 92 Part II. SLO Implementation 6. Getting Buy-In. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Engineering Is More than Code 95 Key Stakeholders 96 Engineering 96 Product 97 Operations 98 QA 98 Legal 99 Executive Leadership 100 Making It So 101 Order of Operation 101 Common Objections and How to Overcome Them 102 Your First Error Budget Policy (and Your First Critical Test) 106 Lessons Learned the Hard Way 108 Summary 109 7. Measuring SLIs and SLOs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Design Goals 111 Flexible Targets 112 Testable Targets 112 Freshness 112 Cost 113 Reliability 113 Table of Contents | vii Organizational Constraints 114 Common Machinery 114 Centralized Time Series Statistics (Metrics) 114 Structured Event Databases (Logging) 119 Common Cases 122 Latency-Sensitive Request Processing 122 Low-Lag, High-Throughput Batch Processing 124 Mobile and Web Clients 125 The General Case 126 Other Considerations 127 Integration with Distributed Tracing 127 SLI and SLO Discoverability 128 Summary 128 8. SLO Monitoring and Alerting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Motivation: What Is SLO Alerting, and Why Should You Do It? 130 The Shortcomings of Simple Threshold Alerting 130 A Better Way 138 How to Do SLO Alerting 138 Choosing a Target 139 Error Budgets and Response Time 141 Error Budget Burn Rate 142 Rolling Windows 143 Putting It Together 145 Troubleshooting with SLO Alerting 147 Corner Cases 148 SLO Alerting in a Brownfield Setup 149 Parting Recommendations 150 Summary 152 9. Probability and Statistics for SLIs and SLOs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 On Probability 155 SLI Example: Availability 156 SLI Example: Low QPS 162 On Statistics 174 Maximum Likelihood Estimation 174 Maximum a Posteriori 177 Bayesian Inference 185 SLI Example: Queueing Latency 190 Batch Latency 196 SLI Example: Durability 203 Further Reading 208 viii | Table of Contents Summary 208 10. Architecting for Reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Example System: Image-Serving Service 211 Architectural Considerations: Hardware 213 Architectural Considerations: Monolith or Microservices 216 Architectural Considerations: Anticipating Failure Modes 217 Architectural Considerations: Three Types of Requests 218 Systems and Building Blocks 220 Quantitative Analysis of Systems 222 Instrumentation! The System Also Needs Instrumentation! 223 Architectural Considerations: Hardware, Revisited 224 SLOs as a Result of System SLIs 225 The Importance of Identifying and Understanding Dependencies 225 Summary 226 11. Data Reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Data Services 227 Designing Data Applications 228 Users of Data Services 229 Setting Measurable Data Objectives 230 Data and Data Application Reliability 231 Data Properties 233 Data Application Properties 245 System Design Concerns 252 Data Application Failures 252 Other Qualities 253 Data Lineage 254 Summary 255 12. A Worked Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Dogs Deserve Clothes 258 How a Service Grows 259 The Design of a Service 260 SLIs and SLOs as User Journeys 261 Customers: Finding and Browsing Products 262 Other Services as Users: Buying Products 265 Internal Users 268 Platforms as Services 273 Summary 275 Table of Contents | ix