Praise for Observability Engineering Finally, a principled alternative to the guess-and-test approach to answering operational questions like “Why is the system slow?” This book marks a watershed moment for how engineers will think about interrogating system behavior. —Lorin Hochstein, senior software engineer and O’Reilly author This book does not shirk away from shedding light on the challenges one might face when bootstrapping a culture of observability on a team, and provides valuable guidance on how to go about it in a sustainable manner that should stand observability practitioners in good stead for long-term success. —Cindy Sridharan, infrastructure engineer As your systems get more complicated and distributed, monitoring doesn’t really help you work out what has gone wrong. You need to be able to solve problems you haven’t seen before, and that’s where observability comes in. I’ve learned so much about observability from these authors over the last five years, and I’m delighted that they’ve now written this book that covers both the technical and cultural aspects of introducing and benefiting from observability of your production systems. —Sarah Wells, former technical director at the Financial Times and O’Reilly author This excellent book is the perfect companion for any engineer or manager wanting to get the most out of their observability efforts. It strikes the perfect balance between being concise and comprehensive: It lays a solid foundation by defining observability, explains how to use it to debug and keep your services reliable, guides you on how to build a strong business case for it, and finally provides the means to assess your efforts to help with future improvements. —Mads Hartmann, SRE at Gitpod Observability Engineering Achieving Production Excellence Charity Majors, Liz Fong-Jones, and George Miranda BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Observability Engineering by Charity Majors, Liz Fong-Jones, and George Miranda Copyright © 2022 Hound Technology Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: John Devins Indexer: nSight, Inc. Development Editor: Virginia Wilson Interior Designer: David Futato Production Editor: Kate Galloway Cover Designer: Karen Montgomery Copyeditor: Sharon Wilkey Illustrator: Kate Dullea Proofreader: Piper Editorial Consulting, LLC May 2022: First Edition Revision History for the First Edition 2022-05-06: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492076445 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Observability Engineering, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Honeycomb. See our statement of editorial independence. 978-1-492-07644-5 [LSI] Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Part I. The Path to Observability 1. What Is Observability?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Mathematical Definition of Observability 4 Applying Observability to Software Systems 4 Mischaracterizations About Observability for Software 7 Why Observability Matters Now 8 Is This Really the Best Way? 9 Why Are Metrics and Monitoring Not Enough? 9 Debugging with Metrics Versus Observability 11 The Role of Cardinality 13 The Role of Dimensionality 14 Debugging with Observability 16 Observability Is for Modern Systems 17 Conclusion 17 2. How Debugging Practices Differ Between Observability and Monitoring. . . . . . . . . . . 19 How Monitoring Data Is Used for Debugging 19 Troubleshooting Behaviors When Using Dashboards 21 The Limitations of Troubleshooting by Intuition 23 Traditional Monitoring Is Fundamentally Reactive 24 How Observability Enables Better Debugging 26 Conclusion 28 iii 3. Lessons from Scaling Without Observability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 An Introduction to Parse 29 Scaling at Parse 31 The Evolution Toward Modern Systems 33 The Evolution Toward Modern Practices 36 Shifting Practices at Parse 38 Conclusion 41 4. How Observability Relates to DevOps, SRE, and Cloud Native. . . . . . . . . . . . . . . . . . . . . 43 Cloud Native, DevOps, and SRE in a Nutshell 43 Observability: Debugging Then Versus Now 45 Observability Empowers DevOps and SRE Practices 46 Conclusion 48 Part II. Fundamentals of Observability 5. Structured Events Are the Building Blocks of Observability. . . . . . . . . . . . . . . . . . . . . . 51 Debugging with Structured Events 52 The Limitations of Metrics as a Building Block 53 The Limitations of Traditional Logs as a Building Block 55 Unstructured Logs 55 Structured Logs 56 Properties of Events That Are Useful in Debugging 57 Conclusion 59 6. Stitching Events into Traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Distributed Tracing and Why It Matters Now 61 The Components of Tracing 63 Instrumenting a Trace the Hard Way 65 Adding Custom Fields into Trace Spans 68 Stitching Events into Traces 70 Conclusion 71 7. Instrumentation with OpenTelemetry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A Brief Introduction to Instrumentation 74 Open Instrumentation Standards 74 Instrumentation Using Code-Based Examples 75 Start with Automatic Instrumentation 76 Add Custom Instrumentation 78 Send Instrumentation Data to a Backend System 80 Conclusion 82 iv | Table of Contents 8. Analyzing Events to Achieve Observability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Debugging from Known Conditions 84 Debugging from First Principles 85 Using the Core Analysis Loop 86 Automating the Brute-Force Portion of the Core Analysis Loop 88 This Misleading Promise of AIOps 91 Conclusion 92 9. How Observability and Monitoring Come Together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Where Monitoring Fits 96 Where Observability Fits 97 System Versus Software Considerations 97 Assessing Your Organizational Needs 99 Exceptions: Infrastructure Monitoring That Can’t Be Ignored 101 Real-World Examples 101 Conclusion 103 Part III. Observability for Teams 10. Applying Observability Practices in Your Team. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Join a Community Group 107 Start with the Biggest Pain Points 109 Buy Instead of Build 109 Flesh Out Your Instrumentation Iteratively 111 Look for Opportunities to Leverage Existing Efforts 112 Prepare for the Hardest Last Push 114 Conclusion 115 11. Observability-Driven Development. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Test-Driven Development 117 Observability in the Development Cycle 118 Determining Where to Debug 119 Debugging in the Time of Microservices 120 How Instrumentation Drives Observability 121 Shifting Observability Left 123 Using Observability to Speed Up Software Delivery 123 Conclusion 125 12. Using Service-Level Objectives for Reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Traditional Monitoring Approaches Create Dangerous Alert Fatigue 127 Threshold Alerting Is for Known-Unknowns Only 129 Table of Contents | v User Experience Is a North Star 131 What Is a Service-Level Objective? 132 Reliable Alerting with SLOs 133 Changing Culture Toward SLO-Based Alerts: A Case Study 135 Conclusion 138 13. Acting on and Debugging SLO-Based Alerts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Alerting Before Your Error Budget Is Empty 139 Framing Time as a Sliding Window 141 Forecasting to Create a Predictive Burn Alert 142 The Lookahead Window 144 The Baseline Window 151 Acting on SLO Burn Alerts 152 Using Observability Data for SLOs Versus Time-Series Data 154 Conclusion 156 14. Observability and the Software Supply Chain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Why Slack Needed Observability 159 Instrumentation: Shared Client Libraries and Dimensions 161 Case Studies: Operationalizing the Supply Chain 164 Understanding Context Through Tooling 164 Embedding Actionable Alerting 166 Understanding What Changed 168 Conclusion 170 Part IV. Observability at Scale 15. Build Versus Buy and Return on Investment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 How to Analyze the ROI of Observability 174 The Real Costs of Building Your Own 175 The Hidden Costs of Using “Free” Software 175 The Benefits of Building Your Own 176 The Risks of Building Your Own 177 The Real Costs of Buying Software 179 The Hidden Financial Costs of Commercial Software 179 The Hidden Nonfinancial Costs of Commercial Software 180 The Benefits of Buying Commercial Software 181 The Risks of Buying Commercial Software 182 Buy Versus Build Is Not a Binary Choice 182 Conclusion 183 vi | Table of Contents