ebook img

Data Quality Fundamentals: A Practitioner's Guide to Building Trustworthy Data Pipelines PDF

311 Pages·2022·9.546 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Quality Fundamentals: A Practitioner's Guide to Building Trustworthy Data Pipelines

Data Quality DD aa tt Fundamentals aa QQ uu aa A Practitioner's Guide to Building ll ii tt Trustworthy Data Pipelines yy FF uu nn dd aa mm ee nn tt aa ll ss M o &s e Vs or, G w a ev rcis kh Barr Moses, Lior Gavish & Molly Vorwerck Data Quality Fundamentals Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you’re using broken or just “A must-read for anyone plain wrong? These problems affect almost every team, yet who cares about data they’re usually addressed on an ad hoc basis and in a reactive quality.” manner. If you answered yes to these questions, this book is —Debashis Saha for you. Data Leader Many data engineering teams today face the “good pipelines, AppZen, Intuit, and eBay bad data” problem. It doesn’t matter how advanced your data infrastructure is if the data you’re piping is bad. In this Barr Moses is CEO and cofounder book, Barr Moses, Lior Gavish, and Molly Vorwerck, from of Monte Carlo, creator of the data the data observability company Monte Carlo, explain how observability category. During her to tackle data quality and trust at scale by leveraging best decade-long career in data, she served practices and technologies used by some of the world’s most as commander of a data intelligence unit in the Israeli Air Force, a consultant at innovative companies. Bain & Company, and vice president of • Build more trustworthy and reliable data pipelines operations at Gainsight. She led O’Reilly’s first course on data quality. • Write scripts to make data checks and identify broken Lior Gavish, CTO and cofounder of pipelines with data observability Monte Carlo, previously cofounded • Learn how to set and maintain data SLAs, SLIs, and SLOs cybersecurity startup Sookasa, acquired by Barracuda in 2016. At Barracuda, he • Develop and lead data quality initiatives at your company was senior vice president of engineering, • Learn how to treat data services and systems with the launching award-winning ML products for fraud prevention. Lior holds an MBA diligence of production software from Stanford and an MSc in computer • Automate data lineage graphs across your data ecosystem science from Tel Aviv University. • Build anomaly detectors for your critical data assets Molly Vorwerck, head of content at Monte Carlo, also served as editor- M in-chief of the Uber Engineering blog o and lead program manager for Uber’s &s e technical brand team. She also led Vs internal communications for Uber’s chief or, G technology officer and strategy for Uber w a AI Labs’ research review program. ev rcis kh DATA Twitter: @oreillymedia linkedin.com/company/oreilly-media US $59.99 CAN $74.99 youtube.com/oreillymedia ISBN: 978-1-098-11204-2 Praise for Data Quality Fundamentals Data engineers, ETL programmers, and entire data pipeline teams need a reference and testing guide like this! As I did, they will learn the building blocks, processes, and tooling that help ensure the quality of data-intensive applications. This book adds fresh perspectives and practical test scenarios that expand the wisdom to test modern data pipelines. —Wayne Yaddow, Data and ETL Quality Analyst Your data investments, infrastructure, and insights don’t matter at all if you can’t trust your data. Barr, Lior, and Molly have done a tremendous job in breaking down the fundamentals of what trusting your data means and have created a very practical framework to implement data quality in enterprises. A must-read for anyone who cares about data quality. —Debashis Saha, Data Leader AppZen, Intuit, and eBay As data architecture becomes increasingly distributed and the accountability for data increasingly decentralized, the focus on data quality will continue to grow. Data Quality Fundamentals provides an important resource for engineering teams that are serious about improving the accuracy, reliability, and trust of their data through some of today’s most significant technologies and processes. —Mammad Zadeh, Data Leader and Former VP of Engineering at Intuit Data Quality Fundamentals A Practitioner’s Guide to Building Trustworthy Data Pipelines Barr Moses, Lior Gavish, and Molly Vorwerck BBeeiijjiinngg BBoossttoonn FFaarrnnhhaamm SSeebbaassttooppooll TTookkyyoo Data Quality Fundamentals by Barr Moses, Lior Gavish, and Molly Vorwerck Copyright © 2022 Monte Carlo Data, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Aaron Black Indexer: WordCo Indexing Services, Inc. Development Editor: Jill Leonard Interior Designer: David Futato Production Editor: Gregory Hyman Cover Designer: Karen Montgomery Copyeditor: Charles Roumeliotis Illustrator: Kate Dullea Proofreader: Piper Editorial Consulting, LLC September 2022: First Edition Revision History for the First Edition 2022-09-01: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098112042 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Quality Fundamentals, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Monte Carlo Data. See our statement of editorial independence. 978-1-098-11204-2 [LSI] To Rae and Robert, who keep things in perspective, no matter where we look. To the Monte Carlo jellyfish and the data reliability pioneers—you know who you are. So grateful to be on this journey with you. Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Why Data Quality Deserves Attention—Now. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is Data Quality? 4 Framing the Current Moment 4 Understanding the “Rise of Data Downtime” 5 Other Industry Trends Contributing to the Current Moment 8 Summary 10 2. Assembling the Building Blocks of a Reliable Data System. . . . . . . . . . . . . . . . . . . . . . . . 13 Understanding the Difference Between Operational and Analytical Data 14 What Makes Them Different? 15 Data Warehouses Versus Data Lakes 16 Data Warehouses: Table Types at the Schema Level 17 Data Lakes: Manipulations at the File Level 18 What About the Data Lakehouse? 21 Syncing Data Between Warehouses and Lakes 21 Collecting Data Quality Metrics 22 What Are Data Quality Metrics? 22 How to Pull Data Quality Metrics 23 Using Query Logs to Understand Data Quality in the Warehouse 30 Using Query Logs to Understand Data Quality in the Lake 31 Designing a Data Catalog 32 Building a Data Catalog 33 Summary 38 v 3. Collecting, Cleaning, Transforming, and Testing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Collecting Data 39 Application Log Data 40 API Responses 41 Sensor Data 42 Cleaning Data 43 Batch Versus Stream Processing 45 Data Quality for Stream Processing 47 Normalizing Data 50 Handling Heterogeneous Data Sources 50 Schema Checking and Type Coercion 52 Syntactic Versus Semantic Ambiguity in Data 52 Managing Operational Data Transformations Across AWS Kinesis and Apache Kafka 53 Running Analytical Data Transformations 54 Ensuring Data Quality During ETL 54 Ensuring Data Quality During Transformation 55 Alerting and Testing 55 dbt Unit Testing 56 Great Expectations Unit Testing 59 Deequ Unit Testing 60 Managing Data Quality with Apache Airflow 63 Scheduler SLAs 63 Installing Circuit Breakers with Apache Airflow 66 SQL Check Operators 67 Summary 67 4. Monitoring and Anomaly Detection for Your Data Pipelines. . . . . . . . . . . . . . . . . . . . . . . 69 Knowing Your Known Unknowns and Unknown Unknowns 70 Building an Anomaly Detection Algorithm 72 Monitoring for Freshness 73 Understanding Distribution 79 Building Monitors for Schema and Lineage 87 Anomaly Detection for Schema Changes and Lineage 88 Visualizing Lineage 92 Investigating a Data Anomaly 94 Scaling Anomaly Detection with Python and Machine Learning 99 Improving Data Monitoring Alerting with Machine Learning 104 Accounting for False Positives and False Negatives 105 Improving Precision and Recall 106 Detecting Freshness Incidents with Data Monitoring 110 vi | Table of Contents

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.