Practical Python Data Wrangling and Data Quality Susan E. McGregor Practical Python Data Wrangling and Data Quality by Susan E. McGregor Copyright © 2022 Susan E. McGregor. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Jessica Haberman Development Editor: Jeff Bleiel Production Editor: Daniel Elfanbaum Copyeditor: Sonia Saruba Proofreader: Piper Editorial Consulting, LLC Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Jose Marzan Jr. Illustrator: Kate Dullea December 2021: First Edition Revision History for the First Edition 2021-12-02: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492091509 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Practical Python Data Wrangling and Data Quality, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-09150-9 [LSI] Preface Welcome! If you’ve picked up this book, you’re likely one of the many millions of people intrigued by the processes and possibilities surrounding “data”—that incredible, elusive new “currency” that’s transforming the way we live, work, and even connect with one another. Most of us, for example, are vaguely aware of the fact that data—collected by our electronic devices and other activities—is being used to shape what advertisements we see, what media is recommended to us, and which search results populate first when we look for something online. What many people may not appreciate is that the tools and skills for accessing, transforming, and generating insight from data are readily available to them. This book aims to help those people—you, if you like—do just that. Data is not something that is only available or useful to big companies or governmental number crunchers. Being able to access, understand, and gather insight from data is a valuable skill whether you’re a data scientist or a day care worker. The tools needed to use data effectively are more accessible than ever before. Not only can you do significant data work using only free software and programming languages, you don’t even need an expensive computer. All of the exercises in this book, for example, were designed and run on a Chromebook that cost less than $500. You can even just use free online platforms through the internet connection at your local library. The goal of this book is to provide the guidance and confidence that data novices need to begin exploring the world of data—first by accessing it, then evaluating its quality. With those foundations in place, we’ll move on to some of the basic methods of analyzing and presenting data to generate meaningful insight. While these latter sections will be far from comprehensive (both data analysis and visualization are robust fields unto themselves), they will give you the core skills needed to generate accurate, informative analyses and visualizations using your newly cleaned and acquired data. Who Should Read This Book? This book is intended for true beginners; all you need are a basic understanding of how to use computers (e.g., how to download a file, open a program, copy and paste, etc.), an open mind, and a willingness to experiment. I especially encourage you to take a chance on this book if you are someone who feels intimidated by data or programming, if you’re “bad at math,” or imagine that working with data or learning to program is too hard for you. I have spent nearly a decade teaching hundreds of people who didn’t think of themselves as “technical” the exact skills contained in this book, and I have never once had a student who was genuinely unable to get through this material. In my experience, the most challenging part of programming and working with data is not the difficulty of the material but the quality of the instruction.1 I am grateful both to the many students over the years whose questions have helped me immeasurably in finding ways to convey this material better, and for the opportunity to share what I learned from them with so many others through this book. While a book cannot truly replace the kind of support provided by a human teacher, I hope it will at least give the tools you need to master the basics—and perhaps the inspiration to take those skills to the next level. Folks who have some experience with data wrangling but have reached the limits of spreadsheet tools or want to expand the range of data formats they can easily access and manipulate will also find this book useful, as will those with frontend programming skills (in JavaScript or PHP, for example) who are looking for a way to get started with Python. WHERE WOULD YOU LIKE TO GO? In the preface to media theorist Douglas Rushkoff’s book Program or Be Programmed (OR Books), he compares the act of programming to that of driving a car. Unless you learn to program, Rushkoff writes, you are a perpetual passenger in the digital world, one who “is getting driven from place to place. Only the car has no windows and if the driver tells you there is only one supermarket in the county, you have to believe him.” “You can relegate your programming to others,” Rushkoff continues, “but then you have to trust them that their programs are really doing what you’re asking, and in a way that is in your best interests.” More and more these days, the latter assertion is being thrown into question. Over the years, I’ve asked several hundred students if they believe anyone can learn to drive, and the answer has always been yes. At the same time, I have met few people, apart from myself, who truly believe that anyone can program. Yet driving a motor vehicle is, in reality, vastly more complex than programming a computer. Why, then, do so many of us imagine that programming will be “too hard” for us? For me, this is where the real strength of Rushkoff’s analogy shows, because his “windowless car” doesn’t just hide the outside world from the passenger—it also hides the “driver” from passersby. It’s easy to believe that anyone can drive a car because we actually see all kinds of people driving cars, every day. When it comes to programming, though, we rarely get to see who is “behind the wheel,” which means our ideas about who can and should program are largely defined by media that portray programmers as typically white and overwhelmingly male. As a result, those characteristics have come to dominate who does program—but there’s no reason why it should. Because if you can drive a car—or even punctuate a sentence—I promise you can program a computer, too. Who Shouldn’t Read This Book? As noted previously, this book is intended for beginners. So while you may find some sections useful if you are new to data analysis or visualization, this volume is not designed to serve those with prior experience in Python or another data-focused programming language (like R). Fortunately, O’Reilly has many specialized volumes that deal with advanced Python topics and libraries, such as Wes McKinney’s Python for Data Analysis (O’Reilly) or the Python Data Science Handbook by Jake VanderPlas (O’Reilly). What to Expect from This Volume The content of this book is designed to be followed in the order presented, as the concepts and exercises in each chapter build on those explored previously. Throughout, however, you will find that exercises are presented in two ways: as code “notebooks” and as “standalone” programming files. The purpose of this is twofold. First, it allows you, the reader, to use whichever approach you prefer or find more accessible; second, it provides a way to compare these two methods of interacting with data-driven Python code. In my experience, Python “notebooks” are extremely useful for getting up and running quickly but can become tedious if you develop a reliable piece of code that you wish to run repeatedly. Since the code from one format often cannot simply be copied and pasted to the other, both are provided in the accompanying GitHub repo. Data files, too, are available via Google Drive. As you follow along with the exercises, you will be able to use the format you prefer and will also have the option of seeing the differences in the code for each format firsthand. Although Python is the primary tool used in this book, effective data wrangling and analysis are made easier through the smart use of a range of tools, from text editors (the programs in which you will actually write your code) to spreadsheet programs. Because of this, there are occasional exercises in this book that rely on other free and/or open source tools besides Python. Wherever these are introduced, I will offer some context as to why that tool has been chosen, along with sufficient instructions to complete the example task. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Monospaced Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Monospaced bold Shows commands or other text that should be typed literally by the user. Monospaced italic Shows text that should be replaced with user-supplied values or by values determined by context. TIP This element signifies a tip or suggestion. NOTE This element signifies a general note. WARNING This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/PracticalPythonDataWranglingAndQuality. If you have a technical question or a problem using the code examples, please send email to [email protected]. The code in this book is here to help you develop your skills. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Practical Python Data Wrangling and Data Quality by Susan E. McGregor (O’Reilly). Copyright 2022 Susan E. McGregor, 978-1-492- 09150-9.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected]. O’Reilly Online Learning