D ata Wrangling with Python TIPS AND TOOLS TO MAKE YOUR LIFE EASIER Jacqueline Kazil & Katharine Jarmul Praise for Data Wrangling with Python “This should be required reading for any new data scientist, data engineer or other technical data professional. This hands-on, step-by-step guide is exactly what the field needs and what I wish I had when I first starting manipulating data in Python. If you are a data geek that likes to get their hands dirty and that needs a good definitive source, this is your book.” —Dr. Tyrone Grandison, CEO, Proficiency Labs Intl. “There’s a lot more to data wrangling than just writing code, and this well-written book tells you everything you need to know. This will be an invaluable step-by-step resource at a time when journalism needs more data experts.” —Randy Picht, Executive Director of the Donald W. Reynolds Journalism Institute at the Missouri School of Journalism “Few resources are as comprehensive and as approachable as this book. It not only explains what you need to know, but why and how. Whether you are new to data journalism, or looking to expand your capabilities, Katharine and Jacqueline’s book is a must-have resource.” —Joshua Hatch, Senior Editor, Data and Interactives, The Chronicle of Higher Education and The Chronicle of Philanthropy “A great survey course on everything—literally everything—that we do to tell stories with data, covering the basics and the state of the art. Highly recommended.” —Brian Boyer, Visuals Editor, NPR “Data Wrangling with Python is a practical, approachable guide to learning some of the most common tasks you’ll ever have to do with code: find, extract, tidy and examine data.” —Chrys Wu, technologist “This book is a useful response to a question I often get from journalists: ‘I’m pretty good using spreadsheets, but what should I learn next?’ Although not aimed solely at a journalism readership, Data Wrangling with Python provides a clear path for anyone who is using spreadsheets and wondering how to improve her skills to obtain, clean, and analyze data. It covers everything from how to load and examine text files to automated screen-scraping to new command-line tools for performing data analysis and visualizing the results. “I followed a well-worn path to analyzing data and finding meaning in it: I started with spreadsheets, followed by relational databases and mapping programs. They are still useful tools, but they don’t take full advantage of automation, which enables users to process more data and to replicate their work. Nor do they connect seamlessly to the wide range of data available on the Internet. Next to these pillars we need to add another: a programming language. While I’ve been working with Python and other languages for a while now, that use has been haphazard rather than methodical. “Both the case for working with data and the sophistication of tools has advanced during the past 20 years, which makes it more important to think about a common set of techniques. The increased availability of data (both structured and unstructured) and the sheer volume of it that can be stored and analyzed has changed the possibilities for data analysis: many difficult questions are now easier to answer, and some previously impossible ones are within reach. We need a glue that helps to tie together the various parts of the data ecosystem, from JSON APIs to filtering and cleaning data to creating charts to help tell a story. “In this book, that glue is Python and its robust suite of tools and libraries for working with data. If you’ve been feeling like spreadsheets (and even relational databases) aren’t up to answering the kinds of questions you’d like to ask, or if you’re ready to grow beyond these tools, this is a book for you. I know I’ve been waiting for it.” —Derek Willis, News Applications Developer at ProPublica and Cofounder of OpenElections Data Wrangling with Python Jacqueline Kazil and Katharine Jarmul Boston Data Wrangling with Python by Jacqueline Kazil and Katharine Jarmul Copyright © 2016 Jacqueline Kazil and Kjamistan, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Meghan Blanchette Indexer: WordCo Indexing Services, Inc. Editor: Dawn Schanafelt Interior Designer: David Futato Production Editor: Matthew Hacker Cover Designer: Randy Comer Copyeditor: Rachel Head Illustrator: Rebecca Demarest Proofreader: Jasmine Kwityn February 2016: First Edition Revision History for the First Edition 2016-02-02 First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491948811 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Wrangling with Python, the cover image of a blue-lipped tree lizard, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-4919-4881-1 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Introduction to Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Why Python 4 Getting Started with Python 5 Which Python Version 6 Setting Up Python on Your Machine 7 Test Driving Python 11 Install pip 14 Install a Code Editor 15 Optional: Install IPython 16 Summary 16 2. Python Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Basic Data Types 18 Strings 18 Integers and Floats 19 Data Containers 23 Variables 23 Lists 25 Dictionaries 27 What Can the Various Data Types Do? 28 String Methods: Things Strings Can Do 30 Numerical Methods: Things Numbers Can Do 31 List Methods: Things Lists Can Do 32 Dictionary Methods: Things Dictionaries Can Do 33 Helpful Tools: type, dir, and help 34 type 34 v dir 35 help 37 Putting It All Together 38 What Does It All Mean? 38 Summary 40 3. Data Meant to Be Read by Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 CSV Data 44 How to Import CSV Data 46 Saving the Code to a File; Running from Command Line 49 JSON Data 52 How to Import JSON Data 53 XML Data 55 How to Import XML Data 57 Summary 70 4. Working with Excel Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Installing Python Packages 73 Parsing Excel Files 75 Getting Started with Parsing 75 Summary 89 5. PDFs and Problem Solving in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Avoid Using PDFs! 91 Programmatic Approaches to PDF Parsing 92 Opening and Reading Using slate 94 Converting PDF to Text 96 Parsing PDFs Using pdfminer 97 Learning How to Solve Problems 115 Exercise: Use Table Extraction, Try a Different Library 116 Exercise: Clean the Data Manually 121 Exercise: Try Another Tool 121 Uncommon File Types 124 Summary 124 6. Acquiring and Storing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Not All Data Is Created Equal 128 Fact Checking 128 Readability, Cleanliness, and Longevity 129 Where to Find Data 130 Using a Telephone 130 US Government Data 132 vi | Table of Contents Government and Civic Open Data Worldwide 133 Organization and Non-Government Organization (NGO) Data 135 Education and University Data 135 Medical and Scientific Data 136 Crowdsourced Data and APIs 136 Case Studies: Example Data Investigation 137 Ebola Crisis 138 Train Safety 138 Football Salaries 139 Child Labor 139 Storing Your Data: When, Why, and How? 140 Databases: A Brief Introduction 141 Relational Databases: MySQL and PostgreSQL 141 Non-Relational Databases: NoSQL 144 Setting Up Your Local Database with Python 145 When to Use a Simple File 146 Cloud-Storage and Python 147 Local Storage and Python 147 Alternative Data Storage 147 Summary 148 7. Data Cleanup: Investigation, Matching, and Formatting. . . . . . . . . . . . . . . . . . . . . . . . 149 Why Clean Data? 149 Data Cleanup Basics 150 Identifying Values for Data Cleanup 151 Formatting Data 162 Finding Outliers and Bad Data 167 Finding Duplicates 173 Fuzzy Matching 177 RegEx Matching 181 What to Do with Duplicate Records 186 Summary 187 8. Data Cleanup: Standardizing and Scripting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Normalizing and Standardizing Your Data 191 Saving Your Data 192 Determining What Data Cleanup Is Right for Your Project 195 Scripting Your Cleanup 196 Testing with New Data 212 Summary 214 Table of Contents | vii 9. Data Exploration and Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Exploring Your Data 216 Importing Data 216 Exploring Table Functions 223 Joining Numerous Datasets 227 Identifying Correlations 232 Identifying Outliers 233 Creating Groupings 235 Further Exploration 240 Analyzing Your Data 241 Separating and Focusing Your Data 242 What Is Your Data Saying? 244 Drawing Conclusions 244 Documenting Your Conclusions 245 Summary 245 10. Presenting Your Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Avoiding Storytelling Pitfalls 247 How Will You Tell the Story? 248 Know Your Audience 248 Visualizing Your Data 250 Charts 250 Time-Related Data 257 Maps 258 Interactives 262 Words 263 Images, Video, and Illustrations 263 Presentation Tools 264 Publishing Your Data 264 Using Available Sites 265 Open Source Platforms: Starting a New Site 266 Jupyter (Formerly Known as IPython Notebooks) 268 Summary 272 11. Web Scraping: Acquiring and Storing Data from the Web. . . . . . . . . . . . . . . . . . . . . . . . 275 What to Scrape and How 276 Analyzing a Web Page 278 Inspection: Markup Structure 278 Network/Timeline: How the Page Loads 286 Console: Interacting with JavaScript 289 In-Depth Analysis of a Page 293 Getting Pages: How to Request on the Internet 294 viii | Table of Contents
Description: