Practical Data Science Cookbook 89 hands-on recipes to help you complete real-world data science projects in R and Python Tony Ojeda Sean Patrick Murphy Benjamin Bengfort Abhijit Dasgupta BIRMINGHAM - MUMBAI Practical Data Science Cookbook Copyright © 2014 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: September 2014 Production reference: 1180914 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78398-024-6 www.packtpub.com Cover image by Pratyush Mohanta ([email protected]) Credits Authors Project Coordinator Tony Ojeda Priyanka Goel Sean Patrick Murphy Proofreaders Benjamin Bengfort Simran Bhogal Abhijit Dasgupta Maria Gould Ameesha Green Reviewers Richard Heimann Paul Hindle Sarah Kelley Kevin McGowan Liang Shi Lucy Rowland Will Voorhees Indexers Rekha Nair Commissioning Editor James Jones Priya Sane Acquisition Editor Graphics James Jones Abhinash Sahu Content Development Editor Production Coordinator Arvind Koul Adonia Jones Technical Editors Cover Work Pankaj Kadam Adonia Jones Sebastian Rodrigues Copy Editors Insiya Morbiwala Sayanee Mukherjee Stuti Srivastava About the Authors Tony Ojeda is an accomplished data scientist and entrepreneur, with expertise in business process optimization and over a decade of experience creating and implementing innovative data products and solutions. He has a Master's degree in Finance from Florida International University and an MBA with concentrations in Strategy and Entrepreneurship from DePaul University. He is the founder of District Data Labs, a cofounder of Data Community DC, and is actively involved in promoting data science education through both organizations. First and foremost, I'd like to thank my coauthors for the tireless work they put in to make this book something we can all be proud to say we wrote together. I hope to work on many more projects and achieve many great things with you in the future. I'd like to thank our reviewers, specifically Will Voorhees and Sarah Kelley, for reading every single chapter of the book and providing excellent feedback on each one. This book owes much of its quality to their great advice and suggestions. I'd also like to thank my family and friends for their support and encouragement in just about everything I do. Last, but certainly not least, I'd like to thank my fiancée and partner in life, Nikki, for her patience, understanding, and willingness to stick with me throughout all my ambitious undertakings, this book being just one of them. I wouldn't dare take risks and experiment with nearly as many things professionally if my personal life was not the stable, loving, supportive environment she provides. Sean Patrick Murphy spent 15 years as a senior scientist at The Johns Hopkins University Applied Physics Laboratory, where he focused on machine learning, modeling and simulation, signal processing, and high performance computing in the Cloud. Now, he acts as an advisor and data consultant for companies in SF, NY, and DC. He completed his graduation from The Johns Hopkins University and his MBA from the University of Oxford. He currently co-organizes the Data Innovation DC meetup and cofounded the Data Science MD meetup. He is also a board member and cofounder of Data Community DC. Benjamin Bengfort is an experienced data scientist and Python developer who has worked in military, industry, and academia for the past 8 years. He is currently pursuing his PhD in Computer Science at the University of Maryland, College Park, doing research in Metacognition and Natural Language Processing. He holds a Master's degree in Computer Science from North Dakota State University, where he taught undergraduate Computer Science courses. He is also an adjunct faculty member at Georgetown University, where he teaches Data Science and Analytics. Benjamin has been involved in two data science start-ups in the DC region: leveraging large-scale machine learning and Big Data techniques across a variety of applications. He has a deep appreciation for the combination of models and data for entrepreneurial effect, and he is currently building one of these start-ups into a more mature organization. I'd like to thank Will Voorhees for his tireless support in everything I've been doing, even agreeing to review my technical writing. He made my chapters understandable, and I'm thankful that he reads what I write. It's been essential to my career and sanity to have a classmate, a colleague, and a friend like him. I'd also like to thank my coauthors, Tony and Sean, for working their butts off to make this book happen; it was a spectacular effort on their part. I'd also like to thank Sarah Kelley for her input and fresh take on the material; so far, she's gone on many adventures with us, and I'm looking forward to the time when I get to review her books! Finally, I'd especially like to thank my wife, Jaci, who puts up with a lot, especially when I bite off more than I can chew and end up working late into the night. Without her, I wouldn't be writing anything at all. She is an inspiration, and one of the writers in my family, she is the one who students will be reading, even a hundred years from now. Abhijit Dasgupta is a data consultant working in the greater DC-Maryland-Virginia area, with several years of experience in biomedical consulting, business analytics, bioinformatics, and bioengineering consulting. He has a PhD in Biostatistics from the University of Washington and over 40 collaborative peer-reviewed manuscripts, with strong interests in bridging the statistics/machine-learning divide. He is always on the lookout for interesting and challenging projects, and is an enthusiastic speaker and discussant on new and better ways to look at and analyze data. He is a member of Data Community DC and a founding member and co-organizer of Statistical Programming DC (formerly, R Users DC). About the Reviewers Richard Heimann is a technical fellow and Chief Data Scientist at L-3 National Security Solutions (NSS) (NYSE:LLL), and is also an EMC-certified data scientist with concentrations in spatial statistics, data mining, and Big Data. Richard also leads the data science team at the L-3 Data Tactics Business Unit. L-3 NSS and L-3 Data Tactics are both premier Big Data and analytics service providers based in Washington DC and serve customers globally. Richard is an adjunct professor at the University of Maryland, Baltimore County, where he teaches Spatial Analysis and Statistical Reasoning. Additionally, he is an instructor at George Mason University, teaching Human Terrain Analysis; he is also a selection committee member for the 2014-2015 AAAS Big Data and Analytics Fellowship Program and member of the WashingtonExec Big Data Council. Richard has recently published a book titled Social Media Mining with R, Packt Publishing. He recently supported DARPA, DHS, the US Army, and the Pentagon with analytical support. Sarah Kelley is a junior Python developer and aspiring data scientist. She currently works at a start-up in Bethesda, Maryland, where she spends most of her time on data ingestion and wrangling. Sarah holds a Master's degree in Education from Seattle University. She is a self-taught programmer who became interested in the field through her desire to inspire her students to pursue careers in Mathematics, Science, and technology. Liang Shi received his PhD in Computer Science and a Master's degree in Statistics from the University of Georgia in 2008 and 2006, respectively. His PhD study is on Machine Learning and AI, mainly solving surrogate model-assisted optimization problems. After graduation, he joined the Data Mining Research team at McAfee; his job was to detect network threats through machine-learning approaches based on Big Data and cloud computing platforms. He later joined Microsoft as a software engineer, and continued his security research and development leveraged by machine-learning algorithms, basically for online advertisement fraud detection on very large, real-time data scales. In 2012, he rejoined McAfee (Intel) as a senior researcher, conducting network threat research, again with the help of machine-learning and cloud computing techniques. Early this year, he joined Pivotal as a senior data scientist; his work is mainly on data scientist projects with clients of popular companies, mainly for IT and security data analytics. He is very familiar with statistical and machine-learning modeling and theories, and he is proficient with many programming languages and analytical tools. He has several journal- and conference-proceeding publications, and he also published a book chapter. Will Voorhees is a software developer with experience in all sorts of interesting things from mobile app development and natural language processing to infrastructure security. After teaching English in Austria and bootstrapping an education technology start-up, he moved to the West Coast, joined a big tech company, and is now happily working on infrastructure security software used by thousands of developers. In his free time, Will enjoys reviewing technical books, watching movies, and convincing his dog that she's a good girl, yes she is. www.PacktPub.com Support files, eBooks, discount offers, and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print and bookmark content f On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. Table of Contents Preface 1 Chapter 1: Preparing Your Data Science Environment 7 Introduction 7 Understanding the data science pipeline 9 Installing R on Windows, Mac OS X, and Linux 11 Installing libraries in R and RStudio 14 Installing Python on Linux and Mac OS X 17 Installing Python on Windows 18 Installing the Python data stack on Mac OS X and Linux 21 Installing extra Python packages 24 Installing and using virtualenv 26 Chapter 2: Driving Visual Analysis with Automobile Data (R) 31 Introduction 31 Acquiring automobile fuel efficiency data 32 Preparing R for your first project 34 Importing automobile fuel efficiency data into R 35 Exploring and describing fuel efficiency data 38 Analyzing automobile fuel efficiency over time 43 Investigating the makes and models of automobiles 54 Chapter 3: Simulating American Football Data (R) 59 Introduction 59 Acquiring and cleaning football data 61 Analyzing and understanding football data 65 Constructing indexes to measure offensive and defensive strength 74 Simulating a single game with outcomes decided by calculations 77 Simulating multiple games with outcomes decided by calculations 81